You can sort them. And, and then just go to the sorted list.
So either of these solutions would work. And, of course, this is a very simple
problem. One that you might encounter in a very
first algorithms course. You know, a standard solution is to use a
hash table. You could also use sorting, right?
A hash table is just the same thing as a symbol table.
As you encounter elements, put them into the hash table.
And obviously, if you encounter a duplicate, [inaudible] realize this.
Okay, so this is a very simple problem. What makes this problem very hard is, What
happens if the list was very big? What if you had to do this not on, you
know, 1000 elements, but a billion elements.
Or a trillion elements. Okay?
Then all of a sudden, you know, all of these simple approaches, like using a hash
table or using, or doing sorting or whatever, go out of the window.
It, well number one you need a whole bunch of, a whole lot of memory to actually keep
a hash table. And of course modern computers are capable
of doing hash tables even of this size perhaps.
But you know, this size where sort of it's C, okay?
Sorting again, you know? Sorting such a huge list takes a very long
time. Okay?
So, why would you care about a problem like this?
Well, if you think about Google. For example, some statistics that, they,
they released earlier this year saying that you know, Google in its entire life
time has seen something like. 30 trillion URLs.
They crawl something like twenty billion pages a day and they serve about three
billion searches a day which means, you know, 100 billion searches a month.
Okay. So that's the scale at which they are
operating. And If you had to solve the simplest kinds
of, if you had to answer the simplest kinds of questions about what they do.
Like, you know, how many queries did they get in a day?
How many distinct queries they get in a day?
How many distinct queries did they get in a month or a year?
You're faced with solving what seems to be a very simple problem.
But is very difficult just because of the immense scale at which you need to
operate. Okay.
Of course, you know? Distinct, answering this question of how
many distinct queries you have is the very simple statistic that you might want to
compute and keep track of as time, as time goes by.
But you might want more sophisticated statistics like, you know, how much did
the queries change from day to day? How much are they different today compared
to yesterday. How much are they different this month
compared to previous month. This month compared to the same month last
year and so on and so forth. And all of these very, very simple
westerns just seem incredibly difficult when your dealing with, you know, the
sizes of data that go with the info. Okay.
You know, there are other settings where you have, lots of data and you'd like to,
come up with algorithms, real simple things.
You know, a network router is one example. It sees a lot of packets whizzing by at
fast speed, you know. Current, home routers are even capable of
speeds. About a one gigabyte per second.
Commercial routers are capable of speeds of something like ten gigabytes per
second. The router is something that's, that's,
you know, really simple computing device, all it does is receive packets and forward
them along. It could perhaps do some very simple sort
of analysis on the data when it's sync. How could you use this very, very simple
computing device To do simple monitoring like you know, can you tell me how many
active flows are there, are, are in the router.
You know, which are the flows that are contributing most, most of the traffic.
And these things are important because monitoring these sorts of things are
essential to you know, various security measures like, Early detection of denial
of service attacks. Okay?
It's important to keep track of, you know, what's going on, what sort of traffic
you're seeing, because changes in this might signal that something strange is
happening and you might want to take some security measures.
Okay? So very, very simple statistics like this
are, are a huge challenge just because of the amount of data that this router is
seeing and the fact that the router itself is a very, very simple device.
Okay? So.
You know in general there's the stock of big data everywhere.
If you, if you read the news read magazines you know, every other day
there's a, there's an article about how we're surrounded with data, we need to
find ways to make sense of the data, and. What I am really going to talk about is
what are some of the algorithmic techniques that we can do to tame these,
that we can use to tame these big data problems.
Okay, alright. So, that's, that's the introduction of
what the problems are. Let me talk about what, what models we
could envision to design algorithms. So this clearly, our traditional notion
of, you know having all of the data available to us.
Having the algorithm that has random access to the data.
You know, fitting all the data in memory. All of this, doesn't really apply when we
are dealing with data at such large scale. So you know, the two operatives are,
number one, you know, the data size is very large.
It far exceeds the available memory. And that's just a reality we have to face.
We can't actually store all the data in memory.
And the other thing is that there's a new emphasis on efficiency.
Usually when you design algorithms, polynomial time algorithms are considered.
Efficient and implementable. But, in reality, when you are dealing with
data size, size of this, this scale, your algorithms are run in you know, linear,
near linear time perhaps. You know.
Even something like quadratic, at, at the scales of which we talked about are
[inaudible], okay? So that's, that's what we are.
Those are, sort of the, the barometers. Okay.
So one kind of model that's used, we'll see, later on in the, in the, in the
lecture is. The idea of taking our [inaudible] and
somehow reducing it to a smaller size, okay?
So, it turns out that we can do this is many cases.
Replace the original data by some kind of compact summary.
Such that, you know, by. Discarding the original data, and just
looking at this compact summary. We can actually solve, or we can actually
answer the questions that we care about. Now, it would be wonderful [inaudible] if
you could do this. Because you would [inaudible] reduce the
data size. It turns out that in many cases, this is
actually possible, okay? Now, you know, one thing you never give up
on is the idea that you really want to solve a problem exactly.
In most cases, like, you know? If you wanna find out how many distinct
elements were there in the queries. Do you really care about.
Whether I give you a one percent approximate answer, or whether I give you
the exact answer, it doesn't really matter.
So, in most cases of interest it's okay to not give the exact answer, but give
something that is fairly accurate. And turns out that this relaxation moving
from getting the exact answer to tolerating some error.
[inaudible] error is something that actually makes many problems tractable.
Okay? So we'll see some examples of this, this
idea. Another idea that's, that's been very
useful. Is, a certain model called, the streaming
model. And a streaming model is the following.
It says, you know, we, you have to design your algorithms.
It operates in the following way. Imagine the data as being, you know, one
stream. And you make one pass over the data.
Or perhaps, you know, [inaudible]. A few passes over the data.
But you can remember only a small amount of information relative to the size of the
area. So, you can afford to remember very little
and you are only allowed to. See the data in one bias or in several
biases. So in, for example, you know you're going
to zig zag back and forth. You don't have random access to the data,
okay? And the point is, if you've actually
designed an algorithm that works in this very, very restrictive way, could actually
run it in on very large areas that's very efficient.
Okay. So it seems like we're really, really
tying our hands and, you know, putting a blindfold on.
How could we possibly do anything in this very, very restricted mode of operation?
But turns out that there are very interesting things that you can do even
when you restrict yourself to interacting with this data in this very controlled
way. Just a little bit of history of, you know,
where this notion of streaming algorithms came from.
Actually the. This idea of, trying to design algorithms,
at work you know, in one pass of the data, remembering very little.