If I roll down, if I drill down, am I done?
No, I can continue exporting my data.
I'm looking at my data,
and I see that's, okay, I have Arizona,
California Maryland, can I roll this up more?
In this case I can say, "Well,
why don't you roll-up location Arizona more,
because I want to see where Arizona is in the US."
In this example, my original table had Arizona in only one entry,
and when I roll-up in Arizona,
I get one entry in southwest.
So I know that one of my users is in the southwest.
So if you can see,
drill-down and roll-up is a tool that I can use
to add either more details to my database or my data set,
or to remove details from my data.
This way, I can either get better more higher level patterns,
I can see more high level patterns in my data,
or I can see more details in my data.
This is a very strong data exploration tool.
But, roll-up and drill-down requires that in addition to the data table,
you also have access to metadata
hierarchies that can support either roll-up or that can support drill-down.
Next, let's take a look at frequent items, sketches or summaries.
In this course, in the set of slides,
we'll mainly focus on summaries.
We'll revisit frequent items and sketches in the future. So what is a summary?
Remember when we were doing roll-up operation,
we were clustering objects together.
Now, when we were clustering objects together,
we were giving specific information about what we want to roll-up,
or what we want to drill-down.
For example, in the last example, we said, "Well,
why don't we take the Arizona entry,
and then why don't we roll-up on Arizona."
So, we gave very specifications about what we want to roll-up,
how we want to cluster.
In the case of generating summaries however,
we are again given possibly a data table and possibly additional metadata,
but instead of being stated ahead of the time,
how we want to roll-up,
how we want to go high in the hierarchy,
we are given a target number of rows.
So, in this example,
what I'm saying is, "Well you know what,
my original data table had one,
two, three, four, five, six, rows.
That's too much data for me.
Why don't you summarize it down to only two rows, can you do that?"
So, I'm user, I tell the system,
"I'm exploring the data.
I need help. Please summarize the data into a smaller set of rows."
Now, in this case the system has,
I didn't tell the system how to summarize.
So this system can do different things.
So, in this example,
we see one summarization of the table.
There are two tuples,
so what the system had decided to do was,
keep this entry intact,
it kept that entry intact as you can see,
and summarized all of the other remaining entries into one single row.
Now, let's take a look into that.
The name, I cannot,
I'm summarizing the data so the different rows have different names,
so I cannot really tell what the name is.
So, I'm going to replace it with a placeholder value, in this case a dash.
For age, as you can see,
I am going up to summarize,
to put all of this in the same row.
I have to put 12, 19,
22, 22, and 27 into one single row.
So, the only way I can do that is if I go all the way up in the age metadata.
Which means that I have to replace the age entry with a star.
So I'm going all the way to the upper level.
What about location? So let's take a look into that.
For the location, once again,
I have to put Phoenix, San Diego,
Baltimore, Frederick, and another Baltimore entry into one single row.
And the only way I can do it is if I roll-up all the way to the US entry.
So for this summary,
I have replaced the entry with US.
Essentially, in this case this row marked with
a blue arrow correspond to five rows in my original table.
This second row might be the orange arrow
correspond to only one entry in my original table. A summary.
So, the summary in this case is created by the system.
I only told the system, "Well,
create a summary to me with two entries."
And the system created it.
Now, as you will see,
this is one summary,
but there can be other summaries as well.
So for example, there's an alternative summary of the same table.
Again, I have two rows,
but in this example,
each row correspond to three tuples.
Instead of one row corresponding to five tuples in my database,
and the other one corresponding to only one tuple, in this case,
I have three rows in one row,
and three rows in the other summarized row.
So, how does the system achieve this?
Well, essentially what the system had to do is for the row marked in blue,
it had to roll-up to one star,
and roll-up to Southwest.
And for the row represented with the orange arrow,
this system had to roll-up to two star in Asia attribute,
and Maryland in the location attribute,
and that was sufficient.
Now, hope with this you can see these two summarizations are not identical.
And in fact, one summerization may be more desirable than the other summarization.
In this case, if I had to choose,
I would probably choose this summarization as a better summarization.
Because instead of having one row with a lot of data in it,
summarize a lot of data and it and another one that summarize enough,
I get the same amount of summarization in both rows, in the output.
So, how do we summarize the data effectively?
It is the data system's responsibility to
find a good summarization to support effective exploration.
We will not discuss this anymore in this class,
but data clustering, data summarization,
is an important challenge.
And in the machine learning lecture that we are covering,
courses that we are covering,
we'll discuss more about how to cluster the data together, more effectively.
What about aggregate queries and iceberg queries, what are they?
So let's basically take a look into them a little bit more carefully.
Let's go back to our last example,
where we had taken our data with six tuples,
and we had summarized them into a summarized table with only two rows in them.
By the way I'm using tuples and rows interchangeably,
so they are the same thing.
In a relational database,
usually the row are called tuples.
So I'm using them interchangeably.
Okay. So, this is a summarized data table.
What I can do is, I can add to
this summarized data table additional aggregation operations.
So for example, I can take a look at my summarized table,
here I see that age is one star,
for the other row age is two star.
But this doesn't tell me much about what is the age distribution inside?
I don't know for example what is the maximum age inside.
It could be 12, it could be 19.
If summarized, I don't know what's inside.
So what aggregation operations are doing,
if given a summary,
given a cluster, or given a grouping of the data,
they enable me to get more information about the data distribution in that cluster,
data distribution in that grouping,
or data distribution in the summary.
In this example, if I asked the system to tell me about the maximum age,
the system is telling me, "Well,
in this grouping, the maximum age was 19.
In this grouping, the maximum age was 27." Look what happened.
I still have summarized data,
but about the summaries,
I have more information.
So this we call aggregation.
What about iceberg queries?
Iceberg queries are one more step further.
In this case, what I had done was,
I told the system, "Well,
please create me a summary,
and for each group give me the maximum age."
Great. But let's assume in this example that I have only two tuples, right?
So but let's assume that somehow I had after doing this,
I want to reduce the data more.
And what impact to get what I want is that,
I want to get the results where the maximum age is greater than 20.
So this is called an iceberg query.
Why is it called iceberg query?
Well, essentially what I'm doing if you think about it, I have a condition.
The condition clusters from the data,
and I want to basically put a constraint on the condition,
and I want to basically find the results where the maximum age is greater than 20.
So it's like creating an iceberg if you sort of think about it.
So in this example,
my iceberg query returns to me only one row,
and that row is the second row of my summary.
The first row of my summary got eliminated.
Why did it get eliminated?
Because my condition is,
maximum age is greater than 20.
And in this example,
there's only one row,
marked with the orange arrow where the maximum age is greater than 20.
So Iceberg queries essentially combine,
summarization, aggregation, and additional filtering.
In one query and that basically potentially
help me reduce the data so that I can more effectively explore the data.
Great. Now finally for this lecture,
let's look at Skyline queries and let's define what
Skyline queries are and how they are used
in decision support and how they are used in data exploration.
Now in summarization we have seen that what we are
doing is we are reducing the data by telling the system we want,
say, five summarized tuples or two summarized tuples.
Skylines as you will see also reduce data.
So our goal is similar.
We are reducing the data, but instead of telling
the system we want these many results in the output,
we give another condition and I will explain what that condition is in just a second.
So let's take a look at another example.
This example is a little bit different and instead of having a table with hierarchies,
we have a table represented in the form of a vector space again.
So in this vector space,
in this case, we have a table of restaurants,
we have a number of restaurants and for each restaurant I have a record of
the average menu price and I have also a record of the rate.
For each restaurant, I know what is the sort of how much I will spend in
a dinner for a person let's say and what is the average rating as of late.
Obviously, the higher the rating the better the restaurant.
I want to go to a highly rated restaurant.
But again obviously the cheaper the price the better.
I don't want to spend too much money.
If I can get a good food at a cheaper price obviously I want that.
Right. So if that is the case,
if I'm given restaurants with the rating information and average price and if
I'm also given the information that
the high rating the better and the cheaper the food the better,
can you reduce the data
such that it only shows me the restaurants that I should consider.
This is called the Skyline Query.
Essentially what this Skyline Query is trying to
do is it tries to eliminate from the datasets
those restaurants that are both expensive and don't serve good food.
So in this example for instance,
this restaurant over here may not be very attractive.
Why? Well, because it has a low rating and it is also quite expensive.
What about this restaurant here.
It's pretty good, because it has a very high rating and it's also quite cheap.
So what the Skyline query is doing is to have me decide where I should go for dinner.
It helps more effectively explore the data by eliminating restaurants that are
not good in terms of their price and rating relationship.
So, by the way,
this is also known as the Maximum Vector Problem and it was coined by
Skylines in a 2001 paper by Börzsönyi, Kossman and Stocker.
But essentially, what this kind of an operation does is it
identifies a subset of
the objects in the database that dominates the other objects,
but are not dominated by each other.
So for example, let's see,
for this object is not in the Skyline. It should be eliminated.
Why? Well, because this restaurant or here is more
expensive and has less average rating than this restaurant over here.
The same way this restaurant that's called restaurant A is more
expensive and has a poorer rating than this restaurant over here.
What we call this is domination.
So what you are saying is that the restaurant is dominated by
other restaurants which means that the restaurant A should be eliminated,
shouldn't even be considered.
On the other hand, let's focus on say a restaurant B over here.
Restaurant B is not dominated by any other restaurant in our dataset.
There is no other restaurant in our database that has a better rating price relationship.
Yes there are some restaurants such as
restaurants C which has a better rating. That is true.
However restaurant C is more expensive than restaurant B.
So in terms of their price rating relationship they are not compatible.
I might decide to go to restaurant B or restaurant C depending
on whether I want to eat better food or whether I want to eat cheaper food.
What about restaurant D over here.
Well if you take a look at restaurant D you will see that it is
very expensive but it stores very good food.
Once again it is not comparable to Restaurant B or
restaurant C. Yes restaurant B and restaurant C are serving cheaper food,
however if I really want to have
an excellent dinner I might decide to go to restaurant D. So,
this is what we mean by Skylines.
Essentially a Skyline means give me the dataset,
give me a decision criteria,
in this case two decision criteria.
And what I will do is I will find the set
of data elements that are dominated by others and I will eliminate them.
This means that my decision now is only limited to a small number of elements.
I need to explore on a small number of elements to be able to decide.
Very powerful tool and is used very
effectively and very often in decision support systems.
Great. Finally, as I
had mentioned earlier there are data sketches as well.
So what I will do next is I will give just a hint of what we mean by a data sketch.
I will not provide the definition.
I will also not provide a very detailed example,
but that you will get some sense about what the data sketch is.
And this is the example that I'm sure that you are
familiar with and or you have seen it as before.
So let's assume that we have a document collection.
From this document collection we can create what is known as a tag cloud or a term cloud.
Essentially what we are doing is we are looking at the keywords that are
occurring in the documents and I'm just counting them.
These counts gives us essentially some indication as to what are the common keywords,
common terms or common tags in the database.
So we can use this term cloud or tag cloud as a sketch of our document collection.
Great. But how can I use this for exploring my data.
Well I can use this for exploring my data for example,
what I can do is I can link the terms in my term cloud to the documents in the database.
For example, in this case I can link the vacation and I
can allow the user to use this vacation term as a filter condition.
So in this case word user says vacation,
I can eliminate from the collection those documents that do not have that term.
Now if the user is adding another condition,
another filter condition such as Japan,
this enables me further eliminate documents from my collection.
So essentially what I am doing is that I'm using
the sketch as a way to explore the data in
my document collection and to potentially reduce the number
of objects or documents that I need to further explore.
A very simple but a very powerful tool.
We call this data sketch and there are different forms of data sketches.
Some of these we've all learned throughout the data exploration course.
See you in the next lecture.