Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

So in this module,

what we want to talk about is how to apply methods of visualizing

discrete data values using area plots that we call mosaic plots.

Mosaic plots are different than scatter plots,

parallel coordinate plots, Chernoff faces,

and that we're really trying to take a piece of screen real estate,

and break it up into chunks to allow us to

examine the relationship among categorical variables now.

So we talked about some categorical variables in parallel sets,

and we're going to go back to that example of the Titanic,

where essentially we want to think about how we

can demonstrate different categories of data.

And so, in the Titanic data set remember we had gender.

We had male and female.

We had categories of first class,

second class, third class, and crew.

We also had survivor versus non-survivor.

And so we can create a whole set of data

like this where we know how many females were in first class?

How many survived? How many didn't?

And so by adding the survivors versus the non-survivors,

I know how many females were in first class, and so forth.

And what a Mosaic plot is going to do is it's going

to take a square piece of real estate,

and we're going to split this square up.

We are going to divide it into either horizontal bars or vertical bars first,

then we're going alternating back and forth between these horizontal and vertical splits.

And we're going to wind up with something like this.

So, here's the actual Titanic data set.

So, we can count how many people were in first class.

So right away, we can count.

We have 197 people in first class.

So let's talk about how we interpret this Mosaic plot.

So this length is,

in theory, the same as this length.

And what we're doing is splitting this up along sort of different chunks.

First, we want to count how many people are in our entire data set.

So to count how many people are in the Titanic,

we add up all of these numbers.

We also have to add all of the children as well.

Let's erase some of this so we can see this a little bit more.

So that tells me how many people were on the boat.

So let's say I add all these up,

and let's just say it works out.

We had thousand people on the Titanic.

It doesn't, and if you add this up,

you'll get a different number.

But let's just pretend for now we've added every number up here.

So this is the total population of the Titanic.

So what a Mosaic plot is going to do is we want to know

what percentage of this population was female,

and what percentage was male.

And so let's say that of the total population on the Titanic,

700 were male, and 300 were female.

So what we do on this gender axis,

and we get to pick how we want to split this.

So if I say my first split is going to be on gender,

I take this axis,

and I split it up into zero to one.

So, some percentages of this axis represents female,

since we had 300 females out of a thousand.

This is point three.

And then the rest of the area,

the rest of the length of this line,

is my other category male.

And I could have had multiple categories here.

Instead of male and female,

we could have done this with a different data set for countries.

We could have had like North American countries, South American, Asian,

whatever, and we would split this up into more breaks.

And now, this box in orangish-yellow is the percent of females.

This box is the percent of males.

And I split this with my vertical line.

My next split is on how many people survived.

And so what I can do is I can count for females,

how many survivors we had.

So I add up my survivors.

I add up my non-survivors.

So let's say, for females,

we had 300 females,

and let's say we had 200 survive,

and 100 not survive.

So, that means, again, I go from zero to one on my axis,

and I've got a total of 300.

So 100 out of 300 were not survivors according to our made up numbers here.

So, that's where this line comes in.

This goes from zero to one for my split.

So I had more people survived,

and that's what's this category.

One means survived, zero means non-survived.

And for male, I do the same thing except now this is the zero category.

This larger chunk of males didn't survive, then survived.

So I go back to my data set,

I add up how many males survived,

and how many males didn't survive.

I did vertical.

Then I did horizontal. I can also now add other chunks like class.

So, for example, I can split this by what class the boats were in.

So this is how many people were in third class,

second class, first class.

Then I can look at survivors versus non-survivors.

And we can see that the number of people that survived was smaller in each class.

So, first class had the most survivors.

But I can also then split this again by gender.

So now, I've got first class,

second class, third class.

I've got male, female,

and survive, and not survive.

So, for female, my zero and one are here.

So, the blue is not survive,

the yellow is survive.

So, in first class, almost all the females survived.

In second class, still the majority survived.

And in third class, still some survived, but less.

With male, I've got zero to one.

So I can see in first class,

many did not survive.

In second class, again,

many did not survive.

And in third class, the majority did not survive.

And so, I can again do these splits by just rotating horizontal-vertical,

horizontal-vertical, counting these up,

and putting these different percentages in my boxes.

Now, the question you should be asking is, well,

remember our lecture about which visual variables are most salient to people,

and Bill Cliven would have said, "Well,

you're comparing area between these different boxes."

Area is not a very good visual variable.

So, do people really use these mosaic plots?

Well, it's tempting to dismiss these mosaic plots because they

represent counts of categories as rectangular areas.

They're providing a distorted perceptual encoding.

But the important thing to realize is really encoding the length.

So, remember I'm splitting across the length here,

and I'm always splitting along the length.

So at each stage, the comparison of interest is the length of

the side of a different box, not really the area.

The problem, as you can see,

is as we have more and more splits,

this gets harder and harder to read,

harder and harder to label,

and sometimes we can get some long and skinny boxes.

And you can imagine that sometimes it may be that a count is so small,

I can't even see it, or what would happen if a count was zero?

How do I represent in this type of mosaic plot?

So, while this is a really interesting plot to look at,

we're actually going to see there's a similar variation of this.

There is a Treemap as well.

Later on, we'll talk about hierarchical data.

So this is yet another tool we can put in our belt for looking