Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

In this module, we want to talk about a unique sort

of visualization space called parallel coordinate plots.

We're going to look at different attributes of this multivariate data visualization.

So much like the scatter plots and the other multivariate data visualization lectures,

again we're thinking about a dataset.

For example we could have our players again,

could a batting average,

we could have percent on base,

we could have countries,

we could have income,

GDP, population, and so forth.

So we've got data sets with lots of variables,

lots of rows, lots of information,

and we want to think about how to create a plot

that lets people explore trends and relationships between those.

And in previous modules we talked about the scatter plot where we can

compare things like population size and GDP for example,

and each dot might be a particular country.

We might see some sort of trends.

We can extend those by changing size and shape

and color and adding more variables into our data sets.

And in parallel coordinate plots,

we have a similar sort of idea,

where now different variables can take different values with different ranges.

And what's going to happen is if we have a data set for example like country,

GDP, population, let's have some sort of measure like wellness,

maybe it's the average age people live life expectancy and things like that.

Let's think of some other variables.

We will just call them variable one variable two and variable three.

So you can have a ton of variables about a country.

We could have size for example,

what's the landmass of the country,

number of provinces, or states or whatever in a country and so forth.

So we have all these different measures.

Well, for each measure, we have an axes.

So I've got population,

and maybe population ranges from zero to one and a half billion.

Now there's only a few countries that have one and a half billion

and there's a bunch of countries that maybe have several millions here,

so we wind up with this skewed distribution.

Now the other weird thing is that population goes from zero to one and a half billion.

But if we do life expectancy,

it's sort of maybe goes from zero to 100.

So how do I match 100 to one and a half billion?

These axes have such drastically different values,

so oftentimes we might try to normalize these from zero to one.

What I mean by that is I might find the maximum and I might

divide everything by the max in this column to try and normalize the data.

And now what you're starting to see when I draw my graph like this,

is for each variable, I can create a line.

So for GDP, I can create a line.

For population, I can create a line.

For life expectancy, I can create a line.

For variable one, I create a line.

And the more variables we get,

the more these axes we're going to have.

So these are my parallel coordinate axes.

Now, I want to plot countries for example.

So for every unique country,

the unique country has a GDP.

And remember I can just do the 1DK,

so every dot on this line is already a country's GDP,

and every dot on this line is a country's population.

Parallel coordinate plots connect the same countries together across each of these lines.

So this line now represents a single country,

and where it crosses is the values in our data table.

So we can figure out what that country's GDP is,

we can figure out what its population is and so forth tracking those.

And a parallel coordinate plot has a line for every country.

And again, what we're seeing is

these pairwise combinations on a parallel coordinate plot.

And so we're just putting all of our variables on different axes and

then we can connect based on our categorical variable interest or things like that.

So if this is eight quizzes in class,

each line could be a student representing that student's record.

Now the thing is with parallel coordinate plots,

I didn't have to put V8 next to V7.

I could have moved V8 over to V4 and V4 back to V8.

What happens in a parallel coordinate plot is even though I can now

see all of the data records in this single view,

I can only see sort of pairwise correlations.

So here I can see most of the data from V1 to V2 has a downward trend.

Meaning that in general, V1,

if I do have a high value in V1,

I generally have a lower value in V2.

If I look at V4 to V5,

I have a lower volume V4,

I have a higher value in V5.

So these trends let me still look at pairwise comparison and it

can even show V3 to V4 and V4 to V5.

So I can sort of look at two correlations at once,

I can reorder things,

but the order of these axes is going to greatly

influence what this visualization looks like.

Likewise, what's really important actually is the angle between the axes,

because these angles represent this level of correlation,

and we keep coming back and talking about correlation.

The reason why correlation is important is because it

allows me to create some sort of mathematical formula,

like X is equal to MY.

And if I can make some formula like this,

if I know Y,

I can predict X.

And if I can predict X, if I know something about the future,

if I know something about other things I haven't seen,

so for example if I'm trying to hire

new baseball player and I want to guess how he might do and I don't have

any measurements on their batting average

but I have lots of measurements on their on base percent,

I can guess they're batting average from this information.

For stock markets, if I can make some sort of

equation like this where I know how much stock Y has made,

I can guess how much stock X will make

and I might be able to predict things in the future,

I might be able to classify unknown information.

So that's why looking and exploring data is really important,

and parallel co-ordinate plots lets us

see different correlations between subsets of the data.

So we can even start thinking about is there a subset of the data,

like this subset here that is correlated and why is that

subset correlated where other subsets are not.

So we can start reasoning and exploring different chunks and we can see

which correlations along two axes are of interest.

And again here we're looking at a car data sets.

We have the year car was made,

it's horsepower, it's acceleration,

the number of cylinders.

So we can start exploring and extracting information.

And what's interesting is we can even think about looking at how these elements cluster,

and we can see the visual clustering of the data in the parallel coordinate plot,

and so we can apply color and opacity based on line density.

So the more elements across a particular chunk,

the more dense they can get.

We can compute local density for each line and average of the density values,

and we can apply color and opacity based on user specifications.

So we can start filtering things out, looking for trends.

We don't even have to draw straight lines.

As you see here, we can try to curve things

to show different of patterns and get more different visual aesthetics,

because one of the big challenges with parallel coordinate plots is,

if I have a really large set of

lines and a really large set of countries or baseball players,

I can wind up with plots like this where it's just really hard to see anything,

I have too many lines that overlap.

So we started thinking about how we might be able to bundle these together.

How can we use color and opacity to help show trends?

How can we allow the user to select different things that are important?

They may say well, I'm only really interested in countries with

a low GDP and a high life expectancy,

because those things are sort of interesting.

I wonder why that might occur.

And so again we can allow user interaction to filter to give information on the tooltip,

to even reorganize axes in the parallel coordinate plot.

And just like we talked about scatter plots,

we can look at screens based metrics to calculate insight into these plots.

In scatter plots we talked about things like skewness, clumpiness.

Striations, and with scattered plots.

Once we draw the plot,

we're trying to measure what this geometry looks like,

and the way the geometry looks,

may give us insight into whether this plot is interesting for humans to look at.

By that same token,

with parallel coordinate plots,

once we draw our geometry,

we can start doing some sort of Screen Space Metrics

to try to understand what this might look like.

Likewise, we can also think about how we

could create lower dimensional projections of the data.

So, taking all of these N variables,

and reducing them to just the two or three most important variables,

so that we only have to plot the ones that would give the maximum insight into the data.

This would help us optimize the parameter space

for things like pixel based orientations and visualizations.

And we can have metrics then based on particular views of the parallel coordinate plots.

Problem is this also depends on the size of the display,

and the space between the axes is where interesting patterns occur.

But the more variables I have,

the more axes I have to draw.

And sometimes then if I have a lot of axes I have to draw,

I may not have a lot of space.

Likewise, this gets really long,

but I have a lot of screen space,

I might have screen space appear that I don't even wind up using.

I can take this and I can rotate it,

so I don't have to draw my parallel axes vertically.

I can do my parallel coordinate plots this direction,

and I can draw my lines as well.

But now I'm losing out on this green space over here.

So again, thinking about trade offs between

these different visualizations of scatter plots and parallel coordinate plots.

And with parallel coordinate plots, basically again,

each connection between an axis,

gives us some information about the data.

Well, we can use a variety of metrics to try and optimize the use of screen space.

For example, we can look at histogram distance,

like recording the slope of the lines between the axes.

We can use paired histograms or histogram

of all the lines covering both axes to try to determine,

you know, should I put these two axes further apart,

and these two axes closer together?

What sort of information is there?

Can I delete this axis altogether because it's not interesting?

So, we can start using these different metrics

and information to again sort of analyze the data first,

present what's interesting, let the user filter, analyze again,

and try to help them explore

form hypotheses and understand what's going on within the data set.

And again there's a variety of metrics to optimize the use of

screen space like line crossing.

So we can interpret each line between a pair of axes as a directed interval,

and sort of count the number of times that the lines cross.

And again, think about it this way.

If we have no line crossings between two axes,

this means that every time a value goes from low to high,

or could have been from high to low,

we easily can sort of sense this pattern.

We can also have the angles of crossing to determine angles between line crossing.

So if I have two axes and I'm getting all of these crossed lines,

I can measure the angle between each cross pair,

and use that as some sort of metric as well.

And we don't have to do this only for quantitative data,

we can actually do this for sets of data as well.

And so, Robert Kosara introduced this idea of parallel sets,

where we can adopt

this parallel coordinate layout by use a frequency based interpretation.

So for example, there's a nice data set about the Titanic.

So, how many male passengers were in first class and survived the Titanic,

versus second class, or third class, or so on.

So our data set winds up looking something like this.

So, we have first class,

second class, third class.

We have male, we have female,

we may also have some sort of a secondary role like survived not survived.

And so, I might know how many survivors were in first class,

maybe this was 30,

and maybe not survived was one,

and maybe this was 25,

and not survived was zero.

But then by the time we get the third class,

it may have been 300,

and 279 did not survive,

and something like 300 and 100 didn't survive.

So again we have this sort of data set,

and we have this sort of frequency based representation.

So, the width of our line here represents the number of males that were in first class.

This bar sort of shows us how many males were on the boat, and how many females.

We split this axis into its overall count.

So I can count the number of men and women on the boat by

adding up all of these different boxes.

And so the length of this part of the axis is the number of males,

length here's the number of females,

the length here is the number of people that didn't survive,

the length here is the number of people that did

survive and the length here is the number people in first class,

second and third class, and then crew.

So I can quickly get an overview of

first class had many less passengers than second and third added together,

and the crew had a similar size the second and third class.

I can look and see how many people in first class were male,

how many people in first class were female, and how many survived.

And we can see actually all the females in first class,

none of them had a green line to not survive.

So, all the women in first class survived,

only a small chunk of men in first class didn't survive.

But when we get to second class,

we see a large chunk of men who were in second and third class,

and the survival rate was really low.

And even for females,

even though we had a pretty decent survival rate,

we have a much higher death rate than we had in first class.

And this parallel set lets us sort of compare

and reason with these data sets to see how these sets of

data overlap and float to their different variables.

And there's been things on how we might update parallel sets with other variables,

so looking at distributions along axes.

So, one of the parallel axes could be for family data set.

So, market, family type income,

and looking like unemployment distributions for

different classes of people on a different axis.

So, there's all sorts of ways we can try to think

about how we can enhance and interpret this data.

And we don't have to lay this out in a horizontal or vertical pattern,

we can even think about laying out these axes in a circular pattern,

to create what we call star plots.

So again now each axis is still next to a neighbor.

So this is our baseball example,

so we might have data about At bats,

Runs, RBI, Batting Average.

And for each player,

I can again draw my line.

And I could put all the players on one plot if I want.

But I quickly get sort of too much noise,

but I could also make a glyph now,

a single plot for each baseball player to let me compare.

So for example we could have drawn Gonzales, or Votto,

or Infante, and see different sizes and shapes.

And we can see Gonzales was very well rounded in all of these,

while Infante may have been good at Batting Average but he had a low RBI for example.

And these glyphs allow us to compare between different players.

We could have done this for countries,

we could do this for quiz scores in class,

all sorts of different things.

But we don't have to just stick with star plots,

and with these single views.

People have also tried this for other sorts of things.

Thinking well, humans are really good at facial recognition.

So what if I encode data into a face.

Or what if I try to draw a plant

with petals and stem rings and those sorts of things equal to the data.

And so Chernoff Faces are an example,

where we have a whole lot of different ways we can encode the data.

So what if we have a car data set where we have variables on miles per gallon,

the weight of the car, or the year every car was made, horsepower with cylinders?

And so we can start thinking about drawing a face that represents this car data set.

So for example, we could say okay,

the further apart the eyes,

the more miles per gallon we have.

The more rounded the head is,

the more the car weighs.

The eye size could correspond to the year.

The mouth will correspond the horsepower, and so forth.

And so we can map our different variables to

all of these different possible combinations for Chernoff Faces,

and then for each car we get a different face.

Then we can try to organize cars based on face similarity,

to have different groups and clusters,

and we can do the same thing with star plots as well.

And so we can get this again,

concept of small multiples,

so each row in our data set has its own face.

So we can start comparing these clustering and grouping together.

And so hopefully you can see how scatter plots and parallel coordinate plots relate,

how we can go and transfer these to other shapes such as star plots,

and how we can even really think about different sorts of elements

like a face for Chernoff Faces for encoding data to all sorts of things.

And again we can think about this for a flower, right?

So, we could have the number of petals could represent a particular variable,

the distance, the angle between petals could represent things,

the number of dots on the pedal face,

and all sorts of different visual elements

have been explored to try to help people look at,

quickly overview, understand their data,

and there's pros and cons to all of these.

Some of the things like Chernoff Faces,

there's papers trying to compare how well people interpret these.

People have problems with interpreting some of these data.

It's hard to get an exact value if I ask you exactly how

far apart are the eyes spaced. It's hard to tell.

But you can probably tell us which face,

if I'd give you two different faces,

which one has them further apart than the other,

just not how much further.

And so when I think about these trade offs,

when I think about how we can compare correlations and data sets,

how we can extract information,

and what are other sorts of visualizations we might be able to use to

explore and draw this multivariate data. Thank you.