This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

從本節課中

STATISTICAL GRAPHICS: DESIGN PRINCIPLES FOR Box Charts and QQ Plots

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In previous modules, we discussed quantiles, quartiles and

visual representations of those using box plots.

And we discussed how quantiles and

box plots can be used to compare between different data distributions.

Again allowing us to sort of get an idea of skewness, median,

how samples are organized within different data sets.

However, it may often be visually difficult to directly

compare to box plots.

Especially if we're trying to ask ourselves the question, well,

do these two groups come from sort of the same distributional sample of data.

And this is where the concept of Q-Q plots come in.

And in this lecture, we're going to discuss how to create and

use these box plots and

Q-Q plots to better compare between two different data distributions.

So a Q-Q plot is exactly what it sounds like,

it's just plotting quantile versus quantile for a given data set.

And so we decide [COUGH] how many points we want in a Q-Q plot based

on the number of quantiles we want to calculate for a data set.

So remember, in our first example we said, okay we wanted the number

of quantiles, To be equal to 5.

And so we calculated the percent, less than 20%,

40%, 60%, 80%, and 100%.

If I calculate the same quantile set for two different data sets,

I can create a plot where this is Q1 and this is Q1,

and this is data set one, and this is data set two.

And as I plot how the quantiles map,

I may get some sort of picture like this, like we're seeing here.

Ideally what a quantile-quantile, or Q-Q plot shows us,

is if the data falls along this 45 degree line, it means the data

came from samples that have similar statistical distribution.

And so what people often do is they'll plot the quantiles for

a normal distribution against the quantiles for a different data set,

to see if a particular data set does have an underlying assumption of normality.

So again, trying to think about what are the different properties of our data set?

Is this multimodal?

Is it normal?

Is it skewed?

How can we explore and understand this?

Now this is a lot more powerful than comparing two different

distributions than histograms.

So for example,

if I'm looking at a histogram that sort of winds up maybe looking like this,

versus a histogram that sort of ends up looking like this, how similar are those?

It's difficult to tell a little bit.

Let alone, this may have been created from a sample size of 100.

This may have used a sample size of 1,000.

What we could be comparing is library number one versus library number two,

and we want to see what the distribution of page counts for

books is in library one and library two.

The libraries have different books.

They have different budgets.

But if we're using a Q-Q plot, sample sizes don't need to be equal.

Because again,

we're just calculating the quantiles for the data set from library one,

the data set from library two, and then plotting the quantiles in an x,y plot.

Another alternative is a probability plot that we're not going to talk about now,

but again, these are different tools to be aware of for your data detective role.

And so to create a Q-Q plot is again, we need two different distributions.

So in this case, we're looking at a normal distribution

versus the quantiles of a particular RBI.

And so, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

So notice we have 10 dots.

So 10 samples, that means we had our number of quantiles, Was equal to 10.

So this dot, this is the position at which 10% of the data samples of

a normal distribution are going to be less than this value.

Remember, a normal distribution is often centered at 0, so

it has a [COUGH] mean of 0.

And then a standard deviation of 1 is for your common, normal Gaussian distribution.

So for our Gaussian distribution, values less than negative 1.7, about 10%

of the values in the normal distribution are going to be less than negative 1.7.

For our RBI, how do we calculate the quantiles?

First we sort the data.

And so, we see that 10% of our samples are going to be less than the smallest value,

that's 41.

So that's why we've got this 41 versus this negative 1.7.

So this is Q1, Q2, Q3, and so forth, all the way to Q10.

So all we're doing is plotting the quantiles of data set A,

our RBIs, versus the quantiles of a normal distribution.

And had this mapped to a nice 45-degree straight line,

then we would know that the RBIs were distributed relatively normally.

Here, we can see we wind up with this sort of curve,

sort of S shape, meaning we're not quite normal distribution.

Maybe we have some skewness or other underlying properties of the data.

But this again helps us begin reasoning about this,

asking questions about, do we expect RBIs to be normally distributed?

Is this what we expect in the data?

Are there problems underlying our data?

Are there transformations we should be doing to analyze this?

And so forth, for understanding the resulting statistics.

And so Q-Q plots are just another tool that we can add to our bag of tricks

to directly compare between two different data distributions.

They're most commonly used to compare a normal distribution

to an unknown data distribution that we have a bunch of samples of.

But we can also just take two different sample data sets and

compare them to each other to explore different properties and

to understand relationships between our data variables.