0:01

After you've collected all your data, what do you do with it? The main technique

you'll learn in this video is how to compare rates. Rate comparison is my pick

as a first statistical test to learn in HCI because it's easy to do and relevant

to many real world activities like comparing click through rates on websites.

Before we get to the details, let's begin at a higher level. Here are three really

important questions that you can ask and answer by analyzing your data. For

starters, what does my data look like? To do this, explore your data graphically.

Plot all of your data. And then, once you see some patterns that may be interesting,

try to look at aggregate summaries of different sorts. Second, after you've

looked at your data graphically, what are the overall numbers? These aggregate

statistics help give you a quick summary of what you've seen in your experiment.

Simple things to look at initially are mean, the average, and the standard

deviation, how much variation was there within each condition. And third, are the

differences that you see and say the means, real? Separating real differences

from mirage differences is the goal of statistical testing, and you'll learn a

technique for doing that in this lecture. Say for example I have a coin, and we'd

like to know whether it's loaded, that is, is it equally likely to produce heads and

tails. So I have a coin right here and I can toss it a whole bunch of times. Tails,

heads, and I can keep going. Let's say I toss it twenty times and thirteen of those

twenty tosses turn up heads. What would we expect to get? Well if we have an even

coin that is equally likely to produce heads and tails, that would say that our

expected value for twenty tosses would be ten heads. Is three extra heads out of

twenty a significant difference? That is, is it weird enough that we would be pretty

likely to say that our coin is biased and not an even coin. Well to figure this out,

we are going to use a test statistic. And what attribute does our test statistic

need? Well one thing th at we'll want to encode is the difference from the expected

value. So the fact that we got three extra heads here and we want to know out of how

many this is out of in a couple of ways, both in a ratio sense. Three is less

material if it's out of a thousand. And also in a number of trial sense, if we had

a thousand trials and we saw a 30 percent rise in the number of heads, we might say

that, that's more unusual than if we had a couple of trials none of 30 percent rise

cuz that's only, that's only a few. And so in this lecture, we are going to use the

Pearson's Chi Square Test. This is a fancy name for the totally standard test for

being able to compare the rates of an expected value to an observed value. And

it's going to use, exactly the parameters that we decided were necessary. So it's

going to compare our observed to our expected and it's going to do that in a

way that as we have more trials. If we have a sizable difference, that will give

us increased confidence that the difference is robust and significant. And

it's called the Chi Squared Test because the value that we're going to get out of

this test is called the Chi Squared Value so our x^2 =, we're going to take the

difference between the observed and the expected value. It's always positive so

that a divergence in either direction is positive and also so that large

divergences from the mean are more notable than smaller ones. And then to get that as

a proportion, we'll divide that by the expected value. And we're going to

consider this difference for each of the possible outcome values. In our coin, we

only have two values but you can imagine a dice or other options where there will be

more possible values of the outcome. Now before we get to the outcome of our test,

I'd like to introduce some additional machinery first. You can imagine that a

normal coin. One that not loaded will, if you toss it twenty times sometimes it will

come up ten out of twenty right as the expected value. But, it wouldn't be

shocking if it came up nine heads out of twenty, or eleven h eads out of twenty.

And this distribution of expected values for something like a coin will follow

what's called a normal or a Gaussian Distribution. So you've got the expected

value or the mean in the middle and that's going to be the most probable outcome and

then it's going to slowly fall off, becoming increasingly unlikely as you head

out towards the tails. And the area under our hill is gonna sum to 100%. The sum of

all the possible probabilities adds up to 100%. And out here on the end are the two

tails and we're gonna call the area that's in those very edges, the really unusual

behavior. So to fill this in a little bit, in our coin case, our expected value is

going to be ten so our mean will be ten, and we'll have some observed value here in

our example thirteen. And the question that we're gonna ask with our statistical

techniques is whether that observed value of thirteen is sufficiently weird. Whether

it's sufficiently far out into the tail as to qualify as unlikely to have occurred by

chance. And that hinter lands of unlikely to have occurred by chance. By convention,

we're gonna have the, the portion of the two ends of the tails that together forms

five percent of the distribution. This would be for a 2-tailed example. And so,

if you've ever read in a scientific publication that the value, the

probability was P less than.05. That P value is this observed value that we're

seeing. And if it falls far enough out in the tails were going to say that was

unlikely to have occurred by chance. The second piece of machinery that we're going

to need to be able to do our statistical test is an idea called the Null

Hypothesis. And what the Null Hypotheses means is that our opening bid in any

statistical test is going to be, we don't think there's a difference between the two

conditions or however many conditions you have. So in the case of our coin, what

that would mean is your opening bid would be the Null Hypothesis which would be, the

coin is not loaded or to be a little bit more precise about it. The behaviour of

the coi n does not differ significantly from that of a normal, unloaded coin. And

what our statistical test is going to do is we're going to, check whether the data

falsifies the null hypothesis. So, that's a fancy way of saying that, if our opening

bid is this coin's behavior doesn't diverse significantly from a normal

unloaded coin. And you got say, twenty heads out of twenty. You'd reasonably say

this isn't right. So in that case our data would allow us to say that we falsified

the null hypothesis in this case. We reject the bid that the coin's behavior is

normal. And the very last thing that we're going to need out of our statistical test

is, in the case of a chi squared, we're going to need to know, what's our p value?

What's the probability that the observed behavior could have been generated by a

normal coin? So as our probability goes down as it becomes extremely unlikely that

the behavior was generated by a normal coin, once we get past our magic threshold

of .05, we're going to say we reject the null hypothesis, that's a loaded coin. So

the thing to take away from this table is that as your x^2 number gets bigger, as

that gap between expected and observed gets larger, or the number of trials gets

larger, that's going to up your x^2 and in turn, that's going to make it increasingly

unlikely that this behavior could have been generated by an unbiased coin. So now

we can return to our example and do so a little bit more formally. And we can ask,

twenty tosses, thirteen heads, at p<0.05, can we reject the null

hypothesis that there's no difference between this coin and an unbiased coin? So

let's work it out. We've got (thirteen - ten)^2/10, those are our heads. The

thirteen observed minus the ten expected. And then we're going to add the other side

of the coin which is seven observed tails minus ten expected tails again squared and

divided by ten. When we sum that all up, we get a value of 1.8. The other thing we

need to figure out is the degrees of freedom. The degrees of freedom is the

number of choices that you have minu s One. So for a coin, we have two sides so

it's two choices minus one gives us one degree of freedom. And that's cuz you

know, you really only picking one thing. If you had a die for example, a six sided

die, your degrees of freedom would be five cause there are six faces minus one, that

gives you five. So we can go to our table and remember as we go further to the right

that makes it unlikely that we have a normal coin, so that's our loaded coin.

And with one degree of freedom and a chi-squared value of 1.8, we can see that

we have coin behavior that is slightly unusual for an even coin. But not out of

the realm of reasonable. It's between ten and 25 percent of the time that you toss a

coin twenty times, you will see a divergence from the mean of this

magnitude. And so, because our divergent, because our chi-squared statistic doesn't

show up as sufficiently unusual, that is to say that it doesn't make it to the .05

level, in this case, we can't reject the null hypothesis. So that's a real fancy

way of saying that we can't yet stand up and say, this coin is a loaded coin. If we

really cared, the thing to do would to be to gather more data. So, let's say we keep

going. We're going to now toss it 60 times and we see this same ratio continue, 39

out of 60 times it shows up heads, so that's going to give us 39 - 30^2 / 30.

Heads + 21 - 30^2 / 30, that's our tails, and that's going to give us a bigger

chi-squared value, even though the ratio is the same because we're having more

trials, that's increasing our confidence, and the chi-squared value goes up. So now

it's up to 5.4. And we can look up in our table again, same coin, still a coin, so

degrees of freedom is still one, but we see now that our chi-squared value is way

over to the right and so the equivalent p value that pops out of this table is

somewhere around .02. So, we can reject the null hypothesis that our coin is no

different than an unbiased one with 98 percent confidence. One thing I'd like is

point out is that, if your trend is robust, if the ratio continues, increasing

your sample size, by in this case, a factor of three is going to decrease your

p value by a factor of nine. So, that's all and good you might say. You now know

how to walk into a gambling hall and check whether the coin that somebody presents

you is fair or not. But, what does this have to do with HCI? Well, the mechanism

that we've just learned for comparing rates, that holds for coins, also holds

for things like click-through waves on websites. Let's say we have a website that

has a button labeled Sign Up, and ten percent of visitors click that button. To

try and improve traffic to that button and get more conversions, we might change the

button to Learn More and then start gathering data. Over a week, there are

1,000 visitors to the site, and 118 of them clicked on the Learn More button. Can

we say with confidence that the Learn More button has a higher click-through rate

than the Sign Up button did? So we can work it through. We have 118 observed

clicks on this site, minus 100 expected clicks on the site, and we have 882 people

who did not click, minus 900 who we expected would not click. And as you can

see, over a week there were a 1000 visitors to this site, and 119 of them

clicked the Learn More button. Can we say with confidence that the Learn More button

has a higher click-through rate than the Sign Up button. Let's look it through. So

we have a observed number of, so we have a 119 observed click throughs minus a 100

expected and we have 881 observed non clicks minus 900 expected. Add it all up

and you get about 4.01 as the chi-squared value and again we have one degree of

freedom because we have two choices clicked and didn't click, one degree of

freedom. And when we look it up on our table, what we see is that the

chai-squared value is just slightly larger than the threshold for p<0.05. And so

we can say that this change indeed probably did influence the click rate. So

the chi-squared test in statistical testing as a general methodology gives you

two really powerful tools. First, it gives you a way of formal izing, we're pretty

sure. And deeply intertwined with this is it gives you a way of generalizing from

small samples. Are the differences that I've observed on a small scale likely to

generalize, if I were to scale this up? And this idea of inference from small

samples owes a lot to beer. In 1908, William Gosset was a chemist at the

Guinness Brewery in Dublin. At the time, Guinness was hiring top graduates from

Oxford and Cambridge to apply biochemistry and statics to Guinness' industrial

processes and Gosset devised the t-test as a cheap way of monitoring the quality of

stout. He published this test in Biometrica, but he used a pen-name in the

journal so that Guinness could keep its use of statistics as a trade secret. For

Gosset and Guinness there were really two important benefits of being able to make

broad quality estimates from small samples. First, if Gosset's testing

consumed all the beer, there would be none left to sell. Second, many statisticians

find it difficult to do mathematics after having consumed a large quantity of

Guinness. And so the quality of the results are importantly contingent on

testing on just a small sample. And this general strategy of working from a small

sample and performing significant testing can be done under a variety of framework

so today we talked about the chi-squared test but there are several others. For

example, if you have continuous data as opposed to discrete rate data, there's the

t-test, and if you have more than two conditions, there's a test called the

Inova or the Analysis of Variants. And these all work for the same normal

Gaussian data that we've been looking at so far. So for example, these tests can

help you compare which vacuum cleaner gets things cleaner if you have a measure of

cleanliness or which running shoes help you run faster or which input device,

trackpad, mouse, stylus is fastest for input and while handling it is beyond the

scope of this class, I think it's important to point out that data often

isn't normally distributed. Data could be bimodal, so if everybody falls into one of

two camps and nobody in the middle, then you don't have the nice big blob for a

normal distribution. Or, it might be, shifted over to one side. For example,

anything that's time-based. You can be infinitely slow, but you can't be

infinitely fast. And right now I just wanted to point out that that's something

to watch out for. In practice, a lot of the tests that you'll come across are

reasonably robust to modest deviations from being normal. And so, especially for

practical purposes, things often work out. And so, knowing what those assumptions are

and whether a particular test is appropriate for your data is half the

battle. Another clever technique that I picked up from Ranco Havi is to use AA

tests which is to say, take one condition like all of the people who got the Sign Up

button. Divide that in half and see if you see a statistically significant difference

between one half and the other half. That can be a good warning sign about whether

you're seeing mirages. Or, and again this is way beyond what we'll cover here, you

can use techniques like randomization testing which make no explicit model of

your underlying data and rely on repeated simulations as a way of modeling the data.

So to pop back up, what we learned today is that, to get a feel for your data,

graph it all. We also saw how statistics offers us tools that help you distinguish

real trends from mirages. And we learned a common technique, the chi-squared test,

for comparing rates. And here, as with other lectures, my goal is both with an

introduction to an area and a concrete skill that you can put to regular use. The

web has provided huge increases in the quantity of available data and also made

it much easier for you to run experiments online. So, my hope is that many of you

will use the experimental skills you've learned here all the time. And while

there's nothing fancy in this video, you may find it useful to review it once or

twice when you first use them for your own work. And really we've just scratched the

tip of the iceberg. If you'd like to learn more, as a next step I highly recommend

Jake Wobbrock's course on Practical Statics for HCI, he's got a series of

online materials. If you'd like to learn about practical strategies for doing

experiments in the general sense, I highly recommend David Martin's Doing Psychology

Experiments. If you'd like to learn the philosophy behind statistical testing,

there's a great book called Statistics as Principled Argument. And if you'd like a

nice flow chart of which test should you use when, I recommend the book Learning to

Use Statistical Tests in Psychology.