Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

In this module, we want to talk about means of analyzing geographical distributions.

Before, we've talked about detect the expected, discover the unexpected,

and the visual analytics monitor about analyze first, show what's important.

And so a lot of times when we're talking about

this sort of data exploration and we go through

different lectures of sort of visual design versus different modules of data mining,

these are trying to combine these different techniques to give us an overview of

how people can explore data to find things that are interesting.

And for spatial data,

we have a bunch of different measures for sort of

identifying interesting patterns across

space and really I'm talking about geographical data in this case.

So we have Tobler's First Law of Geography where Tobler describes that,

"Everything is related to everything else,

but near things are more related than distant things."

What Tobler is trying to discuss here is that if something occurs at point A,

there may be cases where it spreads out.

And so the neighboring regions are going to be more affected than things further away.

Imagine somebody getting sick and they go to your school and they cough on

something the people that go to that school are more

likely to get sick than people that don't,

that are further way to different building.

So this is the sort of concept behind that and there's

different measures and statistics we can do to try

to help identify the co-variation of properties within this geographic space.

If correlation exists either positively or negatively,

then we sort of can have

three different possible explanations about what might be going on in a spatial region.

It could be there's just a simple spatial correlation relationship.

So it means something is next to something,

they wind up being correlated.

It could be spatial causality meaning that something occurred in A,

it causes something to occur in B or could be

some sort of interaction effect that the regions are interacting.

And so spatial statistics are designed to try to help us identify

these correlations so we can start reasoning about

what might have occurred in the data set.

And again, imagine you have a huge data set where you have all of the counties

in the U. S. and inside of this county,

you have a whole bunch of different variables.

We might want to see in

variable one is there some relationship between a spatial distribution of this?

This maybe cancer rates and this may

be number of refineries

in a particular region.

Is there a correlation between these types of variables?

What other correlation might we find?

What is the spatial relationships?

And so we've talked a lot about correlation and measurements of

correlation and doing this four time series for example in our time series module,

trying to see how things line up and if we have two lines that look like this,

and I have another data set that looks like this and they follow the same pattern,

these can be considered correlated time series,

I'm at a scatter plot.

If I have some data that's distributed like this and I could fit a nice line to it,

that data is considered correlated.

Covariance has a very similar formula.

We're trying to measure the pattern of common variation observed in

a collection of two or more data sets or partitions of data.

So with covariance, we have data set one,

so we have the mean of data set X and we have

all the measures x_ sub_i and the mean of data set Y and the measurements of y_sub_i.

So we can think of this as this is data set X and Y,

we can figure out the mean and each point is,this is x_sub_1,

x_sub_2, x_sub_3, y_sub_1, y_sub_2,

y_sub_3 and so forth.

And so once I calculate the mean for x,

then I can find the covariance between x and y for i=1.

So I put y1 here, x1 here,

subtract the mean, multiply them and I add this over my entire dataset.

And that value gives us some indication of the covariance between those.

And this is fine for this sort of data where we have time series data or we have

relational patterns but it's not taking into

account space and what we're really interested in is how do we take into account space.

And again, we had other measures like

correlation so we can do the correlation coefficient,

how correlated, how similar are two or more paired datasets.

So we get no space into account here.

We have a very similar formula.

Notice this chunk of our formula

looks identical to this chunk here. We should have a bar here.

And what we're looking for is just normalizing this.

So correlation ranges from minus one to one.

And this gives us some measure of similarity between

two or more paired data sets and there's

measures to test how significant that pairing is and so forth that

oftentimes we calculate this correlation because we want to show people this correlation.

With space, we still need to modify this to handle space.

There's other metrics such as entropy to show a measure of

the amount of pattern disorder information in a set of data x.

So we have some probability is the proportion of events or value occurring in

the ith class or range so we can calculate Shandon's entropy using this sort of formula.

And all of these give us some sort of information about the distribution

between two datasets or in a single data set we can calculate things like diversity.

So the entropy standardized by the number of classes in a dataset for example.

And so now we're hearing the word class we might

start thinking about our choropleth maps we have

different classes in the choropleth maps so we can start thinking about diversity.

But for spatial data,

we want to find related regions.

You know our eyes are drawn to these clusters,

these clumps of different data in different regions.

Are those statistically correlated?

And can I do an analysis prior to the visualization to find areas that are

statistically correlated and maybe help the visualization pop those out?

And one means of doing this is adapting some of those measures I

showed earlier for spatial autocorrelation.

Now there's a whole lot of issues in spatial statistics ranging

from scales of the data to how do we sample,

to logical fallacies and ecological fallacies.

And in analyzing our spatial data,

we have to be aware of these different issues.

Try to report those to our users so that we're clear about what's going on in

the data set and try to see if we can come up with ways to overcome those.

Now one critical thing in measurements in space is distance and direction.

So we know the location of let's say of events so we

can look at a data set of all the crimes that occurred in your town.

You can go to local police blotter,

collect all this information and knowledge of locations

allows the analysts to determine the distance and direction between different locations.

We can have traffic and trajectories,

we can download the New York taxicab data for example.

And a lot of spatial analysis requires the calculation of

a table expressing the relative proximity of pairs and places.

So if I have a bunch of events,

what's the distance between those events?

And so I might make a table like,

crime number one, crime number two,

crime number three, crime one, crime two,

crime three and so forth,

and I may have a data set where this is the distance between those crimes.

So if I know the latitude and longitude of crime two and crime one,

I can find the distance between one and two.

These don't have to be crime events these can be counties as well.

So could be the distance between all the counties.

It could be the distance between police stations,

the distance between bars,

all sorts of different things can be

expressed as this relative proximity of pairs of places.

We can also create what's called a weights matrix.

And so with a weights matrix,

we may have a geographic region and we may have a different counties for example.

And each polygon I'm showing here is a particular county.

So we have county one,

two, three, four, five,

six and this Matrix W is going to tell us how these different counties are connected.

So we are going to create a matrix that's got all of the counties

and what goes inside of this matrix is whether or

not the counties share a common boundary for example.

It could also represent the length of the common boundary.

It could be the distance between the centroid.

So the weights matrix can be complicated.

So let's just do a simple one where if they share common boundary,

we're going to have a one otherwise they're zero.

So county one has a common boundary with two and three.

Right? So has a common boundary with two,

common boundary with three,

does not have a common boundary with itself,

so none of them have common boundaries with themselves.

Okay, now how about county number two?

County two shares a common boundary with one,

three, five, six and here we noticed even touches four.

So in fact county two shares a common boundary with everyone, county three,

common boundary with one, two,

four and five, one,

two, four, five but no six.

County four is two,

three and five and county six is just two and five.

And I skipped county five.

Let's go back to county five, it has six,

it has two, It has three, it has four.

So we get this sort of matrix.

And again, we can do

all sorts of different things like the length of any common boundary,

a decreasing function of the distance between places,

so we can create this matrix in lots of different ways that this weights matrix becomes

critically important in lots of calculations for spatial statistics.

So we're not going to go into depth on calculating a lot of

these spatial statistics but being able to understand how to build

this weights matrix is going to then fold in

a weights matrix to modify some of these different equations we've talked about.

So this weights matrix is going to let us modify

these different equations to start calculating spatial covariance,

spatial correlation and other sorts of metrics that we might be

interested in for having some idea of if things are correlated in space,

how do we calculate that and how do we use

that information to determine what might be important for a user.

And if we're using our packages in Python,

we can use the PI SAL library to calculate these different measures and

metrics to help us get an idea of what might be interesting and