Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

Great. Are we done?

Do we have everything that we care about covered?

Now, unfortunately, not.

Because the ARIMA model that we've discussed so far doesn't do one thing.

What it doesn't do is the cyclic behavior that we have seen earlier.

So, to look at the cyclic behavior,

you will see that what happens at this point can possibly

more easily described if I look at the past.

But not immediate past,

not the two time instance back,

not three time instance back,

but maybe 100 time instance back.

The point at which that describes what will happen right

now is better described with far back,

which we called usually a gap or a lag.

So in this time series that I draw here,

the lag that would best describe what's happening right now is 100 units.

Again, the same thing here.

What's happening here may be best described by 100 units back.

So this is a cyclic behavior with a 100th unit lag.

So if I can say that out,

if I can come and say, well, you know what?

Don't look at immediate past,

don't look at two units past,

don't look at the three past,

but look at S unit past.

That is, if I can say,

what is happening right

now is determined by S unit or 100 unit past,

a seasonal past, then I can better characterize these time series.

So essentially, to be able to account for that,

what does the mathematical tool that we use is called seasonal differencing.

Essentially, in the case of seasonal differencing,

we create a new time series.

In this case, it's a series called S. The series is

a seasonal time series is

obtained by taking the current value and by subtracting from it,

the value of the time series that is S unit in the past.

Look at what would happen if you do the seasonal differencing on this time series.

What would happen? Well, what would happen is the following, right?

At the beginning, so for the first 100 days,

we don't really have any past.

So we will be, essentially,

let's say, we get this value.

So we don't have the past, nothing to subtract.

But beyond that point,

what we are doing is we are taking and subtracting from it 100 units past.

In this case, if this is the idea,

if it's a perfect curve,

if there is no random event that's happening,

essentially what would happen is that,

we will basically get 0.

So the value here would be 0.

The next value, if you take the difference again would be 0.

The next value again will be 0.

The next value again would be 0.

Essentially, what we have done is that we have taken this complex time series that shows

this complex cyclic behavior and we created from that a stationary time series.

So that's everything stays the same.

Of course in real world,

when we have cycles, you usually you have some random changes. So it is not perfect.

Usually, you have some random component on time series

that comes from external events or external variations.

So what would remain essentially is a purely random time series.

So differencing, essentially, seasonal differencing,

or differencing with a gap, with a large gap,

essentially enables us to be able to take a complex cyclic time series,

non-stationary cyclic series and convert into a simple stationary time series.

So it's actually a very powerful tool which essentially means that if we

expect that our data is showing cyclic behavior,

we might as well add into our ARIMA model,

also cyclic or seasonal terms.

So, essentially, in this case what we are doing is that we are taking the time series X.

We do a seasonal or cyclic differencing

to obtain our new time series,

S. Then we create a new ARIMA model on top

of the newly created time series and we call this seasonal ARIMA model.

So essentially, what we see is that our time series that we

observe has ARIMA component and has a seasonal component.

Usually, there are different techniques,

different models and some of them would say,

well, the current observations,

the observations we do has

an additive behaviors which means that essentially we can look at the data

as an ARIMA time series superimposed and edit with a seasonal time series.

In some more complex scenarios, you say,

well, they are observing at a given point in time has two components.

It has an ARIMA component.

It has a seasonal component,

and these are combined multiplicatively.

Of course, using multiplicative seasonal behavior

leads to more complex models and

more difficult discovery algorithms than an additive model.

Usually, addition is easier to discover

and easier to account for than multiplicative behaviors.

To wrap up, we start from a very simple random model,

very simple model where the only thing that we have to account for

is the current external events and nothing else.

We learned autoregressive models where we are looking at past observations.

We learned about moving average models where we look at past random events.

We had said that these may need to be combined in the real world data

and we actually do also account for things like speed of change, degree of acceleration.

We came up with the ARIMA models to do that, and that we said that,

well, if our data show seasonal behaviors or cyclic behaviors,

you might need to add seasonal differencing and that leads to essentially these complex,

more complex but more richer

and in many application more useful ARIMA with seasonal behavior type of models.

The goal essentially then is given a time series,

given a total complex time series,

what we want to do is we want to take

this time series and we want to basically convert this

into a closed-form formula where essentially here we have the AR components discovered,

the MA components discovered,

the differencing components discovered,

and the seasonal components discovered.

It's actually the next question we'll try to answer is how do we do that?

Given a time series,

how do we do the model discovery process?

In particular, I will outline

a procedure that is commonly known as Box-Jenkins procedure.

So Box-Jenkins procedure, as we will see,

is not really an algorithm per se,

but it is an approach to model discovery.

It is an approach to discovering the characteristics of a time series

first and then using those characteristics to

discover a closed-form formula for the given time series.

So let's start. So, essentially,

what we are doing right now is that we are learning model-based, time-series analysis.

Essentially, the goal, as I said in the previous slide,

is to discover a model that describes the given time series well.

This is also called that fit our time series as well.

So this essentially is usually referred to as model fitting problem.

So we tried to find a closed-form formula that fits into the observed time series.

Usually, there are a couple of characteristics here,

or a couple of [inaudible] ,

a couple of things that we want to be able to have from these models.

The first thing that we usually want is that the model should be as simple as possible.

That is they contain as few terms as possible.

The reason basically why we want a few terms as possible is that as I described before,

if you have more and more and more terms,

first of all discovering these terms becomes more difficult.

Secondly, for decision-makers, the more complex the model is,

the less useful or the less interpretable the resulting models are.

So, usually, if you can fit to the observed data well,

a simpler model is preferred.

So, we have two criteria,

the first one is the fit,

the second one is the complexity.

As we will see, usually,

there's a trade-off between these.

There's a strong tradeoff between how well you fit and how complex the model is.

Unfortunately, there's really no simple answer to which one you should prefer,

whether you should prefer basically have

a better fit or whether you should try to find simpler models,

and that's really an application dependent issue.

Sometimes you want to fit the data very,

very well even though you might basically to

more complex model because you

need to be able to predict the results more very accurately.

Sometimes not fitting the data well is

okay as long as the resulting model is easy to interpret and easy to understand.

So this is really an application dependent challenge,

and I'm not going to go into details about the

trade-off between them in the next few slides,

but I want you to remember that model fitting has usually two criteria.

One of them is degree of fit and the second one is the complexity of the models.

The higher the fit and lower the complexity, the better.

Good. So let's basically sort of go over the steps of this Box-Jenkins procedure.

So the first step in the procedure is essentially to try to remove

any seasonal patterns or deterministic trends in

the data to obtain essentially a stationary,

to remove seasonal patterns and deterministic transient data to obtain a

stationary as possible time series.

If you remember, one of the first thing that we said in

the earlier slides was stationary data is easier to analyze.

Complex data with trends or a cyclic patterns are difficult to analyze.

So, essentially, because of that,

the Box-Jenkins procedure starts with taking a complex time series

and trying to convert it into

a stationary time series through the differencing operations.

Let's see an example.

So, here, we have a complex time series.

Note that this complex time series is not stationary.

First of all, the mean of the series change over time.

Sometimes it is basically changing.

Sometimes it's constant over time.

Sometimes it shows a trend.

So, this is clearly not a stationary time series.

However, if we apply differencing on this time series, that is,

if you basically instead of representing the series with the observed values,

but the difference of the observed values, in time,

we might be able to get a more stationary time series.

So, differencing usual helps us take

a complex data and convert it into a simpler stationary data.

Because of that, Box-Jenkins procedure usually starts with differencing.

So you take the data, you apply differencing to see whether you

can make the data as stationary as possible.

Now, once you obtain the stationary data,

as stationary as possible,

then the next step essentially is to do what is known as plot analysis.

That is, we will basically drop in certain plots,

and we'll basically visual analyze those plots to see whether we

can discover the parameters of the underlying model.

But what are we plotting? What are we plotting?

So, what are these things that we will be plotting?

We will not be plotting the time series themselves.

We'll be plotting the statistical properties of the time series.

The first statistical property of the time series that we care about is known

as the autocorrelation function of the time series.

The autocorrelation function essentially is a function that helps us observe

linear relationships between lagged values of a given time series.

So this is the close form formula of the autocorrelation function.

Essentially, you will see here,

is that we are basically,

so this is from earlier lectures.

You will remember that this is a correlation.

This essentially is the correlation between two time series.

In this case, we're finding the correlation between

the time series itself and the lagged values of the time series.

That is, the shifted version of the time series.

So we can visualize it like this.

So I'm given the time series.

So this is my time series.

It's sort of a small part of the time series.

There is the past, and there is the sort of the future.

So, it's a longer time series.

So I'm here basically.

Let's assume that the color correspond to the value.

The darker means higher value.

The lighter means, say, a lower value.

Essentially, what we do is basically we take

lagged values of the time series and we compare them.

In this case, for example,

we can basically look at when lag is equal to 0.

Essentially, what we are doing is we're comparing the time series with itself.

We find the correlation of the time series with itself.

We're comparing value by value,

and we are seeing are these values similar.

That's what correlation does if you remember.

Obviously, if you basically have no lag in the data,

if you don't shift the data,

we will have a very high correlation because this time series here,

which is X_t, look very similar to this time series,

which is X_t plus 0, which is again X_t.

In fact, the correlation,

in this case, will be 1.

The ACF with lag equal to 0 will have the value 1.

It is going to be perfect match,

perfect alignment, or let's treat it like zero.

Next, what we will do is we will apply a shift.

We'll basically shift the time series with one unit.

In this case, our lag is equal to 1.

Note that when we apply a shift,

lag equal to 1, when we shift the time series a bit, rather,

we shift the time series a bit, just a bit,

what's happening is that there are sort of observation

start differing from each other at the same point.

Now, of course, since the shift is kind of little, usually,

what we would expect is that if the lag is small or if the shift is small,

even though we might not get perfect correlation,

even though we might not get correlation value one,

we might get a high correlation,

say, basically 0.9 correlation because we expect that basically,

just say, shifting the time series is

just a bit and it's not going to make a very big change.

So, we will expect that.

But again, not all datasets will have that.

But usually, you kind of expect that.

So, what if I shifted a bit more, what happens?

So what happens in this case?

If you will see that if I take lag equal to 2,

so the difference between the aligned values increases.

As basically I am shifting the time series further and further and further.

What's happening is essentially,

I am basically getting maybe much lower autocorrelations.

At lag equal to 2,

the series start looking very different from each other.

Since, basically, I'm getting a very different values,

I can kind of now start seeing, oh, look,

it looks like basically,

the autocorrelation we can see is,

the actual lag, to see start looking back.

Now, if my data is cyclic however, eventually,

if I shift my data sufficiently, in this case,

if I shift my series seven units,

I might again find the position at which the correlation becomes high.

So what happened is that at lag equal to 0,

my correlation was high.

My correlation starts going down and eventually,

my correlation basically again become close to one.

So, essentially, really what's happening is that as you can see,

this autocorrelation function is telling me something about

the shape of the curve or shape of the time series.

In this example, if I see some behavior like this,

also when I started the high autocorrelation,

it comes down, and it goes up,

and eventually, it reaches the same value.

I can actually even say that, wow,

look like this maybe cyclic.

So this maybe seasonal.

So I can maybe learn that.

So, as you can see, autocorrelation function that I plot can give

me some information about the shape of the time series, and we will use that.

We will use this idea for the autocorrelation function to understand the time series,

and we will basically learn how to do that.

The second tool that we use a lot when we try to

understand the high-level characteristics of

a time series is called partial autocorrelation function.

We are again studying the autocorrelation.

We are again taking the time series and comparing it to its shifted values.

We are again doing that.

The difference, however, is that when we are using the partial autocorrelation function,

we adjust the correlation for

the presence of intermediate values between the X_t and X_t lag.

Now, I'm not actually going to define how this is done.

I'm only going to say what partial autocorrelation function does.

Essentially, it is the correlation between X_t

and X_t lag conditioned on the intermediate values.

So I'm not really going to basically go into the actual formula to compute that,

but what I will do instead is to show you guys how to use ACF,

autocorrelation function, and PACF,

partial autocorrelation function, to understand the time series data.

So we'll actually learn this plot analysis procedure with some examples.