Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

So, one thing that I would like to highlight is that

the Euclidean distance and the correlation similarity,

while they measure different things,

they are kind of related to each other.

They are related to each other in the sense that both of them are synchronized measures.

They are synchronized in the sense that what they do is they in both cases,

we are assuming that the time is synchronized.

We are basically looking at

the relationship between the two series at the same synchronized time instances.

That is we are saying,

"Oh, indicates a Euclidean distance, oh,

at this time instance they are different from each other this much."

In the case of correlation we are again doing the same thing.

Basically, to create these graphs that we then use to analyze for correlations.

We are essentially looking at pairs of values at the same time instance.

Then once basically we have the pairs of values at the same time instance,

then we are drawing these correlation graphs and then we

analyze them for positive-negative correlation.

So we call these synchronized measures.

The problem with the synchronized measures is

that they assume that the time is synchronized.

That the two-time series are behaving the

same or are on the same time scale and within the same time frame,

which doesn't always fall.

For example, in the left-hand side,

we see two time series.

In this case, I have a brown time series and red time series.

If you look at them, they look very similar to each other.

The two-time series have a very similar behavior.

On the other hand, the red time series is time shifted relative to the brown time series.

So they are not occurring at the same time.

So if we compute correlation,

or if we compute Euclidean distance,

we will see that these two time series are very different from each other.

Why? Because at this time instance,

they have very high values, very large difference.

The same way when this is decreasing,

the other one is increasing.

As you can see, these two-time series are different from each other.

If you look at Euclidean distance,

you will find that they are different from each other.

If you take a look at the correlation,

you will again find that they are different from each other, right?

On the other hand, in terms of their shape,

they are similar to each other.

Simply because one of them is shifted in time.

So, if we want our measure to be shift invariant,

we cannot use Euclidean measure.

We can also not use the correlation measure.

We need something else.

There are other asynchronous as well.

Another other possibility could be stretch.

So in this case,

in the right-hand side,

we have two functions.

These two functions are actually similar to each other.

They both show an increase and decrease behavior.

The only difference is that the red curve shows a very fast increase and decrease.

Whereas, the brown curve shows a slow increase and decrease.

That is, you can sort of think of this as basically

here on the left hand side the time is shifted.

In the right hand side time is stretched.

So in some applications you don't care about the time stretch.

You want to say "Well,

do these two curves show a similar behavior?"

Independent or how fast and how slow they show the same behavior, right?

If your application has stretching variance,

once again, you cannot use Euclidean distance,

you cannot use the correlation similarity.

You need something else, right?

So those are basically the major concerns that we have with synchronized measures.

So then the question essentially becomes well,

if basically we have this problem with the synchronized measures,

how can we come up with mechanisms to quantify

the similarity or difference when we don't assume time synchronously?