Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

So, what we have done so far if you looked at it,

in the first algorithm,

we have segmented the query pattern.

In the second algorithm,

we moved a window

on the data pattern.

So, these are for the two clusters of approaches; segmentation, and windowing.

In general, this idea of moving Windows,

has a specific name,

because there are many algorithms in sequence processing, or sequence matching,

or sequence exploration, that relies on moving windows, and,

these are usually called as w-grams,

or sometimes they are referred to as the n-gram algorithms.

Essentially, you've got a sequence,

its w-grams are obtained

by sliding a window of length w over the stream,

or the sequence, and collecting all the resulting sub sequences.

So, to compute w-grams,

I take the string say P,

I move a window of length W at each and every position,

and I collect all of them into a set called w-grams of P. So,

essentially, the previous algorithm that we learned here was basically

working on M grams of

P because we were moving a window of length M on the string P. So,

the next few algorithm that we will learn,

will also rely on this idea of moving windows,

but instead of basically moving windows of length M,

where M is the query pattern and that can be costly if the query pattern is long,

we will basically use what we call w-grams,

where W is a user provided parameter.

So, W could be five,

W could be ten, W could be 100.

It's an application dependent parameters.

We'll see that basically,

we can use w-grams for searching for matches.

So, let's see an algorithm.

The algorithm that we will learn will be called common w-gram counting.

Let's see how the algorithm works.

So once again, we are given a string P of length N,

and a string Q of length M. So you can

think of P as the data string and Q as the pattern string.

But in this problem that we will look at,

there is actually not much difference because the goal will be to see

whether P may match Q,

with at most K errors.

So note that this problem is slightly different than the problems that we looked here.

Here basically we will look at the containment problem.

In this case, we are looking at their matching problem,

which is basically more similar to the distance.

So, we are basically given two long strings.

One of them is length N,

the other one is length M,

and we are trying to see if these may match each other.

We know that the new way of implementing this,

say using it's distance has order N times M cost,

and we don't want to pay that expensive price.

So, how can we use w-grams to basically help us filter possible mismatches.

So the idea is as follows.

So, now I will use the other screen to explain the process,

but we may need to come back between the two windows.

One of the window will present

the algorithm and the other window I will try to explain that.

So, let's basically see how this will work.

So what we have is that once again it's string P of length N,

and other string Q of length M. So,

possibly both of them are long.

As the first step, we will identify all w-grams of query string M,

or query pattern M. So,

we'll basically identify all possible

w-grams of the query pattern M. Note that if

the query pattern is of length M and the windows are of lengths W,

you can see that basing the number of w-grams,

we can identify is equal to M minus W plus one.

The reason for that is that basically when they reach

the end of the pattern,

the ones that basically reach

out do not count as the w-grams which means that they don't exist.

So, that we can have,

if basically the length is M and the window length is W,

there are at most M minus W plus one w-grams of query string M. Now, remember,

in this problem, we are trying to see if there's a match between P

and Q with at most K mismatches.

So, let's basically assume that there is a mismatch

somewhere here on the query string Q, there's a mismatch.

Now, what you will first notice is that this mismatch will be

aligned with multiple w-grams,

but it is not going to be aligned with any w-gram that finishes before that,

and it is not going to be aligned with any w-gram that starts after that position.

So, in general, it is easy to see that each mismatch will affect.

So, this is a mismatch.

This is a window that's affected.

The number of windows that are affected

by a given mismatch will be W. This

is easy because essentially the mismatch if this is my window W,

if the length is W, say W is equal to five,

the mismatch may happen at the first position,

the mismatch may happen at the second position,

the mismatch may happen at the third position,

the mismatch may happen at the fourth position,

the mismatch may happen at the fifth position.

So, each mismatch essentially can affect if my W is equal to five,

the number of w-grams that may be affected is also five.

So, this means that each mismatch can

affect five if W is equal to five can mismatch affect five w-grams or more.

Generally, each mismatch can effect at most w-grams.

But remember, I am allowing for at most k mismatches.

This means that the maximum number of w-grams that contain a mismatch should be KW.

So this is the number of out off

the M minus W plus one w-grams that I have in my query string.

The number of w-grams that may contain a mismatch is KW.

So, what does this mean?

This means that, a total of M minus W plus one

minus K times one w-grams must have exact match.

They must have an exact match in string P,

because my mismatch cannot affect those.

The K mismatches cannot effect more than this many w-grams.

This means that,

if I identify M minus W plus one w-grams,

and given an upper bound of K errors,

if I check W minus M minus W

plus one minus K times W w-grams in P,

and if I find that many,

at least that many w-grams,

then there's a chance that these two strings may contain an approximate match.

If however I cannot find this many w-grams,

then what I can tell you is that,

we are guaranteed that

these two strings do not contain an approximate match with the given error bound.

So, we can prune this away.

Note that again, I didn't compute any distance between the two strings,

I simply divided my query string into it's w-grams,

I search for this w-grams in my P string,

and if I can find at least this many w-grams with

the exact match then I know that there is a chance these might basically be similar.

So, I should go and basically pay the expensive distance cost.

Note that, the good news is that we will not learn about data algorithm now,

but since we are talking about sub sequence match,

we are trying to find w-grams of Q in P that they are doing sub sequence match in

P by using suffix trees that are

efficient algorithms to implement this w-gram counting algorithm.

We are not going to learn that,

but there are very efficient ways to search for the w-grams of Q in the P string.

They can be very long,

but the search is going to be linear time rather than being quadratic.

That is a big gain and that is very important in data exploration.

We need to make sure that these searches can be done

quickly so that the user can basically be provided

with the data for export purposes in near real-time.