So, let's return to

our general formula for the probability of a sequence.

We said that we need to constrain the complexity

somehow and consider Markov models as one possible solution.

We also said that incorporating medium or

long-term memory is quite problematic with Markov models.

We also said that because

the observed value y is stochastic itself conditioning on its past that

was directly may produce a bit too much noise

in predictions that we really ready to tolerate.

So what are ideas for sequence modeling that we would consider?

One very popular idea is inspired by

a latent variable models that we looked at in

the previous videos for non-sequential data.

It turns out that a suitable generalization of such models for

sequential data is provided by dynamic hidden variable models.

The way the work is shown in this diagram,

this time we have a sequence of hidden states which is the same as viewing

a hidden variable as a random process x_t rather than a random variable on its own.

This hidden process is Markov,

which means that it's next value depends only on the previous value and nothing else.

The only way for us to know anything about

this hidden process is to look at the observed values

of y_t and try to make inference about hidden states based on these observations.

Now the observed signal y_t at time t is assumed to be generated from

the contemporary value x_t according to some model specific emission probability.

Note that it does not depend on the previous value of y_t.

All co-dependencies between consequent values of y are induced only

by co-dependencies between values of x_t which are determined by the Markov dynamics.

Now what are our potential benefits from introducing

such a dynamic hidden state into Markov dynamics?

The main the idea here is that the current hidden state

captures all relevant information for predicting the next value of y.

In other words the hidden state x_t is what is

called sufficient statistic for this problem.

The hope here is that the hidden state x_t somehow

smooths out the noise in data leading to better predictions.

So conditioned on extreme places conditioning the past values

of the observed signal that we had in the regular Markov model.

Now let's talk about how specifically we can

implement latent variable models for sequential data.

We will be talking about three major classes of dynamic latent variable models.

The first class is called state-space models or SSM for short.

The dynamics in these models is as we saw in

the previous diagram over general dynamics hidden variable models.

The only additional specification here is that both the hidden

and the observed signals are continuous in state-space models.

So to reiterate state-space models are Markov

latent variable models with continuous hidden and observed states.

There are also an example of short memory model as they only look one step back.

We will talk more about these models shortly

but for now let's continue with the higher level picture.

The next class of dynamic latent variable models is formed

by the class of Hidden Markov Models or HMM for short.

Unlike state-space models, Hidden Markov Models have a discrete hidden state.

It's discrete state dynamics is called the Markov chain dynamics.

So the hidden state in the HMM model is discrete,

but the observed state in

an HMM model can be anything depending on what you want to model.

If the observed signal is continuous you can

use Gaussian emission probabilities to model these.

If it's discrete you can use multinomial distribution for example, to model it.

The same way as Hidden Markov models are

examples of short memory models that only remember the immediate past.

Finally, I want to introduce the last class of

latent variable models that we will be talking about going forward.

These are models that are implemented via neural network architectures.

Examples that we will study in more details later include

the so-called Recurrent Neural Networks or RNN and

the Long Short Term Memory or LSTM networks as well as Bayesian neural networks.

There are two common themes to all of these models that make them quite

different from both state-space models and Hidden Markov Models.

First, this model are non parametric models unlike the other two.

Which are typically build as parametric models.

Second, neural based models such as RNN or

LSTM can have both the short term and long term memory exactly what the name LSTM,

which stands for Long Short Term Memory stands for.

In particular they keep track of as many as up

to a 100 of previous states in practice,

and even more all the way to infinity at least in theory.

This is potential is very useful in financial applications

where we want to capture long memory effects of financial markets,

assuming of course that they exist.

We will talk more about how it's done in the future.

Well for now let's take

our traditional quick recess break before moving on to the next topic.

Finally, I want to briefly discuss how we can estimate dynamic latent variable models.

This part is easy at least conceptually,

you simply have to maximize the log-likelihood of data.

Now how to compute the likelihood.

The dynamics in the diagram for either an SSM

or HMM is first-order Markov in a hidden variable.

The causality structure in this graph means that the probability to see

the whole sequence of peers x and y is given by a product

of probabilities of transitions for x states times

the emission probability of y conditional on the current value of x.

Now you can use this expression to compute the log-likelihood of observed data.

To do it we should integrate or marginalize over human variables x.

Therefore, the log-likelihood is logarithm of the product of all integrals

over x_t for t of the product of these two probabilities.

Now, if not for the integral sign,

this would be the logarithm of a product which is equal to a sum of logarithms.

This is a additive function that is relatively easy to maximize,

but in our case we have this integral sign in

between of the logarithm and the product signs,

so that the previous method of decomposing the result into a sum doesn't work anymore.

This makes the log-likelihood maximization in a presence of

latent factors much more difficult than in models without hidden factors.

As we will discuss in more details in the next video,

our good old friend,

the EM algorithm can come to the rescue in such cases.