In stochastic signal processing, we are interested in describing and manipulating random signals. Random signals are the realizations of stochastic processes or random processes. The best way to think of a random process is to think of a machine that contains an infinite number of random variables indexed by a time index n, so if here inside you have an infinite number of random variables, that could be all completely different and then these random variables are triggered in turn and produce the set of samples that we could actually measure if we run this machine. As a simple example, imagine a process where the random variables are all the same and map the event of flipping a fair coin. So each sample of the realization will take value either plus one, if the outcome of the nth coin toss is head, and minus one, if the outcome of the nth toss is a tail. So a realization of this process could look like so. If we run the machine, a second time, we would get a different realization that perhaps would look like so, and a third time will give us yet another different realization, and so on and so forth. So a discrete time random process generates an infinite length sequence of random values. The questions are, how do we describe this process statistically? Well, we need to know the statistical distribution of each value, but also the interdependence between the infinite random variables that appear inside this process generation machine. Now, you remember from the previous discussion, if we want to characterize statistically, the behavior of more than one random variable, we need to know the joint probability density function of all the variables. Now, here we have an infinite set of random variables, and so we would need to specify the PDF for any subset of random variables that take part in the generation of the process. In other words, you should be able to pick any set of time indices, pick the random variables that are associated to the indices, and then provide the PDF for this subset of random variables. This is clearly too much to handle, so we need to simplify the problem. A first type of simplification involves considering descriptions only up to a certain order. For instance, a first-order description of the process would provide us with all the PDFs associated to all the random variables that are used in the process generation. So with this kind of description, you could get for instance, a time-varying mean, which is the expectation of the variable x of n, which is the random variable associated to time index n. A second-order description would provide all first and second-order PDFs. The second order PDFs would involve any two time indices. With this kind of description, we can compute the time-varying mean and also the time-varying auto-correlation, which will depend on the chosen time indices. A third order description continues in this vein and adds a third-order probability density function for all possible triples of time indices. Now, in all these cases, we need to still provide an infinite number of probability density functions, so we're still dealing with unmanageable amount of data. Things start to be a little bit more reasonable, in the case of stationary processes. For a stationary process, all partial-order descriptions are time-invariant. In other words, if you have a joint probability density function for any set of time indices and you shift this time indices by capital M samples, the probability density function does not change. Physically, this is really the same thing as time invariance in the case of filters, and it means that if you have a random signal generating machine, it doesn't really matter when you switched this machine on, its internal properties will translate uniformly in time. So it's a reasonable requirement to impose on a random process. For a stationary random process, the mean becomes time-invariant because it no longer depends on when you switch on the machine, and the auto-correlation only depends on the time lag namely on the difference between the two indices that you choose. So the expectation of the product between the random variable at time n and the random variable at time m only depends on the time difference, n minus m. If you go to higher-order descriptions, once again, they will only depend on the relative time differences between the chosen time indices. So in a fully stationary random process, all order descriptions are time-invariant. Things become even simpler if we just care about the first and second order descriptions. So a wide sense stationary process or WSS for short, is characterized by having a time-invariant mean, and an auto-correlation that only depends on the difference between the time indices. Why is this enough for us? We will see that in stochastic signal processing, most of the techniques use a quadratic cost function and this only requires the first and second moments. Using the quadratic cost function is also very well behaved mathematically, and it makes sense in terms of minimizing an error measure. We will see this in the next module. A very common type of wide sense stationary process is white noise, and white noise process is characterized by having zero mean and by being uncorrelated namely, the expectation of the product of any two random variables in the process is equal to the product of the expectations. But now, since the process is zero mean, the auto-correlation sequence will be equal to zero everywhere except in zero, where it is equal to the variance of the process. Now, the variance will depend on the underlying distribution for the random variables that describe the process. So if you have a Gaussian random variable, at each sample, we will have a Gaussian white noise or we could have a uniform white noise. The coin toss process that we saw before is also a white process. Each new sample is independent because we assume that each sample is generated by tossing an independent coin and each sample has 50 percent probability of being plus one or minus one, so that it's PDF is equal to Delta of x plus 1, plus Delta of x minus 1 divided by 2. These Deltas are the Dirac deltas, and the auto-correlation will be equal to the discrete time Delta because it will be zero everywhere except for n equal to 0, where it would be equal to 1, which is the variance of this random variable. So how do we compute the moments of a process, of a wide sense stationary process? Well, if we have access to the theoretical first and second-order probability density functions for the process, we can just apply the definition and the mean will be the expectation of the generic random variable. Remember now, this is time-invariant because of the wide sense stationarity. For the auto-correlation, we will pick an index k which represents the time lag, and then we compute the expectations of the product of the random variables at say, time zero and time k. These actual values are immaterial, all that matters is the time difference between the two. Since we have access to the bivariate PDF, we can compute this double integral and obtain the auto-correlation sequence. Of course, this is just in theory, because in practice, we don't have access to the theoretical probability density functions, where we have access to is, if we're lucky, a set of realizations of the process. So suppose we have access to the machine that generates the stochastic signal, we can run this machine several times and we collect say, capital M realizations of the wide sense stationary process. At that point, we can compute the mean as an ensemble average, and the auto-correlation, as an ensemble auto-correlation. We have several realizations, and to compute the mean, we average across realizations. Of course, it doesn't matter where we do this because the mean is time-invariant. To compute the auto-correlation, we take an index k, and then we compute the sum of the product of two samples of the process that are k samples apart, and then we sum this across realizations. So imagine this is a time difference of k, and so we take these two samples, we multiply them together, then we take these two samples, we multiply them together, we sum them all together, and then we normalize by the number of realizations that we have. If we're even less lucky though, we will not have access to the generation machine. We will have access to a single realization of the stochastic process, as in the case, for instance, of someone uttering a sentence, this is a speech signal of which you only have one realization. At that point, what you try to do is to take time averages. So you have a single realization, and to compute the mean, you will just get all the samples, and sum them together, and divide by the length of the samples. To compute the auto-correlation, you will choose a time lag k and then take a sample and the sample, k samples apart, compute the product, then move by one, and take this sample and go k samples apart, and compute the product and sum them all together. In reality, we should be performing ensemble averages and we're performing time averages. For some processes, this will work and the class of processes for which we can replace ensemble averages with time averages is called ergodic. In practice, we should use a number of samples in our realization that is at least four times bigger than the maximum index in the auto-correlation sequence that we will need in our application. As a last observation, consider the formula for the correlation between two random variables computed as an ensemble average, so apart from the normalization factor, this is very close to the formula for the inner product between two sequences in l2 of z. So by analogy, we said that if the correlation between two variables is zero, the two variables are orthogonal. The interpretation of this orthogonality is the same as in l2 of z. Orthogonality means that there is no linear relationship between the variables and that the variables are maximally different. This is a concept that will be very useful in adaptive signal processing.