All right, so it's good to start by saying a couple of words about how sound propagates in rooms. So here we're going to have to deal with two effects, mainly. And the first one of them is something that we call free space propagation, and this is something that you observe in practice, right? The further you are from the source of sound, the lower the volume. It turns out that for a point source of sound, this can be described by the following statement. Sound pressure is inversely proportional to the distance that the sound had to travel. This is equivalent to saying that it's inversely proportional to time with another constant that the sound had to travel. Another effect is reflections. Okay, in rooms, sound reflects off the walls. And every reflection attenuates the sound. In practice, these reflections are, these attenuations are depending on the frequency, and on some other things, like different materials and different walls. Well, we are going to model them using a single co-efficient to alpha here, so we're going to say that every reflection attenuates the sound by alpha, and what we mean by that is that if there is some pressure with the sound wave, just before hitting the wall is P, then the sound pressure, just after bouncing off the wall, is going to be alpha times P. Okay, very simple. So implicitly, this alpha is strictly smaller than one. So it turns out that we can describe the room as a linear filter. And what does this mean? Well, this means that if we want to describe the system between the source of sound and some sound sink, for example, a microphone, then we can simply write out what the microphone picks up. So the thing recorded by microphone as the emitted sound, the input sound convolute with some input response, okay, corresponding to the room, describing the room, we call this simple response the room impulse response. Okay, this is a very common abbreviation. Now this is good news because it means that the room is just a linear shifting variance system, once we fix the locations of the source and the receiver. And this means in turn that we can use all of the tools that we've developed so far to deal with linear shift in variant systems. For example, the discrete-time Fourier transform or the Z-transform, to deal with rooms. Fabulous. Okay, so let's start by hearing how some very different rooms sound like and we start by an anechoic chamber. This room is designed in special ways so that there are no reflections of the walls. There are no echos. So it has the name anechoic, and, you know, for example, having a conversation in such a room is really weird because you really have to look at the speaker. Once you turn your head around, there are no reflections. So, you also say that this room sounds very dry. Let us listen to a sound sample recorded in this room. >> [FOREIGN] >> All right. So, without any context, you might say, okay, this was recorded somewhere, I don't know, in some room. But in fact, you might have noticed that it is extremely dry. All right, now we move on to the next room, which is not anechoic anymore. So this is a small classroom that we emptied here at EPFL and we were running some experiments in it, so we had the impulse response measurements. Now, just for fun, we can try listening to the impulse response itself, so to the signal H. It sounds like this. [SOUND] As if someone fired a small starter pistol or something. So, you might notice that it's relatively short. We say that this room has a relatively low reverberation time, short. And the same sound sample from earlier, reproduced in this room, sounds like this. >> [FOREIGN] >> So it is definitely more natural. And you can hear that it's not as dry. There are some echos in this room, as the sound is richer. Finally, we can take an extreme example of a cathedral, and, we know from experience that these large buildings, they have very long reverberation times. And so, let us listen to the impulse response of a cathedral, now. All right, here it comes. [NOISE] So it is very different from the impulse response of a small classroom, right? It's much, much longer. These are very reflective surfaces and there's a huge volume. So the reverberation time is very long. And the same sound sample we produced in the cathedral, sounds like this. >> [FOREIGN] >> As mentioned earlier, these sound samples were obtained simply by convolving the original sample from an anechoic chamber with the impulse response of the corresponding room. So basic street or line, or if we have a classroom, or the cathedral. Now we can start having some fun. Let us analyze a very simple room that will allow us to write some formulas explicitly. This is not really a room. So imagine that you're standing half way between two very long walls. Say that these walls are infinite, or just very long, and these walls are at a distance, d, from one another, okay? And the fact that you're standing exactly half way between them means that the reflections, and you're having a microphone, okay, and the reflection from this wall, okay, will arrive at the microphone at exactly, or approximately, but we're going to think about exactly the same time as this reflection from this wall. This simplifies some things, okay? And the impulse response of this room, where you're standing, is given by this formula. So notice that it is just a bunch of shifted delta functions. Each delta function models one reflection. And we know that the gate reflection must be scaled by alpha to the k because it was attenuated k times by wall, and there is also this free space propagation. So notice that in the denominator, there is this k times T term, which models the free space propagation. And this epsilon, here, it just helps us. It's like a patch for a formula to avoid division by zero, or if you want, it models the first direct path to the microphone, because the microphone is between most of the paths. It's not exactly colocated. And what is T? Well, it is just the time necessary for the sounds to go from the source or from the mouth to the wall and back. So what is it? The sound, or the distance of the sound has to travel is exactly two times d or half, so it's d, and the time it takes then is d over c, the speed of sound. Okay, so T is equal to d over c. Okay, the speed of sound. So, what is capital N here? It's the time measured in samples that it takes for one reflection to occur, okay? And it has to be an integer because we're working in discrete time, and you want everything to be on the grid. So we want it to be shifted by an integer number of samples. So, we just round, okay? So, we round the time in seconds, multiplied by the sampling frequency, which will correspond to the number of samples. And finally, you should not see this formula as being a very exact model for something. But it's a very good model, so it describes very well what happens in this situation, and it's going to serve us to derive some interesting things. In fact, this formula is still a bit too complicated. We don't really like this 1/kT term, it will wreak havoc in the z-transform. So, what we want to do is we're just going to rid of it. It's difficult to handle in the z-domain, it will make us struggle, so why not simplify further? So we're going to assume that the dominant attenuation is due to reflections only and arrive at this approximate impulse response if you're going to use in the beginning. Okay, we just features of, and this approximation is not very good. But we'll see that even if it's not very good, it gives some very good results. Okay, and nice thing is that this has a simple z-transform. Okay, so now we want to hear how these things sound like and how they look like. Okay, so this here is the simplified impulse response. And this here, the right hand side, is the realistic impulse response with a 1/t term. And we can see that they are quite different. And they also sound quite differently. So here, we can first listen to the original sound, this is going to be our benchmark sound. >> One, two, three, four. [MUSIC] >> Okay, it's some voice and some guitar, then if you play this sound and convolve it with the approximate room, it sounds like this. >> One, two, three, four. [MUSIC] >> Sounds bad. It says that the room is really large, so we can actually hear the individual reflections, and the walls are quite reflective. And in what we call a realistic room, it'll sound like this. >> One, two, three, four. [MUSIC] >> Okay, so it is much more natural, even though, obviously, it is not a real room. Our goal now is to invert the room, so we have the reverberated sound. And we want to get rid of the room influence, so we want to remove the echoes, the reverberation from this sound. And we're going to do it using single processing, of course. So the reverberated sound is given as a convolution, okay, and here, we say that it is given as a convolution between the input sound x and the approximate impulse response, okay? And our goal is to design a filter that reverberates this sound. So we're going to have a simple linear scheme, nothing complicated. So we want to design a new filter that we call the inverse filter, hi here, that when convolved with the output signal, with the reverberated sound, gives us back the dry sound, the original input signal, okay? So let's play with this expression, and then get a very simple solution to these problems. So we want the inverse filter convolved with y gives us x, okay? But we have the expression for y, it's just the input single convolved with the room, okay? And we come over with the inverse filter. And now we use the properties of the convolution, okay, and the particular one that we use here is associativity. So we put parentheses different, we just parenthesize these guys. And so we see that what we actually ask for is that x is equal to x, convolved with something, okay? Okay, so first, we simply write out an expression, a definition of the z-transform. Then we plug in what we had computed for the room impulse response for the approximate room impulse response, okay? And here we just use the properties of the delta function with the delta sequence. So we can switch the sums, so this here is actually call to first sum over k, and then we can put alpha to the k here, and then we can put sum over n and whatever depends on n, inside. So this is just z to the -n, delta of n-kN. All right, and now the delta sequence will sieve out the values of whatever is left here, multiplied with it at kN, right? So we can write this out as being equal to sum over k, and then alpha to the k, and then z to -kN, okay? And this is exactly this expression here, with k exchanged by N for whatever reason. Okay, what remains to be done is just to sum up these geometric series, and we did it many times, so we know how to do it, okay. All right, now what happens because now we can invert the room, and as we said, z-transform of our inverse filter is just 1 over the z-transform of the room impulse response. This is a transform of the room impulse response, okay. And luckily, it turns out to be a very simple filter. It's a fine act impulse response filter that has only two taps different from 0, one at position 0 and another one at position capital N. And now even if this observation might seem very innocent, that finite impulse response filters cancel exponential impulse responses. It's, in fact, in the basis of some modern sampling theories, of something that is called finite rate of innovation sampling, that you might want to look up if you're interested. Now, here we show equalized room impulse responses. So, this is just a convolution between the inverse filter and the approximate room impulse response on the left-hand side and the realistic room impulse response on the right-hand side, okay? And as we were designing our inverse filter exactly for the approximate impulse response, this comes as no surprise that we get the delta function on the left-hand side. What maybe surprising is that even on the right-hand side, we get something that's not too far. Even if the room impulse responses are very different, they appear to look very different. The bottom part shows magnitude of the BTFT of these equalized room impulse responses. We would expect this to be constant, to be close to 1. And we see that it is indeed, of course, the case for The approximate impulse response, but even when we apply to the realistic impulse response, we get something that somehow stays close to one. And now let's hear how this thing sound. So first let us hear how it sounds if we convolved the sound with the approximate impulse response, and then apply our inverse filter to it. Okay, here it comes. >> One, two, three, four. [MUSIC] It sounds exactly like the original sound, and this is absolutely no surprise, since the equalized the room impulse response, we can see that it's a delta function. And what if the sound was convolved with the realistic impulse response, but then we apply the filter that was designed for a different RIR, that was designed for the approximate one? Then it sounds like this. >> One, two, three, four. [MUSIC] It is slightly different but perceptually, I mean it is extremely close to the original sound. And here is where we see the power of approximation. So even if we heavily approximate it, The real room impulse response, perceptually the result is not too bad. What happens if we have a different kind of model mismatch? Assume that we designed everything almost perfectly, but somehow when we were designing our inverse filter, we thought that the room has a different size than what it has in reality. If we made just 1% error in the room size, then the things would sound like this. Okay, first I will play the original sound, just so that you remember how it sounded. One, two, three, four. [MUSIC] Okay, and now we equalize it with a filter that was correctly designed but for a slightly different room with 1% size error. >> One, two, three, four. [MUSIC] It's just that. And you can see what happened if you look at the equalized impulse response, it looks nothing like the delta sequence. Also, the frequency domain plot shows that some frequencies are very amplified around here, and some frequencies are very attenuated very close to these amplified frequencies. It's nothing like the constant, that it should be right around one. So, it's clear that something bad happened, and it's also something to think about. It tells us that our design method is not really robust, it's quite brittle, actually. And now we'll quickly take a look at a different application where we have two people talking over phone, over Internet, say or Skype as is shown in this figure. And person in Room 1 listens to the sound over headphones, and in Room 2 over loudspeaker. Okay. And what happens now? So when person in Room 1 speaks into the microphone, the sound gets transmitted over with some delay to the person in Room 2, and it gets reproduced over the loudspeaker. Notice also that what gets transmitted from Room 1 to Room 2 is not only the voice directed into the microphone, but when person in Room 1 speaks, its voice gets bounced off the walls, so it gets reflected. It gets convolved with Room 1, and this is what gets transmitted with some delay to Room 2. Now in Room 2 it's reproduced over the loudspeaker, and then this sound is again convolved with Room 2, picked up by the microphone and transmitted back into Room 1 with some delay, okay. So what person in Room 1 hears in his headphones is his own voice, I mean, coming from his mouth, then he hears person in Room 2 talking, but he also hears the delayed version colored with Room 1 and Room 2 of its own voice which is extremely annoying. And this is how it sounds like. One, one, two, three, three, four. [MUSIC] Here for simplicity we use the room impulse responses that we have explained and computed earlier. So how do we get rid of this annoying echo? Well, somehow the most natural idea that first comes to mind is we know why it gets transmitted from Room 1 to Room 2. We know what comes in to be reproduced over the loudspeaker. So call this s of n.. So why not just subtract s of n from whatever is being sent back to Room 1? And this is a very nice idea. Makes a lot of sense, except that it does not sound very well. So here's how it sounds like. >> One, one, two, three, four, four. [MUSIC] It doesn't help much. So the reason why it doesn't help much is that not only s of n gets transmitted back to Room 1, but also the reflections of s of n follow the walls in Room 2, okay? Okay, so s of n gets convolved with Room 2, and this is what gets transmitted back. So in order to correctly do the echo cancellation, we must first estimate the impulse response of Room 2 and this is the rule of Geothentic here, okay. So we must first estimate the impulse response of Room 2, and convolve s of n with this impulse response and then subtract this convolved signal from whatever is being sent back to Room 1. The situation is actually a bit more complicated, since we also need to estimate the impulse response, for example of the loudspeaker, which is not just a simple delta, okay? And it's even further complicated by the fact that the conditions in Room 2 change. So people move around, the temperature changes and so on, so we must re-estimate geothental work time. If we take all these things into account, then after properly doing the echo cancellation, this is the sound that we get. >> One, two, three, four. [MUSIC] As expected, the result is near perfect, right? Because we assumed perfect knowledge of Room 2. And so the only information that we have in this sound now is coming from the fact that is convolved with Room 1.