Hello, in this video we'll be talking about transformations. In the last video we talked about reliable distributed data sets and we noticed that those data sets are immutable. What immutable means, that you can never modify an RDD in place and you cannot modify a chunk of an RDD. This is essential for keeping track of all the processing that has been applied to our data set. And so we describe our data analysis pipeline as a chain of transformations. One after the other. And so, we have an initial RDD, which is transformed to, with several steps many other RDDs until we get to our final result. An important way, an important thing in Spark is that all these transformations are lazy. That means that they are not executed straight away. So, when we apply a transformation, nothing happens straight away. We basically are preparing our computation to happen later. So once we have described all of our computation to the final step. Then Spark will take care of choosing the best way execute this computation, and then start all the necessary tasks in our worker nodes. So let's look at an example. This is the same example we looked at before. The word count example. So we create an RDD from the local file system. And this is going to create an RDD where each element is a line of this file. So let's take a look at probably the simplest transformation which is a map. So a map applies the function that we provide to each element of an RDD. So this is a one to one transformation. So one element of the input RDD is gonna be transformed to one element of the output RDD. This example, the function that we want to apply is lower. So the purpose is to get all of one line lowercase. So the input is one line of text with any kind of capitalization and the output is gonna be the same line, all lowercase. And let's take a look on how this actually happens on our worker node. So, we have, in this example, we have two worker nodes. So our orange boxes are worker nodes. And our black boxes are partitions of our data set. So remember this is difference between Spark and MapReduce. We work by partition and not by element. And the partition is just a chunk of our data, so some number of elements. And then map goes from our partition to another partition that has the same size. So this is an operation which is completely local. So each node is independent from the other, there is no communication, and this applies the processing locally on each node. There are other transformations. This is just for reference. Some interesting transformation, we will look at two of those in more detail just right now. So one is flatMap. flatMap is a very similar to map. The difference is that in the map case we had one element as an input and one element as an output. Instead, flatMap accepts a function that might have any number of output for one input, okay? So in this case the function is split_words. That takes a line as an input, which is one element, and then its output is each word as a single element. So depending on the length of the line, we might have 1 word, or 5 words or 20 words. And what we want to do after we get this output is to flatten it out so that we get a simple one-dimensional data set of words. So let's see how this works in this example. So the output is just the list of all of the words, okay? And if we have multiple lines It's going to be a longer list of many, many words. And so what happens here? We have, it's still local operation. Okay, which is, in Spark terms, is called a narrow transformation. So it's local to each node, and from each partition as an input where the partition is an output. The difference here, is that the output partitions might be of different sizes, okay? So you see that the partitions on the left are all the same the partition of the right. I have different height. And another very important transformation is filter. So a lot of time we're interested just in subset of our data or we want to get rid of bad data. And filtering is the operation that takes a function, and this function itself takes one element of my RDD and for each of this element returns either true or false. So true if we wanna keep this element and false if we don't. So in this example, the filtering function is starts_with_a. And starts_with_a takes the input word, then transforms it to lowercase, and then checks if this starts with a. So if it starts with a, it returns true, if not, it returns false. And so what is this doing? It is filtering out everything that doesn't start with a. So you see at the bottom of the slide the output of this operation. So what happens here, is again, this is another narrow transformation. So it's a local. And what happens is that, of course, the size of our partition are gonna change and dependent of your application. You can have very different output. And in particular, sometimes this cause some trouble because now, at the beginning, you add very even partitions, and now you have uneven partitions. So you have partitions that are very, very small, and partitions that maybe are larger. So sometimes, it's convenient to join some of those partitions into lower number of partition that are easier to process, and can allow some performance gain. And the operation is called coalesce. And let's look at an example. Here I also introduce another function, glom, which is very useful for debugging purposes because turns the content of one partition into an array. So you see here, we call parallelize on the range of numbers between 0 and 9. And we specify that in parallelize we want to have 4 partitions. So Spark is going to transfer our data from the driver program to our work nodes in 4 partitions. And, by calling glom and then collect, we can find out how this data set was split in this 4 partitions. What happens with the transformation coalesce, is we can reduce the number of partitions. So this is, as I was telling you before, this is used a lot after filtering. Where you have reduced the size of your data set and is not very useful to have a large number of partitions, but it's better to reduce it to a more manageable number. And so you see in the second case we call glom and collect, and you find out the partitions now are only 2. So let's take a look at what happens on our worker nodes. So we have some RDD partitions at the beginning that are uneven, and then we call coalesce. And we reduce the number of them, so that they are more evenly distributed. And generally, this is a local operation. In some conditions, these might cause some communication between nodes. But it's still considered an formation. Next, we'll be talking about y transformations.