Welcome back, in this video, we will be talking about the Resilient Distributed Datasets. Resilient distributed datasets, usually called RDD for short, are the data containers so are the way in spark to store your data. So let's start from the definition and check one term at a time. So the first is dataset, okay? So this is like a variable or an object in programming, so it is created, generally reading from some disk. So that can be HDFS or S3. It's very convenient also if you are running from your local laptop, you can get some local text files into Spark or even a hierarchy of folders. And once Spark reads those data in, you can reference to those data with an RDD. And the other way is by transforming another RDD. RDDs are immutable, so you cannot change just a section of them. But every time you transform one RDD and create a new one, and by a series of many transformation, you can write your data analysis part one. The second term is distributed. So, of course, we are running Distributed across a number of machines. Maybe hundreds of instances on Amazon, and the complexity is hidden by this very simple interface. Where we can send comments if we want to execute the but we just reference this single object. And also data are divided in partitions. For example, if you are working in text files, Spark doesn't work line by line, but in group of tens or hundreds of lines, and this can be changed depending on the application so that you can optimize performance. And this way, partitions are divided across all of your machines and are the atomic chunks that are processed by your analysis biplane. The last element is resilient and it is very important because in the cloud environment it is pretty common to have node failures or vary flow processes, so it's very important to be able to recover from this situation without losing any work already done. The technique by Spark is to track the history of each partition. So, every point in your calculations, spark knows which are the partitions needed to recreate the partition in case it gets lost. And if that happens, then spark automatically figures out where it can start from to recompute what's the minimum amount of processing needed to recover the lost partition. So let's now go to our console. So you can open PySpark on the Cloudera VM. And let's see how you can play interactively with RDD. So a very interesting functions from start is paralyzed, this is just used for testing purposes. But basically it takes a data structure generally a list on your driver program, so the driver program is your PySpark console and distributes the across your class, there, with some number of partitions. In this case the number three is the number of partitions, and this sc.parallelize function gives you back a reference to your RDD. And then for example, one operation, the simplest operation we can do is just collect our data back to the driving program. And this is what you generally do once you have completed all your calculations. So your final results are gonna be, you will want to copy them back to your session so that you can and you can see the results. And of course, you need to make sure that you're not trying to collect a very big RDD, because this needs to fit in memory on your driver program. And so in this case, the output of this operation is the initial list because we copied it over. So this hides the complexity of a distributed operation where Spark has communicated with all the nodes and asked them to communicate back their section of the RDD. Another operation we can do is sometimes we want to check exactly how the data are partitioned across our cluster of nodes. And you can achieve that with the glom function. So you can call glom and then collect. And in this case, we get back our data but they are actually separated by partitions. So for the dog in park, this is a pretty interesting function. Then we can go into reading a textFile. For example, in the last module we saw an example where we created a test file called testfileone in our home on cloudervian. So you can read it either as a local file by specifying that, thanks to the file, a semicolon and three forward slash syntax, or you can read it from HDFS with the syntax there that you see below. And one simple operation you can do with this, RDD, which is now the text_RDD, is take which copies back to your driver program the first element of your data set. In this case, it's going to be the first line. So let's repeat the word count example that an Hadoop mad produce. Using Python and using Hadoop streaming interface, now in Spark. So the first type is the mapping phase. The mapping phase, we were actually doing two different operations. We were splitting each line into words and then we were creating key value pairs where the key is the word and the value is one. Okay. And this was the output of mapper.pi in the Hadoop MapReduce example. So here, we can implement those two operation as a very simple function. So you see the first is split_words function and the second is create_pair. And then we can apply them to our input RDD is we flat map first, so flat map is the same operation as a map so applies your function to each of the elements of our RDD. But in this case, the output of split_words is gonna be a list of words. And we want to flatten it out so that our output is just an RDD with all of our words. And so this is exactly what flatMap does. And then the output of this is gonna be a very long list of words, and then we apply the create_pair to create the pairs word and 1. And thanks to the fact that PySpark is very interactive. We can take a look at this exactly what this intermediate stage does. So we can call it collect operation on this pair RDD and this the output. So you see for each word in our input text file. We have a key value pair with one as the value. Then we want to sum all of those. Okay. So the operation we want to do is a sum so we implement a simple function which is sum counts. And then what we want to do is we want to apply this but we want to apply this key by key so we want all our instances for example of the word far to be sum together. In this case, the result is gonna be two. So we call the reduceByKey operation and we give it the reduction function, which is sum_counts, and this is actually the output of our word count example. Then we can collect to take a look at this on our driver program, and this is the result, which is the same we were getting with Hadoop MapReduce. And next time, we will be talking more in detail about all different kinds of operations that you can apply to the data seen to flat map in map and all the actions that he can apply to the data like reduce [INAUDIBLE]. See you next time.