In the last lecture we learned that constructing data structures in parallel requires an efficient combine operation. We saw that for most data structures the combine operation cannot be implemented efficiently. This lecture has a more constructive character. We will learn about parallel two-phase data structure construction, a technique used to implement combiners efficiently. In fact, it turns out that most data structures can be constructed in parallel using the so-called two-phase construction. Previously we insisted that a combiner and the resulting collection have the same underlying data structure. For example, we assume that a combiner that produces an array must internally contain an array at the point when its combine method is called. In two-phase construction, the combiner has an intermediate data structure as its internal representation. This intermediate data structure is different from the resulting data structure, as we will see, and it has the following properties. First, obviously the intermediate data structure has an efficient combine method. Its running time is O(log n + log m) or better. Then the intermediate data structure has an efficient += method. This ensures that individual processors can efficiently modify the data structure. For sequences, the meaning of += is appending an element to the sequence while for sets, the += method is standard set addition. Finally, it must be possible to convert the intermediate data structure into the resulting data structure in O(n/P) time, where n is the size of the data structure and P is the number of processors. In other words, the result method is allowed to copy the entire intermediate data structure, but this copying process must be parrallelizable. Together these properties allow building the resulting data structure in two phases. In the first phase, different processors build intermediate data structures in parallel by invoking the += method. These intermediate data structures are then combined in a parallel reduction tree until there is a single intermediate data structure at the root. In the second phase, the result method uses the intermediate data structure to create the final data structure in parallel. In our illustration, the final result is some array-like data structure whose subintervals are populated in parallel by different processors. Ignoring constant factors in the computation, the total running time in our example is N/4 + 2 times c(n), where c(n) is the combined running time, plus N/4. We expect c(n) to be small so we can ignore it. There is roughly in total N/2 computational steps. Four processors halve the running time of constructing the final data structure. Having seen the high-level overview of how the two-phase parallel construction works, we now turn to a concrete example, a two-phase array combiner. To keep things simple, we will limit our ArrayCombiner class to reference objects, expressed with a time bound of the type parameter T. We also add the ClassTag context bound to be able to instantiate the resulting array and the parallelism level argument. Internally, the ArrayCombiner keeps the field numElems to store the number of elements in the combiner, and the nested ArrayBuffer used to store the elements. The actual elements will be stored in these entries. We use a nested ArrayBuffer instead of a normal one for reasons that should soon become apparent. We start with +=. This method finds the last nested array buffer in buffers and appends the element x to it. If the last nested ArrayBuffer ever gets full, it is expanded to accommodate more elements. As learned previously, appending to an array buffer takes amortized constant time. Next we implement the combine method. Here the reason for using nested array buffers becomes obvious. The combine method simply copies the references of the argument combiners buffers to its own buffers field. It does not need to copy the actual contents of those nested buffers, only a pointer to them. What is the running time of this combine? The number of computational steps is equal to the number of nested array buffers in the argument combiner. Since every array combiner is first created with only one nested array buffer, and there are exactly P array combiners created in the reduction tree, one for each processor, the buffers field will never have more than P entries. For this reason, the running of this combined method is O(P). Typical desktop computers today have around four processors, and the most powerful workstations have several dozen. So P is usually negligible compared to the number of elements in the data structure, and this is still an acceptable running time for the combine operation. Finally we can implement the result method. Once we have the root intermediate data structure, we know the required size of the array from the numElems field, so we allocate the resulting array. We then divide the array indices into chunks, pairs of starting and ending indices that each parallel task should in parallel copy. We start these tasks, wait for their completion, and then return the array. We can now test the performance of the array combiner on a simple PERL transformer operation. To correctly measure the combiner's performance, it is important that we don't stress this operation with extra computational workload. Most of the workload should come from invoking the combiner operations themselves. For this reason, it will merely traverse all the elements and produce a duplicate of the original array in parallel. We use the aggregate operation for this purpose. So let's take a look at the demo. Before we run the benchmark, we will take a look at the contents of our source file. Here we can see the implementation of the ArrayCombiner class. Due to type variance in the combiner class, the combine method has a slightly different signature from what we saw in the lecture. However, its semantics and implementation remain the same. We can also see the implementation of the copyTo method, used to copy the respective subintervals of the array. We leave you to examine its details if you so desire. The program starts by instantiating the ScalaMeter configuration object with the default warmer. It then uses the ArrayCombiner with the aggregate method with different levels of parallelism, 1, 2, 4, and 8. For each parallelism level the program reports the running time. Let's run the benchmark through SBT. Let's examine the output. ScalaMeter starts by executing 24 warm-up runs. Once the steady state is detected, ScalaMeter reports the average running time. For one processor, this takes 30 milliseconds. For two processors the speed-up is approximately linear. The time it takes is 50 milliseconds. For four processors the speed-up is no longer linear. The time it takes is 12.6 milliseconds. The reason for this is the memory bottleneck that we described and discussed in the previous lectures. At parallelism level eight, we even see some loss in performance. Let's take a step back and try to understand why this technique works. Doing so can help us generalize it to other data structures. In two-phase construction, the crucial step is picking the right intermediate data structure. This intermediate data structure must partition the element space into buckets. The array combiner partitioned the elements according to their index into distinct subintervals. It turned out that these subintervals can be combined easily. For other data structures, we need to find alternative ways of partitioning the element space. For hash tables, we can separate the elements according to their hash code prefix. For example, if we want to add an element 9 to the hash table, then we first compute its hash code. Let's assume that the hash code of the element 9 is 1001 in binary. Since it has the prefix 10, it must be added to the following bucket. Each bucket is a link list of arrays. Once an array in the link list becomes full, a new array is added to the end. When calling the combine method, the link list of arrays from the corresponding buckets get merged. This is done by updating their tail pointers. So combine takes as many steps as the number of buckets you choose. Once reduction completes, we allocate the hash table with the required size. The hash table has a useful property that the elements with the same hash code prefix must occur in the same contiguous subintervals of the hash table. This means that independent processors can write to these non-overlapping intervals without paying the cost of synchronization. In search trees, elements are keyed according to some total ordering. To split the elements into buckets, pivot elements for the partition must be chosen. Assume that the processors independently produce these outputs. They could then agree on the following pivot elements, 3, 7, and 9. Once the pivot elements are chosen, processors partition the respective elements into the buckets in parallel. During reduction, corresponding buckets from different processors are linked together, and eventually we end up with a single intermediate data structure. In the second phrase, each processor constructs a search tree from one of the buckets. We end up with four search trees that can now be merged. Hold on a second. Isn't this the same problem we originally started with? Didn't we assert that almost all balanced search trees cannot be efficiently merged? It is true that arbitrary search trees cannot be merged efficiently, but these are not arbitrary trees. We now know that these trees contain sets of elements that do not overlap. That is, they contain disjoint intervals of elements. Unlike arbitrary search trees, search trees with disjoint intervals can be merged in logarithmic time. We will not dive into the details of how to merge specific disjoint trees. However, if you're interested in learning more, we encourage you to investigate online resources yourself. For spatial data structures, such as quad trees, the elements are partitioned using their spatial coordinates. You will implement the spatial data structure in one of the exercises of this course, so we're leaving you only with this high-level idea. Having learned about two-phase construction, we ask ourselves, Are there other reasonable ways to implement combiners? The answer is yes. An alternative is to rely on a data structure with an efficient concatenation or union. Such data structures are less frequent, but they do exist. The third approach is to use a concurrent data structure. Here combiners use the same underlying memory area but rely on synchronization to ensure that concurrent modifications do not corrupt the data structure. In the next lecture we will focus on the second method. We will implement a specific data structure with efficient concatenation, which is more suited for parallel computations.