User four has just two friends, one and three,
they already have, reversed copies for both these edges in our list.
If you look carefully,
we will see that for the rest of users,
it has 5, 6, and 7,
they already have all possible or both copies for their edges.
each edge. User five has two friends,
two and three these are reversed copies.
<2,5> and <3, 5> added already to edge list.
User six, as well as seven has only one friend, user three.
reversed copies, <3,6>, and <3,7> are also added.
Now, let's count the amount of mutual friends for each pair of users using the Edge list.
Let's use the Spark data frame API to represent our Edge list.
Each edge will be a separate row of dataframe.
First element of each pair will go to column A,
and second, will go to column B.
If you look carefully,
you will see that vertices A and B are sorted in a scaling order.
It's important to remember this.
Remember our final goal,
for each user, we want to know,
how many mutual friends he or she has this all other users.
We want to know how many mutual friends,
user one has this all other users,
user two has this all other users and so on.
Thus, for user one,
will find the full result of computation.
This user two, he has only one mutual friend, user three.
This user three, he has two mutual friends, users two and four.
This user four, he has only one mutual friend, user three.
This user five, he has two mutual friends,
users two and three.
This user six, he has only one mutual friend, user three.
And this user seven,
he has one mutual friend, again, user three.
We will see a combinatorial explosion.
The amount of people,
this whom user one,
has at least one mutual friend is much bigger than
the amount of people user one knows directly.
So, we need to write our code in an effective way.
How can we receive the desirable results from an Edge list?
Was it pairs of vertices?
Let's call them A and C,
we will count the amount of intermittent vertices,
people who shared connections with both A and C. Let's
call them B. Dataframe discount's A and B,
we will name abDF.
Let's duplicate it first.
Second, Let's rename the column A as column B,
and column B as column C, into duplicated dataframe.
Finally, let's give this copy dataframes a name, bcDF.
So, now to find the structural pattern,
ABC to vertices A and C connected through a third
vertex B. we should in adjoin abDF to bcDF,
using the condition that abDF's column B is equal to bcDF's column B.
It's time to show you the code.
Here, we are importing all necessary libraries.
Here, we are creating the edge list for our graph.
Here, we are creating a schema for our dataframe,
namely columns A and B.
Here, we are creating the abDF dataframe.
In these two rows,
we are creating a duplicate of abDF or name in
column A to column B and column B to column C. Here,
you can see two newly born dataframes, abDF and bcDF.
Aren't they cute?
After in adjoin abDF to bcDF,
using the conditions that abDF's column B is equal to bcDF's column B,
we will receive the abcDF data frame.
This column's AC as a duplicated B,
one from abDF and second from bcDF.
Then, we can throw away the friend in the middle, column B,
because for each pair of users,
we need only the amount of their mutual friends,
and we don't need the IDs of their mutual friends.
Now, let's group our data on columns A and C,
and count the amount of times we receive each pair of users.
Finally, let's filter from the result only the rows where column A is equal to one.
We will get the next result, but wait.
Why it is that user one doesn't have any mutual friends with
user two and have just one mutual friend who's user three?
Maybe he's just too picky to make friends randomly. I don't think so.
The answer is, for each pair of vertices,
they have just the one edge.
For example, they have an edge two,
three but we don't have as symmetrical edge three, two.
After joining abDF to bcDF using
the condition that abDF column B is equal to bcDF column B,
they will not receive a row,
one, three, three, two.
That's why in the final result,
we don't receive row one, two, one.
We'll add in a symmetric copy for each edge,
fix our problem with the missing one mutual friend for the pair of vertices, one, three.
Let's check. For each edge in abDF dataframe via agent its reversed copying,
duplicate the new doubled abDF again,
renames the column AB to BC,
give a new copy name bcDF,
and perform and in a join again.
And let's take a look at this intermediate result.
For user one, we now have this.
Looks promising.
We have row one, two, two, three.
So after throwing the friend in the middle two,
we will receive the missing result that one and three have a mutual friends.
Let's finish the computation.
In the new abcDF,
let's throw away the friend in the middle,
will becomes, group all data by columns A and C,
count the amount of times we receive each pair of users,
and finally, let's filter only the rows where column A is equal to one.
We will get.
Hooray. Now, we can see that the result is correct,
and we have side effect.
We know the user one has three mutual friends.
This is user one, which is actually true because
user one and user one is the same person.
This side effect can be easily filtered in the last step of our algorithm.
In summary, we count the amount of mutual friends in five steps.
First, for each edge of abDF,
add it's reversed copy.
Create a copy of abDF and rename columns bcDF.
Join abDF to bcDF using the condition abDF column B is equal to bcDF column B.
Throw out friend in the middle B,
for each pair of vertices,
count the number of occurrences,
filter the pairs consisting of identical vertices.
What are the bottleneck of this calculation?
First of all, we double the data on our first step.
Second, we shuffle our data twice,
once on step two,
and once on step four.
Letters very expensive in terms of sending data through a network.