Hello everyone, my name is Pavel.
And now, we'll talk about how to find visitor sessions.
This is very important task should be solved when you analyze user's behavior.
How do people usually browse the internet?
Imagine, someone goes to the Facebook,
he scrolls the feed, reads the news,
writes something, then he gets tired of it,
he close Facebook and goes chatting with people and enjoys the life.
And after a couple of hours, he comes back,
uploads the new photos and check their status,
surfs a bit and leaves again.
In fact, most people visit websites in this way,
enters them, reading something interesting and leave.
Web analysis has even a special term for such wizards.
It's called sessions.
On some sites, the session last two three clicks,
people read some articles and leave.
And online games, there can be thousands of clicks but only one session.
Practically, the concept of session for site owners
is more understandable and important as individual clicks.
And first of all,
they are interested to know,
how many sessions a person makes on their site?
Let's learn how to answer this question.
It could seem an easy thing to do.
You take all the clicks of the visitor find
the delays between clicks that are greater than 30 minutes.
Each of them is treated as a new session.
But, frankly speaking, for a long time,
I have been thinking that new sessions couldn't be found in SQL.
And I'm sure that half of big data engineers doesn't know how this can be done.
And earlier, I took Python,
loaded all clicks from SQL as the list,
process them in the loop,
found the session, and load it back in the SQL.
But, you can use SQL or our beloved data frame API for this now.
I will show you how to do this,
for this, you need to use window functions.
And then in this video, you will recall what window functions are and how they work.
What types of windows there are and why is that needed?
And as a result, you will solve the task of counting sessions gracefully.
So, window functions are a special kind of aggregation functions.
However, when you apply them,
the size of the source table doesn't change,
a new value is added to it as a new column.
A typical window function is defined inside the selects method and looks like this.
This transformation also consists of two parts the same as aggregation.
The grouping condition, which is called the window and segregation function.
This example counts the number of clicks for
each IP address and adds to the end of the list.
You can verify that for the first IP address,
the number of clicks is calculated correctly.
There are two clicks.
Window functions are similar to aggregates but
there are differences between them. Let's take a look.
Firstly, the window is applied not to the whole table but to a separate column.
Secondly, the number of values doesn't decrease as output, as an aggregation.
Thirdly, for window function,
the grouping condition is written at the end.
And the fourth, which is very important.
The values in the window functions can be ordered and that can be used.
Let's talk in the details about how the windows are arranged.
In window functions, the grouping of values is specified using the window.
Using groupBy, you can choose the columns you need.
Using partitionBy, you select the lines can be grouped by IP address.
This divides events into groups with the same IP.
For example, you have a visitor who did nine clicks.
And here is the time of each of them.
And there is a grouping and in the simplest type of window,
the elements are not organized.
But, unlike groupings, you can set order in the windows using orderBy.
And since the function is applied to each line,
it also has a sequence number of the current line of the list.
And this allows you to use completely new types of functions,
such as gives the first value,
gives the last value,
takes the previous value,
takes the following value,
returns a number of current line as a window.
And all the aggregation functions like the min(), max(),
sum() also continue to work.
To make it more clear,
let's take a window grouped by users and organized according to timestamps.
And for each click, based on this window,
you'll find the sequence number of the event in the window,
time of the previous event,
time of the next event.
In fact, it is very convenient.
You can see that for first click,
the previous value is NaN.
In Pandas, it's designated as not a number.
The same applies to the next value of the last click as a group.
Function lead, that I used here,
will help us to find sessions.
As I said before,
to find the session,
you need to compare the time to the current and the next click.
You have thrown out all unnecessary columns from
the query by subtracting unixtime column from lead.
You have found the time between clicks.
When the time is NaN, this means that it was the most recent visitor click.
You are interested in finding two types of events.
When the time difference was greater than half an hour or
when a person made his last click and the time difference is NaN.
You count number of sessions for each person,
grouping them by IP.
You did it. I wonder who had the most sessions?
Let's see, for this,
you sort the results by a number of events and take the top.
In our little piece of the log,
there're three people with six sessions.
Wow. Now, let's check the accuracy of our calculation.
It is always important to check your results before presenting it.
Let's take some IP address and check the number of sessions in it.
For example, you chose IP with three sessions and visualize them.
To do this, filter all the events from from
access_log for this IP address and store them as a Panda's data frame.
The time will be on the X-axis,
for this you need to make the time stamp index of the data frame.
On the igrek axis,
you will set some random values.
So that will separate close points on the chart.
Every thing is ready to visualize
our time series and you can be sure that there are real three sessions.
You found that right.
So, at this lesson,
you revised window functions,
talked about the types of windows,
and solved a counting sessions problem by them.
In the next video,
we will talk about
two-dimensional aggregation using pivot tables. Do you know what is it?