Welcome back. This is the last lecture of the whole course,
Spatial Data Science and Applications.
In this lecture, you will study
spatial big data management and analytics using taxi trajectory data,
which is a typical example of spatial big data.
Assume that you have a client,
who is a taxi call-service company,
and they want to provide a new service to taxi driver named "Passenger Finder",
which can guide to the places where
more passengers are waiting for taxi cabs.
The solution would have
a private and public value, with which each taxi drivers would expect more incomes,
and the municipal government would be able to provide a better taxi service to citizens.
Some issues of the development include the size and the noise of taxi trajectory data,
and appropriate design of spatial data analysis to solve the problem.
We will take care of each issue one by one throughout this lecture.
For the given problem,
a proposed solution is summarized in flow chart.
The noisy taxi trajectory should be preprocessed,
for filtering outliers and map-matching to road network.
The drop-off/pick-up locations are spatially matched to the regions,
and time information should be properly saved.
Next, polygon-based hotspot analysis conducted for every hour of the week,
and finally deliver hotspot with respect to any given time of the week.
The solution would require the functionalities
of all the disciplines related to spatial data science.
So, it would be fit in a solutions structure of spatial big data management and analytics.
All the four disciplines would be used for implementation of the proposed solution.
Big data system will be cut out for filtering outliers and
map-matching of taxi trajectory data.
Spatial DBMS could save all the refined datasets of pick-ups and take-offs,
and basic spatial operations,
such as spatial join, can be conducted.
Data analysis tool can be used for hotspot analysis, and finally
the outcome of the solution can be visualized in GIS with a series of data processing.
Now, let's take a step-by-step procedures to
implement the proposed solution for "Passenger Finder" service.
The first step is to prepare the datasets for the problem.
Taxi trajectory, road network layer, and administrative district of project site,
which is Seoul in Korea.
The specification of taxi trajectory data is summarized in the slide,
which is given by company name Navicall,
which has more than one year time span collectively from 6,500 cabs,
and data size per day is about 400 megabytes and so on.
The attribute of taxi trajectory is described in the table,
which has 11 attributes including x, y
and time tag as a trajectory,
taxi status of empty/pick-up/drop-off
and hire, car ID, driver ID,
company ID and so on.
The road network layer around city of Seoul is presented here
and the administrative district of Seoul is presented.
The region polygon layer will be used for spatial unit of hotspot analysis.
Now, datasets are all ready and start data processing.
The first data processing is to filter out
outliers and to keep only trajectory data around Seoul.
The second preprocessing is to conduct map-matching the filtered trajectory.
Fortunately, the filtering and map-matching can be implemented in the distributed manner,
and the data size is rather big,
so, Hadoop MapReduce could be a good solution for the two preprocessing.
Taxi trajectory data is subject to noises and outliers as figure illustrates.
That would happen to most of big data,
so preprocessing is an inevitable steps for big data analysis.
For the given problem,
you will analyze the taxi trajectory only inside of Seoul
because "Passenger Finder" service will be presented to taxi drivers of Seoul.
In Korea, taxi drivers are registered to only one municipal district,
and they can run taxi businesses only within the registered district.
For the reason, trajectory outside of Seoul
would be meaningless to the target taxi drivers.
So, bounding box filtering is applied to the taxi trajectory data,
which you learned in big data systems in the fourth week.
The next preprocessing is map-matching,
which you learned in network analysis in the fifth week.
One quick question is what the value of map-matching for the problem would be?
Basically, it can improve the data accuracy
with locating every point to an appropriate map link,
and eventually, it can present a more analytic power.
For example, it can reveal driving directions of a trajectory,
traffic volume estimation of each link, and so on.
Map-matching algorithm were already discussed in the fifth week.
The difference here is that we will use Hadoop MapReduce.
For that, it would require spreading the whole trajectory dataset
into multiple subsets using Key-value pairs of MapReduce.
Then with respect to each subset,
a geometric map-matching will be conducted.
MapReduce code is given here for implementation of map-matching algorithm.
In the Map function, bounding box filtering and splitting
the entire dataset based on car ID, key, are conducted.
In the Reduce function,
point of curve geometric map-matching algorithm is conducted,
and the output of the map-matching is written to HDFS.
You are looking at the outcome of map-matching algorithm.
The first 11 columns are identical to
the original attributes of input trajectory data and 2 columns are added,
which are line segment ID,
which is result of map-matching
and distance between trajectory point and corresponding line segment.
Also note that pick-ups and drop-offs points are only selected because
pick-ups and drop-offs will be considered in hotspot analysis.
The outcome of two preprocessing in Hadoop MapReduce
will be converted and stored to spatial DBMS,
and spatial join will be conducted between the
drop-off/pick-up locations and administrative district regions.
The slide present a list of step-by-step procedures to conduct the data processing.
Please note that SQOOP is used for transferring
HDFS to PostgreSQL which you learned in Hadoop Ecosystem in the fourth week.
You are looking at the query of
group drop-off/pick-up by region in the time of interest.
From the query the number of drop-offs and pick-ups in the morning of November 26,
year 2016 are retrieved and exported to text file.
You are looking at the output text file of the query.
The first column denotes region ID
and the final column represent the number of drop-offs.
You are looking at a R code of hotspot analysis using Getis-Ord Gi* which is
a polygon-based hotspot analysis that you already learn in the fourth week.
For the visualization of the final result, QGIS is used.
After hotspot analysis completed in R, the output of Gi*, text file,
is joined to region shapefile and then it can be visualized in QGIS.
You are looking at outcome after GI* value and region shapefiles are joined by region ID.
Now, let's take a look at the visualization of
hotspot analysis with respect to specific period of time.
The first set of visualizations are from a regular day,
November 24, year 2016 Thursday.
The first outcome of hotspot analysis is pick-up hotspots during 06:00 am to 09:00 am.
The second outcome of hotspot analysis is drop-off hotspots and coldspot
during the same time,
06:00 am to 09:00 am.
The next outcome of hotspot analysis is pickup hotspots during 09:00 pm to midnight.
This is interesting outcome that drop-off in
the morning and pick-up in the night have very similar pattern.
The next outcome of a hotspot analysis it drop-off hotspots during the same time,
09:00 pm to midnight.
Many hotspots of drop-offs are located outskirt of Seoul.
Now, this time let's take a look at hotspot analysis on the day of Mass Protest Event,
which is November 26, year 2016 Saturday.
The outcome is pick-up hotspots and coldspots during 06:00 am to 09:00 am,
which is very different from regular day.
A lot of pick-ups took place at the downtown in red color,
where the mass protest was planned to occur in the evening.
The next outcome is drop-off hotspots and
coldspots during the same time 06:00 am to 09:00 am.
The downtown is coldspot drop-offs,
which collectively means that people are getting out of the town by taxi in the morning.
The next outcome is pick-up hotspots and coldspots during
03:00 pm to 06:00 pm just before the event starts.
The little pick-up took place at the downtown.
Likewise very little drop-off activity took place at the downtown as well.
Those makes sense, because most of the participant of
the event used public transit to access to downtown for protest.
These outcomes can be averaged with respect to the time of
the week and a few other parameters such as weather conditions.
The result could be compiled and then delivered to
taxi drivers through "Passenger Finder".
All right, you and I took a step-by-step
procedure to come up with a solution to the question,
where are more passengers waiting for taxi cab?
We have a solution structure and tangible outcomes to the question.
However, the proposed solution has some issues and limitations.
First of all, it would not fit in big data system because the data size is not that big.
Secondly, region-based hotspot analysis do not represent the best solution.
Probably the solution is somewhat basic.
For a realistic solution,
we should consider demand and supply together and competitions as well.