Missed me? I'm going to close the loop on Hive basic skills.
This video is about DML or data manipulation language.
Now, you have experience of creating tables on
top of existing data in HDFS with Hive DDL.
You have also seen how to execute
simple HiveQL queries without investigating how it works under the hood.
In this video, you'll be moving data in and
out and inspect HiveQL queries a little bit further.
First, if you have data in HDFS folder,
then you can move it into Hive warehouse location with the following DML query.
I intentionally call it data moving.
Hive, unlike traditional relational databases supports schema on read strategy.
If you have a schema on a write database,
then all the data that you try to log into
the database will be automatically validated to fit the table schema.
On the other hand,
this statement holds true when you write into a Hive table with the help of SQL query.
On the other hand,
hive in its own doesn't validate the data
during the loading procedure which you see in the slide.
Consequently, it will throw exception during
the data reading if your data is not aligned with a table schema.
That is why it is called schema on read.
If you move the data into Hive table,
then it is your responsibility to align the data.
In addition to moving data from one HDFS folder to another,
you can log data into Hive table from a local file system.
You only need to add a statement local,
and provide a path in the local file system.
By default, data that you move into Hive warehouse location
is added into existing files in HDFS folder.
If there are files with the same names,
then they will be over written.
If you want to truncate HDFS folder content during
this tape to make sure there are no any obsolete pieces of data,
then you can add an statement overwrite.
The other way around you can export the data into
a local or HDFS folder with the help of HiveQL queries.
And the Hive functionality that differentiates it from
traditional relational databases is at multiple-insert statement.
By querying one table,
Hive can optimize the execution plan in a way where it reads
HDFS data once and do in-memory calculations for different select statements in parallel.
You can create and populate one Hive tables with the help of the others.
And from my personal experience,
it is the most common workflow.
You are familiar with insert overwrite tables statement.
The only change is that you don't move the data,
you generate it from other Hive tables with HiveQL query.
To make a new table,
you can use a special construction create table as select,
due to its popularity it has a got special acronym CTAS.
You can have an arbitrary compas HiveQL query and save
the outcome in a Hive table for future processing or analysis.
So, you know all the basic components of Hive warehouse solution.
They are high metascore data definition language,
data manipulation language, and query language.
I bet you wonder how it works on top of map producing executional layer.
In the first videos,
you saw the following SQL like HiveQL statements.
I can't make you wait any longer.
Let us go through them one by one.
In select from statement,
you only choose a subset of existing columns.
So, you will be able to do it within a map phase,
where statement again is applicable during data reading.
So, it is a map phase.
Group by is implemented by Hadoop mapper use framework withing shuffle and sort phase,
you can control it with the help of partition.
Consequently, haveing is used on a reduced phase.
We had intensive discussion about joints in the first course,
and we have touch upon this question during this lesson.
There are map side and reduce Hive joins.
Therefore, it can be calculated during map or reduce phases.
Order by statement is tricky.
You rarely need to order the whole distributed data set.
That is why Hive developers provide you with a statement sort by.
Sort by guarantees sorting within each file in HDFS folder,
which is related to a particular reduce worker.
Therefore, you can have an arbitrary number of
reducers and a higher level of parallelism.
Order by guaranties total ordering and therefore is much more expensive.
So, please use it with care.
Want to get detailed execution plan for your query?
Practice it with the explained statement.
In the output, you'll see how your query is parsed into abstract syntax tree,
how many stages you have and what is the connection between them.
And most importantly, you'll see detailed information about each stage.
So, you should be able to figure out what will exactly happen during the execution,
and where you will have map or reduce phases.
Let me finalize our present outcomes.
You know what DML stands for and how to move data from and into Hive warehouse.
You can populate existing tables and you can create
a new Hive table on-the-fly with the help of CTAS approach.
You can explain where map or reduce phase is used for an arbitrary HiveQL query.
Moreover, you can get a detailed execution plan for any HiveQL query.
Feel free to use official documentation when you are doing your homework.