本课程致力于介绍如何在商业分析中运用关系数据库。你将会学到关系数据库是如何工作的，以及如何运用实体-联系图来展示数据结构。这将会帮助你理解商业环境下数据需要怎样被收集，而且如果你的工作涉及到完成新数据的归类，此课程也会帮助你

Loading...

來自 Duke University 的課程

用MySQL管理大数据

1918 個評分

本课程致力于介绍如何在商业分析中运用关系数据库。你将会学到关系数据库是如何工作的，以及如何运用实体-联系图来展示数据结构。这将会帮助你理解商业环境下数据需要怎样被收集，而且如果你的工作涉及到完成新数据的归类，此课程也会帮助你

從本節課中

Queries to Summarize Groups of Data from Multiple Tables

<p>Welcome to week 3! This week, we are going to learn the SQL syntax that allows you to segment your data into separate categories and segment. We are also going to learn how to combine data stored in separate tables.</p><p>By the end of the week, you will be able to:</p><ul><li>Summarize values across entire columns, and break those summaries up according to specific variables or values in others columns using GROUP BY and HAVING clauses</li><li>Combine information from multiple tables using inner and outer joins</li><li>Use strategies to manage joins between tables with duplicate rows, many-to-many relationships, and atypical configurations</li><li>Practice one of the slightly more challenging use cases of aggregation functions, and</li><li>Work with the Dognition database to learn more about how MySQL handles mismatched aggregation levels.</li></ul><p>Make sure to watch the videos about joins, and complete both the MySQL and the Teradata exercises. At the end of the week, you will test your understanding of the SQL syntax introduced this week by completing the Week 3 graded quiz.</p><p>We strongly encourage you to use the course Discussions to help each other with questions. </p> <p>To get started, please begin with the video 'Welcome to Week 3.’</p><p>I hope you enjoy this week’s materials!</p>

- Daniel EggerExecutive in Residence and Director, Center for Quantitative Modeling

Pratt School of Engineering, Duke University - Jana Schaich BorgAssistant Research Professor

Social Science Research Institute

We've talked a lot about how relational databases work by breaking large data sets

up into smaller tables of data with unified themes.

We generally try to organize our themes so

that we replicate individual data entries as rarely as possible.

Of course,

this strategy only works if we can put the tables back together when we need to.

We achieve this using a class of SQL commands called joins.

Joins have a bad reputation among new SQL users, but

in my experiences joins are only hard to understand if

you don't have a firm grasp of how relational databases are set up.

Since you mastered the material in week one of this course,

you do have a firm grasp of how relational databases are set up.

So you will master joins very quickly.

By the end of this video, you'll understand how joins work and

why there are different kinds of them.

To understand what joins are,

let's consider the problem they are meant to solve.

Recalling the tables we made for our Egger's roast coffee database.

Let's focus on just two of those tables, the orders table and

the distribution center table, and let's zoom in a little bit.

You can see that in order for us to learn any information about the distribution

center of the items in the orders table,

you have to link each row of the orders table data to the appropriate row of

the distribution center table data by matching up the distribution location ID.

So far, we have talked about linking tables as a general concept for

organizing databases, but now we have to think about the details of

how the database can actually make that look.

Take a second to think about at least

one strategy a database could use to

link these two tables together.

One of the first strategies that might come to mind is that the database could go

through the distribution center ID column of the order's table row by row and ask,

which distribution center ID is this?

It's dl1?

Okay, in that case,

it matches up with all the data in the dl1 row of the distribution center.

You could implement this strategy in Excel with a VLOOKUP function, and

it would work very well if your data set wasn't too large.

But what if you had to copy the VLOOKUP function to 100 million

rows spread across multiple spreadsheets in Excel?

Even if you're an Excel expert, it would take a bit of time to even paste

the VLOOKUP formulas into all the appropriate cells,

although a computer is much quicker than a human at these types of operations.

A similar concept still holds for a database.

Operations that make databases look information up

are always relatively slow compared to other types of mathematical operations.

And the more times a database has to look up that information or the wider the pool

of information it has to search, the slower the database query will be.

So the row by row strategy is going to be inefficient.

Is there another strategy we could use?

For those of you who've done a lot of analysis using MATLAB or

R software you might be thinking isn't there someway we could do this using

a matrix algebra if we treat each table as a matrix?

Well I sympathize with that statement, but

I'm not sure what the strategy would be, given that in most cases the tables you

will be combining will not be of the same length or width.

Here's where set theory comes to the rescue.

The theoretical way set theory combines two sets is by multiplying them,

resulting in what is called the Cartesian product, or cross product.

In the case of sets, multiplication doesn't mean multiplying numerical values.

The Cartesian product is a set that contains all the possible pairs of items

in two tables.

If this was our orders table, or set, and this was our distribution table, or

set, this would be the Cartesian product of the two sets.

You can see that the Cartesian product gives every possible combination

of the rows.

The way databases combine tables is by first filtering the columns you are using

to link tables together by the criteria you specify in your query,

then it makes the Cartesian products of these filtered columns.

And finally, it appends any additional columns that you asked for in your query.

So the way you can join tables in your queries to get data you want is to specify

I only want the Cartesian product of rows that have the same IDs and

a unique ID column.

You need to know about cartesian products because sometimes you get

unexpected outputs when you join tables that have duplicate rows or

columns that have many to many relationships.

These outputs will only make sense if you remember the fundamental notion

that Cartesian products underline joins.

We're going to work our way up to understanding those types of situations.

But let's start by thinking about joins in the perfect idyllic situation when you

have no duplicate rows, and

each row in one table is only connected to one row on another table.

Imagine that you have the following two tables that are linked by department ID.

There are at least four different ways you can combine these tables.

The first would be if you wanted to find out all the employees who work in

a current department.

In this case, you want a list of all the employee rows which have

matching department IDs in the department table.

This type of join, where a row is only included if its value is contained in both

unique columns of a table is called an inner join.

The result of an inner join of these two tables will look like this.

Since there is no department ID of 4 or 5 in the department table, and

no department ID of 3 in the employee table, the name Jessica, the name Daisy,

and the department of furniture will be left out of the result.

Also notice that the employee Brian and the department Cosmetics were dropped from

the results because NULL values can never be used to link rows in a join.

Null values are not linked up to other linked values.

They're simply excluded.

What if you wanted a list of all the employees and their departments regardless

of whether the employee's department was in the current department table?

In this case, you could use an outer join.

When you write an outer join,

the order you enter the tables into your query matters.

The name of the join defines which table will have all of it's values included.

Let's assume we will enter the employee table first and

the apartment table second in our query.

In this case, the result of the LEFT outer join of these two tables would result in

all of the values of the table on the left,

in the Employees table being included.

Values in the table on the right would only be included if they have

a value in their linking column that matches up with the linking column in

the table on the left.

Here's what a result of a left join on our tables would look like.

You see here that all of our employees are included in the output, but

the only departments that are included are those that have a matching

value in the Department ID column.

Notice that wherever there was a row in the employee table that didn't have

a matching department ID in the Department table, the department is listed as NULL.

The road that had a null value in the employees table is included in the output

table this time because there is no criterion that it had to much up with

a value in another column.

You could also retrieve a list of all the current departments and

their employees, regardless of whether any employees in the employee table

were pointed to a department in the department table.

If we once again include the employee table first and the department table

second in our query, we could acquire this list using a right outer join.

In right joins, all the results of the table on the right are included.

Values of the table on the left are included only if they have a value in

their linking column that matches up with the linking column in table on the right.

So the results would look like this.

This time, all of the department rows are included in the output.

However, the furniture row has a null value in the employee table,

because there is no department ID 3 listed anywhere in the department ID

column of the employees table.

The row of data in the department table had a null value and

department ID is included in the output.

On the other hand, the rows of data from the employee table that do not have

department IDs that appear in the department table are excluded.

At this point you might be thinking, wait, couldn't I get the same result

if I ran the join we just described using a left join but

reversed the order of the tables I enter into the query?

The answer is yes, yes you could.

Left and right outer joins are fairly redundant.

They just give you flexibility in how you write your query.

A left outer join with the tables written in this order gives you the same result

as a right outer join with the tables written in this order.

There is one more type of join we haven't talked about.

Full outer joins.

Full outer joins would give you all the employees and departments list in one or

the other table,

regardless of whether they match up with values in the other table.

You might use this if you simply wanted to extract all of the data that meet

a certain criterion, so

that you can examine every single piece of raw data in your analysis.

The result of a full outer join between our two tables would look like this.

You can see all rows are included, and whenever there was a row that didn't have

a matching value in the department ID column, no values are entered.

One important thing to know is that not all database management systems support

full outer joins.

MySQL in particular does not support full joins while Teradata does.

The fact that not all database management systems support full outer joins gives you

a sense of how rarely they are used.

So those are the basic concepts behind joins.

They are summarized in this diagram that you might want to keep for

future reference.

Not too bad so far, right, I agree.

That said, so far we have only looked at the results of joining tables that have

columns that have only one match per value in their linking column.

The results do get slightly more complicated when you apply outer joints to

tables who relationships have cardinality constraints that are greater than one.

Or if some of your tables have duplicate rows.

I hope you will join me and your tables in the next video to find out why.