Now you probably created a new notebook and you copied and pasted it.

Another thing that we could do is just go ahead and

clear all the cells out of this notebook.

So at this point,

this is just the original notebook with none of the outputs there.

And at this point, I can click on each cell in turn and

run it by clicking Shift+Enter, so that's what I'm going to be doing.

And it's going to show you which cell is being executed, so

this is the cell that's being executed.

So I'll do that and now I have just imported the data the next thing that I

want to do is I know that there is this table called NYCPLC.

That's a project name.

Green, which is the data set and trips_2015,

which is the table name and I want to find a query of the schema of that.

So I basically go ahead and Shift+Enter that and

we find out that it has a column named pickup date time,

drop off underscore date time, stolen forward flag and so on.

This is all the information in the taxicab data.

The next thing that I want to do, is I want to basically look at

what days of year are available.

So that's what I'm just going to do a quick query to find out

get the pick up date, time, get the day of the year.

Call that daynumber from this New York taxi cab data limiting ourselves

to 10 items in that list and this is now going to run the query.

And we see that there's day number 120, day number 150, 150, 150.

So there 3 rights that he got all of it's happen on the 150 at the day of the year

and so on.

So this is each of the day numbers correspond the each of the rights.

There are billion of this rights we're just showing ten of them.

Next thing that we can do is we can make a SQL module.

So I'm going to make a module called taxiquery, and

I'm going to get the day of the year.

And for every day, I'm going to get the total number of trips on

that day because I'm grouping this by the day number.

So if I do this, Shift+Enter,

And then, I will run the query bq.query and

we see that on day number 1 which is January 1,

there was 62943 trips.

On day number 2, there were 43,000, so they are obviously a lot

more trips on January 1 probably after a party night the previous day.

January 3rd 53,866 probably there also you can see that there

is a difference here It just changes from day to day and the number of rides.

And we can basically say what is the average number of trips on any day, so

that's the average, np, np is a numpy.

Notice that we're able to use those trips, which is a panda's data frame, and

we're able to use all the Python data analysis tools very straightforwardly.

And we learn that there's an average of around 55,000 trips every day.

Now, another data set that exists is the weathered gsod.

This is the weather data and we can go ahead and

find all these stations in New York and where the name contains La Guardia.

La Guardia is one of the airports in New York City and

we find that this a station at LaGuardia Airport.

That's the latitude longitude, and it's in New York.

So now that we know the weather station number,

we can basically go ahead and put that number in.

725030, to go ahead and get the minimum and maximum daily temperatures.

So what we're getting is we're getting the day number.

We are basically getting the day of the week,

we're getting the minimum temperature, the maximum temperature, and

the amount of rainfall from New York airport.

So that is my SQL query, I'm going to run it, and get it into my weather data frame.

And we basically see that on the last day of the year, day of the week was five and

the minimum temperature was 46 maximum temperature was 48.

The day of week is five, Sunday is one, Monday is two, etc., right?

And then we basically had 0.17 inches of rain and the 364, 363 etc., these are each

of the days of the year 2015 because that's where we did the query.

And now, so we have two data sets, we have one data set which is the trips,

we have another data set which is the weather.

Let's go ahead and combine both of them on the daynumber.

So this is the panda's call, so pd is panda's, so

I'm just going to merge it, and that's my new dataset.

For every daynumber, I have the day of the week,

I have the temperature, I have the rain, I also have the number of trips.

So these are my predictors in the machine learning model,

this is the thing that I'm trying to predict.

I want to be able to predict given the weather how many trips will happen

in New York City on a particular day.

So we can look at this data again, we're now using Python plotting programs.

We can look at the data, we can plot it as a scatter plot and

we find that there is actually a pretty good relationship between the day of

the week and the number of trips.

There are a lot more trips on weekends than on weekdays more trips towards

the end of the week than the beginning of the week.

It seems that Mondays Tuesdays and Wednesdays people like to walk.

Thursday and Friday people prefer to take the taxi, go figure.

So we can also add in 2014 data and

we can concatenate that data, let's go ahead and do those.

And at this point now, we'll have our data set and

we are ready to carry out machine learning.

And that's something that we're going to do in the next section.