In this video, we'll be talking about descriptive statistics.

When you begin to analyze data,

it's important to first explore your data

before you spend time building complicated models.

One easy way to do so is to

calculate some descriptive statistics for your data.

Descriptive statistical analysis helps

to describe basic features of

a dataset and obtains

a short summary about the sample and measures of the data.

Let's show you a couple different useful methods.

One way in which we can do this is by

using the describe function in pandas.

Using the describe function and applying it on your DataFrame,

a describe function automatically computes

basic statistics for all numerical variables.

It shows the mean, the total number of data points,

the standard deviation, the quartiles,

and the extreme values.

Any NaN values are automatically skipped in these statistics.

This function will give you a clearer idea of

the distribution of your different variables.

You could have also categorical variables in your dataset.

These are variables that can be divided up into

different categories or groups and have discrete values.

For example, in our dataset,

we have the drive system as a categorical variable,

which consists of the categories,

forward-wheel drive, rear-wheel drive, and four-wheel drive.

One way you can summarize the categorical data is

by using the function value underscore counts.

We can change the name of the column to make it easier to read.

We see that we have 118 cars in the front-wheel drive category,

75 cars in the rear-wheel drive category,

and eight cars in the four-wheel drive category.

Box plots are great way to visualize numeric data,

since you can visualize the various distributions of the data.

The main features that the box plot

shows are the median of the data,

which represents where the middle data point is.

The upper quartile shows where the 75th percentile is.

The lower quartile shows where the 25th percentile is.

The data between the upper and lower quartile

represents the interquartile range.

Next, you have the lower and upper extremes.

These are calculated as 1.5 times the interquartile range above

the 75th percentile and as

1.5 times the IQR below the 25th percentile.

Finally, box plots also display outliers as

individual dots that occur outside the upper and lower extremes.

With box plots, you can easily spot

outliers and also see the distribution and skewness of the data.

Box plots make it easy to compare between groups.

In this example, using box plot,

we can see the distribution of different categories

of the drive wheels feature over price feature.

We can see that the distribution of price between

the rear-wheel drive and the other categories are distinct,

but the price for front-wheel drive and

four-wheel drive are almost indistinguishable.

Oftentimes, we tend to see continuous variables in our data.

These data points are numbers contained in some range.

For example, in our dataset,

price and engine size are continuous variables.

What if we want to understand

the relationship between engine size and price?

Could engine size possibly predict the price of a car?

One good way to visualize this is using a scatter plot.

Each observation in the scatter plot is represented as a point.

This plot shows the relationship between two variables.

The predictor variable is

the variable that you are using to predict an outcome.

In this case, our predictor variable is the engine size.

The target variable is

the variable that you are trying to predict.

In this case, our target variable is the price,

since this would be the outcome.

In a scatter plot,

we typically set the predictor variable on

the x-axis or horizontal axis,

and we set the target variable on the y-axis or vertical axis.

In this case, we will thus plot the engine size

on the x-axis and the price on the y-axis.

We are using the matplotlib function scatter here,

taking an x and a y variable.

Something to note is that it's always important to

label your axis and write a general plot title,

so that you know what you're looking at.

Now, how is the variable engine size related to price?

From the scatter plot, we see that as the engine size goes up,

the price of the car also goes up.

This is giving us an initial indication that there is

a positive linear relationship between these two variables.