For data with lots and lots of observations,

you often want to omit the outliers because you get this big mash of

black from over-plotting that you can't see individual points.

And there is no point in calling them outliers anymore if there's

hundreds of them.

Here's an example of a bad box plot.

I just give you some R code to generate a bad box plot, right.

It's all squished together, the outliers.

There's too many outliers being displayed and the fact that there's so

many outliers means that all the interesting aspects of the kind of

meat of the data are obscured by the handful of outliers.

So that's box plots.

So box plots kind of give you a density estimate by grabbing a bunch of

quantiles and using them from the data.

Kernel density estimates, on the other hand, are direct density estimates in

the same way histograms are direct density estimates.

But kernel density estimates maybe are a little bit better.

And the ideas that you're waiting observations according to a kernel,

in most cases that kernel is a Gaussian density, and

then you have to pick a parameter that determines how smooth or

jiggly your density estimate is going to be, called the bandwidth.

And your density estimate is itself a statistical estimate.

It has variability that you should probably investigate as well.

And you should investigate how the bandwidth impacts that variability and

the estimate itself.

But it's not like this is something that's just unique to kernel density estimates.

For example, if you take a histogram, the width and

construction of the bins in a histogram play the same role as the bandwidth, so

you still have that tuning parameter you have to work with.

And again,

the width and the number of bins in a histogram can impact what it looks like.

But in addition, a histogram's also an estimate with noise, so

both kernel density estimates and histograms, and so on,

they all are statistical estimates that have variation and

it's maybe unfortunate that I'm gonna do this as well that, when you plot these

things, you don't explicitly acknowledge the uncertainty in the density estimation.

So that is kind of a problem.

But maybe the solutions to it are a little bit above the discussion for this class.

So anyway, the R function density can be used to just create density estimate.

So here's the waiting and eruption times in minutes

between eruptions of the Old Faithful Geyser in Yellowstone National Park.

You can grab this data by just doing data(faithful), and

d is the density estimate and

this bandwidth parameter here gives a specific rule for selecting the bandwidth

of the density estimate, and then the plot creates the plot.

So there's our density estimate, and

it actually gives you the specific bandwidth it used.

And you can see there's an incredibly obvious feature in this data set at

4.5 minutes, let's say, that the eruption seemed to occur in two time periods.

But you also get a sense of the variation around those eruption times as well.

So anyways, kernel density estimates are a nice way to estimate a density and

maybe are an improvement over a histogram, I think by smoothing out the data.