and this is sometimes also called the development set.

And for brevity I'm just going to call this the dev set, but

all of these terms mean roughly the same thing.

And then you might carve out some final portion of it to be your test set.

And so the workflow is that you keep on training algorithms on your training sets.

And use your dev set or your hold-out cross validation set to see which

of many different models performs best on your dev set.

And then after having done this long enough,

when you have a final model that you want to evaluate,

you can take the best model you have found and evaluate it on your test set.

In order to get an unbiased estimate of how well your algorithm is doing.

So in the previous era of machine learning, it was common practice

to take all your data and split it according to maybe a 70/30% in

terms of a people often talk about the 70/30 train test splits.

If you don't have an explicit dev set or maybe a 60/20/20%

split in terms of 60% train, 20% dev and 20% test.

And several years ago this was widely considered best practice

in machine learning.

If you have maybe 100 examples in total,

maybe 1000 examples in total, maybe after 10,000 examples.

These sorts of ratios were perfectly reasonable rules of thumb.

But in the modern big data era, where, for example,

you might have a million examples in total, then the trend is that your dev and

test sets have been becoming a much smaller percentage of the total.

Because remember, the goal of the dev set or the development set is that you're

going to test different algorithms on it and see which algorithm works better.

So the dev set just needs to be big enough for

you to evaluate, say, two different algorithm choices or

ten different algorithm choices and quickly decide which one is doing better.

And you might not need a whole 20% of your data for that.

So, for example, if you have a million training examples you might decide that

just having 10,000 examples in your dev set is more than enough

to evaluate which one or two algorithms does better.

And in a similar vein, the main goal of your test set is, given your final

classifier, to give you a pretty confident estimate of how well it's doing.

And again, if you have a million examples maybe you might decide that 10,000

examples is more than enough in order to evaluate a single classifier and

give you a good estimate of how well it's doing.

So in this example where you have a million examples,

if you need just 10,000 for your dev and 10,000 for your test,

your ratio will be more like this 10,000 is 1% of 1 million so

you'll have 98% train, 1% dev, 1% test.

And I've also seen applications where,

if you have even more than a million examples, you might end up

with 99.5% train and 0.25% dev, 0.25% test.

Or maybe a 0.4% dev, 0.1% test.

So just to recap, when setting up your machine learning problem,

I'll often set it up into a train, dev and test sets, and

if you have a relatively small dataset, these traditional ratios might be okay.

But if you have a much larger data set, it's also fine to set your dev and

test sets to be much smaller than your 20% or even 10% of your data.

We'll give more specific guidelines on the sizes of dev and

test sets later in this specialization.

One other trend we're seeing in the era of modern deep learning is that more and

more people train on mismatched train and test distributions.

Let's say you're building an app that lets users upload a lot of pictures and

your goal is to find pictures of cats in order to show your users.

Maybe all your users are cat lovers.

Maybe your training set comes from cat pictures downloaded off the Internet, but

your dev and test sets might comprise cat pictures from users using our app.

So maybe your training set has a lot of pictures crawled off the Internet but

the dev and test sets are pictures uploaded by users.

Turns out a lot of webpages have very high resolution, very professional,

very nicely framed pictures of cats.

But maybe your users are uploading blurrier,

lower res images just taken with a cell phone camera in a more casual condition.

And so these two distributions of data may be different.

The rule of thumb I'd encourage you to follow in this case is to

make sure that the dev and test sets come from the same distribution.