I think the key challenge in, in pretty much any data analysis was well

characterized by Dan Meyer who's a mathematics

educator and he taught high school mathematics.

In his Ted talk he said ask yourselves what problem have you solved

ever, that was worth solving where you knew all the given information in advance.

Where you

didn't have a surplus of information and have to filter it out.

Or you had insuf, insufficient information and had to go find some.

And so I think that's a key element of data analysis that which is that you know,

typically, you don't have all the facts or

you have too much information, and you kind of

have to go through it, and the process, a lot of the process of data analysis is

sorting through kind of all this, all this stuff

And so, the first part, the, the kind of important

part of data analysis that you want to start with is, is define a question.

And not every data analysis starts with the very specific or coherent question.

But the, the more effort you can put into coming up with a reasonable

question, the, the less effort you'll spend

having to filter through a lot of, stuff.

And the reason why is that defining a question is the kind of the most powerful.

Dimension reduction tool you can

ever employ.

because if you're interested in, you know, in,

in, a specific variable, like height or weight, then

you can kind of remove, a lot of other

variables that don't really pertain to that at all.

But if you're interested in a different type

of variable then you can remove another subset.

And so the idea is if, if you

can narrow down your question as specifically as possible.

Then that will serve to reduce the kind of noise that you'll, that you'll have

to deal with when you're going through a potentially very large data set.

Now sometimes you just want to look at a data

set and see kind of what's inside this data set.

And then you'll have to explore all kinds of things in a large data set.

But if you can narrow down your interest, your, your interest to a

specific type of question, then that can

be extremely useful for simplifying your problem.

So I encourage you to to kind of think about what

type of question you're interested in answering before, you go into delving

into all the details of your data set.

So, the science, generally speaking, will determine

what type of question you're interested in asking.

And that will lead you to the data.

Which may lead you to applied statistics, which

is you know, you use to analyze the data.

And then if you get really you know,

ambitious you might want to think of some theoretical

statistics that will kind of generalize the the

methods that you apply to different types of data.

Now of course there are relatively few people

who can even, who can do that, and so I, that would not be expected of everyone.

So the part that's in the, the red bracket that's

number one That's typically referred to as statistical methods development.

The part that's in the purple bracket here number two, which is just kind

of, the application of statistics to, just to raw data without any sense of science.

is, is what I would refer to as the danger

zone, and which we, which I kind of, derive here from.

A kind of a Venn diagram of data science that's written by Drew Conway.

The idea is if you just kind of randomly apply

statistical methods to data sets to find an interesting answer.

First of all, you will find something interesting almost certainly, but

it may not be reproducible and it may not be really meaningful.

And so I think the a truly, a proper data analysis

has a scientific context, it hopefully has at least some general

question that we're trying to.

To try and to investigate which will narrow

down the kind of dimensionality of the problem.

And then we'll apply the appropriate

statistical methods to the appropriate data.