It comes up because of different types,

the way they're organized,

and why multiple channels of communication?

We may be getting inputs

from many different channels

and we are trying to coordinate them.

So that's part of the data complexity issue.

Now, increasing complexity, does it give benefits?

We already saw, for example,

in the car example,

car price example, we say look,

they could be in the future variable which is

the condition monitoring of

the car which is available to you.

Can you use it? How can you use it?

How much extra value

does it get when you're pricing a car?

So why do we call it a curse?

We're going to talk a bit about it and talk about

some very simple ways of taking care of the curse.

Finally, we're going to say, "Okay,

if our objective is to

extract meaning from a data like this,

is there some tools,

some method which will help?"

We will talk about very simple methods like

a scatter plot and the more complicated methods.

Only one, I'm going to explain but

I'll at least name a few more,

which you can go and read about

in the references to this module.

You might have seen this chart somewhere.

Somebody would have already said that what

has changed about Big Data is

the velocity with which it is coming through,

the volume to which it's coming through,

the variety of data which is available,

and also the veracity effect with the fake news and all,

and some data you can trust and some was

verified and some was triple-verified.

So given that, we know the data comes in

so many different ways

and this is one way of thinking about it.

But there're more precise ways of looking at it.

Data comes in numbers.

You've seen numbers already, you've seen statistics.

You have seen weight,

you've seen maybe distributions.

So we have seen numeric data of all kinds of things.

Data also comes in what we'd call ordered fashion

but the different classes are

not equidistant so you could have small, medium, large.

You could have low,

medium, high income bucket.

Right. You could to have how relevant the data is,

not very relevant, more relevant, and most relevant.

Right. So there is an ordering.

There is another kind of data which is symbolic,

where there's no order.

Red, green and blue.

Right. There is a state, and the country,

and the region so basically there are ways in which

you can just label things.

So these are labels. Right. So what

we saw was different types of data.

Here it did also differs as you would have seen in

your first course in how they are stored.

So the common way is to store them as a table,

rows, and various features in tables.

In fact all the data we have explored so

far out of that kind. They are tabular.

Or they could be basket,

like a shopping basket of data,

a market basket, or a list of keywords.

Now, what's the big difference?

That this is a high dimensional highly sparse data

compared to a table.

The table it's got fewer dimensions and it's very

dense compared to a basket of items that you purchased.

We would see that in one of

our examples in one of the module.

Another kind of data

is simply whatever you've got in your bag,

three eggs, four cereals,

two soap bars, five milks.

We could do the same thing

where the document and say okay,

does this document have

five words which say you win and three words which

say immediately and 10

times it says it's lucky and you know it's a spam,

right, compared to a document which says

you are requested to be

at such and such a place for such and such meeting.

So you've got a meeting mentioned once,

you mentioned once, place mentioned one.

So the Bag of Words

is meaningful depending on what it contains.

So we could organize data in different ways.

So you have different types of data

and different types of ways of organizing.

So the Bag of Words is considered a

high dimensional and sparse data.

Moving on data and my colleague who

shared some of these slides with me uses

the word modality here I'm using the word structure here.

Data could be structured in the sense it

could be in fixed columns.

Right. It could be numeric,

it could be symbolic,

whereas data could unstructured.

So it could be a speech like what's I'm doing,

it could be a ticker symbols morning,

it could be text document.

So just look at the complexity now you've types of data,

how they're organized, how they're structured.

It creates a sort of a variety in which we can use

to do the same classification prediction.

So the question for us is more better.

Obviously, we would like to believe it is.

Right. But we run into two problems.

First problem is distance.

So if your data,

your predictor variables are categorical,

how do you model distances?

It becomes a problem because for

categorical variables distance is not well-defined.

The other problem is, on top of it

add text so I have numerical variables,

I have categorical variables, I have text.

How do they even model something with it?

So obviously I have to find a way

of extracting the features in

the text and making it in

a nice modelable format so that then I can predict.

We're asking is more better.

But this is a case study shared with

me by the same person who is

now one of the top machine-learning experts

with one of the big companies in India.

He said in his first job he saw this example.

So you see a graph here there's a line in black and

a line in red and basically if it is higher it is better.

So the first part,

the bank wanted to predict whether calling

customer would help in improving collections of a loan.

So if you use just the numerical variables

there was a bit of accuracy.

What my friend his name is

Sailesh Kumar found was let us say I recorded

the call center conversation and

I looked at the notes that the call center representative

was making and extracted information from

the text and added it to the numerical variables.

My prediction accuracy goes up.

Okay. So the idea being

that more is better but how much more?

That's the question we're trying to ask.