0:00

In this lesson, we're going to discuss Associate Rules for performing

a Market-Basket Analysis commonly abbreviated MBA.

So to perform this Market-Basket Analysis,

we're going to look at how we can determine which items are frequently

purchased together using the Apriori algorithm.

And through that, we're going to see how to determine associative rules.

And moreover, we're going to define

some important terms like support, confidence, and lift.

All interesting things have a great story and so we'll start off with one.

You may or may not be familiar with the beer and diaper story,

but here it goes.

So, a store decided to do a formal analysis on

their data and found that men between the ages of 30 and 40 while

shopping between the hours of 5:00 and 7:00 pm on Fridays were

considerably more likely to purchase beer if they already had diapers in their cart.

Armed with this knowledge, the store relocated the beer section closer to

the diaper section and saw an increase in sales of both by 35 percent.

While a great story, unfortunately, is not true.

The true story is that there is a company called Osco Drug which examined 1.2 million

transactions across 25 stores and identified around 5,000 slow moving items.

And then they removed them.

And then the result was really quite measurable.

By removing that inventory,

it made it easier for customers to find what they wanted,

and customers thought selection had increased.

So, therefore, their sales increased and the company saved

money by reducing their inventory overhead.

But let's see how we can perform a real Market-Basket Analysis.

So, we're going to go ahead and install some dependencies,

go ahead and import those dependencies,

and before we get into the apriori algorithm and associative rules,

first, we are going to talk about how we need to model our data.

Let's imagine that we have a super small store that only sells five items.

We sell beer, chips,

salsa, chocolate, and diapers.

Here, we have a schema that lists all of our transactions where each entry is

a customer's shopping cart and each entry inside

of the shopping cart we'd say what they bought and how many of those items they bought.

So, it's very easy for us to go ahead and take

this variable and put it into a data frame.

And you can see right here,

that's what this data looks like in a tabular format.

Okay, so we have this transaction table but in order to build association rules,

the first thing we going to do is get rid of all these NaN values,

all these not a number values.

Fortunately for us, we can just use

pandas' built-in fillna function and replace all those NaN's with zeros.

Now that all of our data is numerical, now we need to one-hot encode our data.

And that simply means that we need to represent whether something was present or not.

And so, what we're going to do is we're just going to go through

all the different values and if anything is greater than zero,

we're just going to set it to one. And there we go.

Now, our data demonstrates whether or not someone purchased something.

Keep in mind that one-hot encoding something

like a trait can be a little bit more tricky.

So, for example, we had gender.

Let's say your gender was restricted to just male and

female and we represented that binarily with one or zero.

In order to one-hot encode that,

we'd really need to create two separate columns,

one for female and one for male,

and then represent it like so because we're really trying to represent

the presence or lack of presence for some feature.

Great, now that we know what our data needs to look like,

let's define some terms and then look at a little bit

of the math for Market-Basket Analysis.

With association rules, when we define relationships,

we use the terms antecedent and consequents,

and then we also have some characteristic terms like support, confidence, and lift.

When we define a rule we say that something implies something else so A implies B.

And this cooler arrow actually means implies.

So, we have a rule that says chips implies beer.

We say that chips is the antecedent and beer is the consequent.

And these are the two terms for defining the relationships in an association rule.

Support is the first term we're going to use to

describe the characteristics of a relationship.

Support is just the occurrence of an item among all transactions.

So, for example, there are five occurrences of chips in all six transactions,

chips would have a 0.833 support.

And beer appears four times across six transactions so,

it has a support of 0.667.

And we would do this for every single combination of

items referred to commonly as item sets.

So, we'd start with all items at set one,

all items at set two,

all items at set three, et cetera.

Now when you do this, you typically defined some kind of minimum amount

of support to avoid exploring extremely uncommon pairings.

In this example, I've limited the minimum threshold to 0.5.

The next term is confidence,

and confidence is the likelihood that

some item set B will be bought together with an item set A.

It is calculated by dividing the support of items set [A,B] by item set A.

So, the confidence that the antecedent,

chips, implies the consequent, beer,

is 60 percent because the support for chips and beer

is 0.5 and the support for chips by itself is 0.833,

divide them, and you get 0.6.

So, 60 percent of the time that chips were bought,

beer was also bought.

Now, confidence can be a very good indicator but it also has a major drawback.

If the consequent is popular then confidence does not take this into account,

and can lead to an implication where there really isn't any.

And the last characteristic we're going to look at is lift.

And lift is how likely an item set B was purchased when item set A was purchased.

So, unlike confidence, lift takes into account the popularity of item set B,

and it's calculated by dividing the support of item set [A,B] by

the product of the support of item set A by the support of item set B.

So, the lift for the rule chips implies beer would be point 0.9.

Now, lift values of one imply no association,

values greater than one imply a positive association,

and values less than one imply a negative association.

Now that we have these terms defined,

let's go ahead and look back at the code.

To calculate these different values,

we're going to use two different methods from the ML extended, Machine Learning Library.

We are going to use the apriori method and the association rules method.

First, we're going to build our associations with the apriori method,

and as you can see, I am setting my minimum threshold for support to 0.5.

Now, keep in mind, as the data gets larger,

you may need to decrease this threshold.

And as you can see, here are different items sets with their different support values.

Now we can pass those associations to the association rules method.

Here, I'm using a minimum threshold of 0.5.

In reality, we'd want something probably greater than one,

but I'm doing this so that we can see all of the associations.

And here, you can see our different association rules.

We have our different antecedent item sets,

our different consequent items sets with their support, confidence, and lift.

Moreover, you can see that our diaper and beer story

is true for this data set and you can also see

that chips implies beer with

a lift of 0.9 just like when we calculated it manually earlier.

Okay, now that we understand the basics of association rules,

let's go and do the same thing with a much larger data set.

First, we are going to connect to our cluster with Pymongo.

And in this dataset, we have documents like this where we have

a purchases array with embedded documents describing each purchase,

and we really want to convert this into a format like this where

we have every product ID,

or I guess it's a stock ID,

and then whether or not someone purchased something.

To do this, we're going to use our replace route stage by mapping over

all of the different object keys and just train them to one for every stock code.

And that's going to be our only stage in our pipeline.

And then very simply, we can go ahead and

exhaust that cursor and shove it into a data frame.

And like before, we have a bunch of not a number values.

So again, we're going to go ahead and use the fillna data frame function.

And here, we're replacing all those NaN's with zero.

And now, like before, we can go ahead and use the apriori function.

Now notice, I have a much lower minimum support and that's because we

have a little over 3,600 different stock codes among our data set.

So go ahead and get those associations and we'll go ahead and look at them

and here all the different support values for all the different item sets.

And then we can go ahead and, like,

before pass these associations to the association rules function,

here, I'm giving a minimum threshold of three.

This time, we don't want to look at every possible rule.

We really only want to look at the strongest rules.

And now, we can go ahead and print them out,

our very top rule with a lift of 24.22 says that stock goods

22698 alongside 22699 are frequently purchased with 22697.

We can go ahead and create a simple aggregation pipeline to see what these products were,

and it makes sense.

People were buying tea cups and saucers of different colors together.

So, knowing this information,

maybe we'd want to go ahead and package these items together in our store.

Okay, let's summarize what we've learned.

We saw how Market-Basket Analysis work by using the apriori algorithm we saw how

to get that data out of MongoDB so we can could pass it into the appropriate functions.

And moreover, we saw the different terms for

these associative rules and what these different terms meant,

and how each of these terms were calculated.