这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

來自 University of Illinois at Urbana-Champaign 的課程

Pattern Discovery in Data Mining

137 個評分

從本節課中

Module 2

Module 2 covers two lessons: Lessons 3 and 4. In Lesson 3, we discuss pattern evaluation and learn what kind of interesting measures should be used in pattern analysis. We show that the support-confidence framework is inadequate for pattern evaluation, and even the popularly used lift and chi-square measures may not be good under certain situations. We introduce the concept of null-invariance and introduce a new null-invariant measure for pattern evaluation. In Lesson 4, we examine the issues on mining a diverse spectrum of patterns. We learn the concepts of and mining methods for multiple-level associations, multi-dimensional associations, quantitative associations, negative correlations, compressed patterns, and redundancy-aware patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

We already seen Lift and a chi-square may not be very good measures,

examining the transaction data that contain lots of null transactions.

So, what we may like to see is,

what are good measures?

They do not influence much by number of null transactions.

Let's look at those different measures.

Some measure, they have the property called null-invariance,

that means their values may not change with a number of null transactions.

Let's see what measures are null-invariant,

what measures are not null-invariant.

We already know chi-square and a Lift,

they are not null-invariant.

Their value change with number on null transactions.

But, people have found that the folding five measures,

if you check their formula,

their definition, they are actually null-invariant measures.

So, you probably know Jaccard coefficient and cosine measure quite well.

These two measures are popularly used, they're null-invariant.

But all confidence which actually take

the smaller value among A and B as the denominator,

the numerator is just the transaction support of the rule.

So, the max confidence is try to find the maximum one of them.

These two actually proposed in the study of measuring association rules.

Kulczynski measure was proposed around 2007 by us, by our group.

We originally called these as balanced measure, but later,

the reviewer actually point out that this measure was actually proposed

by a Polish mathematician called Kulczynski in 1927.

So, we changed the name of this measure to Kulczynski measure.

Let's first look at null-invariance,

why they are very important.

That means why in analysis massive transaction data,

the null-invariance is so critical,

because in many many transactions,

the transaction set contain particular sets of item.

The chance actually is very rare like a warm up transactions.

They may contain neither milk nor coffee.

We will try to analyze milk and coffee using the following contingency table.

So, this m_c means the number of transactions that contain both milk and coffee.

This not m nor c means the number of transactions that contain neither milk nor coffee.

So, this not m nor c is the number of no transactions.

Then, we see Lift and chi-square,

they are not null-invariant.

So they are not good at evaluating data that

contain too many or too few null transactions.

For example, we just look at this,

for this dataset D1,

m_c means number of transactions that contain both milk and coffee;

not m_c means the number of transactions contain no milk but coffee;

m no c means that number of transactions that contains milk but not coffee;

not m nor c means the number of null transactions,

they contain neither me nor coffee.

So, you'll probably go to like a Walmart,

it's kind of shopping market,

you probably see there could be the cases you get

10,000 transactions that contain milk and a coffee,

but 1,000 contain not milk but coffee,

1,000 contain milk but not coffee.

In that case, you probably say actually likely if people buy milk,

they would buy coffee as well because there are 10,000 such cases.

Buy only one of them, there's only 1,000.

But, if you have a lot of a null transactions,

this value could be quite positive.

If you have very few null transactions,

it turn out they are independent.

You probably look at the value they are not independent.

On the other hand, if you see there's only 100 cases,

you're got a milk and coffee together.

There are many more cases,

they buy it alone.

But once you have many null transactions,

it turns out to be very positive.

The number is quite big.

So, no matter you get a very many null transactions,

so very few null transactions,

something may go wrong.

So, we do need to analyze such data using some null-invariance measures.

We'll examine this in more detail.