Tuesday, December 4, 2012

Exploring Data :: association rules and frequent item sets

If I had to choose a propositional method (i.e. attribute based) that is close to inductive logic programming, it would be association rules. Both methods produce a set of rules that describe the data, although ILP, being logic based, is more powerful, since association rules are of the form of conjunction of  has_a(_) type predicates, while ILP can have any kind of predicates that you define, such as before(_,_), distance_between(_,_), etc. One great thing about association rules is that you don't need any negative examples to contrast learning, and you don't need your data to be labeled. All you need is your data, which consists of a set of items (i.e. this is not a numeric prediction method) and some idea of the minimum accuracy for the resulting rules (this could be experimented with). So this method is a great exploratory tool, frequently used for basket analysis (ex. when you need to find out if buying milk leads to the purchase of bread and peanut butter).

Since when we are mining for association rules we have no idea of what attribute we are trying to predict and which attributes  and their values it will depend on, bluntly going through all data looking for all possible combination of attributes and their values is not feasible. So the first step in finding association rules is finding frequent item sets. In other words, we are just looking for the items that frequently appear together, without inferring any relationship between the items. Taking a grocery store example, we'll be looking for an item set {milk, bread, peanut butter}. There are several algorithms that find frequent item sets, the most common is Apriori (Agrawal, Srikant; VLDB 1994). I'm not going to go through the details of the algo, since it is described pretty well in the paper and on wiki, but generally all you need besides your data is minimum support for the set (i.e. minimum number of time that an item set appears in data). The algo starts with one item item-sets, i.e. looking at each individual item, rejecting those that don't meet minimum support. With every iteration, it adds one more item to the item sets, again rejecting those that don't meet minimum support and stops when no larger item sets that meet min support can be found.

Once the frequent datasets are found, each one is turned into one or more association rule with a specified minimum accuracy. But I think that in most cases finding frequent datasets already solves the data analysis problem, and there is no need to turn sets into rules.

Nitty-gritty

For a java implementation of the Apriori with a small example in main, see my GitHub repo.

No comments:

Post a Comment