Thursday, November 29, 2012

Inductive Logic Programming :: applying logic to discover patterns in data

Inductive Logic Programming (or ILP) is an interesting, if not unusual technique to computationally analyze data. To get a feel of what it is, let’s start with name. Inductive, or induction is reasoning from specific to general. The way I think of it is that it is the opposite of deduction. Remember Sherlock? He solved the crimes by knowing the "rules of the game" (i.e. how people think, etc.) and getting his facts in order. Then the solution, i.e. the outcome of an action is logically deduced. With induction, you know the outcome, you know the facts, what you need to find are the rules that got you from the facts to the outcome.

As to the Logic Programming part, the good example of it is Prolog programming language. It differs from the object oriented or procedural languages like Java or C in the way that you don't specify step by step instructions of how the program should be executed. You specify the facts and the rules, and then you "ask questions" about your newly created world. For instance, you can specify facts like:
parent(john,ann); parent(john, ben); ... - John is a parent of Ann; John is a parent of Ben
and then you define what sibling means:
sibling(X,Y) :- parent(A,X), parent(A,Y) - two people (X,Y) are siblings if they have the same parent
Then you can ask questions like:
  • Is Ben sibling of Ann?: sibling(ben,ann). And you will get TRUE as your answer.
  • List all siblings?: sibling(X,Y). And you will get all sibling combinations. I our case, only (ann,ben)

So Prolog (and logic programming in general) is deductive. When you put inductive together with logic programming, what you get is the situation when you have facts (parent(john,ann);...) and you have examples (i.e. answers to the questions like sibling(ben,ann)), what you need to find are the rules (i.e. definition of the sibling().).

When is ILP useful? When you have data, some general rules that apply to it (i.e. basic knowledge) and you need to find out a theory that describes your data. For instance, let's say you are conducting a cancer study, and as a result you have a set of DNA sequences from the cancer cohort and from the healthy cohort. Each sequence contains some markers that you are interested in (transcription factor binding sites, etc.). You also know some general rules that apply to markers/sequence, i.e. two markers can be at some distance from each other; one marker can be located before another; sequence can be on some chromosome, etc. What you'd love to find out is what's so special about the cancer cohort, compared to the healthy cohort.  That's where ILP comes in. Machine learning people may recognize this as a classification problem. ILP is a machine learning method, but what is different it compared to, for instance, Decision Trees or Support Vector Machines is that data is represented as logic programs, as opposed to a set of attributes. This could be beneficial when the relationship between the attributes is significant. Another difference or I would even say advantage of ILP is that it is able to describe your data and discover new knowledge about it. I'll illustrate this in an example.

Let's take the cancer study and see how we would use ILP. First, we represent our data as logic programs like these:
sequence(<name of the sequence>, <chromosome it is located on>)  - we describe all sequences we have in this form
has_marker(<sequence name>, <marker name>, <location of marker in the sequence>) - describe each marker in the sequence

We also specify which sequences are cancer and which are not (positive and negative examples).

Then we give our ILP engine a set of basic rules of interaction that can describe our data, like:
has_marker(<marker name>)
before(<marker 1>, <marker 2>)
distance_interval(<marker1>, <marker2>, <distance>, <offset>)

When the ILP executes, we might get the following theory:
 has_feature(m1), before(m9,m16), distance_interval(m9, m16, 60, 10).
What this means is that a cancer sequence should contain marker m1 AND marker m9 should come before m16 AND the distance between m9 and m16 should be 60 +/- 10.

You can see that the new knowledge is discovered by concatenation of basic rules. It's unlikely that you would be looking for such a specific rule after the experiment, while ILP may discover it for you.

The advantages of ILP is the ability to represent input data more expressively using logic (granted, it may not always be needed) and  the ability to not only classify, but describe the data in human readable form (I've seen people who are not familiar with programming learn how to read the rules in 15 minutes). The disadvantage, as you might have guesses, is the speed of execution. So if you have large amount of data and you do have your own theories about it, you're better off testing them first. However, if you have exhausted all the possibilities and still have no clue, ILP can provide you with possible theories. If you consider that some biological experiments take months or even years (not to mention the cost), then a few days of computer processing is peanuts.

Nitty-gritty

To try to use this method, start with picking and installing an ILP engine. I've used Aleph, but there are lots of others available.

No comments:

Post a Comment