Wednesday 26 September 2012

Recently I entered the Kaggle 'Detecting Insults in Social Commentary' competition. I used Andrew McCallum's Mallet tool in Java http://mallet.cs.umass.edu and the Snowball analyser in the Lucene distribution.

I used Scala at the top level as it has superior functional manipulation of data structures. I used JVM based tools instead of R or Python because in previous experience it seems more scalable and can be easily turned into a map-reduce implementation.

I finished 5th/118 in the milestone and 8th in the verification set. Therefore, I think my approach seems to have generalized well. Overall, I think the competition had some issues and possibly any of the top 10 players could've won it with different parameters.

The evolution of my approach was as follows: I assumed most of the text was English so I used the Lucene Snowball English parser which parses and does some stemming. Initially, I tried Naive Bayes and an L2 Maximum entropy and started by submitting the hard predictions 0,1. The Naive Bayes classifier came close to the similar benchmark figure of about 0.8. I then tried submitting the predicted probabilities of being in the insult class. This improved the AUC score to 0.87. I then considered an ensemble approach of other models. The first one I chose was a simple character 3-gram approach from the assumption that there were weird character sequences in this data such as smilies etc. which would be stripped out by the word parser. This improved things to 0.89. I then used InfoGain to select the most significant 3-grams which improved things 0.90. I then added in all n-grams up 6. This also improved things, so I continued this approach and then added another model of word n-grams to capture frequent pattern such as 'you are a' and again the score improved the final model consisted of 4 classifiers and had a final score of 0.91423 which was 0.00468 behind the leader. Moreover, I could increase this to 0.9199 by tuning parameters after deadline passed.

The final mixing is as follows:

{0.33*Words, 0.17*4..6chargrams,0.17*1..3chargrams,0.17*2-wordgrams,0.17*3-wordgrams}

It's interesing to compare this with a similar result given here: Andreas' blog

Things that did work:

Weighted sum of different models.
* MaxEnt approaches
* L1 regularization with character n-grams
* Using Info Gain to reduce features improves performance.

Things that didn't work:

* Removing the extended characters - the score dropped.
* Naive Bayes - not good for predicting probabilities in unbalanced set.
* Transductive learning adding in the unlabelled points.

Things didn't try:

* Bad word lists

Friday 4 February 2011

Musings on Semi-Supervised Graph Learning

Marginal Distribution P supported by low dimensional manifold M

f* = argmin 1/n sum(E(x,y)) + norm
= sum a.K(x,y)

If P known introduce another regulizer term which is a penalty (Riemannian = Laplace operator) that reflects intrinsic structure.

f* = argmin 1/n sum(E(x,y)) + amb.norm + intr.norm

Most cases P is unknown we need estimates of P and norm from unlabeled examples

f* = argmin 1/n sum(E(x,y)) + amb.norm + intr.norm / (u +l)^2.f'Lf

Can be solved by a regularized least squares algorithm




If we disregard the labeled data it becomes:

f* amb.norm + f'Lf s.t. sum f(x) = 0 and f(x)^2 =1

Which gives the generalized eigen problem

P(amb.K + K.L.K)Pv = lam.P.K^2.P.v

Monday 21 June 2010

Just a couple of days ago the odds for the winners of the premier league 10/11 were published.

Chelsea were favourites at 6/4 followed by Manchester United. I'm not a betting man but I was curious about the numbers. The first thing I noticed was that the odds for other clubs decreased in what looked like a logarithmic fashion. I thought I'd investigate further! The first thing I did was average all the odds from different companies and convert the odds to probabilities to win:

.380228 Chelsea
.294118 Man Utd
.125000 Man City
.111111 Arsenal
.066667 Liverpoo
.019608 Tottenha
.003984 Everton
.003984 Aston Vi
.000666 Sunderla
.000500 Fulham
.000400 West Ham
.000400 Birmingh
.000400 Stoke
.000200 Blackbur
.000200 Bolton
.000133 Wigan At
.000133 Wolves

They add up to 1 which is good (I've omitted the promoted sides).

Then I added in the finishing positions from the 09/10 season and converted the probabilities to log probabilities and plotted the 2 against each other on a scatter plot with a linear best fit line.






















There seems to be a clear correlation. Chelsea and Man U are below the line so are expected to perform less well (slightly), Man City and Liverpool significantly better than last year, Aston Villa and Everton worse - Blackburn to get relegated!