Wednesday, 26 September 2012

Recently I entered the Kaggle 'Detecting Insults in Social Commentary' competition. I used Andrew McCallum's Mallet tool in Java http://mallet.cs.umass.edu and the Snowball analyser in the Lucene distribution.

I used Scala at the top level as it has superior functional manipulation of data structures. I used JVM based tools instead of R or Python because in previous experience it seems more scalable and can be easily turned into a map-reduce implementation.

I finished 5th/118 in the milestone and 8th in the verification set. Therefore, I think my approach seems to have generalized well. Overall, I think the competition had some issues and possibly any of the top 10 players could've won it with different parameters.

The evolution of my approach was as follows: I assumed most of the text was English so I used the Lucene Snowball English parser which parses and does some stemming. Initially, I tried Naive Bayes and an L2 Maximum entropy and started by submitting the hard predictions 0,1. The Naive Bayes classifier came close to the similar benchmark figure of about 0.8. I then tried submitting the predicted probabilities of being in the insult class. This improved the AUC score to 0.87. I then considered an ensemble approach of other models. The first one I chose was a simple character 3-gram approach from the assumption that there were weird character sequences in this data such as smilies etc. which would be stripped out by the word parser. This improved things to 0.89. I then used InfoGain to select the most significant 3-grams which improved things 0.90. I then added in all n-grams up 6. This also improved things, so I continued this approach and then added another model of word n-grams to capture frequent pattern such as 'you are a' and again the score improved the final model consisted of 4 classifiers and had a final score of 0.91423 which was 0.00468 behind the leader. Moreover, I could increase this to 0.9199 by tuning parameters after deadline passed.

The final mixing is as follows:

{0.33*Words, 0.17*4..6chargrams,0.17*1..3chargrams,0.17*2-wordgrams,0.17*3-wordgrams}

It's interesing to compare this with a similar result given here: Andreas' blog

Things that did work:

Weighted sum of different models.
* MaxEnt approaches
* L1 regularization with character n-grams
* Using Info Gain to reduce features improves performance.

Things that didn't work:

* Removing the extended characters - the score dropped.
* Naive Bayes - not good for predicting probabilities in unbalanced set.
* Transductive learning adding in the unlabelled points.

Things didn't try:

* Bad word lists