Accuracy and performance of Google's Compact Language Detector

To get a sense of the accuracy and performance of Google's Compact Language Detector, I ran some tests against two other packages:

For the test corpus I used a the corpus described here, created by the author of language-detection. It contains 1000 texts from each of 21 languages, randomly sampled from the Europarl corpus.

It's not a perfect test (no test ever is!): the content is already very clean plain text; there are no domain, language, encoding hints to apply (which you'd normally have with HTML content loaded over HTTP); it "only" covers 21 languages (versus at least 76 that CLD can detect).

CLD and language-detection cover all 21 languages, but Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), so I only tested on the remaining subset of 17 languages that all three detectors support. This works out to 17,000 texts totalling 2.8 MB.

Many of the texts are very short, making the test challenging: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.

In addition to the challenges of the corpora, the differences in the detectors make the comparison somewhat apples to oranges. For example, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27, so this biases against CLD, and language-detection to a lesser extent, since their classification task is harder relative to Tika's.

For CLD, I disabled its option to abstain (removeWeakMatches), so that it always guesses at the language even when confidence is low, to match the other two detectors. I also turned off the pickSummaryLanguage, as this was also hurting accuracy; now CLD simply picks the highest scoring match as the detected language.

For language-detection, I ran with the default ALPHA of 0.5, and set the random seed to 0.

Here are the raw results:

CLD results (total 98.82% = 16800 / 17000):
     da  93.4%   da=934  nb=54  sv=5  fr=2  eu=2  is=1  hr=1  en=1  
     de  99.6%   de=996  en=2  ga=1  cy=1          
     el  100.0%   el=1000                
     en  100.0%   en=1000                
     es  98.3%   es=983  pt=4  gl=3  en=3  it=2  eu=2  id=1  fi=1  da=1
     et  99.6%   et=996  ro=1  id=1  fi=1  en=1        
     fi  100.0%   fi=1000                
     fr  99.2%   fr=992  en=4  sq=2  de=1  ca=1        
     hu  99.9%   hu=999  it=1              
     it  99.5%   it=995  ro=1  mt=1  id=1  fr=1  eu=1      
     nl  99.5%   nl=995  af=3  sv=1  et=1          
     pl  99.6%   pl=996  tr=1  sw=1  nb=1  en=1        
     pt  98.7%   pt=987  gl=4  es=3  mt=1  it=1  is=1  ht=1  fi=1  en=1
     ro  99.8%   ro=998  da=1  ca=1            
     sk  98.8%   sk=988  cs=9  en=2  de=1          
     sl  95.1%   sl=951  hr=32  sr=8  sk=5  en=2  id=1  cs=1    
     sv  99.0%   sv=990  nb=9  en=1            

Tika results (total 97.12% = 16510 / 17000):
     da  87.6%   da=876  no=112  nl=4  sv=3  it=1  fr=1  et=1  en=1  de=1        
     de  98.5%   de=985  nl=3  it=3  da=3  sv=2  fr=2  sl=1  ca=1          
     el  100.0%   el=1000                        
     en  96.9%   en=969  no=10  it=6  ro=4  sk=3  fr=3  hu=2  et=2  sv=1        
     es  89.8%   es=898  gl=47  pt=22  ca=15  it=6  eo=4  fr=3  fi=2  sk=1  nl=1  et=1    
     et  99.1%   et=991  fi=4  fr=2  sl=1  no=1  ca=1              
     fi  99.4%   fi=994  et=5  hu=1                    
     fr  98.0%   fr=980  sl=6  eo=3  et=2  sk=1  ro=1  no=1  it=1  gl=1  fi=1  es=1  de=1  ca=1
     hu  99.9%   hu=999  ca=1                      
     it  99.4%   it=994  eo=4  pt=1  fr=1                  
     nl  97.8%   nl=978  no=8  de=3  da=3  sl=2  ro=2  pl=1  it=1  gl=1  et=1      
     pl  99.1%   pl=991  sl=3  sk=2  ro=1  it=1  hu=1  fi=1            
     pt  94.4%   pt=944  gl=48  hu=2  ca=2  it=1  et=1  es=1  en=1          
     ro  99.3%   ro=993  is=2  sl=1  pl=1  it=1  hu=1  fr=1            
     sk  96.2%   sk=962  sl=21  pl=13  it=2  ro=1  et=1              
     sl  98.5%   sl=985  sk=7  et=4  it=2  pt=1  no=1              
     sv  97.1%   sv=971  no=15  nl=6  da=6  de=1  ca=1              

Language-detection results (total 99.22% = 16868 / 17000):
     da  97.1%   da=971  no=28  en=1      
     de  99.8%   de=998  da=1  af=1      
     el  100.0%   el=1000          
     en  99.7%   en=997  nl=1  fr=1  af=1    
     es  99.5%   es=995  pt=4  en=1      
     et  99.6%   et=996  fi=2  de=1  af=1    
     fi  99.8%   fi=998  et=2        
     fr  99.8%   fr=998  sv=1  it=1      
     hu  99.9%   hu=999  id=1        
     it  99.8%   it=998  es=2        
     nl  97.7%   nl=977  af=21  sv=1  de=1    
     pl  99.9%   pl=999  nl=1        
     pt  99.4%   pt=994  es=3  it=1  hu=1  en=1  
     ro  99.9%   ro=999  fr=1        
     sk  98.7%   sk=987  cs=8  sl=2  ro=1  lt=1  et=1
     sl  97.2%   sl=972  hr=27  en=1      
     sv  99.0%   sv=990  no=8  da=2      

Some quick analysis:

  • The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
  • The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it's not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD  171 msec  16.331 MB/sec
language-detection  2367 msec  1.180 MB/sec
Tika  42219 msec  0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

I used the 09-13-2011 release of language-detection, the current trunk (svn revision 1187915) of Apache Tika, and the current trunk (hg revision b0adee43f3b1) of CLD. All sources for the performance tests are available from here.

Source: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

