Language Detection Shiny App

Sébastien Pujadas
26th February 2017

Overview

This app determines the language(s) that a user-supplied text is possibly written in, and then enables the user to select one of the top-scoring languages to visualise word frequencies in the text and in the language's corpus
All the plots in the app are interactive
Supported languages (and corresponding two-letter ISO 639-1 code)

Afrikaans (af), Breton (br), Bosnian (bs), Catalan (ca), Czech (cs), Danish (da), German (de), English (en), Esperanto (eo), Spanish (es), Estonian (et), Basque (eu), Finnish (fi), French (fr), Galician (gl), Croatian (hr), Hungarian (hu), Indonesian (id), Icelandic (is), Italian (it), Lithuanian (lt), Latvian (lv), Malay (ms), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovene (sl), Albanian (sq), Serbian (sr), Swedish (sv), Tagalog (tl), Turkish (tr)
Data source for the language corpora

2016 OpenSubtitles Frequency Word Lists
Source code

Published on GitHub at https://github.com/spujadas/coursera-ddp-shiny

Scoring the text

For each supported language, a score between 0 and 1 is assigned using a simple algorithm: the frequency of all the words in the text is calculated, and then the frequencies of the words that also appear in the top n (default: 500) most frequent words of the supported languages are added together
- For Bosnian, Croatian, and Serbian, the number of words used to detect the language should be increased as the default 500 words may not be enough to discriminate between these three (similar) languages
The highest scores determine the languages that the text is probably written in
The top 10 scores are shown as a bar graph in the app

Example for the text Oom Gert Vertel en Ander Gedigte:

   language      score                   langName
1        af 0.69075677      Afrikaans (Afrikaans)
24       nl 0.48370466 Dutch (Nederlands, Vlaams)
8        en 0.17943289                    English
7        de 0.16350580           German (Deutsch)
4        ca 0.12437857           Catalan (català)
2        br 0.12378015         Breton (brezhoneg)
15       gl 0.10688639          Galician (galego)
33       sv 0.10127048          Swedish (svenska)
6        da 0.09514822             Danish (dansk)
14       fr 0.08833548          French (français)

The highest scoring language is Afrikaans, and the text is indeed actually written in Afrikaans.

Properties of the word frequencies in the corpus and in the text

In the language that the text is written in, the frequencies of the words in the corpus and in the text tend to be similar, especially for the most frequent words, as can be seen in the word frequencies plot
- In the actual language of the example text (leftmost plot), there are many points, and they tend to be grouped around the line of equal frequencies, especially at higher frequencies
- In the two next best candidate languages, there are less points and they are more scattered.

plot of chunk unnamed-chunk-3

Zipf's law states that in a natural language, the frequency of any word is inversely proportional to its rank in the frequency table
- The leftmost part of the plot below show the word frequencies in the previous example text and in the Afrikaans corpus, according to their rank: the distributions of the frequencies are similar
- The rightmost part of the plot replaces the example text with an artificial text consisting of the 500 most frequent words in Afrikaans, each used exactly once: the distributions are completely different

plot of chunk unnamed-chunk-4

Going further

Scoring algorithm

The algorithm used in this app obtains accurate results, the scoring function¹ could however be improved to:
- Further discriminate between similar languages (e.g. Bosnian, Croatian, and Serbian)
- Support languages that use non-Latin scripts (e.g. Chinese, Arabic, Hindi)
Corpora

The corpora for the supported languages are based on subtitles of TV series and films, and are therefore biased towards the spoken form of the languages.

For instance, first and second person pronouns such as “I” or “you” appear more frequently than in traditional written works

Additional or alternative corpora could be created² to handle other forms of the languages (e.g. non-fiction and fiction written works, online forums and chats) and obtain better results depending on the type of input that is fed to the app.

¹ languageScore() function in the language-detection.R source file.
² The comments accompanying the readLanguageFrequencyData() function decribe the structure of the corpus files.