Sébastien Pujadas
26th February 2017
This app determines the language(s) that a user-supplied text is possibly written in, and then enables the user to select one of the top-scoring languages to visualise word frequencies in the text and in the language's corpus
All the plots in the app are interactive
Supported languages (and corresponding two-letter ISO 639-1 code)
Afrikaans (af
), Breton (br
), Bosnian (bs
), Catalan (ca
), Czech (cs
), Danish (da
), German (de
), English (en
), Esperanto (eo
), Spanish (es
), Estonian (et
), Basque (eu
), Finnish (fi
), French (fr
), Galician (gl
), Croatian (hr
), Hungarian (hu
), Indonesian (id
), Icelandic (is
), Italian (it
), Lithuanian (lt
), Latvian (lv
), Malay (ms
), Dutch (nl
), Norwegian (no
), Polish (pl
), Portuguese (pt
), Romanian (ro
), Slovak (sk
), Slovene (sl
), Albanian (sq
), Serbian (sr
), Swedish (sv
), Tagalog (tl
), Turkish (tr
)
Data source for the language corpora
Source code
Published on GitHub at https://github.com/spujadas/coursera-ddp-shiny
For each supported language, a score between 0 and 1 is assigned using a simple algorithm: the frequency of all the words in the text is calculated, and then the frequencies of the words that also appear in the top n (default: 500) most frequent words of the supported languages are added together
The highest scores determine the languages that the text is probably written in
The top 10 scores are shown as a bar graph in the app
Example for the text Oom Gert Vertel en Ander Gedigte:
language score langName
1 af 0.69075677 Afrikaans (Afrikaans)
24 nl 0.48370466 Dutch (Nederlands, Vlaams)
8 en 0.17943289 English
7 de 0.16350580 German (Deutsch)
4 ca 0.12437857 Catalan (català)
2 br 0.12378015 Breton (brezhoneg)
15 gl 0.10688639 Galician (galego)
33 sv 0.10127048 Swedish (svenska)
6 da 0.09514822 Danish (dansk)
14 fr 0.08833548 French (français)
The highest scoring language is Afrikaans, and the text is indeed actually written in Afrikaans.
In the language that the text is written in, the frequencies of the words in the corpus and in the text tend to be similar, especially for the most frequent words, as can be seen in the word frequencies plot
Zipf's law states that in a natural language, the frequency of any word is inversely proportional to its rank in the frequency table
Scoring algorithm
The algorithm used in this app obtains accurate results, the scoring function1 could however be improved to:
Corpora
The corpora for the supported languages are based on subtitles of TV series and films, and are therefore biased towards the spoken form of the languages.
For instance, first and second person pronouns such as “I” or “you” appear more frequently than in traditional written works
Additional or alternative corpora could be created2 to handle other forms of the languages (e.g. non-fiction and fiction written works, online forums and chats) and obtain better results depending on the type of input that is fed to the app.
languageScore()
function in the language-detection.R
source file.
readLanguageFrequencyData()
function decribe the structure of the corpus files.