In the first article we created a simple tokenizer, today we’re going to create a language detector to identify Toki Pona text among other texts.
First I want to say that are at least few good libraries for detecting natural languages in ruby:
But those are for mainstream: French, English, German… We want Toki Pona! Also, since we are focused on Toki Pona only, we can get much more precise results.
My initial idea was dead simple: since there are only 125 Toki Pona words, we can simply check whether a token is a Toki Pona word or something else. Then it’s easy to calculate a density of Toki Pona words in a given phrase and compare it against some threshold, where 0 < threshold <= 1.
Here is an example of how such algorithm would work with threshold=0.75 and given phrase “mi moka e kala suli”.
1 2 3 4 5 6 7 8
The phrase has a typo: instead of word
moka there should be word
“mi moku e kala suli” means “I am eating a big fish”.
I decided to make the algorithm face real data: I picked ~100 random message from #tokipona IRC channels and split them into three groups:
- Messages in Toki Pona
- Messages in other languages (mostly English)
- Mixed messages (half Toki Pona and half English)
I wrote a spec, I expected
LanguageDetector.toki_pona?(text) to return
true for the first group of messages
false for others. After playing with the value of the threshold, I made it
work almost for all cases.
One of messages looked like this:
It’s obvious that it’s pure Toki Pona, then what’s wrong with it?
The issue was in the
Tokenizer that we had implemented in the previous article.
For this message it returns 4 tokens:
["Moku", "pona", "x", "D"]. While
pona belong to Toki Pona vocabulary,
We, as humans, can see that
xD actually must be one token, and it’s not a regular word, but a smile.
So I had to update
Tokenizer to distinguish words and smiles.
That’s the point where I had to introduce the difference between tokens and lexemes.
As the famous Dragon Book says about lexemes:
A lexeme is a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that token.
And about tokens:
A token is a pair consisting of a token name and an optional attribute value.
The token name is an abstract symbol representing a kind of lexical unit, e.g.,
a particular keyword, or sequence of input characters denoting an identifier.
The token names are the input symbols that the parser processes.
So I’ve updated
Tokenizer to return array of hashes with
By now I have 3 types of tokens: word, smile and punctuation.
1 2 3
The implementation of
Tokenizer now is the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
Now tokenization process contains two stage: detecting lexemes and defining tokens
for those lexemes. Note, that the order of regular expressions in
method is important. Since
WORD_REGEXP can overlap (e.g. “XD”),
SMILE_REGEXP to have higher priority. The same with punctuation: everything
what is not a smile or a word we consider as punctuation.
So I’ve I updated the language detection algorithm to count only words. Still there are number of things to improve.
What if some words are meant to be Toki Pona words but they contain a typo? Actually we can detect them and adjust the algorithm to count them as well.
So to detect a word with a typo we need somehow to calculate word similarity. And actually what we need is called Levenshtein distance.
Levenshtein distance between two words is the minimum number of single-character
edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
Levenshtein distance is used in computer science (e.g. for spell checkers), genetics (comparison of gens) and likely in some other areas. The algorithm isn’t the simplest one, but is not hard to understand, so I encourage you to take a look at wikipedia to uderstand it better.
Now back to our case: how do we measure the difference between words
The levenshtein distance for them is 1, because only one edit needs to be performed to make
We can implement it based on the algorithm ourself (actually I did it for fun, but then replaced it with this implementation) or google for existing solutions. One of suprises was to find it in rubygems gem. I guess it’s used when one mistypes name of gem in order to have “Did you mean …?” feature.
Updating the detection algorithm
Now, we can adjust the LanguageDetector to be not so strict with typos, and give a word score
0.5 if it has
levenshtein distance 1 with one of Toki Pona words. Considering the previous example with phrase “mi moka e kala suli”,
the density will be 0.9:
1 2 3 4 5 6 7 8
In reality, apart from the mentioned 125 words, the Toki Pona vocablurary includes names of languages, countries and cities. So the real detector is slightly more complected. You can check it here.
Btw, the entire implementation of what is being described can be found as the project at github greyblake/tokipona.
That’s it for now! In the next part we will try to implement grammar and parser for Toki Pona!