In the first version of this post I showed how we can make a simple search engine in 100 lines of Python. I’ll build on this to create a highly efficient full text search library in Rust.
In my last blog post I described how to build a basic search engine from scratch. In this post I’ll go into one part of that process, which is converting the documents into tokens.
Our goal will be to explore what the possible throughput for the tokenisation part of the pipeline is.
When we are working with tokenisation we are in the domain of NLP. We’re processing natural language.
Let’s first assume that tokens are the only tokens we can have.
What does “a word’ mean?
Processing text seems easy at first. You just split on anything that’s not a letter. 99% of the time that works as expected, but there are a lot of cases where it does not.
For now let’s assume that we’re processing exclusively english text. A language like Norwegian has a lot of quircks like the frequent combination of works that makes it harder, but in english the main problem is phrases like ‘we’re’. If you split that with the rules above you end up with [we, re], which are not optimal. Fortuantly we don’t really care about either of those words when searching (because they don’t mean much), but in general NLP that is a problem.
The second issue is that “Cars” and “car” are fundementally the same word, but since text search are looking for tokens that exactly match, it’s not something we can eaily do. There is a quirck here which is that in a Finite State Trasducer you could easily get the edit distance between two words, so non-exact matches are possible. At the same time it’s not as efficient as exact matches.
The third issue is that the sentence “i was in san fransico” is very similar to “i am in san fransico”, but they end up being completely different. This is called lemmetazation. Generallly it’s not that useful for search, but it’s something to keep in mind. We will fokus on tokenisation and stemming algorithms.
A last issue is words that don’t really mean much. For example the, it’s the most common word, but it provides no value. We can add a stop word filter to remove it. Removing it has some problems, like that we can no longer match exact phrases with the, but for a lot of cases it make sense.
It can also be profanities.
Implementing an efficient library for text tokenization
Let’s first consider the interface of the library. We may want to split words in a certain way, then stem them, then remove stop words. This is where the context of a tokenisation pipeline comes in.
Splitting words can be as simple as.
In this section we’re going to look at the performance of that approach, and how to scale it to more advanced things.
Time spent whitespace: 0.25ns
Time spent regex: 0.75ns
When it comes to stemming there is tradeoff between accuracy and speed. A good algorithm is Porters algorithm. It cointains 5 steps, that executed in order stems most words successfully. It’s also very simple so it’s very efficient.
You can read the origional paper here:
Porter stemming algorithm
M.F.Porter 1980 Originally published in Program, 14 no. 3, pp 130-137, July 1980.Removing suffixes by automatic means…
There are more advanced approaches, but we are looking at making this fast- and Porters algorithm is one of the fastest.
Time spent: 200ns
I’ll expand this section later, but for now, this is a great article.
Time spent: ~2ms
From my intial testing with tries and hashsets, hashsets come out as the clear winner — with about 10x the performance.
Conclusion. Use a hashset.
Introduction to Stemming - GeeksforGeeks
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly…
Time spent: 20ns
There are machine learning approaches. I’ll expand this section later.
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Train new…
Taking ML to production with Rust: a 25x speedup | A learning journal
If we look at the big picture, butchering all the little details, there are two constants in Machine Learning…
State-of-the-art Multilingual Lemmatization
An analysis of state-of-the-art lemmatizers that work for tens of languages
When we are building a pipeline we need to keep in mind the speed to quality tradeoff. The standard setting of of just splitting words is almost free, using a regex is 3x the speed, but still very cheep. Removing stop words is also pretty cheap, but it’s still 20x more expesive than using a regex. Stemming the word is going to cost 10x that again. There is no good library for lemmatisation in Rust, so I’ll have to write one myself, but it’s probably way more expensive.
With each of the steps we have to consider the cost of indexing vs the quality of queries. We might also have to use multiple indexes with different pipelines, for different purpuses. For example lemmatisation, stop words and lemmatization break exact queries.