Optimising tokenizers for search engines

Halvor Fladsrud Bø
4 min readMay 9, 2021
Photo by Brett Jordan on Unsplash

In the first version of this post I showed how we can make a simple search engine in 100 lines of Python. I’ll build on this to create a highly efficient full text search library in Rust.

In my last blog post I described how to build a basic search engine from scratch. In this post I’ll go into one part of that process, which is converting the documents into tokens.

Our goal will be to explore what the possible throughput for the tokenisation part of the pipeline is.

When we are working with tokenisation we are in the domain of NLP. We’re processing natural language.

Let’s first assume that tokens are the only tokens we can have.

What does “a word’ mean?

Processing text seems easy at first. You just split on anything that’s not a letter. 99% of the time that works as expected, but there are a lot of cases where it does not.

Splitting words

For now let’s assume that we’re processing exclusively english text. A language like Norwegian has a lot of quircks like the frequent combination of works that makes it harder, but in english the main problem is phrases like ‘we’re’. If you split that with the rules above you end up with [we, re], which are not optimal. Fortuantly we don’t really care about either of those words when searching (because they don’t mean much), but in general NLP that is a problem.

Stemming

The second issue is that “Cars” and “car” are fundementally the same word, but since text search are looking for tokens that exactly match, it’s not something we can eaily do. There is a quirck here which is that in a Finite State Trasducer you could easily get the edit distance between two words, so non-exact matches are possible. At the same time it’s not as efficient as exact matches.

Lemmetazation

The third issue is that the sentence “i was in san fransico” is very similar to “i am in san fransico”, but they end up being completely different. This is called lemmetazation. Generallly it’s not that useful for search, but it’s something to keep in mind. We will fokus on tokenisation and stemming algorithms.

Stop words

A last issue is words that don’t really mean much. For example the, it’s the most common word, but it provides no value. We can add a stop word filter to remove it. Removing it has some problems, like that we can no longer match exact phrases with the, but for a lot of cases it make sense.

It can also be profanities.

Implementing an efficient library for text tokenization

Let’s first consider the interface of the library. We may want to split words in a certain way, then stem them, then remove stop words. This is where the context of a tokenisation pipeline comes in.

Splitting words

Splitting words can be as simple as.

article.split()

In this section we’re going to look at the performance of that approach, and how to scale it to more advanced things.

Time spent whitespace: 0.25ns

Time spent regex: 0.75ns

Stemming

When it comes to stemming there is tradeoff between accuracy and speed. A good algorithm is Porters algorithm. It cointains 5 steps, that executed in order stems most words successfully. It’s also very simple so it’s very efficient.

You can read the origional paper here:

There are more advanced approaches, but we are looking at making this fast- and Porters algorithm is one of the fastest.

Time spent: 200ns

Lemmatisation

I’ll expand this section later, but for now, this is a great article.

Time spent: ~2ms

Stop words

From my intial testing with tries and hashsets, hashsets come out as the clear winner — with about 10x the performance.

Conclusion. Use a hashset.

Time spent: 20ns

Machine learning

There are machine learning approaches. I’ll expand this section later.

Conclusion

When we are building a pipeline we need to keep in mind the speed to quality tradeoff. The standard setting of of just splitting words is almost free, using a regex is 3x the speed, but still very cheep. Removing stop words is also pretty cheap, but it’s still 20x more expesive than using a regex. Stemming the word is going to cost 10x that again. There is no good library for lemmatisation in Rust, so I’ll have to write one myself, but it’s probably way more expensive.

With each of the steps we have to consider the cost of indexing vs the quality of queries. We might also have to use multiple indexes with different pipelines, for different purpuses. For example lemmatisation, stop words and lemmatization break exact queries.

--

--

Halvor Fladsrud Bø

Software Engineer. Interested in security, distributed systems and databases. Using this as a scratchpad.