Optimising tokenizers for search engines

Photo by Brett Jordan on Unsplash

In the first version of this post I showed how we can make a simple search engine in 100 lines of Python. I’ll build on this to create a highly efficient full text search library in Rust.

In my last blog post I described how to build a basic search engine from scratch. In this post I’ll go into one part of that process, which is converting the documents into tokens.

Our goal will be to explore what the possible throughput for the tokenisation part of the pipeline is.

When we are working with tokenisation we are in the domain of NLP. We’re processing natural language.

Let’s first assume that tokens are the only tokens we can have.

What does “a word’ mean?

Splitting words

Stemming

Lemmetazation

Stop words

It can also be profanities.

Implementing an efficient library for text tokenization

Splitting words

article.split()

In this section we’re going to look at the performance of that approach, and how to scale it to more advanced things.

Time spent whitespace: 0.25ns

Time spent regex: 0.75ns

Stemming

You can read the origional paper here:

There are more advanced approaches, but we are looking at making this fast- and Porters algorithm is one of the fastest.

Time spent: 200ns

Lemmatisation

Time spent: ~2ms

Stop words

Conclusion. Use a hashset.

Time spent: 20ns

Machine learning

Conclusion

With each of the steps we have to consider the cost of indexing vs the quality of queries. We might also have to use multiple indexes with different pipelines, for different purpuses. For example lemmatisation, stop words and lemmatization break exact queries.

Software Engineer. Interested in security, distributed systems and databases. Using this as a scratchpad.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store