How indexing is done in a search engine

6 min readMay 16, 2021

Photo by Maksym Kaharlytskyi on Unsplash

Have you ever wondered what data structures are used in search engines and why? If yes, you’ll probably find this article interesting. We’ll cover the theory behind indexing, what makes a data-structure well-suited for search applications, what data structures are used in the popular open source search engines, and the relationship between physical hardware and search performance.

If you have not read my introduction post, you can read it here. I’ll not explain the basic terms in this post.

Boring theory

When working with search the main data-structure is an inverted index.

An inverted index is formally a data-structure with the following methods:

first(term)

last(term)

next(term, current)

prev(term, current)

I showed in my previous post how these methods can be effectively handled by a hash map with a sorted list of term positions.

What makes for a good inverted index?

There are a lot of ways to implement an inverted index, and which data-structure is better depends on the queries you are making and how much data you are searching. Generally, we find data-structure that offer a high degree of compression for “the common case’ in search — while maintaining logarithmic times for lookup.

Any data-structure that implements the “map” interface effectively will work as a inverted index. When searching a small amount of data, which data-structure you use is not that important. At scale is when the data-structure you chose will start to matter. Specifically when you run out of space in memory to store your inverted index. Commonly multiple data-structures are used to implement the interface and the actual implementation is highly optimised.

A high degree of compression is important in search, because it both allows storing more data in memory and reading less data from disk. We can split the inverted index into two parts in most cases. The first one is the term dictionary. For this data-structure the level of compression is important, but the flexibility of the data-strucucture when performing regex and fuzzy matches is also important. The other part is the term term frequency and term positions datastructures. In real inverted index implementations these are almost always separate, because the term frequency index is ~2–3x smaller than the position index. Commonly these are also stored on disk. The requirement is for the data-structure to be easy to lookup a single term or document in, while maintaining a close to perfect level of compression.

So, a good inverted index stores data in a compressed format while maintaining the ability to be easily queried.

How do we store the mapping from terms to documents?

As mentioned one of the datastructures mentioned above is used for storing the mapping from term to a space in the backing position and term frequency dictionaries. We need a data-structure that can effectivly store a mapping from a string to a number. For this the most popular libraries are using a Finite State Transducer. Finite State Transducers are a very specialised data-structure you will almost never run into outside search, but for search they are perfect.

Let’s break down what a Finite State Transducer is. First, we start with a State Machine. So you have states and transitions. Any time you apply a transition you end up in a new state. The word finite simply means that the state machine has a finite number of states. A transducer means that every state has an output. So when going through the state machine we get an output in every state.

http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html

What makes it powerful is how we implement the datastructure. If you want the full story, read this article:

Index 1,600,000,000 Keys with Automata and Rust

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines…

blog.burntsushi.net

The short story is that we can build a prefix-trie and a suffix-trie in the same data-strucure. So the most efficient representation of a FST is one in which no states are duplicated. That means that if two words have the same ending they can share that path. Since we get output as we go through the state machine, we can be in the same state and still keep track of an output value.

The article linked above is a much better explanation, but warning: it’s very long.

How do we store the term positions and frequencies?

The next level of data-structure commonly used is some kind of sorted list, where each of the elements represent output value in the FST. For this Lucene uses a skip list.

Each element in this list contains a compressed list of integers. Again I’m not going to give the full explaination of the topic, because it’s huge but here is a good article:

An Introduction to Integer Compression

Modern CPUs are very fast, but RAM is not able to keep up with the speed. If RAM turns out to be a bottleneck, then…

www.codeproject.com

In short we often store the deltas between document ids and term positions.

A good article is this one:

Of tantivy's indexing

This post is the second post of a series describing the inner workings of a rust search engine library called tantivy…

fulmicoton.com

The performance of the compression and decompression is very important, because it can be the bottleneck. We therefor use SIMD instructions. The compression will not be perfect, but the speed is much higher. Lucene does not do this, while Tantivy does.

So why all this trouble?

What we get from using all the complicated methods above is a quick way where we can fill up the disk of a machine with term positions and frequencies and still query for terms effectively in the FST.

The price of memory is ~100x (TODO get this number) more expensive than disk space, but it’s also roughly ~100x faster (TODO get the exact number).

What other options exist?

There exist a lot of different search libraries out there for different applications. Some support at lot of queries, some only support basic query lookup. It’s all about the use-case. There are also hundreds of papers on different ways to optimise for common use-cases. I have presented the most basic ideas from the Tantivy and Lucene universe only. They are general purpuse and are built to scale. Other libraries are built with different specifications.

TODO add list of search libraries

Generally there are three criterias, where getting one means loosing some of the others:

How quickly can we index?
How much text can we store on a single machine?
How quickly can we query the data?
What query types are supported.

For example, we can just not store term positions. The indexing speed will go up, the amount we store will go up, how quickly we can query will stay the same, but we no longer support positional queries like exact matches.

Generally when using any of the search libraries above, the way we chose to build our indexes should reflect our needs. The default settings will not give us the ultimate in performance and support for queries.

Conclusion

Full text search indexing is all about tradeoffs. Certain data-structures perform well enough in 99% of cases so they are used in the general purpose libraries — but at a high level full text search requires a lot of optimization and tricks.

Sources

Information retrieval book
Tantivy docs
FST from bunt sushi
Integer compression https://www.codeproject.com/Tips/1080308/An-Introduction-to-Integer-Compression