Building a Full Text Search Engine

October 18, 2021

Hey Everyone,

Today we’ll be talking about

Building a Full Text search engine
- What’s Full Text Search and real world examples of FTS engines
- The Inverted Index data structure
- Text Analysis and Querying
Plus, a couple awesome tech snippets on
- An amazing GitHub repo on Security Engineering
- A 4-part blog series on how Google Chrome works
- How Shopify rewrote their Ruby on Rails monolith to improve Storefront response times
- Writing a File System from scratch in Rust

We have a solution to our last coding interview question on linked lists, plus a new question from Google.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Building a Full Text Search Engine

Artem Krylysov wrote an awesome blog post where he builds a Full-Text Search Engine.

The corpus for the search engine is 600,000 pages from Wikipedia and the FTS engine can search through all the documents for keywords in less than a millisecond.

Here’s a summary.

What is a Full Text Search Engine

The Full-text Search problem is where you have a collection of documents (or a single document) and you want to search through all the text in those documents for some keywords.

A document refers to any structured text. It could be a web page, a book, or just a string.

Full-text Search is different from searches based on metadata or on pieces of the original text (like titles, selected regions, etc.) since the full text search engine is examining all the words in every stored document.

The most well known FTS engine is Lucene. Elasticsearch, Solr and OpenSearch are examples of FTS engines that are based on Lucene.

The Inverted Index Data Structure

The core of a FTS engine is a data structure called Inverted Index.

The Inverted Index associates every word in all the documents with documents that contain the word.

Let’s say you have 3 documents, where each document is a string.

Every document has a unique integer label.

The Inverted Index for those 3 documents would be a hash table that looks like this.

The keys in the Inverted Index are the words from all the documents in the corpus.

The values in the Inverted Index is a sorted array of the document labels for the documents that contain that word.

The Inverted Index data structure is actually inspired from the Index section of textbooks.

Here, every page in the textbook is a document. The page number is the document’s label.

The keys are the important words/phrases in the textbook.

Text Analyzer

In order to build the Inverted Index, you have to go through a process called text analysis.

Text analysis consists of tokenizing the text and then applying a set of filters.

Tokenizer

The tokenizer splits each document into a list of tokens.

Tokens can be words, characters or subwords. You pick the format of your tokens based on the problem you’re solving.

For this problem, it makes sense for our tokens to be individual words, so we can tokenize based on word boundaries (spaces, periods, etc.) and remove any punctuation.

Filters

To make the text easier to index and search, you can do additional normalization by applying some filters to the tokens.

Lowercase - We want the FTS engine to be case-insensitive. Therefore, the lowercase filter will convert all the tokens to lowercase.
Dropping Common Words - We don’t want common words like “a”, “and”, “the”, etc. to be in our Inverted Index otherwise a query like “A Cat” would match with every single document that had the word “a” (lots of false positives).Therefore, we can remove the most popular words in the English language from our Inverted Index. You can just download a ranking of the top 15 most popular words and remove them.
Stemming - Because of English grammar rules, documents may include different forms of the same word. Stemming reduces words into their base form.Fishing and fishes could be reduced to the base form fish.

So, the text analyzer works by going through each document, looking at the raw text, tokenizing it and applying the filters.

After, you can add each token to the Inverted Index with the document label for the value.

Querying

To query the index, you can apply the same tokenizer and filter on the query string and search all matching documents.

You can search for documents that contain any of the words in the query string or narrow your search to documents that include all the words in the query string.

To search for the latter, you can compute the intersection of the lists of matching document labels for each word in the query string.

To read the full source code and get a more detailed explanation, read the full blog post.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Tech Snippets

How to Secure Anything - This is an amazing github repo on Security Engineering.It goes through various security models, security tools (cryptography, tamper detection, access control, etc.), and also links resources on how real world systems are secured.Turns out there’s an entire series of textbooks on how to secure nuclear facilities!
An Inside Look at Modern Web Browsers - A fantastic 4-part blog series on how the Chrome browser works. It goes from the high level architecture to the inner workings of the rendering process and the compositor.
How Shopify reduced Storefront Response Times - This is a great blog post by Shopify Engineering on how they rewrote their Storefront Renderer to reduce response times.Shopify builds a platform where you can use their software to build an E-Commerce store.The Storefront Renderer is responsible for loading a Shopify merchant’s storefront theme, along with the data required to serve the request (product data, inventory information, images, etc.)Their architecture is a Ruby on Rails monolith and it handled almost all kinds of traffic.Now, they have a separate app that handles storefront traffic while the monolith still handles checkout, admin and API traffic.
Writing a File System from scratch in Rust - An awesome post where you can learn about some of the concepts of file systems and write a very basic file system with Rust.

Interview Question

You are given two integers, n and k.

Return all possible combinations of k numbers out of the range [1,n]

You can return the answer in any order.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Previous Solution

As a reminder, here’s our last question

You are given the head of a linked list and an integer k.

Rotate the linked list to the right by k places.

Here’s the question in LeetCode

Solution

So, first of all, it’s possible for k to be bigger than the length of the linked list.

To make the problem easier, let’s adjust k so we can avoid that scenario.

We’ll find the length of the linked list and then set k to k % length where length is the length of the linked list.

Now that k is less than the length of the linked list, we can find the last k nodes in our list.

We’ll do that by setting a slow and fast pointer, both equal to head.

Then, the fast pointer will traverse k nodes in our linked list.

Now, we’ll traverse both slow and fast through the linked list until fast reaches the last node.

The slow pointer is now pointing to the last k nodes in our linked list.

We’ll split the linked list here (ahead of the slow pointer) and then create a new linked list that combines the last k nodes and then the earlier nodes.

This new linked list is our answer.

Here’s the Python 3 code. What’s the time and space complexity of our solution? Reply back with your answer and we’ll tell you if you’re right/wrong.