Manual Reference Source

Design Document

This design document has the aim to explain the details of MiniSearch design and implementation to library developers that intend to contribute to this project, or that are simply curious about the internals.

Latest update: Feb. 11, 2019

Goals (and non-goals)

MiniSearch is aimed at providing rich full-text search functionalities in a local setup (e.g. client side, in the browser). It is therefore optimized for:

  1. Small memory footprint of the index data structure
  2. Fast indexing of documents
  3. Versatile and performant search features, to the extent possible while meeting goals 1 and 2
  4. Small and simple API surface, on top of which more specific solutions can be built by application developers
  5. Possibility to add and remove documents from the index at any time

MiniSearch is therefore NOT directly aimed at offering:

For these points listed as non-goals, other solutions exist that should be preferred to MiniSearch. Adapting MiniSearch to support those goals would in fact necessarily go against the primary project goals.

Technical design

MiniSearch is composed of two layers:

  1. A compact and versatile data structure for indexing terms, providing prefix and fuzzy lookup
  2. An API layer on top of this data structure, providing the search features

Here follows a description of these two layers.

Index data structure

The data structure chosen for the index is a radix tree, which is a prefix tree where nodes with no siblings are merged with the parent node. The reason for choosing this data structure follows from the project goals:

The class implementing the radix tree is called SearchableMap, because it implements the standard JavaScript Map interface, adding on top of it some key lookup methods:

As a trade-off for offering these additional features, SearchableMap is restricted to use only string keys.

The SearchableMap data type is part of the public API of MiniSearch, exposed as MiniSearch.SearchableMap. Its usefulness is in fact not limited to providing a data structure for the inverted index, and developers can use it as a building block for other solutions. When modifying this class, one should think about it in terms of a generic data structure, that could in principle be released as a separate library.

Fuzzy search algorithm

The algorithm used to provide fuzzy search of keys within a maximum Levenshtein distance from a given term is the following:

Note that this algorithm can get complex if the maximum edit distance is large, as many paths would be followed. The reason why this algorithm is employed is a trade-off:

Search API layer

The search API layer offers a small and simple API surface for application developers. It does not assume that a specific locale is used in the indexed documents, therefore no stemming nor stop-word filtering is performed, but instead offers easy options for developers to provide their own implementation. This heuristic will be followed in future development too: rather than providing an opinionated solution, the project will offer simple building blocks for application developers to implement their own solutions.

The inverted index is implemented with SearchableMap, and posting lists are stored as values in the Map. This way, the same data structure provides both the inverted index and the set of indexed terms. Different document fields are indexed within the same index, to further save space. The index is therefore structured as following:

term -> field -> { document frequency, posting list }

When performing a search, the entries corresponding to the search term are looked up in the index (optionally searching the index with prefix or fuzzy search), then the documents are scored with a variant of Tf-Idf, and finally results for different search terms are merged with the given combinator function (by default OR, but AND can be specified).

As the document IDs necessarily occur many times in the posting list, as a space optimization they are substituted by short generated IDs. An index of short ID to original ID is maintained alongside the search index, to reconstruct the original IDs. A similar optimization is applied to the field names.