Manual Reference Source

MiniSearch

Build Status Coverage Status Minzipped Size

MiniSearch is a tiny but powerful in-memory fulltext search engine written in JavaScript. It is respectful of resources, and it can comfortably run both in Node and in the browser.

Try out the demo application.

Find the complete documentation and API reference here, and more background about MiniSearch, including a comparison with other similar libraries, in this blog post.

Use case

MiniSearch addresses use cases where full-text search features are needed (e.g. prefix search, fuzzy search, ranking, boosting of fields…), but the data to be indexed can fit locally in the process memory. While you won't index the whole Internet with it, there are surprisingly many use cases that are served well by MiniSearch. By storing the index in local memory, MiniSearch can work offline, and can process queries quickly, without network latency.

A prominent use-case is real time search "as you type" in web and mobile applications, where keeping the index on the client enables fast and reactive UIs, removing the need to make requests to a search server.

Features

  • Memory-efficient index, designed to support memory-constrained use cases like mobile browsers.

  • Exact match, prefix search, fuzzy match, field boosting

  • Auto-suggestion engine, for auto-completion of search queries

  • Documents can be added and removed from the index at any time

  • Zero external dependencies

MiniSearch strives to expose a simple API that provides the building blocks to build custom solutions, while keeping a small and well tested codebase.

Installation

With npm:

npm install --save minisearch

With yarn:

yarn add minisearch

Then require or import it in your project.

Alternatively, if you prefer to use a <script> tag, you can require MiniSearch from a CDN:

<script src="https://cdn.jsdelivr.net/npm/minisearch@2.0.6/dist/minisearch.min.js"></script>

Finally, if you want to manually build the library, clone the repository and run yarn build (or yarn build-minified for a minified version + source maps). The compiled source will be created in the dist folder.

Usage

Basic usage

// A collection of documents for our examples
const documents = [
  { id: 1, title: 'Moby Dick', text: 'Call me Ishmael. Some years ago...' },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', text: 'I can see by my watch...' },
  { id: 3, title: 'Neuromancer', text: 'The sky above the port was...' },
  { id: 4, title: 'Zen and the Art of Archery', text: 'At first sight it must seem...' },
  // ...and more
]

let miniSearch = new MiniSearch({ fields: ['title', 'text'] })

// Index all documents
miniSearch.addAll(documents)

// Search with default options
let results = miniSearch.search('zen art motorcycle')
// => [ { id: 2, score: 2.77258, match: { ... } }, { id: 4, score: 1.38629, match: { ... } } ]

Search options

MiniSearch supports several options for more advanced search behavior:

// Search only specific fields
miniSearch.search('zen', { fields: ['title'] })

// Boost some fields (here "title")
miniSearch.search('zen', { boost: { title: 2 } })

// Prefix search (so that 'moto' will match 'motorcycle')
miniSearch.search('moto', { prefix: true })

// Fuzzy search, in this example, with a max edit distance of 0.2 * term length,
// rounded to nearest integer. The mispelled 'ismael' will match 'ishmael'.
miniSearch.search('ismael', { fuzzy: 0.2 })

// You can set the default search options upon initialization
miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  searchOptions: {
    boost: { title: 2 },
    fuzzy: 0.2
  }
})
miniSearch.addAll(documents)

// It will now by default perform fuzzy search and boost "title":
miniSearch.search('zen and motorcycles')

Auto suggestions

MiniSearch can suggest search queries given an incomplete query:

miniSearch.autoSuggest('zen ar')
// => [ { suggestion: 'zen archery art', terms: [ 'zen', 'archery', 'art' ], score: 1.73332 },
//      { suggestion: 'zen art', terms: [ 'zen', 'art' ], score: 1.21313 } ]

The autoSuggest method takes the same options as the search method, so you can get suggestions for misspelled words using fuzzy search:

miniSearch.autoSuggest('neromancer', { fuzzy: 0.2 })
// => [ { suggestion: 'neuromancer', terms: [ 'neuromancer' ], score: 1.03998 } ]

Suggestions are ranked by the relevance of the documents that would be returned by that search.

Field extraction

By default, documents are assumed to be plain key-value objects with field names as keys and field values as string values. In order to support custom field extraction logic (for example for nested fields, or non-string field values needing processing before tokenization), a custom field extractor function can be passed as the extractField option:

// Assuming that our documents look like:
const documents = [
  { id: 1, title: 'Moby Dick', author: { name: 'Herman Melville' }, tags: ['fiction', 'whale'] },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', author: { name: 'Robert Pirsig' }, tags: ['fiction', 'zen'] },
  { id: 3, title: 'Neuromancer', author: { name: 'William Gibson' }, tags: ['fiction', 'cyberpunk'] },
  { id: 4, title: 'Zen and the Art of Archery', author: { name: 'Eugen Herrigel' }, tags: ['non-fiction', 'zen'] },
  // ...and more
]

// We can support nested fields (author.name) and array fields (tags) with a
// custom `extractField` function:
let miniSearch = new MiniSearch({
  fields: ['title', 'author.name', 'tags'],
  extractField: (document, fieldName) => {
    // Access nested fields
    const value = fieldName.split('.').reduce((doc, key) => doc && doc[key], document)
    // If field value is an array, join by space
    return Array.isArray(value) ? value.join(' ') : value
  }
})

The default field extractor can be obtained by calling MiniSearch.getDefault('extractField').

Tokenization

By default, documents are tokenized by splitting on Unicode space or punctuation characters. The tokenization logic can be easily changed by passing a custom tokenizer function as the tokenize option:

// Tokenize splitting by hyphen
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  tokenize: (string, _fieldName) => string.split('-')
})

Upon search, the same tokenization is used by default, but it is possible to pass a tokenize search option in case a different search-time tokenization is necessary:

// Tokenize splitting by hyphen
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  tokenize: (string) => string.split('-'), // indexing tokenizer
  searchOptions: {
    tokenize: (string) => string.split(/[\s-]+/) // search query tokenizer
  }
})

The default tokenizer can be obtained by calling MiniSearch.getDefault('tokenize').

Term processing

Terms are downcased by default. No stemming is performed, and no stop-word list is applied. To customize how the terms are processed upon indexing, for example to normalize them, filter them, or to apply stemming, the processTerm option can be used. The processTerm function should return the processed term as a string, or a falsy value if the term should be discarded:

let stopWords = new Set(['and', 'or', 'to', 'in', 'a', 'the', /* ...and more */ ])

// Perform custom term processing (here discarding stop words and downcasing)
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  processTerm: (term, _fieldName) =>
    stopWords.has(term) ? null : term.toLowerCase()
})

By default, the same processing is applied to search queries. In order to apply a different processing to search queries, supply a processTerm search option:

let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  processTerm: (term) =>
    stopWords.has(term) ? null : term.toLowerCase(), // index term processing
  searchOptions: {
    processTerm: (term) => term.toLowerCase() // search query processing
  }
})

The default term processor can be obtained by calling MiniSearch.getDefault('processTerm').

API Documentation

Refer to the API documentation for details about configuration options and methods.

Browser compatibility

MiniSearch natively supports all modern browsers implementing JavaScript standards, but requires a polyfill when used in Internet Explorer, as it makes use functions like Object.entries, Array.includes, and Array.from, which are standard but not available on older browsers. The package core-js is one such polyfill that can be used to provide those functions.

Contributing

Contributions to MiniSearch are welcome! Please read the contributions guidelines. Reading the design document is also useful to understand the project goals and the technical implementation.

Contributing

Contributions to MiniSearch are welcome :) Just follow the guidelines below.

Bugs and feature requests

If you have an idea for a feature, or spotted a bug, please open an issue on GitHub.

  • Check if your issue was already reported. In that case, better to comment on the existing issue rather than open a duplicate one.

  • When reporting a bug, whenever possible, provide some example code to reproduce the bug: that will make the process of fixing it much faster.

  • Remember this is an open-source project. Your feature requests will be discussed and taken into consideration, but please do not take it personally if the feature does not get implemented. This project uses a permissive license, so pull requests are welcome and forks are permitted.

  • Always be respectful of others when discussing issues. The project maintainer has the right and responsibility to remove or close discussions where the tone or language is derogatory, harassing, or offensive toward others. Keep the Code of Conduct in mind.

Pull requests

Thinking about sending a pull request? That is great :) Here is how you can setup your development environment:

  1. Clone the repository

  2. Install the development dependencies with yarn install

  3. Run the tests with yarn test. You can also automatically trigger a run of relevant tests for the code that you change by running yarn test-watch

  4. If you are working on optimising the performance, you can run the performance benchmarks with yarn benchmark

  5. If you are improving the documentation, you can build the docs with yarn build-docs

In order to understand implementation details and design goals, read the design document.

Also, please follow these guidelines:

  • Add tests for your code. This ensures that your feature won't be broken by further code changes. If you are not sure how to test, feel free to send the pull request and ask for advices in the comment.

  • Don't change the version number. That will be done by the mainteiner upon releasing a new version.

  • Make sure that the full test suite passes before sending the pull request.

Code of Conduct

Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to creating a positive environment include:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism
  • Focusing on what is best for the community
  • Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

  • The use of sexualized language or imagery and unwelcome sexual attention or advances
  • Trolling, insulting/derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others' private information, such as a physical or electronic address, without explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project maintainer at mail[AT]lucaongaro.eu. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq

Design Document

This design document has the aim to explain the details of MiniSearch design and implementation to library developers that intend to contribute to this project, or that are simply curious about the internals.

Latest update: Feb. 11, 2019

Goals (and non-goals)

MiniSearch is aimed at providing rich full-text search functionalities in a local setup (e.g. client side, in the browser). It is therefore optimized for:

  1. Small memory footprint of the index data structure
  2. Fast indexing of documents
  3. Versatile and performant search features, to the extent possible while meeting goals 1 and 2
  4. Small and simple API surface, on top of which more specific solutions can be built by application developers
  5. Possibility to add and remove documents from the index at any time

MiniSearch is therefore NOT directly aimed at offering:

  • A solution for use cases requiring large index data structure size
  • Distributed setup where the index resides on multiple nodes and need to be kept in sync
  • Turn-key opinionated solutions (e.g. supporting multiple locales): MiniSearch enables developer to build these on top of the core API, but does not provide it out of the box.

For these points listed as non-goals, other solutions exist that should be preferred to MiniSearch. Adapting MiniSearch to support those goals would in fact necessarily go against the primary project goals.

Technical design

MiniSearch is composed of two layers:

  1. A compact and versatile data structure for indexing terms, providing prefix and fuzzy lookup
  2. An API layer on top of this data structure, providing the search features

Here follows a description of these two layers.

Index data structure

The data structure chosen for the index is a radix tree, which is a prefix tree where nodes with no siblings are merged with the parent node. The reason for choosing this data structure follows from the project goals:

  • The radix tree minimizes the memory footprint of the index, because common prefixes are stored only once, and nodes are compressed into a single multi-character node whenever possible.
  • Radix trees offer fast key lookup, with performance proportional to the key length, and fast lookup of subtrees sharing the same key prefix. These properties make it possible to offer exact match and prefix search.
  • On top of a radix tree it is possible to implement lookup of keys that are within a certain maximum edit distance from a given key. This search rapidly becomes complex as the maximum distance grows, but for practical search use-cases the maximum distance is small enough for this algorithm to be performant. Other more performant solutions for fuzzy search would require more space (e.g. n-gram indexes).

The class implementing the radix tree is called SearchableMap, because it implements the standard JavaScript Map interface, adding on top of it some key lookup methods:

  • SearchableMap.prototype.atPrefix(prefix), returning another SearchableMap representing a mutable view of the original one, containing only entries where the keys share the given prefix.
  • SearchableMap.prototype.fuzzyGet(searchKey, maxEditDistance), returning all the entries where the key is within the given edit (Levenshtein) distance from searchKey.

As a trade-off for offering these additional features, SearchableMap is restricted to use only string keys.

The SearchableMap data type is part of the public API of MiniSearch, exposed as MiniSearch.SearchableMap. Its usefulness is in fact not limited to providing a data structure for the inverted index, and developers can use it as a building block for other solutions. When modifying this class, one should think about it in terms of a generic data structure, that could in principle be released as a separate library.

Fuzzy search algorithm

The algorithm used to provide fuzzy search of keys within a maximum Levenshtein distance from a given term is the following:

  • The search starts with a budget of edit distance, initially equal to the given maximum distance.
  • The radix tree is traversed, starting from the root, visiting each path and propagating the remaining budget along each path, but quitting any search path along which the budget is exhausted.
  • For each visited node in the radix tree, the string contained in the node is traversed character by character using cursors that are kept on a stack.
  • Each cursor has: a pointer to a position in the node string; a pointer to a corresponding position in the search string; the type of the last edit, either deletion, or insertion, or change, or none; a budget of "available edits". This budget is decremented whenever an edit is required. The budget is passed from parent to children cursors.
  • The algorithm pulls cursors from the stack, and compares the pointed character in the node string with the pointed character in the search string:
    • if they are the same, one single child cursor is created, advancing both pointers of 1 position. No edit was necessary, so the last edit type is none.
    • if they are not the same, and the remaining budget is higher than zero, up to three children cursors are created: one corresponding to a character change, where both pointers are incremented by 1; one corresponding to a deletion, where only the search string pointer is incremented; one corresponding to an insertion, where only the node string pointer is incremented. Each of the children cursors have a budget that is one less the parent budget.
    • Some special cases are considered to avoid creating unnecessary cursors. A sequence of adjacent deletion-insertion, or insertion-deletion, would have the same effect of a change, but would consume more budget: therefore, a delete cursor is never created after a insertion cursor, and vice-versa. Similarily, adjacent change-deletion and deletion-change, or change-insertion and insertion-change, are equivalent. Therefore, only one of these cases is generated, by never producing a change cursor after a deletion or insertion one.
  • Whenever the algorithm finds a leaf node, it reports it as a result.

Note that this algorithm can get complex if the maximum edit distance is large, as many paths would be followed. The reason why this algorithm is employed is a trade-off:

  • For full-text search purposes, the maximum edit distance is small, so the algorithm is performant enough
  • The alternatives (e.g. trigram indexes), would require much more space
  • As MiniSearch is optimized for local and possibly memory-constrained setup, higher computation complexity is traded in exchange for smaller space requirement for the index.

Search API layer

The search API layer offers a small and simple API surface for application developers. It does not assume that a specific locale is used in the indexed documents, therefore no stemming nor stop-word filtering is performed, but instead offers easy options for developers to provide their own implementation. This heuristic will be followed in future development too: rather than providing an opinionated solution, the project will offer simple building blocks for application developers to implement their own solutions.

The inverted index is implemented with SearchableMap, and posting lists are stored as values in the Map. This way, the same data structure provides both the inverted index and the set of indexed terms. Different document fields are indexed within the same index, to further save space. The index is therefore structured as following:

term -> field -> { document frequency, posting list }

When performing a search, the entries corresponding to the search term are looked up in the index (optionally searching the index with prefix or fuzzy search), then the documents are scored with a variant of Tf-Idf, and finally results for different search terms are merged with the given combinator function (by default OR, but AND can be specified).

As the document IDs necessarily occur many times in the posting list, as a space optimization they are substituted by short generated IDs. An index of short ID to original ID is maintained alongside the search index, to reconstruct the original IDs. A similar optimization is applied to the field names.