Inverted Index

An inverted index is a database structure used by search engines to organize and store information about words and where they appear within documents. Rather than listing documents in order, an inverted index flips this around, organizing each word and linking it to the documents in which it appears. This structure is essential for search engines, helping them to quickly locate relevant results, even across specialized searches like vertical search and for terms associated through LSI.

How Does an Inverted Index Work?

An inverted index stores words and creates a map showing where each word appears. Here’s a breakdown of how it works:

  1. Text Tokenization
    The search engine scans each document, breaking it down into individual words, or “tokens,” and removes unnecessary common words like “the” and “is.” This process, called tokenization, creates a list of core terms for each document.
  2. Building a Term List
    The inverted index organizes these terms and notes each document in which they appear. For instance, if “SEO” appears in three different articles, the inverted index links each of those articles to the term “SEO.”

Example of an Inverted Index in Action

Imagine we have three documents:

  • Document 1: “SEO helps improve website visibility.”
  • Document 2: “Inverted indexes improve search efficiency.”
  • Document 3: “SEO and indexing are part of search optimization.”

The inverted index would organize the terms as follows:

  • “SEO”: Document 1, Document 3
  • “improve”: Document 1, Document 2
  • “search”: Document 2, Document 3

If a user searches for “SEO,” the search engine immediately finds Document 1 and Document 3 as relevant. With LSI, related terms like “search optimization” might also surface, making results more comprehensive.