LlamaIndex Usage Pattern

The general usage pattern of LlamaIndex is as follows:

  1. Load in documents (either manually, or through a data loader)

  2. Parse the Documents into Nodes

  3. Construct Index (from Nodes or Documents)

  4. [Optional, Advanced] Building indices on top of other indices

  5. Query the index

1. Load in Documents

The first step is to load in data. This data is represented in the form of Document objects. We provide a variety of data loaders which will load in Documents through the load_data function, e.g.:

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()

You can also choose to construct documents manually. LlamaIndex exposes the Document struct.

from llama_index import Document

text_list = [text1, text2, ...]
documents = [Document(t) for t in text_list]

A Document represents a lightweight container around the data source. You can now choose to proceed with one of the following steps:

  1. Feed the Document object directly into the index (see section 3).

  2. First convert the Document into Node objects (see section 2).

2. Parse the Documents into Nodes

The next step is to parse these Document objects into Node objects. Nodes represent “chunks” of source Documents, whether that is a text chunk, an image, or more. They also contain metadata and relationship information with other nodes and index structures.

Nodes are a first-class citizen in LlamaIndex. You can choose to define Nodes and all its attributes directly. You may also choose to “parse” source Documents into Nodes through our NodeParser classes.

For instance, you can do

from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser()

nodes = parser.get_nodes_from_documents(documents)

You can also choose to construct Node objects manually and skip the first section. For instance,

from llama_index.data_structs.node import Node, DocumentRelationship

node1 = Node(text="<text_chunk>", doc_id="<node_id>")
node2 = Node(text="<text_chunk>", doc_id="<node_id>")
# set relationships
node1.relationships[DocumentRelationship.NEXT] = node2.get_doc_id()
node2.relationships[DocumentRelationship.PREVIOUS] = node1.get_doc_id()

3. Index Construction

We can now build an index over these Document objects. The simplest high-level abstraction is to load-in the Document objects during index initialization (this is relevant if you came directly from step 1 and skipped step 2).

from llama_index import GPTVectorStoreIndex

index = GPTVectorStoreIndex.from_documents(documents)

You can also choose to build an index over a set of Node objects directly (this is a continuation of step 2).

from llama_index import GPTVectorStoreIndex

index = GPTVectorStoreIndex(nodes)

Depending on which index you use, LlamaIndex may make LLM calls in order to build the index.

Reusing Nodes across Index Structures

If you have multiple Node objects defined, and wish to share these Node objects across multiple index structures, you can do that. Simply instantiate a StorageContext object, add the Node objects to the underlying DocumentStore, and pass the StorageContext around.

from llama_index import StorageContext

storage_context = StorageContext.from_defaults()

index1 = GPTVectorStoreIndex(nodes, storage_context=storage_context)
index2 = GPTListIndex(nodes, storage_context=storage_context)

NOTE: If the storage_context argument isn’t specified, then it is implicitly created for each index during index construction. You can access the docstore associated with a given index through index.storage_context.

Inserting Documents or Nodes

You can also take advantage of the insert capability of indices to insert Document objects one at a time instead of during index construction.

from llama_index import GPTVectorStoreIndex

index = GPTVectorStoreIndex([])
for doc in documents:

If you want to insert nodes on directly you can use insert_nodes function instead.

from llama_index import GPTVectorStoreIndex

# nodes: Sequence[Node]
index = GPTVectorStoreIndex([])

See the Update Index How-To for details and an example notebook.

Customizing LLM’s

By default, we use OpenAI’s text-davinci-003 model. You may choose to use another LLM when constructing an index.

from llama_index import LLMPredictor, GPTVectorStoreIndex, PromptHelper, ServiceContext
from langchain import OpenAI


# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

# define prompt helper
# set maximum input size
max_input_size = 4096
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

index = GPTVectorStoreIndex.from_documents(
    documents, service_context=service_context

See the Custom LLM’s How-To for more details.

Customizing Prompts

Depending on the index used, we used default prompt templates for constructing the index (and also insertion/querying). See Custom Prompts How-To for more details on how to customize your prompt.

Customizing embeddings

For embedding-based indices, you can choose to pass in a custom embedding model. See Custom Embeddings How-To for more details.

Cost Predictor

Creating an index, inserting to an index, and querying an index may use tokens. We can track token usage through the outputs of these operations. When running operations, the token usage will be printed. You can also fetch the token usage through index.llm_predictor.last_token_usage. See Cost Predictor How-To for more details.

[Optional] Save the index for future use

By default, data is stored in-memory. To persist to disk:


You may omit persist_dir to persist to ./storage by default.

To reload from disk:

from llama_index import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")

# load index
index = load_index_from_storage(storage_context)

NOTE: If you had initialized the index with a custom ServiceContext object, you will also need to pass in the same ServiceContext during load_index_from_storage.

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

# when first building the index
index = GPTVectorStoreIndex.from_documents(
    documents, service_context=service_context


# when loading the index from disk
index = load_index_from_storage(

4. [Optional, Advanced] Building indices on top of other indices

You can build indices on top of other indices! Composability gives you greater power in indexing your heterogeneous sources of data. For a discussion on relevant use cases, see our Query Use Cases. For technical details and examples, see our Composability How-To.

5. Query the index.

After building the index, you can now query it with a QueryEngine. Note that a “query” is simply an input to an LLM - this means that you can use the index for question-answering, but you can also do more than that!

High-level API

To start, you can query an index with the default QueryEngine (i.e., using default configs), as follows:

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

response = query_engine.query("Write an email to the user given their background information.")

Low-level API

We also support a low-level composition API that gives you more granular control over the query logic. Below we highlight a few of the possible customizations.

from llama_index import (
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.postprocessor import SimilarityPostprocessor

# build index
index = GPTVectorStoreIndex.from_documents(documents)

# configure retriever
retriever = VectorIndexRetriever(

# configure response synthesizer
response_synthesizer = ResponseSynthesizer.from_args(

# assemble query engine
query_engine = RetrieverQueryEngine(

# query
response = query_engine.query("What did the author do growing up?")

You may also add your own retrieval, response synthesis, and overall query logic, by implementing the corresponding interfaces.

For a full list of implemented components and the supported configurations, please see the detailed reference docs.

In the following, we discuss some commonly used configurations in detail.

Configuring retriever

An index can have a variety of index-specific retrieval modes. For instance, a list index supports the default ListIndexRetriever that retrieves all nodes, and ListIndexEmbeddingRetriever that retrieves the top-k nodes by embedding similarity.

For convienience, you can also use the following shorthand:

    # ListIndexRetriever
    retriever = index.as_retriever(retriever_mode='default')
    # ListIndexEmbeddingRetriever
    retriever = index.as_retriever(retriever_mode='embedding')

After choosing your desired retriever, you can construct your query engine:

query_engine = RetrieverQueryEngine(retriever)
response = query_engine.query("What did the author do growing up?")

The full list of retrievers for each index (and their shorthand) is documented in the Query Reference.

Configuring response synthesis

After a retriever fetches relevant nodes, a ResponseSynthesizer synthesizes the final response by combining the information.

You can configure it via

query_engine = RetrieverQueryEngine.from_args(retriever, response_mode=<response_mode>)

Right now, we support the following options:

  • default: “create and refine” an answer by sequentially going through each retrieved Node; This make a separate LLM call per Node. Good for more detailed answers.

  • compact: “compact” the prompt during each LLM call by stuffing as many Node text chunks that can fit within the maximum prompt size. If there are too many chunks to stuff in one prompt, “create and refine” an answer by going through multiple prompts.

  • tree_summarize: Given a set of Node objects and the query, recursively construct a tree and return the root node as the response. Good for summarization purposes.

index = GPTListIndex.from_documents(documents)
retriever = index.as_retriever()

# default
query_engine = RetrieverQueryEngine.from_args(retriever, response_mode='default')
response = query_engine.query("What did the author do growing up?")

# compact
query_engine = RetrieverQueryEngine.from_args(retriever, response_mode='compact')
response = query_engine.query("What did the author do growing up?")

# tree summarize
query_engine = RetrieverQueryEngine.from_args(retriever, response_mode='tree_summarize')
response = query_engine.query("What did the author do growing up?")

Configuring node postprocessors (i.e. filtering and augmentation)

We also support advanced Node filtering and augmentation that can further improve the relevancy of the retrieved Node objects. This can help reduce the time/number of LLM calls/cost or improve response quality.

For example:

  • KeywordNodePostprocessor: filters nodes by required_keywords and exclude_keywords.

  • SimilarityPostprocessor: filters nodes by setting a threshold on the similarity score (thus only supported by embedding-based retrievers)

  • PrevNextNodePostprocessor: augments retrieved Node objects with additional relevant context based on Node relationships.

The full list of node postprocessors is documented in the Node Postprocessor Reference.

To configure the desired node postprocessors:

node_postprocessors = [
query_engine = RetrieverQueryEngine.from_args(
    retriever, node_postprocessors=node_postprocessors
response = query_engine.query("What did the author do growing up?")

5. Parsing the response

The object returned is a Response object. The object contains both the response text as well as the “sources” of the response:

response = query_engine.query("<query_str>")

# get response
# response.response

# get sources
# formatted sources

An example is shown below.