Persisting & Loading Data#

Persisting Data#

By default, LlamaIndex stores data in-memory, and this data can be explicitly persisted if desired:

storage_context.persist(persist_dir="<persist_dir>")

This will persist data to disk, under the specified persist_dir (or ./storage by default).

Multiple indexes can be persisted and loaded from the same directory, assuming you keep track of index ID's for loading.

User can also configure alternative storage backends (e.g. MongoDB) that persist data by default. In this case, calling storage_context.persist() will do nothing.

Loading Data#

To load data, user simply needs to re-create the storage context using the same configuration (e.g. pass in the same persist_dir or vector store client).

storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore.from_persist_dir(persist_dir="<persist_dir>"),
    vector_store=SimpleVectorStore.from_persist_dir(
        persist_dir="<persist_dir>"
    ),
    index_store=SimpleIndexStore.from_persist_dir(persist_dir="<persist_dir>"),
)

We can then load specific indices from the StorageContext through some convenience functions below.

from llama_index.core import (
    load_index_from_storage,
    load_indices_from_storage,
    load_graph_from_storage,
)

# load a single index
# need to specify index_id if multiple indexes are persisted to the same directory
index = load_index_from_storage(storage_context, index_id="<index_id>")

# don't need to specify index_id if there's only one index in storage context
index = load_index_from_storage(storage_context)

# load multiple indices
indices = load_indices_from_storage(storage_context)  # loads all indices
indices = load_indices_from_storage(
    storage_context, index_ids=[index_id1, ...]
)  # loads specific indices

# load composable graph
graph = load_graph_from_storage(
    storage_context, root_id="<root_id>"
)  # loads graph with the specified root_id

Using a remote backend#

By default, LlamaIndex uses a local filesystem to load and save files. However, you can override this by passing a fsspec.AbstractFileSystem object.

Here's a simple example, instantiating a vector store:

import dotenv
import s3fs
import os

dotenv.load_dotenv("../../../.env")

# load documents
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data/"
).load_data()
print(len(documents))
index = VectorStoreIndex.from_documents(documents)

At this point, everything has been the same. Now - let's instantiate a S3 filesystem and save / load from there.

# set up s3fs
AWS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET = os.environ["AWS_SECRET_ACCESS_KEY"]
R2_ACCOUNT_ID = os.environ["R2_ACCOUNT_ID"]

assert AWS_KEY is not None and AWS_KEY != ""

s3 = s3fs.S3FileSystem(
    key=AWS_KEY,
    secret=AWS_SECRET,
    endpoint_url=f"https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com",
    s3_additional_kwargs={"ACL": "public-read"},
)

# If you're using 2+ indexes with the same StorageContext,
# run this to save the index to remote blob storage
index.set_index_id("vector_index")

# persist index to s3
s3_bucket_name = "llama-index/storage_demo"  # {bucket_name}/{index_name}
index.storage_context.persist(persist_dir=s3_bucket_name, fs=s3)

# load index from s3
index_from_s3 = load_index_from_storage(
    StorageContext.from_defaults(persist_dir=s3_bucket_name, fs=s3),
    index_id="vector_index",
)

By default, if you do not pass a filesystem, we will assume a local filesystem.