Contributing to LlamaIndex
Interested in contributing to LlamaIndex? Here’s how to get started!
The best part of LlamaIndex is our community of users and contributors.
What should I work on?
🆕 Extend core modules
🐛 Fix bugs
🎉 Add usage examples
🧪 Add experimental features
📄 Improve code quality & documentation
Also, join our Discord for ideas and discussions: https://discord.gg/dGcwcsnxhU.
1. 🆕 Extend Core Modules
The most impactful way to contribute to LlamaIndex is extending our core modules:
We welcome contributions in all modules shown above. So far, we have implemented a core set of functionalities for each. As a contributor, you can help each module unlock its full potential.
NOTE: We are making rapid improvements to the project, and as a result, some interfaces are still volatile. Specifically, we are actively working on making the following components more modular and extensible (uncolored boxes above): core indexes, document stores, index queries, query runner
Below, we will describe what each module does, give a high-level idea of the interface, show existing implementations, and give some ideas for contribution.
A data loader ingests data of any format from anywhere into
Document objects, which can then be parsed and indexed.
load_data takes arbitrary arguments as input (e.g. path to data), and outputs a sequence of
Contributing a data loader is easy and super impactful for the community. The preferred way to contribute is making a PR at LlamaHub Github.
Want to load something but there’s no LlamaHub data loader for it yet? Make a PR!
A node parser parses
Document objects into
Node objects (atomic unit of data that LlamaIndex operates over, e.g., chunk of text, image, or table).
It is responsible for splitting text (via text splitters) and explicitly modelling the relationship between units of data (e.g. A is the source of B, C is a chunk after D).
get_nodes_from_documents takes a sequence of
Document objects as input, and outputs a sequence of
See the API reference for full details.
Noderelationships to model to model hierarchical documents (e.g. play-act-scene, chapter-section-heading).
Text splitter splits a long text
str into smaller text
str chunks with desired size and splitting “strategy” since LLMs have a limited context window size, and the quality of text chunk used as context impacts the quality of query results.
split_text takes a
str as input, and outputs a sequence of
Under the hood, LlamaIndex also supports a swappable storage layer that allows you to customize Document Stores (where ingested documents (i.e.,
Node objects) are stored), and Index Stores (where index metadata are stored)
We have an underlying key-value abstraction backing the document/index stores. Currently we support in-memory and MongoDB storage for these stores. Open to contributions!
See Storage guide for details.
Our vector store classes store embeddings and support lookup via similiarity search. These serve as the main data store and retrieval engine for our vector index.
addtakes in a sequence of
NodeWithEmbeddingsand insert the embeddings (and possibly the node contents & metadata) into the vector store.
deleteremoves entries given document IDs.
queryretrieves top-k most similar entries given a query embedding.
See a vector database out there that we don’t support yet? Make a PR!
See reference for full details.
Our retriever classes are lightweight classes that implement a
They may take in an index class as input - by default, each of our indices
(list, vector, keyword) have an associated retriever. The output is a set of
NodeWithScore objects (a
Node object with an extra
You may also choose to implement your own retriever classes on top of your own data if you wish.
retrievetakes in a
QueryBundleas input, and outputs a list of
Besides the “default” retrievers built on top of each index, what about fancier retrievers? E.g. retrievers that take in other retrivers as input? Or other types of data?
Our query engine classes are lightweight classes that implement a
query method; the query returns a response type.
For instance, they may take in a retriever class as input; our
takes in a
retriever as input as well as a
BaseSynthesizer class for response synthesis, and
query method performs retrieval and synthesis before returning the final result.
They may take in other query engine classes in as input too.
querytakes in a
QueryBundleas input, and outputs a
A query transform augments a raw query string with associated transformations to improve index querying. This can interpreted as a pre-processing stage, before the core index query logic is executed.
run takes in a
Querybundle as input, and outputs a transformed
See guide for more information.
Token Usage Optimizers
A token usage optimizer refines the retrieved
Nodes to reduce token usage during response synthesis.
optimize takes in the
QueryBundle and a text chunk
str, and outputs a refined text chunk
str that yeilds a more optimized response
A node postprocessor refines a list of retrieve nodes given configuration and context.
postprocess_nodes takes a list of
Nodes and extra metadata (e.g. similarity and query), and outputs a refined list of
A output parser enables us to extract structured output from the plain text output generated by the LLM.
format: formats a query
strwith structured output formatting instructions, and outputs the formatted
parse: takes a
str(from LLM response) as input, and gives a parsed tructured output (optionally also validated, error-corrected).
See guide for more information.
2. 🐛 Fix Bugs
Most bugs are reported and tracked in the Github Issues Page. We try our best in triaging and tagging these issues:
Issues tagged as
bugare confirmed bugs.
New contributors may want to start with issues tagged with
good first issue.
Please feel free to open an issue and/or assign an issue to yourself.
3. 🎉 Add Usage Examples
If you have applied LlamaIndex to a unique use-case (e.g. interesting dataset, customized index structure, complex query), we would love your contribution in the form of:
4. 🧪 Add Experimental Features
If you have a crazy idea, make a PR for it! Whether if it’s the latest research, or what you thought of in the shower, we’d love to see creative ways to improve LlamaIndex.
5. 📄 Improve Code Quality & Documentation
We would love your help in making the project cleaner, more robust, and more understandable. If you find something confusing, it most likely is for other people as well. Help us be better!
LlamaIndex is a Python package. We’ve tested primarily with Python versions >= 3.8. Here’s a quick and dirty guide to getting your environment setup.
Then, create a new Python virtual environment. The command below creates an environment in
and activates it:
python -m venv .venv source .venv/bin/activate
if you are in windows, use the following to activate your virtual environment:
Install the required dependencies (this will also install LlamaIndex through
pip install -e .
so that you can start developing on it):
pip install -r requirements.txt
Now you should be set!
Validating your Change
Let’s make sure to
format/lint our change. For bigger changes,
let’s also make sure to
test it and perhaps create an
You can format and lint your changes with the following commands in the root directory:
make format; make lint
You can also make use of our pre-commit hooks by setting up git hook scripts:
We run an assortment of linters:
For bigger changes, you’ll want to create a unit test. Our tests are in the
pytest for unit testing. To run all unit tests, run the following in the root dir:
pip install -r data_requirements.txt pytest tests
Creating an Example Notebook
For changes that involve entirely new features, it may be worth adding an example Jupyter notebook to showcase this feature.
Example notebooks can be found in this folder: https://github.com/jerryjliu/llama_index/tree/main/examples.