Local Embeddings with HuggingFace#

LlamaIndex has support for HuggingFace embedding models, including BGE, Instructor, and more.

Furthermore, we provide utilties to create and use ONNX models using the Optimum library from HuggingFace.


The base HuggingFaceEmbedding class is a generic wrapper around any HuggingFace model for embeddings. You can set either pooling="cls" or pooling="mean" – in most cases, you’ll want cls pooling. But the model card for your particular model may have other recommendations.

You can refer to the embeddings leaderboard for more recommendations on embedding models.

This class depends on the transformers package, which you can install with pip install transformers.

NOTE: if you were previously using a HuggingFaceEmbeddings from LangChain, this should give equivilant results.

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-instructor
!pip install llama-index
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
embeddings = embed_model.get_text_embedding("Hello World!")
Hello World!
[-0.030880315229296684, -0.11021008342504501, 0.3917851448059082, -0.35962796211242676, 0.22797748446464539]


Instructor Embeddings are a class of embeddings specifically trained to augment their embeddings according to an instruction. By default, queries are given query_instruction="Represent the question for retrieving supporting documents: " and text is given text_instruction="Represent the document for retrieval: ".

They rely on the Instructor and SentenceTransformers pip package, which you can install with pip install InstructorEmbedding and pip install -U sentence-transformers.

from llama_index.embeddings.instructor import InstructorEmbedding

embed_model = InstructorEmbedding(model_name="hkunlp/instructor-base")
load INSTRUCTOR_Transformer
max_seq_length  512
embeddings = embed_model.get_text_embedding("Hello World!")
[ 0.02155361 -0.06098218  0.01796207  0.05490903  0.01526906]


Optimum in a HuggingFace library for exporting and running HuggingFace models in the ONNX format.

You can install the dependencies with pip install transformers optimum[exporters].

First, we need to create the ONNX model. ONNX models provide improved inference speeds, and can be used across platforms (i.e. in TransformersJS)

from llama_index.embeddings.huggingface_optimum import OptimumEmbedding

    "BAAI/bge-small-en-v1.5", "./bge_onnx"
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saved optimum model to ./bge_onnx. Use it with `embed_model = OptimumEmbedding(folder_name='./bge_onnx')`.
embed_model = OptimumEmbedding(folder_name="./bge_onnx")
embeddings = embed_model.get_text_embedding("Hello World!")
[-0.10364960134029388, -0.20998482406139374, -0.01883639395236969, -0.5241696834564209, 0.0335749015212059]


Let’s try comparing using a classic large document – the IPCC climate report, chapter 3.

!curl --output IPCC_AR6_WGII_Chapter03.pdf
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

documents = SimpleDirectoryReader(

Base HuggingFace Embeddings#

import os
import openai

# needed to synthesize responses later
# needed to synthesize responses later
openai.api_key = os.environ["OPENAI_API_KEY"]
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
1min 27s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Optimum Embeddings#

We can use the onnx embeddings we created earlier

from llama_index.embeddings.huggingface_optimum import OptimumEmbedding

embed_model = OptimumEmbedding(folder_name="./bge_onnx")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
1min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)