Using as a vector index.

Elasticsearch only supports Lucene indices, so only Opensearch is supported.

Note on setup: We setup a local Opensearch instance through the following doc. https://opensearch.org/docs/1.0/

If you run into SSL issues, try the following docker run command instead:

docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearchproject/opensearch:1.0.1

Reference: https://github.com/opensearch-project/OpenSearch/issues/1598

from os import getenv
from llama_index import SimpleDirectoryReader
from llama_index.vector_stores import OpensearchVectorStore, OpensearchVectorClient
from llama_index import GPTVectorStoreIndex
# http endpoint for your cluster (opensearch required for vector index usage)
endpoint = getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")
# index to demonstrate the VectorStore impl
idx = getenv("OPENSEARCH_INDEX", "gpt-index-demo")
# load some sample data
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
# OpensearchVectorClient stores text in this field by default
text_field = "content"
# OpensearchVectorClient stores embeddings in this field by default
embedding_field = "embedding"
# OpensearchVectorClient encapsulates logic for a
# single opensearch index with vector search enabled
client = OpensearchVectorClient(endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field)
# initialize vector store
vector_store = OpensearchVectorStore(client)
# initialize an index using our sample data and the client we just created
index = GPTVectorStoreIndex.from_documents(documents=documents)
# run query
query_engine = index.as_query_engine()
res = query_engine.query("What did the author do growing up?")
res.response
INFO:root:> [query] Total LLM token usage: 29628 tokens
INFO:root:> [query] Total embedding token usage: 8 tokens
'\n\nThe author grew up writing short stories, programming on an IBM 1401, and building a computer kit from Heathkit. They also wrote programs for a TRS-80, such as games, a program to predict model rocket flight, and a word processor. After years of nagging, they convinced their father to buy a TRS-80, and they wrote simple games, a program to predict how high their model rockets would fly, and a word processor that their father used to write at least one book. In college, they studied philosophy and AI, and wrote a book about Lisp hacking. They also took art classes and applied to art schools, and experimented with computer graphics and animation, exploring the use of algorithms to create art. Additionally, they experimented with machine learning algorithms, such as using neural networks to generate art, and exploring the use of numerical values to create art. They also took classes in fundamental subjects like drawing, color, and design, and applied to two art schools, RISD in the US, and the Accademia di Belli Arti in Florence. They were accepted to RISD, and while waiting to hear back from the Accademia, they learned Italian and took the entrance exam in Florence. They eventually graduated from RISD'

Use reader to check out what GPTOpensearchIndex just created in our index.

Reader works with Elasticsearch too as it just uses the basic search features.

# create a reader to check out the index used in previous section.
from llama_index.readers import ElasticsearchReader

rdr = ElasticsearchReader(endpoint, idx)
# set embedding_field optionally to read embedding data from the elasticsearch index
docs = rdr.load_data(text_field, embedding_field=embedding_field)
# docs have embeddings in them
print("embedding dimension:", len(docs[0].embedding))
# full document is stored in extra_info
print("all fields in index:", docs[0].extra_info.keys())
embedding dimension: 1536
all fields in index: dict_keys(['content', 'embedding'])
# we can check out how the text was chunked by the `GPTOpensearchIndex`
print("total number of chunks created:", len(docs))
total number of chunks: 10
# search index using standard elasticsearch query DSL
docs = rdr.load_data(text_field, {"query": {"match": {text_field: "Lisp"}}})
print("chunks that mention Lisp:", len(docs))
docs = rdr.load_data(text_field, {"query": {"match": {text_field: "Yahoo"}}})
print("chunks that mention Yahoo:", len(docs))
chunks that mention Lisp: 10
chunks that mention Yahoo: 8