DeepLake Vector Store

import os
import textwrap

from llama_index import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores import DeepLakeVectorStore

os.environ["OPENAI_API_KEY"] = "sk-********************************"
os.environ[
    "ACTIVELOOP_TOKEN"
] = "********************************"
/Users/adilkhansarsen/Documents/work/LlamaIndex/llama_index/GPTIndex/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
!pip install deeplake

if you don’t export token in your environment alternativalay you can use deeplake CLI to loging to deeplake

# !activeloop login -t <TOKEN> 
# load documents
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()
print('Document ID:', documents[0].doc_id, 'Document Hash:', documents[0].doc_hash)
Document ID: 14935662-4884-4c57-ac2e-fa62da019665 Document Hash: 77ae91ab542f3abb308c4d7c77c9bc4c9ad0ccd63144802b7cbe7e1bb3a4094e
# dataset_path = "hub://adilkhan/paul_graham_essay" # if we comment this out and don't pass the path then GPTDeepLakeIndex will create dataset in memory
from llama_index.storage.storage_context import StorageContext


dataset_path = "paul_graham_essay"

# Create an index over the documnts
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
|
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/adilkhan/paul_graham_essay
 
hub://adilkhan/paul_graham_essay loaded successfully.
Evaluating ingest: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:21<00:00
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17617 tokens
Dataset(path='hub://adilkhan/paul_graham_essay', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
 embedding  generic  (6, 1536)   None     None   
    ids      text     (6, 1)      str     None   
 metadata    json     (6, 1)      str     None   
   text      text     (6, 1)      str     None   

if we decide to not pass the path then GPTDeepLakeIndex will create dataset locally called llama_index

# Create an index over the documnts
# vector_store = DeepLakeVectorStore(overwrite=True)
# storage_context = StorageContext.from_defaults(vector_store=vector_store)
# index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
llama_index loaded successfully.
Evaluating ingest: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:04<00:00
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17617 tokens
Dataset(path='llama_index', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
 embedding  generic  (6, 1536)   None     None   
    ids      text     (6, 1)      str     None   
 metadata    json     (6, 1)      str     None   
   text      text     (6, 1)      str     None   
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?",)
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4028 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 6 tokens
print(textwrap.fill(str(response), 100))
  The author learned that working on things that are not prestigious can be a good thing, as it can
lead to discovering something real and avoiding the wrong track. The author also learned that
ignorance can be beneficial, as it can lead to discovering something new and unexpected. The author
also learned the importance of working hard, even at the parts of the job they don't like, in order
to set an example for others. The author also learned the value of unsolicited advice, as it can be
beneficial in unexpected ways, such as when Robert Morris suggested that the author should make sure
Y Combinator wasn't the last cool thing they did.
response = query_engine.query("What was a hard moment for the author?")
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4072 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens
print(textwrap.fill(str(response), 100))
 A hard moment for the author was when he was dealing with urgent problems during YC and about 60%
of them had to do with Hacker News, a news aggregator he had created. He was overwhelmed by the
amount of work he had to do to keep Hacker News running, and it was taking away from his ability to
focus on other projects. He was also haunted by the idea that his own work ethic set the upper bound
for how hard everyone else worked, so he felt he had to work very hard. He was also dealing with
disputes between cofounders, figuring out when people were lying to them, and fighting with people
who maltreated the startups. On top of this, he was given unsolicited advice from Robert Morris to
make sure Y Combinator wasn't the last cool thing he did, which made him consider quitting.
query_engine = index.as_query_engine()
response = query_engine.query("What was a hard moment for the author?")
print(textwrap.fill(str(response), 100))
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4072 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens
 A hard moment for the author was when he was dealing with urgent problems during YC and about 60%
of them had to do with Hacker News, a news aggregator he had created. He was overwhelmed by the
amount of work he had to do to keep Hacker News running, and it was taking away from his ability to
focus on other projects. He was also haunted by the idea that his own work ethic set the upper bound
for how hard everyone else worked, so he felt he had to work very hard. He was also dealing with
disputes between cofounders, figuring out when people were lying to them, and fighting with people
who maltreated the startups. On top of this, he was given unsolicited advice from Robert Morris to
make sure Y Combinator wasn't the last cool thing he did, which made him consider quitting.

Deleting items from the database

import deeplake as dp


ds = dp.load("paul_graham_essay")

idx = ds.ids[0].numpy().tolist()
\
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/adilkhan/paul_graham_essay
\
hub://adilkhan/paul_graham_essay loaded successfully.
 
index.delete(idx[0])
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:00<00:00, 4501.13it/s]
 
Dataset(path='hub://adilkhan/paul_graham_essay', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
 embedding  generic  (5, 1536)   None     None   
    ids      text     (5, 1)      str     None   
 metadata    json     (5, 1)      str     None   
   text      text     (5, 1)      str     None