Data Connectors

NOTE: Our data connectors are now offered through LlamaHub πŸ¦™. LlamaHub is an open-source repository containing data loaders that you can easily plug and play into any LlamaIndex application.

The following data connectors are still available in the core repo.

Data Connectors for LlamaIndex.

This module contains the data connectors for LlamaIndex. Each connector inherits from a BaseReader class, connects to a data source, and loads Document objects from that data source.

You may also choose to construct Document objects manually, for instance in our Insert How-To Guide. See below for the API definition of a Document - the bare minimum is a text property.

class llama_index.readers.BagelReader(collection_name: str)

Reader for Bagel files.

create_documents(results: Any) Any

Create documents from the results.

Parameters

results – Results from the query.

Returns

List of documents.

load_data(query_vector: Optional[Union[Sequence[float], Sequence[int], List[Union[Sequence[float], Sequence[int]]]]] = None, query_texts: Optional[Union[str, List[str]]] = None, limit: int = 10, where: Optional[Dict[Union[str, Literal['$and'], Literal['$or']], Union[str, int, float, Dict[Union[Literal['$gt'], Literal['$gte'], Literal['$lt'], Literal['$lte'], Literal['$ne'], Literal['$eq'], Literal['$and'], Literal['$or']], Union[str, int, float]], List[Dict[Union[str, Literal['$and'], Literal['$or']], Union[str, int, float, Dict[Union[Literal['$gt'], Literal['$gte'], Literal['$lt'], Literal['$lte'], Literal['$ne'], Literal['$eq'], Literal['$and'], Literal['$or']], Union[str, int, float]], List[Where]]]]]]] = None, where_document: Optional[Dict[Union[Literal['$contains'], Literal['$and'], Literal['$or']], Union[str, List[Dict[Union[Literal['$contains'], Literal['$and'], Literal['$or']], Union[str, List[WhereDocument]]]]]]] = None, include: List[Union[Literal['documents'], Literal['embeddings'], Literal['metadatas'], Literal['distances']]] = ['metadatas', 'documents', 'embeddings', 'distances']) Any

Get the top n_results documents for provided query_embeddings or query_texts.

Parameters
  • query_embeddings – The embeddings to get the closes neighbors of. Optional.

  • query_texts – The document texts to get the closes neighbors of. Optional.

  • n_results – The number of neighbors to return for each query. Optional.

  • where – A Where type dict used to filter results by. Optional.

  • where_document – A WhereDocument type dict used to filter. Optional.

  • include – A list of what to include in the results. Optional.

Returns

Llama Index Document(s) with the closest embeddings to the query_embeddings or query_texts.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.BeautifulSoupWebReader

BeautifulSoup web page reader.

Reads pages from the web. Requires the bs4 and urllib packages.

Parameters

website_extractor (Optional[Dict[str, Callable]]) – A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.

Show JSON schema
{
   "title": "BeautifulSoupWebReader",
   "description": "BeautifulSoup web page reader.\n\nReads pages from the web.\nRequires the `bs4` and `urllib` packages.\n\nArgs:\n    website_extractor (Optional[Dict[str, Callable]]): A mapping of website\n        hostname (e.g. google.com) to a function that specifies how to\n        extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • is_remote (bool)

field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(urls: List[str], custom_hostname: Optional[str] = None) List[Document]

Load data from the urls.

Parameters
  • urls (List[str]) – List of URLs to scrape.

  • custom_hostname (Optional[str]) – Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.ChatGPTRetrievalPluginReader(endpoint_url: str, bearer_token: Optional[str] = None, retries: Optional[Retry] = None, batch_size: int = 100)

ChatGPT Retrieval Plugin reader.

load_data(query: str, top_k: int = 10, separate_documents: bool = True, **kwargs: Any) List[Document]

Load data from ChatGPT Retrieval Plugin.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.ChromaReader(collection_name: str, persist_directory: Optional[str] = None, chroma_api_impl: str = 'rest', chroma_db_impl: Optional[str] = None, host: str = 'localhost', port: int = 8000)

Chroma reader.

Retrieve documents from existing persisted Chroma collections.

Parameters
  • collection_name – Name of the peristed collection.

  • persist_directory – Directory where the collection is persisted.

create_documents(results: Any) List[Document]

Create documents from the results.

Parameters

results – Results from the query.

Returns

List of documents.

load_data(query_embedding: Optional[List[float]] = None, limit: int = 10, where: Optional[dict] = None, where_document: Optional[dict] = None, query: Optional[Union[str, List[str]]] = None) Any

Load data from the collection.

Parameters
  • limit – Number of results to return.

  • where – Filter results by metadata. {β€œmetadata_field”: β€œis_equal_to_this”}

  • where_document – Filter results by document. {β€œ$contains”:”search_string”}

Returns

List of documents.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.DeepLakeReader(token: Optional[str] = None)

DeepLake reader.

Retrieve documents from existing DeepLake datasets.

Parameters

dataset_name – Name of the deeplake dataset.

load_data(query_vector: List[float], dataset_path: str, limit: int = 4, distance_metric: str = 'l2') List[Document]

Load data from DeepLake.

Parameters
  • dataset_name (str) – Name of the DeepLake dataet.

  • query_vector (List[float]) – Query vector.

  • limit (int) – Number of results to return.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.DiscordReader

Discord reader.

Reads conversations from channels.

Parameters

discord_token (Optional[str]) – Discord token. If not provided, we assume the environment variable DISCORD_TOKEN is set.

Show JSON schema
{
   "title": "DiscordReader",
   "description": "Discord reader.\n\nReads conversations from channels.\n\nArgs:\n    discord_token (Optional[str]): Discord token. If not provided, we\n        assume the environment variable `DISCORD_TOKEN` is set.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "discord_token": {
         "title": "Discord Token",
         "type": "string"
      }
   },
   "required": [
      "discord_token"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • discord_token (str)

  • is_remote (bool)

field discord_token: str [Required]
field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(channel_ids: List[int], limit: Optional[int] = None, oldest_first: bool = True) List[Document]

Load data from the input directory.

Parameters
  • channel_ids (List[int]) – List of channel ids to read.

  • limit (Optional[int]) – Maximum number of messages to read.

  • oldest_first (bool) – Whether to read oldest messages first. Defaults to True.

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.readers.Document

Generic interface for a data document.

This document connects to data sources.

Show JSON schema
{
   "title": "Document",
   "description": "Generic interface for a data document.\n\nThis document connects to data sources.",
   "type": "object",
   "properties": {
      "doc_id": {
         "title": "Doc Id",
         "description": "Unique ID of the node.",
         "type": "string"
      },
      "embedding": {
         "title": "Embedding",
         "description": "Embedding of the node.",
         "type": "array",
         "items": {
            "type": "number"
         }
      },
      "extra_info": {
         "title": "Extra Info",
         "description": "A flat dictionary of metadata fields",
         "type": "object"
      },
      "excluded_embed_metadata_keys": {
         "title": "Excluded Embed Metadata Keys",
         "description": "Metadata keys that are exluded from text for the embed model.",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "excluded_llm_metadata_keys": {
         "title": "Excluded Llm Metadata Keys",
         "description": "Metadata keys that are exluded from text for the LLM.",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "relationships": {
         "title": "Relationships",
         "description": "A mapping of relationships to other node information.",
         "type": "object",
         "additionalProperties": {
            "anyOf": [
               {
                  "$ref": "#/definitions/RelatedNodeInfo"
               },
               {
                  "type": "array",
                  "items": {
                     "$ref": "#/definitions/RelatedNodeInfo"
                  }
               }
            ]
         }
      },
      "hash": {
         "title": "Hash",
         "description": "Hash of the node content.",
         "default": "",
         "type": "string"
      },
      "text": {
         "title": "Text",
         "description": "Text content of the node.",
         "default": "",
         "type": "string"
      },
      "start_char_idx": {
         "title": "Start Char Idx",
         "description": "Start char index of the node.",
         "type": "integer"
      },
      "end_char_idx": {
         "title": "End Char Idx",
         "description": "End char index of the node.",
         "type": "integer"
      },
      "text_template": {
         "title": "Text Template",
         "description": "Template for how text is formatted, with {content} and {metadata_str} placeholders.",
         "default": "{metadata_str}\n\n{content}",
         "type": "string"
      },
      "metadata_template": {
         "title": "Metadata Template",
         "description": "Template for how metadata is formatted, with {key} and {value} placeholders.",
         "default": "{key}: {value}",
         "type": "string"
      },
      "metadata_seperator": {
         "title": "Metadata Seperator",
         "description": "Seperator between metadata fields when converting to string.",
         "default": "\n",
         "type": "string"
      }
   },
   "definitions": {
      "ObjectType": {
         "title": "ObjectType",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "RelatedNodeInfo": {
         "title": "RelatedNodeInfo",
         "description": "Base component object to caputure class names.",
         "type": "object",
         "properties": {
            "node_id": {
               "title": "Node Id",
               "type": "string"
            },
            "node_type": {
               "$ref": "#/definitions/ObjectType"
            },
            "metadata": {
               "title": "Metadata",
               "type": "object"
            },
            "hash": {
               "title": "Hash",
               "type": "string"
            }
         },
         "required": [
            "node_id"
         ]
      }
   }
}

Config
  • allow_population_by_field_name: bool = True

Fields
field embedding: Optional[List[float]] = None

” metadata fields - injected as part of the text shown to LLMs as context - injected as part of the text for generating embeddings - used by vector DBs for metadata filtering

Embedding of the node.

Validated by
  • _check_hash

field end_char_idx: Optional[int] = None

End char index of the node.

Validated by
  • _check_hash

field excluded_embed_metadata_keys: List[str] [Optional]

Metadata keys that are exluded from text for the embed model.

Validated by
  • _check_hash

field excluded_llm_metadata_keys: List[str] [Optional]

Metadata keys that are exluded from text for the LLM.

Validated by
  • _check_hash

field hash: str = ''

Hash of the node content.

Validated by
  • _check_hash

field id_: str [Optional] (alias 'doc_id')

Unique ID of the node.

Validated by
  • _check_hash

field metadata: Dict[str, Any] [Optional] (alias 'extra_info')

A flat dictionary of metadata fields

Validated by
  • _check_hash

field metadata_seperator: str = '\n'

Seperator between metadata fields when converting to string.

Validated by
  • _check_hash

field metadata_template: str = '{key}: {value}'

Template for how metadata is formatted, with {key} and {value} placeholders.

Validated by
  • _check_hash

field relationships: Dict[NodeRelationship, Union[RelatedNodeInfo, List[RelatedNodeInfo]]] [Optional]

A mapping of relationships to other node information.

Validated by
  • _check_hash

field start_char_idx: Optional[int] = None

Start char index of the node.

Validated by
  • _check_hash

field text: str = ''

Text content of the node.

Validated by
  • _check_hash

field text_template: str = '{metadata_str}\n\n{content}'

Template for how text is formatted, with {content} and {metadata_str} placeholders.

Validated by
  • _check_hash

Get node as RelatedNodeInfo.

classmethod class_name() str

Get class name.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod example() Document
classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_langchain_format(doc: Document) Document

Convert struct from LangChain document format.

classmethod from_orm(obj: Any) Model
get_content(metadata_mode: MetadataMode = MetadataMode.NONE) str

Get object content.

get_doc_id() str

TODO: Deprecated: Get document ID.

get_embedding() List[float]

Get embedding.

Errors if embedding is None.

get_metadata_str(mode: MetadataMode = MetadataMode.ALL) str

metadata info string.

get_node_info() Dict[str, Any]

Get node info.

get_text() str
classmethod get_type() str

Get Document type.

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
set_content(value: str) None

Set the content of the node.

to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
to_langchain_format() Document

Convert struct to LangChain document format.

classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
property child_nodes: Optional[List[RelatedNodeInfo]]

Child nodes.

property doc_id: str

Get document ID.

property extra_info: Dict[str, Any]

Extra info.

Type

TODO

Type

DEPRECATED

property next_node: Optional[RelatedNodeInfo]

Next node.

property node_id: str
property node_info: Dict[str, Any]

Get node info.

Type

Deprecated

property parent_node: Optional[RelatedNodeInfo]

Parent node.

property prev_node: Optional[RelatedNodeInfo]

Prev node.

property ref_doc_id: Optional[str]

Get ref doc id.

Type

Deprecated

property source_node: Optional[RelatedNodeInfo]

Source object node.

Extracted from the relationships field.

pydantic model llama_index.readers.ElasticsearchReader

Read documents from an Elasticsearch/Opensearch index.

These documents can then be used in a downstream Llama Index data structure.

Parameters
  • endpoint (str) – URL (http/https) of cluster

  • index (str) – Name of the index (required)

  • httpx_client_args (dict) – Optional additional args to pass to the httpx.Client

Show JSON schema
{
   "title": "ElasticsearchReader",
   "description": "Read documents from an Elasticsearch/Opensearch index.\n\nThese documents can then be used in a downstream Llama Index data structure.\n\nArgs:\n    endpoint (str): URL (http/https) of cluster\n    index (str): Name of the index (required)\n    httpx_client_args (dict): Optional additional args to pass to the `httpx.Client`",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "endpoint": {
         "title": "Endpoint",
         "type": "string"
      },
      "index": {
         "title": "Index",
         "type": "string"
      },
      "httpx_client_args": {
         "title": "Httpx Client Args",
         "type": "object"
      }
   },
   "required": [
      "endpoint",
      "index"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • endpoint (str)

  • httpx_client_args (Optional[dict])

  • index (str)

  • is_remote (bool)

field endpoint: str [Required]
field httpx_client_args: Optional[dict] = None
field index: str [Required]
field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(field: str, query: Optional[dict] = None, embedding_field: Optional[str] = None) List[Document]

Read data from the Elasticsearch index.

Parameters
  • field (str) – Field in the document to retrieve text from

  • query (Optional[dict]) – Elasticsearch JSON query DSL object. For example: {β€œquery”: {β€œmatch”: {β€œmessage”: {β€œquery”: β€œthis is a test”}}}}

  • embedding_field (Optional[str]) – If there are embeddings stored in this index, this field can be used to set the embedding field on the returned Document list.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.FaissReader(index: Any)

Faiss reader.

Retrieves documents through an existing in-memory Faiss index. These documents can then be used in a downstream LlamaIndex data structure. If you wish use Faiss itself as an index to to organize documents, insert documents, and perform queries on them, please use VectorStoreIndex with FaissVectorStore.

Parameters

faiss_index (faiss.Index) – A Faiss Index object (required)

load_data(query: ndarray, id_to_text_map: Dict[str, str], k: int = 4, separate_documents: bool = True) List[Document]

Load data from Faiss.

Parameters
  • query (np.ndarray) – A 2D numpy array of query vectors.

  • id_to_text_map (Dict[str, str]) – A map from ID’s to text.

  • k (int) – Number of nearest neighbors to retrieve. Defaults to 4.

  • separate_documents (Optional[bool]) – Whether to return separate documents. Defaults to True.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.GithubRepositoryReader(owner: str, repo: str, use_parser: bool = True, verbose: bool = False, github_token: Optional[str] = None, concurrent_requests: int = 5, ignore_file_extensions: Optional[List[str]] = None, ignore_directories: Optional[List[str]] = None)

Github repository reader.

Retrieves the contents of a Github repository and returns a list of documents. The documents are either the contents of the files in the repository or the text extracted from the files using the parser.

Examples

>>> reader = GithubRepositoryReader("owner", "repo")
>>> branch_documents = reader.load_data(branch="branch")
>>> commit_documents = reader.load_data(commit_sha="commit_sha")
load_data(commit_sha: Optional[str] = None, branch: Optional[str] = None) List[Document]

Load data from a commit or a branch.

Loads github repository data from a specific commit sha or a branch.

Parameters
  • commit – commit sha

  • branch – branch name

Returns

list of documents

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.GoogleDocsReader

Google Docs reader.

Reads a page from Google Docs

Show JSON schema
{
   "title": "GoogleDocsReader",
   "description": "Google Docs reader.\n\nReads a page from Google Docs",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • is_remote (bool)

field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(document_ids: List[str]) List[Document]

Load data from the input directory.

Parameters

document_ids (List[str]) – a list of document ids.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.HTMLTagReader(tag: str = 'section', ignore_no_id: bool = False)

Read HTML files and extract text from a specific tag with BeautifulSoup.

By default, reads the text from the <section> tag.

load_data(file: Path, extra_info: Optional[Dict] = None) List[Document]

Load data from the input directory.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.JSONReader(levels_back: Optional[int] = None, collapse_length: Optional[int] = None, ensure_ascii: bool = False)

JSON reader.

Reads JSON documents with options to help suss out relationships between nodes.

Parameters
  • levels_back (int) – the number of levels to go back in the JSON tree, 0 if you want all levels. If levels_back is None, then we just format the JSON and make each line an embedding

  • collapse_length (int) – the maximum number of characters a JSON fragment would be collapsed in the output (levels_back needs to be not None) ex: if collapse_length = 10, and input is {a: [1, 2, 3], b: {β€œhello”: β€œworld”, β€œfoo”: β€œbar”}} then a would be collapsed into one line, while b would not. Recommend starting around 100 and then adjusting from there.

load_data(input_file: str) List[Document]

Load data from the input file.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MakeWrapper

Make reader.

load_data(*args: Any, **load_kwargs: Any) List[Document]

Load data from the input directory.

NOTE: This is not implemented.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pass_response_to_webhook(webhook_url: str, response: Response, query: Optional[str] = None) None

Pass response object to webhook.

Parameters
  • webhook_url (str) – Webhook URL.

  • response (Response) – Response object.

  • query (Optional[str]) – Query. Defaults to None.

class llama_index.readers.MboxReader

Mbox e-mail reader.

Reads a set of e-mails saved in the mbox format.

load_data(input_dir: str, **load_kwargs: Any) List[Document]

Load data from the input directory.

load_kwargs:

max_count (int): Maximum amount of messages to read. message_format (str): Message format overriding default.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MetalReader(api_key: str, client_id: str, index_id: str)

Metal reader.

Parameters
  • api_key (str) – Metal API key.

  • client_id (str) – Metal client ID.

  • index_id (str) – Metal index ID.

load_data(limit: int, query_embedding: Optional[List[float]] = None, filters: Optional[Dict[str, Any]] = None, separate_documents: bool = True, **query_kwargs: Any) List[Document]

Load data from Metal.

Parameters
  • query_embedding (Optional[List[float]]) – Query embedding for search.

  • limit (int) – Number of results to return.

  • filters (Optional[Dict[str, Any]]) – Filters to apply to the search.

  • separate_documents (Optional[bool]) – Whether to return separate documents per retrieved entry. Defaults to True.

  • **query_kwargs – Keyword arguments to pass to the search.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MilvusReader(host: str = 'localhost', port: int = 19530, user: str = '', password: str = '', use_secure: bool = False)

Milvus reader.

load_data(query_vector: List[float], collection_name: str, expr: Any = None, search_params: Optional[dict] = None, limit: int = 10) List[Document]

Load data from Milvus.

Parameters
  • collection_name (str) – Name of the Milvus collection.

  • query_vector (List[float]) – Query vector.

  • limit (int) – Number of results to return.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MyScaleReader(myscale_host: str, username: str, password: str, myscale_port: Optional[int] = 8443, database: str = 'default', table: str = 'llama_index', index_type: str = 'IVFLAT', metric: str = 'cosine', batch_size: int = 32, index_params: Optional[dict] = None, search_params: Optional[dict] = None, **kwargs: Any)

MyScale reader.

Parameters
  • myscale_host (str) – An URL to connect to MyScale backend.

  • username (str) – Usernamed to login.

  • password (str) – Password to login.

  • myscale_port (int) – URL port to connect with HTTP. Defaults to 8443.

  • database (str) – Database name to find the table. Defaults to β€˜default’.

  • table (str) – Table name to operate on. Defaults to β€˜vector_table’.

  • index_type (str) – index type string. Default to β€œIVFLAT”

  • metric (str) – Metric to compute distance, supported are (β€˜l2’, β€˜cosine’, β€˜ip’). Defaults to β€˜cosine’

  • batch_size (int, optional) – the size of documents to insert. Defaults to 32.

  • index_params (dict, optional) – The index parameters for MyScale. Defaults to None.

  • search_params (dict, optional) – The search parameters for a MyScale query. Defaults to None.

load_data(query_vector: List[float], where_str: Optional[str] = None, limit: int = 10) List[Document]

Load data from MyScale.

Parameters
  • query_vector (List[float]) – Query vector.

  • where_str (Optional[str], optional) – where condition string. Defaults to None.

  • limit (int) – Number of results to return.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.NotionPageReader

Notion Page reader.

Reads a set of Notion pages.

Parameters

integration_token (str) – Notion integration token.

Show JSON schema
{
   "title": "NotionPageReader",
   "description": "Notion Page reader.\n\nReads a set of Notion pages.\n\nArgs:\n    integration_token (str): Notion integration token.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "integration_token": {
         "title": "Integration Token",
         "type": "string"
      },
      "headers": {
         "title": "Headers",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      }
   },
   "required": [
      "integration_token",
      "headers"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • headers (Dict[str, str])

  • integration_token (str)

  • is_remote (bool)

field headers: Dict[str, str] [Required]
field integration_token: str [Required]
field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(page_ids: List[str] = [], database_id: Optional[str] = None) List[Document]

Load data from the input directory.

Parameters

page_ids (List[str]) – List of page ids to load.

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
query_database(database_id: str, query_dict: Dict[str, Any] = {}) List[str]

Get all the pages from a Notion database.

read_page(page_id: str) str

Read a page.

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
search(query: str) List[str]

Search Notion page given a text query.

to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.ObsidianReader(input_dir: str)

Utilities for loading data from an Obsidian Vault.

Parameters

input_dir (str) – Path to the vault.

load_data(*args: Any, **load_kwargs: Any) List[Document]

Load data from the input directory.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.PineconeReader(api_key: str, environment: str)

Pinecone reader.

Parameters
  • api_key (str) – Pinecone API key.

  • environment (str) – Pinecone environment.

load_data(index_name: str, id_to_text_map: Dict[str, str], vector: Optional[List[float]], top_k: int, separate_documents: bool = True, include_values: bool = True, **query_kwargs: Any) List[Document]

Load data from Pinecone.

Parameters
  • index_name (str) – Name of the index.

  • id_to_text_map (Dict[str, str]) – A map from ID’s to text.

  • separate_documents (Optional[bool]) – Whether to return separate documents per retrieved entry. Defaults to True.

  • vector (List[float]) – Query vector.

  • top_k (int) – Number of results to return.

  • include_values (bool) – Whether to include the embedding in the response. Defaults to True.

  • **query_kwargs – Keyword arguments to pass to the query. Arguments are the exact same as those found in Pinecone’s reference documentation for the query method.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.PsychicReader(psychic_key: Optional[str] = None)

Psychic reader.

Psychic is a platform that allows syncing data from many SaaS apps through one

universal API.

This reader connects to an instance of Psychic and reads data from it, given a

connector ID, account ID, and API key.

Learn more at docs.psychic.dev.

Parameters

psychic_key (str) – Secret key for Psychic. Get one at https://dashboard.psychic.dev/api-keys.

load_data(connector_id: Optional[str] = None, account_id: Optional[str] = None) List[Document]

Load data from a Psychic connection

Parameters
  • connector_id (str) – The connector ID to connect to

  • account_id (str) – The account ID to connect to

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.QdrantReader(location: Optional[str] = None, url: Optional[str] = None, port: Optional[int] = 6333, grpc_port: int = 6334, prefer_grpc: bool = False, https: Optional[bool] = None, api_key: Optional[str] = None, prefix: Optional[str] = None, timeout: Optional[float] = None, host: Optional[str] = None, path: Optional[str] = None)

Qdrant reader.

Retrieve documents from existing Qdrant collections.

Parameters
  • location – If :memory: - use in-memory Qdrant instance. If str - use it as a url parameter. If None - use default values for host and port.

  • url – either host or str of β€œOptional[scheme], host, Optional[port], Optional[prefix]”. Default: None

  • port – Port of the REST API interface. Default: 6333

  • grpc_port – Port of the gRPC interface. Default: 6334

  • prefer_grpc – If true - use gPRC interface whenever possible in custom methods.

  • https – If true - use HTTPS(SSL) protocol. Default: false

  • api_key – API key for authentication in Qdrant Cloud. Default: None

  • prefix – If not None - add prefix to the REST URL path. Example: service/v1 will result in http://localhost:6333/service/v1/{qdrant-endpoint} for REST API. Default: None

  • timeout – Timeout for REST and gRPC API requests. Default: 5.0 seconds for REST and unlimited for gRPC

  • host – Host name of Qdrant service. If url and host are None, set to β€˜localhost’. Default: None

load_data(collection_name: str, query_vector: List[float], should_search_mapping: Optional[Dict[str, str]] = None, must_search_mapping: Optional[Dict[str, str]] = None, must_not_search_mapping: Optional[Dict[str, str]] = None, rang_search_mapping: Optional[Dict[str, Dict[str, float]]] = None, limit: int = 10) List[Document]

Load data from Qdrant.

Parameters
  • collection_name (str) – Name of the Qdrant collection.

  • query_vector (List[float]) – Query vector.

  • should_search_mapping (Optional[Dict[str, str]]) – Mapping from field name to query string.

  • must_search_mapping (Optional[Dict[str, str]]) – Mapping from field name to query string.

  • must_not_search_mapping (Optional[Dict[str, str]]) – Mapping from field name to query string.

  • rang_search_mapping (Optional[Dict[str, Dict[str, float]]]) – Mapping from field name to range query.

  • limit (int) – Number of results to return.

Example

reader = QdrantReader() reader.load_data(

collection_name=”test_collection”, query_vector=[0.1, 0.2, 0.3], should_search_mapping={β€œtext_field”: β€œtext”}, must_search_mapping={β€œtext_field”: β€œtext”}, must_not_search_mapping={β€œtext_field”: β€œtext”}, # gte, lte, gt, lt supported rang_search_mapping={β€œtext_field”: {β€œgte”: 0.1, β€œlte”: 0.2}}, limit=10

)

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.RssReader

RSS reader.

Reads content from an RSS feed.

Show JSON schema
{
   "title": "RssReader",
   "description": "RSS reader.\n\nReads content from an RSS feed.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "html_to_text": {
         "title": "Html To Text",
         "type": "boolean"
      }
   },
   "required": [
      "html_to_text"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • html_to_text (bool)

  • is_remote (bool)

field html_to_text: bool [Required]
field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(urls: List[str]) List[Document]

Load data from RSS feeds.

Parameters

urls (List[str]) – List of RSS URLs to load.

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, encoding: str = 'utf-8', filename_as_id: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)

Simple directory reader.

Load files from file directory. Automatically select the best file reader given file extensions.

Parameters
  • input_dir (str) – Path to the directory.

  • input_files (List) – List of file paths to read (Optional; overrides input_dir, exclude)

  • exclude (List) – glob of python file paths to exclude (Optional)

  • exclude_hidden (bool) – Whether to exclude hidden files (dotfiles).

  • encoding (str) – Encoding of the files. Default is utf-8.

  • errors (str) – how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open

  • recursive (bool) – Whether to recursively search in subdirectories. False by default.

  • filename_as_id (bool) – Whether to use the filename as the document id. False by default.

  • required_exts (Optional[List[str]]) – List of required extensions. Default is None.

  • file_extractor (Optional[Dict[str, BaseReader]]) – A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.

  • num_files_limit (Optional[int]) – Maximum number of files to read. Default is None.

  • file_metadata (Optional[Callable[str, Dict]]) – A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.

load_data() List[Document]

Load data from the input directory.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleMongoReader(host: Optional[str] = None, port: Optional[int] = None, uri: Optional[str] = None, max_docs: int = 1000)

Simple mongo reader.

Concatenates each Mongo doc into Document used by LlamaIndex.

Parameters
  • host (str) – Mongo host.

  • port (int) – Mongo port.

  • max_docs (int) – Maximum number of documents to load.

load_data(db_name: str, collection_name: str, field_names: List[str] = ['text'], query_dict: Optional[Dict] = None) List[Document]

Load data from the input directory.

Parameters
  • db_name (str) – name of the database.

  • collection_name (str) – name of the collection.

  • field_names (List[str]) – names of the fields to be concatenated. Defaults to [β€œtext”]

  • query_dict (Optional[Dict]) – query to filter documents. Defaults to None

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.SimpleWebPageReader

Simple web page reader.

Reads pages from the web.

Parameters
  • html_to_text (bool) – Whether to convert HTML to text. Requires html2text package.

  • metadata_fn (Optional[Callable[[str], Dict]]) – A function that takes in a URL and returns a dictionary of metadata. Default is None.

Show JSON schema
{
   "title": "SimpleWebPageReader",
   "description": "Simple web page reader.\n\nReads pages from the web.\n\nArgs:\n    html_to_text (bool): Whether to convert HTML to text.\n        Requires `html2text` package.\n    metadata_fn (Optional[Callable[[str], Dict]]): A function that takes in\n        a URL and returns a dictionary of metadata.\n        Default is None.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "html_to_text": {
         "title": "Html To Text",
         "type": "boolean"
      }
   },
   "required": [
      "html_to_text"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • html_to_text (bool)

  • is_remote (bool)

field html_to_text: bool [Required]
field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(urls: List[str]) List[Document]

Load data from the input directory.

Parameters

urls (List[str]) – List of URLs to scrape.

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.readers.SlackReader

Slack reader.

Reads conversations from channels. If an earliest_date is provided, an optional latest_date can also be provided. If no latest_date is provided, we assume the latest date is the current timestamp.

Parameters
  • slack_token (Optional[str]) – Slack token. If not provided, we assume the environment variable SLACK_BOT_TOKEN is set.

  • ssl (Optional[str]) – Custom SSL context. If not provided, it is assumed there is already an SSL context available.

  • earliest_date (Optional[datetime]) – Earliest date from which to read conversations. If not provided, we read all messages.

  • latest_date (Optional[datetime]) – Latest date from which to read conversations. If not provided, defaults to current timestamp in combination with earliest_date.

Show JSON schema
{
   "title": "SlackReader",
   "description": "Slack reader.\n\nReads conversations from channels. If an earliest_date is provided, an\noptional latest_date can also be provided. If no latest_date is provided,\nwe assume the latest date is the current timestamp.\n\nArgs:\n    slack_token (Optional[str]): Slack token. If not provided, we\n        assume the environment variable `SLACK_BOT_TOKEN` is set.\n    ssl (Optional[str]): Custom SSL context. If not provided, it is assumed\n        there is already an SSL context available.\n    earliest_date (Optional[datetime]): Earliest date from which\n        to read conversations. If not provided, we read all messages.\n    latest_date (Optional[datetime]): Latest date from which to\n        read conversations. If not provided, defaults to current timestamp\n        in combination with earliest_date.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "slack_token": {
         "title": "Slack Token",
         "type": "string"
      },
      "earliest_date_timestamp": {
         "title": "Earliest Date Timestamp",
         "type": "number"
      },
      "latest_date_timestamp": {
         "title": "Latest Date Timestamp",
         "type": "number"
      }
   },
   "required": [
      "slack_token",
      "latest_date_timestamp"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • earliest_date_timestamp (Optional[float])

  • is_remote (bool)

  • latest_date_timestamp (float)

  • slack_token (str)

field earliest_date_timestamp: Optional[float] = None
field is_remote: bool = True
field latest_date_timestamp: float [Required]
field slack_token: str [Required]
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(channel_ids: List[str], reverse_chronological: bool = True) List[Document]

Load data from the input directory.

Parameters

channel_ids (List[str]) – List of channel ids to read.

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.SteamshipFileReader(api_key: Optional[str] = None)

Reads persistent Steamship Files and converts them to Documents.

Parameters

api_key – Steamship API key. Defaults to STEAMSHIP_API_KEY value if not provided.

Note

Requires install of steamship package and an active Steamship API Key. To get a Steamship API Key, visit: https://steamship.com/account/api. Once you have an API Key, expose it via an environment variable named STEAMSHIP_API_KEY or pass it as an init argument (api_key).

load_data(workspace: str, query: Optional[str] = None, file_handles: Optional[List[str]] = None, collapse_blocks: bool = True, join_str: str = '\n\n') List[Document]

Load data from persistent Steamship Files into Documents.

Parameters
  • workspace – the handle for a Steamship workspace (see: https://docs.steamship.com/workspaces/index.html)

  • query – a Steamship tag query for retrieving files (ex: β€˜filetag and value(β€œimport-id”)=”import-001”’)

  • file_handles – a list of Steamship File handles (ex: smooth-valley-9kbdr)

  • collapse_blocks – whether to merge individual File Blocks into a single Document, or separate them.

  • join_str – when collapse_blocks is True, this is how the block texts will be concatenated.

Note

The collection of Files from both query and file_handles will be combined. There is no (current) support for deconflicting the collections (meaning that if a file appears both in the result set of the query and as a handle in file_handles, it will be loaded twice).

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.StringIterableReader

String Iterable Reader.

Gets a list of documents, given an iterable (e.g. list) of strings.

Example

from llama_index import StringIterableReader, TreeIndex

documents = StringIterableReader().load_data(
    texts=["I went to the store", "I bought an apple"])
index = TreeIndex.from_documents(documents)
query_engine = index.as_query_engine()
query_engine.query("what did I buy?")

# response should be something like "You bought an apple."

Show JSON schema
{
   "title": "StringIterableReader",
   "description": "String Iterable Reader.\n\nGets a list of documents, given an iterable (e.g. list) of strings.\n\nExample:\n    .. code-block:: python\n\n        from llama_index import StringIterableReader, TreeIndex\n\n        documents = StringIterableReader().load_data(\n            texts=[\"I went to the store\", \"I bought an apple\"])\n        index = TreeIndex.from_documents(documents)\n        query_engine = index.as_query_engine()\n        query_engine.query(\"what did I buy?\")\n\n        # response should be something like \"You bought an apple.\"",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": false,
         "type": "boolean"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • is_remote (bool)

field is_remote: bool = False
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(texts: List[str]) List[Document]

Load the data.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.readers.TrafilaturaWebReader

Trafilatura web page reader.

Reads pages from the web. Requires the trafilatura package.

Show JSON schema
{
   "title": "TrafilaturaWebReader",
   "description": "Trafilatura web page reader.\n\nReads pages from the web.\nRequires the `trafilatura` package.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "error_on_missing": {
         "title": "Error On Missing",
         "type": "boolean"
      }
   },
   "required": [
      "error_on_missing"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • error_on_missing (bool)

  • is_remote (bool)

field error_on_missing: bool [Required]
field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(urls: List[str]) List[Document]

Load data from the urls.

Parameters

urls (List[str]) – List of URLs to scrape.

Returns

List of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.readers.TwitterTweetReader

Twitter tweets reader.

Read tweets of user twitter handle.

Check β€˜https://developer.twitter.com/en/docs/twitter-api/ getting-started/getting-access-to-the-twitter-api’ on how to get access to twitter API.

Parameters
  • bearer_token (str) – bearer_token that you get from twitter API.

  • num_tweets (Optional[int]) – Number of tweets for each user twitter handle. Default is 100 tweets.

Show JSON schema
{
   "title": "TwitterTweetReader",
   "description": "Twitter tweets reader.\n\nRead tweets of user twitter handle.\n\nCheck 'https://developer.twitter.com/en/docs/twitter-api/        getting-started/getting-access-to-the-twitter-api'         on how to get access to twitter API.\n\nArgs:\n    bearer_token (str): bearer_token that you get from twitter API.\n    num_tweets (Optional[int]): Number of tweets for each user twitter handle.            Default is 100 tweets.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "bearer_token": {
         "title": "Bearer Token",
         "type": "string"
      },
      "num_tweets": {
         "title": "Num Tweets",
         "type": "integer"
      }
   },
   "required": [
      "bearer_token"
   ]
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • bearer_token (str)

  • is_remote (bool)

  • num_tweets (Optional[int])

field bearer_token: str [Required]
field is_remote: bool = True
field num_tweets: Optional[int] = None
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(twitterhandles: List[str], num_tweets: Optional[int] = None, **load_kwargs: Any) List[Document]

Load tweets of twitter handles.

Parameters

twitterhandles (List[str]) – List of user twitter handles to read tweets.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.readers.WeaviateReader(host: str, auth_client_secret: Optional[Any] = None)

Weaviate reader.

Retrieves documents from Weaviate through vector lookup. Allows option to concatenate retrieved documents into one Document, or to return separate Document objects per document.

Parameters
  • host (str) – host.

  • auth_client_secret (Optional[weaviate.auth.AuthCredentials]) – auth_client_secret.

load_data(class_name: Optional[str] = None, properties: Optional[List[str]] = None, graphql_query: Optional[str] = None, separate_documents: Optional[bool] = True) List[Document]

Load data from Weaviate.

If graphql_query is not found in load_kwargs, we assume that class_name and properties are provided.

Parameters
  • class_name (Optional[str]) – class_name to retrieve documents from.

  • properties (Optional[List[str]]) – properties to retrieve from documents.

  • graphql_query (Optional[str]) – Raw GraphQL Query. We assume that the query is a Get query.

  • separate_documents (Optional[bool]) – Whether to return separate documents. Defaults to True.

Returns

A list of documents.

Return type

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pydantic model llama_index.readers.WikipediaReader

Wikipedia reader.

Reads a page.

Show JSON schema
{
   "title": "WikipediaReader",
   "description": "Wikipedia reader.\n\nReads a page.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • is_remote (bool)

field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(pages: List[str], **load_kwargs: Any) List[Document]

Load data from the input directory.

Parameters

pages (List[str]) – List of pages to read.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.readers.YoutubeTranscriptReader

Youtube Transcript reader.

Show JSON schema
{
   "title": "YoutubeTranscriptReader",
   "description": "Youtube Transcript reader.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • is_remote (bool)

field is_remote: bool = True
classmethod class_name() str

Get the name identifier of the class.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

load_data(ytlinks: List[str], **load_kwargs: Any) List[Document]

Load data from the input links.

Parameters

pages (List[str]) – List of youtube links for which transcripts are to be read.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model