Node Parser

Node parsers.

pydantic model llama_index.node_parser.HierarchicalNodeParser

Hierarchical node parser.

Splits a document into a recursive hierarchy Nodes using a TextSplitter.

NOTE: this will return a hierarchy of nodes in a flat list, where there will be overlap between parent nodes (e.g. with a bigger chunk size), and child nodes per parent (e.g. with a smaller chunk size).

For instance, this may return a list of nodes like: - list of top-level nodes with chunk size 2048 - list of second-level nodes, where each node is a child of a top-level node,

chunk size 512

  • list of third-level nodes, where each node is a child of a second-level node,

    chunk size 128

Parameters
  • text_splitter (Optional[TextSplitter]) – text splitter

  • include_metadata (bool) – whether to include metadata in nodes

  • include_prev_next_rel (bool) – whether to include prev/next relationships

Show JSON schema
{
   "title": "HierarchicalNodeParser",
   "description": "Hierarchical node parser.\n\nSplits a document into a recursive hierarchy Nodes using a TextSplitter.\n\nNOTE: this will return a hierarchy of nodes in a flat list, where there will be\noverlap between parent nodes (e.g. with a bigger chunk size), and child nodes\nper parent (e.g. with a smaller chunk size).\n\nFor instance, this may return a list of nodes like:\n- list of top-level nodes with chunk size 2048\n- list of second-level nodes, where each node is a child of a top-level node,\n    chunk size 512\n- list of third-level nodes, where each node is a child of a second-level node,\n    chunk size 128\n\nArgs:\n    text_splitter (Optional[TextSplitter]): text splitter\n    include_metadata (bool): whether to include metadata in nodes\n    include_prev_next_rel (bool): whether to include prev/next relationships",
   "type": "object",
   "properties": {
      "chunk_sizes": {
         "title": "Chunk Sizes",
         "description": "The chunk sizes to use when splitting documents, in order of level.",
         "type": "array",
         "items": {
            "type": "integer"
         }
      },
      "text_splitter_ids": {
         "title": "Text Splitter Ids",
         "description": "List of ids for the text splitters to use when splitting documents, in order of level (first id used for first level, etc.).",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "text_splitter_map": {
         "title": "Text Splitter Map",
         "description": "Map of text splitter id to text splitter.",
         "type": "object",
         "additionalProperties": {
            "$ref": "#/definitions/TextSplitter"
         }
      },
      "include_metadata": {
         "title": "Include Metadata",
         "description": "Whether or not to consider metadata when splitting.",
         "default": true,
         "type": "boolean"
      },
      "include_prev_next_rel": {
         "title": "Include Prev Next Rel",
         "description": "Include prev/next node relationships.",
         "default": true,
         "type": "boolean"
      },
      "metadata_extractor": {
         "title": "Metadata Extractor",
         "description": "Metadata extraction pipeline to apply to nodes.",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataExtractor"
            }
         ]
      },
      "callback_manager": {
         "title": "Callback Manager"
      }
   },
   "required": [
      "text_splitter_map"
   ],
   "definitions": {
      "TextSplitter": {
         "title": "TextSplitter",
         "description": "Helper class that provides a standard way to create an ABC using\ninheritance.",
         "type": "object",
         "properties": {}
      },
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "MetadataFeatureExtractor": {
         "title": "MetadataFeatureExtractor",
         "description": "Base interface for feature extractor.",
         "type": "object",
         "properties": {
            "is_text_node_only": {
               "title": "Is Text Node Only",
               "default": true,
               "type": "boolean"
            },
            "show_progress": {
               "title": "Show Progress",
               "default": true,
               "type": "boolean"
            },
            "metadata_mode": {
               "default": "1",
               "allOf": [
                  {
                     "$ref": "#/definitions/MetadataMode"
                  }
               ]
            }
         }
      },
      "MetadataExtractor": {
         "title": "MetadataExtractor",
         "description": "Metadata extractor.",
         "type": "object",
         "properties": {
            "extractors": {
               "title": "Extractors",
               "description": "Metadta feature extractors to apply to each node.",
               "type": "array",
               "items": {
                  "$ref": "#/definitions/MetadataFeatureExtractor"
               }
            },
            "node_text_template": {
               "title": "Node Text Template",
               "description": "Template to represent how node text is mixed with metadata text.",
               "default": "[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n",
               "type": "string"
            },
            "disable_template_rewrite": {
               "title": "Disable Template Rewrite",
               "description": "Disable the node template rewrite.",
               "default": false,
               "type": "boolean"
            },
            "in_place": {
               "title": "In Place",
               "description": "Whether to process nodes in place.",
               "default": true,
               "type": "boolean"
            }
         }
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • callback_manager (llama_index.callbacks.base.CallbackManager)

  • chunk_sizes (Optional[List[int]])

  • include_metadata (bool)

  • include_prev_next_rel (bool)

  • metadata_extractor (Optional[llama_index.node_parser.extractors.metadata_extractors.MetadataExtractor])

  • text_splitter_ids (List[str])

  • text_splitter_map (Dict[str, llama_index.text_splitter.types.TextSplitter])

field callback_manager: CallbackManager [Optional]
field chunk_sizes: Optional[List[int]] = None

The chunk sizes to use when splitting documents, in order of level.

field include_metadata: bool = True

Whether or not to consider metadata when splitting.

field include_prev_next_rel: bool = True

Include prev/next node relationships.

field metadata_extractor: Optional[MetadataExtractor] = None

Metadata extraction pipeline to apply to nodes.

field text_splitter_ids: List[str] [Optional]

List of ids for the text splitters to use when splitting documents, in order of level (first id used for first level, etc.).

field text_splitter_map: Dict[str, TextSplitter] [Required]

Map of text splitter id to text splitter.

classmethod class_name() str

Get class name.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_defaults(chunk_sizes: Optional[List[int]] = None, text_splitter_ids: Optional[List[str]] = None, text_splitter_map: Optional[Dict[str, TextSplitter]] = None, include_metadata: bool = True, include_prev_next_rel: bool = True, callback_manager: Optional[CallbackManager] = None, metadata_extractor: Optional[MetadataExtractor] = None) HierarchicalNodeParser
classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
get_nodes_from_documents(documents: Sequence[Document], show_progress: bool = False) List[BaseNode]

Parse document into nodes.

Parameters
  • documents (Sequence[Document]) – documents to parse

  • include_metadata (bool) – whether to include metadata in nodes

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.node_parser.NodeParser

Base interface for node parser.

Show JSON schema
{
   "title": "NodeParser",
   "description": "Base interface for node parser.",
   "type": "object",
   "properties": {}
}

Config
  • arbitrary_types_allowed: bool = True

abstract classmethod class_name() str

Get class name.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
abstract get_nodes_from_documents(documents: Sequence[Document], show_progress: bool = False) List[BaseNode]

Parse documents into nodes.

Parameters

documents (Sequence[Document]) – documents to parse

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.node_parser.SentenceWindowNodeParser

Sentence window node parser.

Splits a document into Nodes, with each node being a sentence. Each node contains a window from the surrounding sentences in the metadata.

Parameters
  • sentence_splitter (Optional[Callable]) – splits text into sentences

  • include_metadata (bool) – whether to include metadata in nodes

  • include_prev_next_rel (bool) – whether to include prev/next relationships

Show JSON schema
{
   "title": "SentenceWindowNodeParser",
   "description": "Sentence window node parser.\n\nSplits a document into Nodes, with each node being a sentence.\nEach node contains a window from the surrounding sentences in the metadata.\n\nArgs:\n    sentence_splitter (Optional[Callable]): splits text into sentences\n    include_metadata (bool): whether to include metadata in nodes\n    include_prev_next_rel (bool): whether to include prev/next relationships",
   "type": "object",
   "properties": {
      "window_size": {
         "title": "Window Size",
         "description": "The number of sentences on each side of a sentence to capture.",
         "default": 3,
         "type": "integer"
      },
      "window_metadata_key": {
         "title": "Window Metadata Key",
         "description": "The metadata key to store the sentence window under.",
         "default": "window",
         "type": "string"
      },
      "original_text_metadata_key": {
         "title": "Original Text Metadata Key",
         "description": "The metadata key to store the original sentence in.",
         "default": "original_text",
         "type": "string"
      },
      "include_metadata": {
         "title": "Include Metadata",
         "description": "Whether or not to consider metadata when splitting.",
         "default": true,
         "type": "boolean"
      },
      "include_prev_next_rel": {
         "title": "Include Prev Next Rel",
         "description": "Include prev/next node relationships.",
         "default": true,
         "type": "boolean"
      },
      "metadata_extractor": {
         "title": "Metadata Extractor",
         "description": "Metadata extraction pipeline to apply to nodes.",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataExtractor"
            }
         ]
      },
      "callback_manager": {
         "title": "Callback Manager"
      }
   },
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "MetadataFeatureExtractor": {
         "title": "MetadataFeatureExtractor",
         "description": "Base interface for feature extractor.",
         "type": "object",
         "properties": {
            "is_text_node_only": {
               "title": "Is Text Node Only",
               "default": true,
               "type": "boolean"
            },
            "show_progress": {
               "title": "Show Progress",
               "default": true,
               "type": "boolean"
            },
            "metadata_mode": {
               "default": "1",
               "allOf": [
                  {
                     "$ref": "#/definitions/MetadataMode"
                  }
               ]
            }
         }
      },
      "MetadataExtractor": {
         "title": "MetadataExtractor",
         "description": "Metadata extractor.",
         "type": "object",
         "properties": {
            "extractors": {
               "title": "Extractors",
               "description": "Metadta feature extractors to apply to each node.",
               "type": "array",
               "items": {
                  "$ref": "#/definitions/MetadataFeatureExtractor"
               }
            },
            "node_text_template": {
               "title": "Node Text Template",
               "description": "Template to represent how node text is mixed with metadata text.",
               "default": "[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n",
               "type": "string"
            },
            "disable_template_rewrite": {
               "title": "Disable Template Rewrite",
               "description": "Disable the node template rewrite.",
               "default": false,
               "type": "boolean"
            },
            "in_place": {
               "title": "In Place",
               "description": "Whether to process nodes in place.",
               "default": true,
               "type": "boolean"
            }
         }
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • callback_manager (llama_index.callbacks.base.CallbackManager)

  • include_metadata (bool)

  • include_prev_next_rel (bool)

  • metadata_extractor (Optional[llama_index.node_parser.extractors.metadata_extractors.MetadataExtractor])

  • original_text_metadata_key (str)

  • sentence_splitter (Callable[[str], List[str]])

  • window_metadata_key (str)

  • window_size (int)

field callback_manager: CallbackManager [Optional]
field include_metadata: bool = True

Whether or not to consider metadata when splitting.

field include_prev_next_rel: bool = True

Include prev/next node relationships.

field metadata_extractor: Optional[MetadataExtractor] = None

Metadata extraction pipeline to apply to nodes.

field original_text_metadata_key: str = 'original_text'

The metadata key to store the original sentence in.

field sentence_splitter: Callable[[str], List[str]] [Optional]

The text splitter to use when splitting documents.

field window_metadata_key: str = 'window'

The metadata key to store the sentence window under.

field window_size: int = 3

The number of sentences on each side of a sentence to capture.

build_window_nodes_from_documents(documents: Sequence[Document]) List[BaseNode]

Build window nodes from documents.

classmethod class_name() str

Get class name.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_defaults(sentence_splitter: Optional[Callable[[str], List[str]]] = None, window_size: int = 3, window_metadata_key: str = 'window', original_text_metadata_key: str = 'original_text', include_metadata: bool = True, include_prev_next_rel: bool = True, callback_manager: Optional[CallbackManager] = None, metadata_extractor: Optional[MetadataExtractor] = None) SentenceWindowNodeParser
classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
get_nodes_from_documents(documents: Sequence[Document], show_progress: bool = False) List[BaseNode]

Parse document into nodes.

Parameters
  • documents (Sequence[Document]) – documents to parse

  • include_metadata (bool) – whether to include metadata in nodes

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
property text_splitter: Callable[[str], List[str]]

Get text splitter.

pydantic model llama_index.node_parser.SimpleNodeParser

Simple node parser.

Splits a document into Nodes using a TextSplitter.

Parameters
  • text_splitter (Optional[TextSplitter]) – text splitter

  • include_metadata (bool) – whether to include metadata in nodes

  • include_prev_next_rel (bool) – whether to include prev/next relationships

Show JSON schema
{
   "title": "SimpleNodeParser",
   "description": "Simple node parser.\n\nSplits a document into Nodes using a TextSplitter.\n\nArgs:\n    text_splitter (Optional[TextSplitter]): text splitter\n    include_metadata (bool): whether to include metadata in nodes\n    include_prev_next_rel (bool): whether to include prev/next relationships",
   "type": "object",
   "properties": {
      "text_splitter": {
         "title": "Text Splitter"
      },
      "include_metadata": {
         "title": "Include Metadata",
         "description": "Whether or not to consider metadata when splitting.",
         "default": true,
         "type": "boolean"
      },
      "include_prev_next_rel": {
         "title": "Include Prev Next Rel",
         "description": "Include prev/next node relationships.",
         "default": true,
         "type": "boolean"
      },
      "metadata_extractor": {
         "title": "Metadata Extractor",
         "description": "Metadata extraction pipeline to apply to nodes.",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataExtractor"
            }
         ]
      },
      "callback_manager": {
         "title": "Callback Manager"
      }
   },
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "MetadataFeatureExtractor": {
         "title": "MetadataFeatureExtractor",
         "description": "Base interface for feature extractor.",
         "type": "object",
         "properties": {
            "is_text_node_only": {
               "title": "Is Text Node Only",
               "default": true,
               "type": "boolean"
            },
            "show_progress": {
               "title": "Show Progress",
               "default": true,
               "type": "boolean"
            },
            "metadata_mode": {
               "default": "1",
               "allOf": [
                  {
                     "$ref": "#/definitions/MetadataMode"
                  }
               ]
            }
         }
      },
      "MetadataExtractor": {
         "title": "MetadataExtractor",
         "description": "Metadata extractor.",
         "type": "object",
         "properties": {
            "extractors": {
               "title": "Extractors",
               "description": "Metadta feature extractors to apply to each node.",
               "type": "array",
               "items": {
                  "$ref": "#/definitions/MetadataFeatureExtractor"
               }
            },
            "node_text_template": {
               "title": "Node Text Template",
               "description": "Template to represent how node text is mixed with metadata text.",
               "default": "[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n",
               "type": "string"
            },
            "disable_template_rewrite": {
               "title": "Disable Template Rewrite",
               "description": "Disable the node template rewrite.",
               "default": false,
               "type": "boolean"
            },
            "in_place": {
               "title": "In Place",
               "description": "Whether to process nodes in place.",
               "default": true,
               "type": "boolean"
            }
         }
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • callback_manager (llama_index.callbacks.base.CallbackManager)

  • include_metadata (bool)

  • include_prev_next_rel (bool)

  • metadata_extractor (Optional[llama_index.node_parser.extractors.metadata_extractors.MetadataExtractor])

  • text_splitter (Union[llama_index.text_splitter.types.TextSplitter, langchain.text_splitter.TextSplitter])

field callback_manager: CallbackManager [Optional]
field include_metadata: bool = True

Whether or not to consider metadata when splitting.

field include_prev_next_rel: bool = True

Include prev/next node relationships.

field metadata_extractor: Optional[MetadataExtractor] = None

Metadata extraction pipeline to apply to nodes.

field text_splitter: Union[TextSplitter, TextSplitter] [Required]

The text splitter to use when splitting documents.

classmethod class_name() str

Get class name.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_defaults(chunk_size: Optional[int] = None, chunk_overlap: Optional[int] = None, text_splitter: Optional[Union[TextSplitter, TextSplitter]] = None, include_metadata: bool = True, include_prev_next_rel: bool = True, callback_manager: Optional[CallbackManager] = None, metadata_extractor: Optional[MetadataExtractor] = None) SimpleNodeParser
classmethod from_dict(data: Dict[str, Any], **kwargs: Any) Self
classmethod from_json(data_str: str, **kwargs: Any) Self
classmethod from_orm(obj: Any) Model
get_nodes_from_documents(documents: Sequence[Document], show_progress: bool = False) List[BaseNode]

Parse document into nodes.

Parameters
  • documents (Sequence[Document]) – documents to parse

  • include_metadata (bool) – whether to include metadata in nodes

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
to_dict(**kwargs: Any) Dict[str, Any]
to_json(**kwargs: Any) str
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
llama_index.node_parser.get_leaf_nodes(nodes: List[BaseNode]) List[BaseNode]

Get leaf nodes.

llama_index.node_parser.get_root_nodes(nodes: List[BaseNode]) List[BaseNode]

Get root nodes.

pydantic model llama_index.node_parser.extractors.metadata_extractors.MetadataExtractor

Metadata extractor.

Show JSON schema
{
   "title": "MetadataExtractor",
   "description": "Metadata extractor.",
   "type": "object",
   "properties": {
      "extractors": {
         "title": "Extractors",
         "description": "Metadta feature extractors to apply to each node.",
         "type": "array",
         "items": {
            "$ref": "#/definitions/MetadataFeatureExtractor"
         }
      },
      "node_text_template": {
         "title": "Node Text Template",
         "description": "Template to represent how node text is mixed with metadata text.",
         "default": "[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n",
         "type": "string"
      },
      "disable_template_rewrite": {
         "title": "Disable Template Rewrite",
         "description": "Disable the node template rewrite.",
         "default": false,
         "type": "boolean"
      },
      "in_place": {
         "title": "In Place",
         "description": "Whether to process nodes in place.",
         "default": true,
         "type": "boolean"
      }
   },
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "MetadataFeatureExtractor": {
         "title": "MetadataFeatureExtractor",
         "description": "Base interface for feature extractor.",
         "type": "object",
         "properties": {
            "is_text_node_only": {
               "title": "Is Text Node Only",
               "default": true,
               "type": "boolean"
            },
            "show_progress": {
               "title": "Show Progress",
               "default": true,
               "type": "boolean"
            },
            "metadata_mode": {
               "default": "1",
               "allOf": [
                  {
                     "$ref": "#/definitions/MetadataMode"
                  }
               ]
            }
         }
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field disable_template_rewrite: bool = False

Disable the node template rewrite.

field extractors: Sequence[MetadataFeatureExtractor] [Optional]

Metadta feature extractors to apply to each node.

field in_place: bool = True

Whether to process nodes in place.

field node_text_template: str = '[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n'

Template to represent how node text is mixed with metadata text.

classmethod class_name() str

Get class name.

extract(nodes: Sequence[BaseNode]) List[Dict]

Extract metadata from a document.

Parameters

nodes (Sequence[BaseNode]) – nodes to extract metadata from

process_nodes(nodes: List[BaseNode], excluded_embed_metadata_keys: Optional[List[str]] = None, excluded_llm_metadata_keys: Optional[List[str]] = None) List[BaseNode]

Post process nodes parsed from documents.

Allows extractors to be chained.

Parameters
  • nodes (List[BaseNode]) – nodes to post-process

  • excluded_embed_metadata_keys (Optional[List[str]]) – keys to exclude from embed metadata

  • excluded_llm_metadata_keys (Optional[List[str]]) – keys to exclude from llm metadata

pydantic model llama_index.node_parser.extractors.metadata_extractors.SummaryExtractor

Summary extractor. Node-level extractor with adjacent sharing. Extracts section_summary, prev_section_summary, next_section_summary metadata fields :param llm_predictor: LLM predictor :type llm_predictor: Optional[BaseLLMPredictor] :param summaries: list of summaries to extract: β€˜self’, β€˜prev’, β€˜next’ :type summaries: List[str] :param prompt_template: template for summary extraction :type prompt_template: str

Show JSON schema
{
   "title": "SummaryExtractor",
   "description": "Summary extractor. Node-level extractor with adjacent sharing.\nExtracts `section_summary`, `prev_section_summary`, `next_section_summary`\nmetadata fields\nArgs:\n    llm_predictor (Optional[BaseLLMPredictor]): LLM predictor\n    summaries (List[str]): list of summaries to extract: 'self', 'prev', 'next'\n    prompt_template (str): template for summary extraction",
   "type": "object",
   "properties": {
      "is_text_node_only": {
         "title": "Is Text Node Only",
         "default": true,
         "type": "boolean"
      },
      "show_progress": {
         "title": "Show Progress",
         "default": true,
         "type": "boolean"
      },
      "metadata_mode": {
         "default": "1",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataMode"
            }
         ]
      },
      "llm_predictor": {
         "title": "Llm Predictor",
         "description": "The LLMPredictor to use for generation.",
         "allOf": [
            {
               "$ref": "#/definitions/BaseLLMPredictor"
            }
         ]
      },
      "summaries": {
         "title": "Summaries",
         "description": "List of summaries to extract: 'self', 'prev', 'next'",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "prompt_template": {
         "title": "Prompt Template",
         "description": "Template to use when generating summaries.",
         "default": "Here is the content of the section:\n{context_str}\n\nSummarize the key topics and entities of the section. \nSummary: ",
         "type": "string"
      }
   },
   "required": [
      "llm_predictor",
      "summaries"
   ],
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "BaseLLMPredictor": {
         "title": "BaseLLMPredictor",
         "description": "Base LLM Predictor.",
         "type": "object",
         "properties": {}
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field llm_predictor: BaseLLMPredictor [Required]

The LLMPredictor to use for generation.

field prompt_template: str = 'Here is the content of the section:\n{context_str}\n\nSummarize the key topics and entities of the section. \nSummary: '

Template to use when generating summaries.

field summaries: List[str] [Required]

List of summaries to extract: β€˜self’, β€˜prev’, β€˜next’

classmethod class_name() str

Get class name.

extract(nodes: Sequence[BaseNode]) List[Dict]

Extracts metadata for a sequence of nodes, returning a list of metadata dictionaries corresponding to each node.

Parameters

nodes (Sequence[Document]) – nodes to extract metadata from

pydantic model llama_index.node_parser.extractors.metadata_extractors.QuestionsAnsweredExtractor

Questions answered extractor. Node-level extractor. Extracts questions_this_excerpt_can_answer metadata field. :param llm_predictor: LLM predictor :type llm_predictor: Optional[BaseLLMPredictor] :param questions: number of questions to extract :type questions: int :param prompt_template: template for question extraction, :type prompt_template: str :param embedding_only: whether to use embedding only :type embedding_only: bool

Show JSON schema
{
   "title": "QuestionsAnsweredExtractor",
   "description": "Questions answered extractor. Node-level extractor.\nExtracts `questions_this_excerpt_can_answer` metadata field.\nArgs:\n    llm_predictor (Optional[BaseLLMPredictor]): LLM predictor\n    questions (int): number of questions to extract\n    prompt_template (str): template for question extraction,\n    embedding_only (bool): whether to use embedding only",
   "type": "object",
   "properties": {
      "is_text_node_only": {
         "title": "Is Text Node Only",
         "default": true,
         "type": "boolean"
      },
      "show_progress": {
         "title": "Show Progress",
         "default": true,
         "type": "boolean"
      },
      "metadata_mode": {
         "default": "1",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataMode"
            }
         ]
      },
      "llm_predictor": {
         "title": "Llm Predictor",
         "description": "The LLMPredictor to use for generation.",
         "allOf": [
            {
               "$ref": "#/definitions/BaseLLMPredictor"
            }
         ]
      },
      "questions": {
         "title": "Questions",
         "description": "The number of questions to generate.",
         "default": 5,
         "type": "integer"
      },
      "prompt_template": {
         "title": "Prompt Template",
         "description": "Prompt template to use when generating questions.",
         "default": "Here is the context:\n{context_str}\n\nGiven the contextual information, generate {num_questions} questions this context can provide specific answers to which are unlikely to be found elsewhere.\n\nHigher-level summaries of surrounding context may be provided as well. Try using these summaries to generate better questions that this context can answer.\n\n",
         "type": "string"
      },
      "embedding_only": {
         "title": "Embedding Only",
         "description": "Whether to use metadata for emebddings only.",
         "default": true,
         "type": "boolean"
      }
   },
   "required": [
      "llm_predictor"
   ],
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "BaseLLMPredictor": {
         "title": "BaseLLMPredictor",
         "description": "Base LLM Predictor.",
         "type": "object",
         "properties": {}
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field embedding_only: bool = True

Whether to use metadata for emebddings only.

field llm_predictor: BaseLLMPredictor [Required]

The LLMPredictor to use for generation.

field prompt_template: str = 'Here is the context:\n{context_str}\n\nGiven the contextual information, generate {num_questions} questions this context can provide specific answers to which are unlikely to be found elsewhere.\n\nHigher-level summaries of surrounding context may be provided as well. Try using these summaries to generate better questions that this context can answer.\n\n'

Prompt template to use when generating questions.

field questions: int = 5

The number of questions to generate.

classmethod class_name() str

Get class name.

extract(nodes: Sequence[BaseNode]) List[Dict]

Extracts metadata for a sequence of nodes, returning a list of metadata dictionaries corresponding to each node.

Parameters

nodes (Sequence[Document]) – nodes to extract metadata from

pydantic model llama_index.node_parser.extractors.metadata_extractors.TitleExtractor

Title extractor. Useful for long documents. Extracts document_title metadata field. :param llm_predictor: LLM predictor :type llm_predictor: Optional[BaseLLMPredictor] :param nodes: number of nodes from front to use for title extraction :type nodes: int :param node_template: template for node-level title clues extraction :type node_template: str :param combine_template: template for combining node-level clues into

a document-level title

Show JSON schema
{
   "title": "TitleExtractor",
   "description": "Title extractor. Useful for long documents. Extracts `document_title`\nmetadata field.\nArgs:\n    llm_predictor (Optional[BaseLLMPredictor]): LLM predictor\n    nodes (int): number of nodes from front to use for title extraction\n    node_template (str): template for node-level title clues extraction\n    combine_template (str): template for combining node-level clues into\n        a document-level title",
   "type": "object",
   "properties": {
      "is_text_node_only": {
         "title": "Is Text Node Only",
         "default": false,
         "type": "boolean"
      },
      "show_progress": {
         "title": "Show Progress",
         "default": true,
         "type": "boolean"
      },
      "metadata_mode": {
         "default": "1",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataMode"
            }
         ]
      },
      "llm_predictor": {
         "title": "Llm Predictor",
         "description": "The LLMPredictor to use for generation.",
         "allOf": [
            {
               "$ref": "#/definitions/BaseLLMPredictor"
            }
         ]
      },
      "nodes": {
         "title": "Nodes",
         "description": "The number of nodes to extract titles from.",
         "default": 5,
         "type": "integer"
      },
      "node_template": {
         "title": "Node Template",
         "description": "The prompt template to extract titles with.",
         "default": "Context: {context_str}. Give a title that summarizes all of the unique entities, titles or themes found in the context. Title: ",
         "type": "string"
      },
      "combine_template": {
         "title": "Combine Template",
         "description": "The prompt template to merge titles with.",
         "default": "{context_str}. Based on the above candidate titles and content, what is the comprehensive title for this document? Title: ",
         "type": "string"
      }
   },
   "required": [
      "llm_predictor"
   ],
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "BaseLLMPredictor": {
         "title": "BaseLLMPredictor",
         "description": "Base LLM Predictor.",
         "type": "object",
         "properties": {}
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field combine_template: str = '{context_str}. Based on the above candidate titles and content, what is the comprehensive title for this document? Title: '

The prompt template to merge titles with.

field is_text_node_only: bool = False
field llm_predictor: BaseLLMPredictor [Required]

The LLMPredictor to use for generation.

field node_template: str = 'Context: {context_str}. Give a title that summarizes all of the unique entities, titles or themes found in the context. Title: '

The prompt template to extract titles with.

field nodes: int = 5

The number of nodes to extract titles from.

classmethod class_name() str

Get class name.

extract(nodes: Sequence[BaseNode]) List[Dict]

Extracts metadata for a sequence of nodes, returning a list of metadata dictionaries corresponding to each node.

Parameters

nodes (Sequence[Document]) – nodes to extract metadata from

pydantic model llama_index.node_parser.extractors.metadata_extractors.KeywordExtractor

Keyword extractor. Node-level extractor. Extracts excerpt_keywords metadata field. :param llm_predictor: LLM predictor :type llm_predictor: Optional[BaseLLMPredictor] :param keywords: number of keywords to extract :type keywords: int

Show JSON schema
{
   "title": "KeywordExtractor",
   "description": "Keyword extractor. Node-level extractor. Extracts\n`excerpt_keywords` metadata field.\nArgs:\n    llm_predictor (Optional[BaseLLMPredictor]): LLM predictor\n    keywords (int): number of keywords to extract",
   "type": "object",
   "properties": {
      "is_text_node_only": {
         "title": "Is Text Node Only",
         "default": true,
         "type": "boolean"
      },
      "show_progress": {
         "title": "Show Progress",
         "default": true,
         "type": "boolean"
      },
      "metadata_mode": {
         "default": "1",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataMode"
            }
         ]
      },
      "llm_predictor": {
         "title": "Llm Predictor",
         "description": "The LLMPredictor to use for generation.",
         "allOf": [
            {
               "$ref": "#/definitions/BaseLLMPredictor"
            }
         ]
      },
      "keywords": {
         "title": "Keywords",
         "description": "The number of keywords to extract.",
         "default": 5,
         "type": "integer"
      }
   },
   "required": [
      "llm_predictor"
   ],
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      },
      "BaseLLMPredictor": {
         "title": "BaseLLMPredictor",
         "description": "Base LLM Predictor.",
         "type": "object",
         "properties": {}
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field keywords: int = 5

The number of keywords to extract.

field llm_predictor: BaseLLMPredictor [Required]

The LLMPredictor to use for generation.

classmethod class_name() str

Get class name.

extract(nodes: Sequence[BaseNode]) List[Dict]

Extracts metadata for a sequence of nodes, returning a list of metadata dictionaries corresponding to each node.

Parameters

nodes (Sequence[Document]) – nodes to extract metadata from

pydantic model llama_index.node_parser.extractors.metadata_extractors.EntityExtractor

Entity extractor. Extracts entities into a metadata field using a default model tomaarsen/span-marker-mbert-base-multinerd and the SpanMarker library.

Install SpanMarker with pip install span-marker.

Show JSON schema
{
   "title": "EntityExtractor",
   "description": "Entity extractor. Extracts `entities` into a metadata field using a default model\n`tomaarsen/span-marker-mbert-base-multinerd` and the SpanMarker library.\n\nInstall SpanMarker with `pip install span-marker`.",
   "type": "object",
   "properties": {
      "is_text_node_only": {
         "title": "Is Text Node Only",
         "default": true,
         "type": "boolean"
      },
      "show_progress": {
         "title": "Show Progress",
         "default": true,
         "type": "boolean"
      },
      "metadata_mode": {
         "default": "1",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataMode"
            }
         ]
      },
      "model_name": {
         "title": "Model Name",
         "description": "The model name of the SpanMarker model to use.",
         "default": "tomaarsen/span-marker-mbert-base-multinerd",
         "type": "string"
      },
      "prediction_threshold": {
         "title": "Prediction Threshold",
         "description": "The confidence threshold for accepting predictions.",
         "default": 0.5,
         "type": "number"
      },
      "span_joiner": {
         "title": "Span Joiner",
         "description": "The seperator beween entity names.",
         "type": "string"
      },
      "label_entities": {
         "title": "Label Entities",
         "description": "Include entity class labels or not.",
         "default": false,
         "type": "boolean"
      },
      "device": {
         "title": "Device",
         "description": "Device to run model on, i.e. 'cuda', 'cpu'",
         "type": "string"
      },
      "entity_map": {
         "title": "Entity Map",
         "description": "Mapping of entity class names to usable names.",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      }
   },
   "required": [
      "span_joiner"
   ],
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field device: Optional[str] = None

Device to run model on, i.e. β€˜cuda’, β€˜cpu’

field entity_map: Dict[str, str] [Optional]

Mapping of entity class names to usable names.

field label_entities: bool = False

Include entity class labels or not.

field model_name: str = 'tomaarsen/span-marker-mbert-base-multinerd'

The model name of the SpanMarker model to use.

field prediction_threshold: float = 0.5

The confidence threshold for accepting predictions.

field span_joiner: str [Required]

The seperator beween entity names.

classmethod class_name() str

Get class name.

extract(nodes: Sequence[BaseNode]) List[Dict]

Extracts metadata for a sequence of nodes, returning a list of metadata dictionaries corresponding to each node.

Parameters

nodes (Sequence[Document]) – nodes to extract metadata from

pydantic model llama_index.node_parser.extractors.metadata_extractors.MetadataFeatureExtractor

Show JSON schema
{
   "title": "MetadataFeatureExtractor",
   "description": "Base interface for feature extractor.",
   "type": "object",
   "properties": {
      "is_text_node_only": {
         "title": "Is Text Node Only",
         "default": true,
         "type": "boolean"
      },
      "show_progress": {
         "title": "Show Progress",
         "default": true,
         "type": "boolean"
      },
      "metadata_mode": {
         "default": "1",
         "allOf": [
            {
               "$ref": "#/definitions/MetadataMode"
            }
         ]
      }
   },
   "definitions": {
      "MetadataMode": {
         "title": "MetadataMode",
         "description": "An enumeration.",
         "enum": [
            "1",
            "2",
            "3",
            "4"
         ],
         "type": "string"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
field is_text_node_only: bool = True
field metadata_mode: MetadataMode = MetadataMode.ALL
field show_progress: bool = True
abstract extract(nodes: Sequence[BaseNode]) List[Dict]

Extracts metadata for a sequence of nodes, returning a list of metadata dictionaries corresponding to each node.

Parameters

nodes (Sequence[Document]) – nodes to extract metadata from