Finetuning

Finetuning modules.

class llama_index.finetuning.EmbeddingAdapterFinetuneEngine(dataset: EmbeddingQAFinetuneDataset, embed_model: BaseEmbedding, batch_size: int = 10, epochs: int = 1, adapter_model: Optional[Any] = None, dim: Optional[int] = None, device: Optional[str] = None, model_output_path: str = 'model_output', model_checkpoint_path: Optional[str] = None, checkpoint_save_steps: int = 100, verbose: bool = False, bias: bool = False, **train_kwargs: Any)

Embedding adapter finetune engine.

Parameters
  • dataset (EmbeddingQAFinetuneDataset) – Dataset to finetune on.

  • embed_model (BaseEmbedding) – Embedding model to finetune.

  • batch_size (Optional[int]) – Batch size. Defaults to 10.

  • epochs (Optional[int]) – Number of epochs. Defaults to 1.

  • dim (Optional[int]) – Dimension of embedding. Defaults to None.

  • adapter_model (Optional[BaseAdapter]) – Adapter model. Defaults to None, in which case a linear adapter is used.

  • device (Optional[str]) – Device to use. Defaults to None.

  • model_output_path (str) – Path to save model output. Defaults to β€œmodel_output”.

  • model_checkpoint_path (Optional[str]) – Path to save model checkpoints. Defaults to None (don’t save checkpoints).

  • verbose (bool) – Whether to show progress bar. Defaults to False.

  • bias (bool) – Whether to use bias. Defaults to False.

finetune(**train_kwargs: Any) None

Finetune.

classmethod from_model_path(dataset: EmbeddingQAFinetuneDataset, embed_model: BaseEmbedding, model_path: str, model_cls: Optional[Type[Any]] = None, **kwargs: Any) EmbeddingAdapterFinetuneEngine

Load from model path.

Parameters
  • dataset (EmbeddingQAFinetuneDataset) – Dataset to finetune on.

  • embed_model (BaseEmbedding) – Embedding model to finetune.

  • model_path (str) – Path to model.

  • model_cls (Optional[Type[Any]]) – Adapter model class. Defaults to None.

  • **kwargs (Any) – Additional kwargs (see __init__)

get_finetuned_model(**model_kwargs: Any) BaseEmbedding

Get finetuned model.

smart_batching_collate(batch: List) Tuple[Any, Any]

Smart batching collate.

pydantic model llama_index.finetuning.EmbeddingQAFinetuneDataset

Embedding QA Finetuning Dataset.

Parameters
  • queries (Dict[str, str]) – Dict id -> query.

  • corpus (Dict[str, str]) – Dict id -> string.

  • relevant_docs (Dict[str, List[str]]) – Dict query id -> list of doc ids.

Show JSON schema
{
   "title": "EmbeddingQAFinetuneDataset",
   "description": "Embedding QA Finetuning Dataset.\n\nArgs:\n    queries (Dict[str, str]): Dict id -> query.\n    corpus (Dict[str, str]): Dict id -> string.\n    relevant_docs (Dict[str, List[str]]): Dict query id -> list of doc ids.",
   "type": "object",
   "properties": {
      "queries": {
         "title": "Queries",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "corpus": {
         "title": "Corpus",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "relevant_docs": {
         "title": "Relevant Docs",
         "type": "object",
         "additionalProperties": {
            "type": "array",
            "items": {
               "type": "string"
            }
         }
      }
   },
   "required": [
      "queries",
      "corpus",
      "relevant_docs"
   ]
}

Fields
  • corpus (Dict[str, str])

  • queries (Dict[str, str])

  • relevant_docs (Dict[str, List[str]])

field corpus: Dict[str, str] [Required]
field queries: Dict[str, str] [Required]
field relevant_docs: Dict[str, List[str]] [Required]
classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_json(path: str) EmbeddingQAFinetuneDataset

Load json.

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
save_json(path: str) None

Save json.

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.finetuning.OpenAIFinetuneEngine(base_model: str, data_path: str, verbose: bool = False, start_job_id: Optional[str] = None)

OpenAI Finetuning Engine.

finetune() None

Finetune model.

classmethod from_finetuning_handler(finetuning_handler: OpenAIFineTuningHandler, base_model: str, data_path: str, **kwargs: Any) OpenAIFinetuneEngine

Initialize from finetuning handler.

Used to finetune an OpenAI model into another OpenAI model (e.g. gpt-3.5-turbo on top of GPT-4).

get_current_job() Any

Get current job.

get_finetuned_model(**model_kwargs: Any) LLM

Gets finetuned model.

class llama_index.finetuning.SentenceTransformersFinetuneEngine(dataset: EmbeddingQAFinetuneDataset, model_id: str = 'BAAI/bge-small-en', model_output_path: str = 'exp_finetune', batch_size: int = 10, val_dataset: Optional[EmbeddingQAFinetuneDataset] = None, loss: Optional[Any] = None, epochs: int = 2, show_progress_bar: bool = True, evaluation_steps: int = 50)

Sentence Transformers Finetune Engine.

finetune(**train_kwargs: Any) None

Finetune model.

get_finetuned_model(**model_kwargs: Any) BaseEmbedding

Gets finetuned model.

llama_index.finetuning.generate_qa_embedding_pairs(nodes: List[TextNode], llm: Optional[LLM] = None, qa_generate_prompt_tmpl: str = 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."\n', num_questions_per_chunk: int = 2) EmbeddingQAFinetuneDataset

Generate examples given a set of nodes.