BeautifulSoupWebReader#

pydantic model llama_index.readers.BeautifulSoupWebReader#

BeautifulSoup web page reader.

Reads pages from the web. Requires the bs4 and urllib packages.

Parameters: website_extractor (Optional[Dict[str, Callable]]) – A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.

Show JSON schema

{
   "title": "BeautifulSoupWebReader",
   "description": "BeautifulSoup web page reader.\n\nReads pages from the web.\nRequires the `bs4` and `urllib` packages.\n\nArgs:\n    website_extractor (Optional[Dict[str, Callable]]): A mapping of website\n        hostname (e.g. google.com) to a function that specifies how to\n        extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "class_name": {
         "title": "Class Name",
         "type": "string",
         "default": "BeautifulSoupWebReader"
      }
   }
}

Config

arbitrary_types_allowed: bool = True

Fields

is_remote (bool)

field is_remote: bool = True#

classmethod class_name() → str#

Get the class name, used as a unique ID in serialization.

This provides a key that makes serialization robust against actual class name changes.

load_data(urls: List[str], custom_hostname: Optional[str] = None) → List[Document]#

Load data from the urls.

Parameters

urls (List[str]) – List of URLs to scrape.
custom_hostname (Optional[str]) – Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)

Returns

List of documents.

Return type

List[Document]