BeautifulSoupWebReader#

pydantic model llama_index.readers.BeautifulSoupWebReader#

BeautifulSoup web page reader.

Reads pages from the web. Requires the bs4 and urllib packages.

Parameters

website_extractor (Optional[Dict[str, Callable]]) – A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.

Show JSON schema
{
   "title": "BeautifulSoupWebReader",
   "description": "BeautifulSoup web page reader.\n\nReads pages from the web.\nRequires the `bs4` and `urllib` packages.\n\nArgs:\n    website_extractor (Optional[Dict[str, Callable]]): A mapping of website\n        hostname (e.g. google.com) to a function that specifies how to\n        extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "class_name": {
         "title": "Class Name",
         "type": "string",
         "default": "BeautifulSoupWebReader"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • is_remote (bool)

field is_remote: bool = True#
classmethod class_name() str#

Get the class name, used as a unique ID in serialization.

This provides a key that makes serialization robust against actual class name changes.

load_data(urls: List[str], custom_hostname: Optional[str] = None) List[Document]#

Load data from the urls.

Parameters
  • urls (List[str]) – List of URLs to scrape.

  • custom_hostname (Optional[str]) – Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)

Returns

List of documents.

Return type

List[Document]