Open In Colab

Simple Directory Reader#

The SimpleDirectoryReader is the most commonly used data connector that just works.
Simply pass in a input directory or a list of files.
It will select the best file reader based on the file extensions.

Get Started#

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

!pip install llama-index

Download Data

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 16:34:53--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay1.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   

2023-12-21 16:34:53 (1.81 MB/s) - ‘data/paul_graham/paul_graham_essay1.txt’ saved [75042/75042]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 16:34:53--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay2.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   

2023-12-21 16:34:53 (1.78 MB/s) - ‘data/paul_graham/paul_graham_essay2.txt’ saved [75042/75042]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 16:34:54--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay3.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   

2023-12-21 16:34:54 (1.80 MB/s) - ‘data/paul_graham/paul_graham_essay3.txt’ saved [75042/75042]
from llama_index import SimpleDirectoryReader

Load specific files

reader = SimpleDirectoryReader(
    input_files=["./data/paul_graham/paul_graham_essay1.txt"]
)
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 1 docs

Load all (top-level) files from directory

reader = SimpleDirectoryReader(input_dir="./data/paul_graham/")
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 3 docs

Load all (recursive) files from directory

!mkdir -p 'data/paul_graham/nested'
!echo "This is a nested file" > 'data/paul_graham/nested/nested_file.md'
# only load markdown files
required_exts = [".md"]

reader = SimpleDirectoryReader(
    input_dir="./data",
    required_exts=required_exts,
    recursive=True,
)
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 1 docs

Create an iterator to load files and process them as they load

reader = SimpleDirectoryReader(
    input_dir="./data",
    recursive=True,
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:
        # do something with the doc
        doc.text = doc.text.upper()
        all_docs.append(doc)

print(len(all_docs))
4

Full Configuration#

This is the full list of arguments that can be passed to the SimpleDirectoryReader:

class SimpleDirectoryReader(BaseReader):
    """Simple directory reader.

    Load files from file directory. 
    Automatically select the best file reader given file extensions.


    Args:
        input_dir (str): Path to the directory.
        input_files (List): List of file paths to read
            (Optional; overrides input_dir, exclude)
        exclude (List): glob of python file paths to exclude (Optional)
        exclude_hidden (bool): Whether to exclude hidden files (dotfiles).
        encoding (str): Encoding of the files.
            Default is utf-8.
        errors (str): how encoding and decoding errors are to be handled,
                see https://docs.python.org/3/library/functions.html#open
        recursive (bool): Whether to recursively search in subdirectories.
            False by default.
        filename_as_id (bool): Whether to use the filename as the document id.
            False by default.
        required_exts (Optional[List[str]]): List of required extensions.
            Default is None.
        file_extractor (Optional[Dict[str, BaseReader]]): A mapping of file
            extension to a BaseReader class that specifies how to convert that file
            to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
        num_files_limit (Optional[int]): Maximum number of files to read.
            Default is None.
        file_metadata (Optional[Callable[str, Dict]]): A function that takes
            in a filename and returns a Dict of metadata for the Document.
            Default is None.
"""