DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

TextFileToDocument

Converts text files to documents.

Most common position in a pipelineBefore PreProcessors or right at the beginning of an indexing pipeline
Mandatory run variables"sources": A list of paths to text files you want to convert
Output variables"documents": A list of documents
API referenceConverters
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/txt.py

Overview

The TextFileToDocument component converts text files into documents. You can use it in an indexing pipeline to index the contents of text files into a Document Store. It takes a list of file paths or ByteStream objects as input and outputs the converted result as a list of documents. Optionally, you can attach metadata to the documents through the meta input parameter.

When you initialize the component, you can optionally set the default encoding of the text files through the encoding parameter. If you don't provide any value, the component uses "utf-8" by default. Note that if the encoding is specified in the metadata of an input ByteStream, it will override this parameter's setting.

Usage

On its own

from pathlib import Path
from haystack.components.converters import TextFileToDocument

converter = TextFileToDocument()

docs = converter.run(sources=Path[("my_file.txt")])

In a pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": file_names}})

Additional References

📓 Tutorial: Preprocessing Different File Types