TextFileToDocument
Converts text files to documents.
Most common position in a pipeline | Before PreProcessors or right at the beginning of an indexing pipeline |
Mandatory run variables | "sources": A list of paths to text files you want to convert |
Output variables | "documents": A list of documents |
API reference | Converters |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/txt.py |
Overview
The TextFileToDocument
component converts text files into documents. You can use it in an indexing pipeline to index the contents of text files into a Document Store. It takes a list of file paths or ByteStream objects as input and outputs the converted result as a list of documents. Optionally, you can attach metadata to the documents through the meta
input parameter.
When you initialize the component, you can optionally set the default encoding of the text files through the encoding
parameter. If you don't provide any value, the component uses "utf-8"
by default. Note that if the encoding is specified in the metadata of an input ByteStream, it will override this parameter's setting.
Usage
On its own
from pathlib import Path
from haystack.components.converters import TextFileToDocument
converter = TextFileToDocument()
docs = converter.run(sources=[Path("my_file.txt")])
In a pipeline
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")
pipeline.run({"converter": {"sources": file_names}})
Additional References
📓 Tutorial: Preprocessing Different File Types
Updated about 2 months ago