DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord

FileTypeRouter

Use this Router in indexing pipelines to route file paths or byte streams based on their type to different outputs for further processing.

NameFileTypeRouter
Folder Path/routers/
Position in a PipelineAs the first component preprocessing data followed by File Converters
Input Names: Input Types"sources": List of file paths or byte streams to categorize
Output Names: Output Types"unclassified": List of uncategorized file paths or byte streams

”mime_types”, for example “"text/plain", "text/html", "application/pdf", "text/markdown", "audio/x-wav", "image/jpeg”: List of categorized file paths or byte streams

Overview

FileTypeRouter routes file paths or byte streams based on their type, for example, plain text, jpeg image, or audio wave. For file paths, it infers MIME types from their extensions, while for byte streams, it determines MIME types based on the provided metadata.

When initializing the component, you specify the set of MIME types to route to separate outputs. To do this, set the mime_types parameter to a list of types, for example: ["text/plain", "audio/x-wav", "image/jpeg"]. Types that are not listed are routed to an output named “unclassified”.

Usage

On its own

Below is an example that uses the FileTypeRouter to rank two simple Documents:

from haystack import Document
from haystack.components.routers import FileTypeRouter

router = FileTypeRouter(mime_types=["text/plain"])
router.run(sources=["text-file-will-be-added.txt", "pdf-will-not-ne-added.pdf"])

In a Pipeline

Below is an example of a pipeline that uses a FileTypeRouter to forward only plain text files to a DocumentSplitter and then a DocumentWriter. Only the content of plain text files gets added to the InMemoryDocumentStore, but not the content of files of any other type. As an alternative, you could add a PyPDFConverter to the pipeline and use the FileTypeRouter to route PDFs to it so that it converts them to Documents.

from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.document_stores.in_memoryimport InMemoryDocumentStore
from haystack.components.file_converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=FileTypeRouter(mime_types=["text/plain"]), name="file_type_router")
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentSplitter(), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("file_type_router.text/plain", "text_file_converter.sources")
p.connect("text_file_converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
p.run({"file_type_router": {"sources":["text-file-will-be-added.txt", "pdf-will-not-ne-added.pdf"]}})

Related Links

See the parameters details in our API reference: