DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

FileClassifier

A File Classifier distinguishes between text, PDF, Markdown, Docx and HTML files and routes them to the appropriate FileConverter in an indexing pipeline.

Position in a PipelineAt the very beginning of an indexing pipeline
InputFile name
OutputFile name (routed)
ClassesFileTypeClassifier

Usage

By default, the FileTypeClassifier has 5 outgoing edges. It routes an incoming file through one of these to a FileConverter, which then converts them into Documents.

These are the default outgoing edges of the File Classifier:

Outgoing EdgeFile Type
1Text
2PDF
3Markdown
4Docx
5HTML
6media

The supported media types are: mp3, mp4, mpeg, m4a, wav, and webm.

πŸ‘

Note

The FileTypeClassifier works best when you pass in one file per Pipeline.run() call. If you pass multiple files of different format into Pipeline.run() or Pipeline.run_batch(), the FileTypeClassifier returns an error.

To use a FileTypeClassifier in an indexing pipeline, run:

from haystack.pipelines import Pipeline
from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter, MarkdownConverter, DocxToTextConverter, PreProcessor

file_type_classifier = FileTypeClassifier()

text_converter = TextConverter()
pdf_converter = PDFToTextConverter()
md_converter = MarkdownConverter()
docx_converter = DocxToTextConverter()
preprocessor = PreProcessor()

# This is an indexing pipeline
p = Pipeline()

p.add_node(component=file_type_classifier, name="FileTypeClassifier", inputs=["File"])

p.add_node(component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"])
p.add_node(component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"])
p.add_node(component=md_converter, name="MarkdownConverter", inputs=["FileTypeClassifier.output_3"])
p.add_node(component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"])

p.add_node(
    component=preprocessor,
    name="Preprocessor",
    inputs=["TextConverter", "PdfConverter", "MarkdownConverter", "DocxConverter"],
)

Related Links