FileClassifier
A File Classifier distinguishes between text, PDF, Markdown, Docx and HTML files and routes them to the appropriate FileConverter in an indexing pipeline.
Position in a Pipeline | At the very beginning of an indexing pipeline |
Input | File name |
Output | File name (routed) |
Classes | FileTypeClassifier |
Usage
By default, the FileTypeClassifier has 5 outgoing edges. It routes an incoming file through one of these to a FileConverter, which then converts them into Documents.
These are the default outgoing edges of the File Classifier:
Outgoing Edge | File Type |
---|---|
1 | Text |
2 | |
3 | Markdown |
4 | Docx |
5 | HTML |
Note
The FileTypeClassifier works best when you pass in one file per
Pipeline.run()
call. If you pass multiple files of different format intoPipeline.run()
orPipeline.run_batch()
, the FileTypeClassifier returns an error.
To use a FileTypeClassifier in an indexing pipeline, run:
from haystack.pipelines import Pipeline
from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter, MarkdownConverter, DocxToTextConverter, PreProcessor
file_type_classifier = FileTypeClassifier()
text_converter = TextConverter()
pdf_converter = PDFToTextConverter()
md_converter = MarkdownConverter()
docx_converter = DocxToTextConverter()
preprocessor = PreProcessor()
# This is an indexing pipeline
p = Pipeline()
p.add_node(component=file_type_classifier, name="FileTypeClassifier", inputs=["File"])
p.add_node(component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"])
p.add_node(component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"])
p.add_node(component=md_converter, name="MarkdownConverter", inputs=["FileTypeClassifier.output_3"])
p.add_node(component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"])
p.add_node(
component=preprocessor,
name="Preprocessor",
inputs=["TextConverter", "PdfConverter", "MarkdownConverter", "DocxConverter"],
)
Updated almost 2 years ago
Related Links