HomeGuidesAPI ReferenceTutorials
Haystack

FileConverters

Use FileConverters to extract text from files in different formats and cast it into the unified Document format. There are a number of converters avaialble for converting PDFs, images, DOCX files, and more.

Position in a PipelineAt the very beginning of an indexing Pipeline
InputFilename
OutputDocuments
ClassesPDFToTextConverter
DocxToTextConverter
AzureConverter
ImageToTextConverter
MarkdownConverter

👍

Tutorial

To see an example of file converters in a pipeline, see out advanced indexing tutorial.

Usage

Click a tab to read more about each converter and see how to initialize it:

from haystack.nodes import PDFToTextConverter

converter = PDFToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)

# Alternatively, if you have a PDF containing images, Haystack uses tessaract under the hood to OCR image PDFs.

from haystack.nodes import PDFToTextOCRConverter
converter = PDFToTextOCRConverter(
    remove_numeric_tables=False,
    valid_languages=["deu","eng"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
from haystack.nodes import DocxToTextConverter</code>

converter = DocxToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.docx"), meta=None)
# We recommend the Azure Form Recognizer service for parsing tables from PDFs or other complex document structures.
# (https://azure.microsoft.com/en-us/services/form-recognizer/)
from haystack.nodes import AzureConverter

converter = AzureConverter(
    endpoint="some-url",
    credential_key="my-secret-key"
)

docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Haystack supports extraction of text from images using OCR.
from haystack.nodes import ImageToTextConverter

converter = ImageToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)

docs = converter.convert(file_path=Path("my-file.png"), meta=None)
from haystack.nodes import MarkdownConverter

converter = MarkdownConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.md"), meta=None)

Haystack also has a convert_files_to_docs() utility function that will convert all txt or pdf files in a given directory.

from haystack.utils.preprocessing import convert_files_to_docs
docs = convert_files_to_docs(dir_path=doc_dir)

Related Links