HomeGuidesAPI ReferenceTutorials
Haystack

FileConverter

Use FileConverters to extract text from files in different formats and cast it into the unified Document format. There are a number of converters available for converting PDFs, images, DOCX files, and more.

Position in a PipelineEither at the very beginning of an indexing Pipeline or after a FileClassifier
InputFile name
OutputDocuments
ClassesPDFToTextConverter
PDFToTextOCRConverter
DocxToTextConverter
AzureConverter
ImageToTextConverter
MarkdownConverter
ParsrConverter
TikaConverter
TextConverter

👍

Tutorial

To see an example of file converters in a pipeline, see the DocumentClassifier at Index Time tutorial.

File Converter Classes

Here's what each of the file convertes type can do:

PDFToTextConverterExtracts text from a PDF file using the pdftotext library.
PDFToTextOCRConverterExtracts text from PDF files that contain images using Pytesseract.
DocxToTextConverterExtracts text from .docx files.
AzureConverterExtracts text and tables from files in the following formats: PDF, JPEG, PNG, BMP, and TIFF. Uses Microsoft Azure's Form Recognizer. To use this converter, you must have an active Azure account and a Form Recognizer or Cognitive Services resource. For more information, see Form Recognizer.
ImageToTextConverterExtracts text from image files using Pytesseract.
MarkdownConverterConverts markdown to plain text.
ParsrConverterExtracts text and tables from PDF and .docx files using the open-source Parsr by axa-group.
TikaConverterConverts files into Documents using Apache Tika.
TextConverterPreprocesses text files and returns documents.

Usage

Click a tab to read more about each converter and see how to initialize it:

from haystack.nodes import PDFToTextConverter

converter = PDFToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)

# Alternatively, if you have a PDF containing images, Haystack uses the pytessaract library under the hood for optical character recognition of image PDFs.

from haystack.nodes import PDFToTextOCRConverter
converter = PDFToTextOCRConverter(
    remove_numeric_tables=False,
    valid_languages=["deu","eng"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
from haystack.nodes import DocxToTextConverter

converter = DocxToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.docx"), meta=None)
# We recommend the Azure Form Recognizer service for parsing tables from PDFs or other complex document structures.
# (https://azure.microsoft.com/en-us/services/form-recognizer/)
from haystack.nodes import AzureConverter

converter = AzureConverter(
    endpoint="some-url",
    credential_key="my-secret-key"
)

docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Haystack supports extraction of text from images using OCR.
from haystack.nodes import ImageToTextConverter

converter = ImageToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)

docs = converter.convert(file_path=Path("my-file.png"), meta=None)
from haystack.nodes import MarkdownConverter

converter = MarkdownConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.md"), meta=None)

The Documents generated by a FileConverter can be directly written into a DocumentStore using an indexing pipeline. To do so, run:

from haystack import Pipeline
from haystack.nodes import TextConverter

indexing_pipeline = Pipeline()
text_converter = TextConverter()

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["TextConverter"])

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(files_to_index)

Choosing the Right PDF Converter

Haystack provides many different options for converting PDFs into Documents. This guide should lessen the challenge of picking the right one since there are so many different types of PDF files. Each of the tools comes with different strengths and weaknesses and you should choose one depending on your use case.

⚠️

Text in PDF isn’t always one continuous stream. Often there are side notes, comment bubbles, tables, and columns. Readers can choose to read these elements in different orders.

Unfortunately, the PDF format doesn’t set strict standards regarding the order of the text stored in PDF files. Therefore, we can’t assure that the extracted reading order of your PDFs corresponds to how a typical human would read the document.

This issue arises rarely. When it does occur, it only appears with PDFs with more complex layouts, such as multi-column PDFs.
If you observe this behavior using one of the FileConverters, we recommend trying out another one on your files as each tool uses a slightly different technique to determine the reading order of a PDF.

PDFToTextConverter

The PDFToTextConverter is a fast and light-weight PDF converter that uses the [pdftotext](https://www.xpdfreader.com/pdftotext-man.html) tool to convert PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer.

The PDFToTextConverter cannot extract the text of image-only PDFs (for example scanned documents.) It does not extract tables as separate Documents of type table but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables to False.

PDFToTextOCRConverter

The PDFToTextOCRConverter performs optical chracter recognition (OCR) on your PDF files before extracting the text using the [pytessearact](https://github.com/madmaze/pytesseract) tool. It therefore works with image-only PDFs as well. The PDFToTextOCRConverter does not extract tables as separate Documents of type table but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables to False.

If your files consist mainly of searchable PDFs containing a text layer, we recommend to use the PDFToTextConverter to reduce the overhead of applying OCR to your files.

AzureConverter

The AzureConverter is based on Microsoft Azure's Form Recognizer service. Therefore, to be able to use this converter, you need an active Azure account and a Form Recognizer or Cognitive Services resource. You can follow this guide to set this up.

The AzureConverter works with both searchable PDFs and image-only PDFs. Furthermore, it supports the following file formats: JPEG, PNG, BMP, and TIFF.

Unlike the PDFToTextConverter and PDFToTextOCRConverter, the AzureConverter doesn’t extract the tables in a file as plain text but generates separate Document objects of type table that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader , for example.

ParsrConverter

The ParsrConverter uses the open source Parsr tool by axa-group. To use this converter, you need to run a Parsr Docker container. To start a Parsr Docker container, run:
docker run -p 3001:3001 axarev/parsr

The ParsrConverter works with searchable PDFs containing a text layer as well as DOCX files. Like the AzureConverter, it doesn’t extract the tables in a file as plain text but generates separate Document objects of type table that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader , for example.

One downside of the ParsrConverter is that it can be extremely slow, especially for PDFs containing many pages.

TikaConverter

The TikaConverter uses the Apache Tika toolkit. To use this converter, you need to run a Tika Docker container. To start a Tika Docker container, run:
docker run -p 9998:9998 apache/tika:1.28.4

The TikaConverter works with a wide range of different formats, including searchable PDFs, DOCX files, PPT files and many more. It might be therefore useful in your Indexing Pipelines if you expect a variety of file formats and don’t want to route the different file formats to a dedicated converter using the FileClassifier node.

Like the PDFToTextConverter and PDFToTextOCRConverter, the TikaConverter does not extract tables as separate Documents of type table but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables to False.


Related Links