DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

FileConverter

Use FileConverters to extract text from files in different formats and cast it into the unified Document format. There are a number of converters available for converting PDFs, images, DOCX files, and more.

Position in a PipelineEither at the very beginning of an indexing Pipeline or after a FileClassifier
InputFile name
OutputDocuments
ClassesAzureConverter
CSVTextConverter
DocxToTextConverter
ImageToTextConverter
JSONConverter
MarkdownConverter
PDFToTextConverter
ParsrConverter
PptxConverter
TikaConverter
TextConverter

πŸ‘

Tutorial

To see an example of file converters in a pipeline, see the DocumentClassifier at Index Time tutorial.

File Converter Classes

Here's what each of the file converters type can do:

AzureConverterExtracts text and tables from files in the following formats: PDF, JPEG, PNG, BMP, and TIFF. Uses Microsoft Azure's Form Recognizer. To use this converter, you must have an active Azure account and a Form Recognizer or Cognitive Services resource. For more information, see Form Recognizer.
CSVTextConverterConverts CSV files containing question-answer pairs into text Documents. The CSV file must have two columns: question and answer. The first column is interpreted as a question, and the second as the answer to the question.
Tip: You can use it to search your FAQs as it preserves the question-answer pairs in the documents. See also FAQ-style QA Tutorial.
Note: It doesn't handle tabular data.
DocxToTextConverterExtracts text from DOCX files.
ImageToTextConverterExtracts text from image files using Pytesseract.
JSONConverterExtracts text from JSON and JSONL files and casts it into Document objects.
MarkdownConverterConverts Markdown to plain text.
PDFToTextConverterExtracts text from a PDF file using the PyMuPDF library or xpdf, depending on the type you install.
ParsrConverterExtracts text and tables from PDF and .docx files using the open-source Parsr by axa-group.
PptxConverterExtracts text from .pptx files.
TikaConverterConverts files into Documents using Apache Tika.
TextConverterPreprocesses text files and returns documents.

Usage

Click a tab to read more about each converter and see how to initialize it:

from haystack.nodes import PDFToTextConverter

converter = PDFToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
from haystack.nodes import DocxToTextConverter

converter = DocxToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.docx"), meta=None)
# We recommend the Azure Form Recognizer service for parsing tables from PDFs or other complex document structures.
# (https://azure.microsoft.com/en-us/services/form-recognizer/)
from haystack.nodes import AzureConverter

converter = AzureConverter(
    endpoint="some-url",
    credential_key="my-secret-key"
)

docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Haystack supports extraction of text from images using OCR.
from haystack.nodes import ImageToTextConverter

converter = ImageToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)

docs = converter.convert(file_path=Path("my-file.png"), meta=None)
from haystack.nodes import MarkdownConverter

converter = MarkdownConverter(
    remove_numeric_tables=True,
    valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.md"), meta=None)
from haystack.nodes import CsvTextConverter

converter = CsvTextConverter()
docs = converter.convert(file_path=Path("my-file.csv"), meta=None)
from haystack.nodes import JsonConverter

converter = JsonConverter()
docs = converter.convert("data_file.json")

You can write the Documents a FileConverter generates directly into a DocumentStore using an indexing pipeline. To do so, run:

from haystack import Pipeline
from haystack.nodes import TextConverter

indexing_pipeline = Pipeline()
text_converter = TextConverter()

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["TextConverter"])

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(files_to_index)

Choosing the Right PDF Converter

Haystack provides many different options for converting PDFs into Documents. Use this guide to help you pick the right one, as there are so many different types of PDF files. Each of the tools comes with different strengths and weaknesses, and you should choose one depending on your use case.

⚠️

Text in PDF isn’t always one continuous stream. Often there are side notes, comment bubbles, tables, and columns. Readers can choose to read these elements in different orders.

Unfortunately, the PDF format doesn’t set strict standards regarding the order of the text stored in PDF files. Therefore, we can’t assure that the extracted reading order of your PDFs corresponds to how a typical human would read the document.

This issue arises rarely. When it does occur, it only appears with PDFs with more complex layouts, such asΒ multi-column PDFs.
If you observe this behavior using one of the FileConverters, we recommend trying out another one on your files as each tool uses a slightly different technique to determine the reading order of a PDF.

PDFToTextConverter

The PDFToTextConverter is a fast and lightweight PDF converter that converts PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer. It can also work with image-based PDFs (for example, scanned documents).

The PDFToTextConverter does not extract tables as separate Documents of type table but treats them as plain text. You can discard numerical tables by setting remove_numeric_tables to False.

Haystack offers two versions of PDFToTextConverter: one based on xpdf and one based on PyMuPDF. PyMuPDF is faster and better maintained but it's licensed under AGPL, so it might not be suitable for all users. xpdf has a more liberal license that allows commercial use.

To use the PyMuPDF version, install Haystack using pip install farm-haystack[pdf]. This command installs all the required dependencies.

To use the xpdf version, install Haystack as pip install farm-haystack (no extras) and then install the xpdf binaries if they're not already available on your system. In most cases, you will need to compile them from the source. On Linux, for example, you can install xpdf with the following Bash command:

curl -O https://dl.xpdfreader.com/xpdf-4.04.tar.gz && \
    tar -xvf xpdf-4.04.tar.gz && \
    cd xpdf-4.04 && \
    cmake . && \
    make && \
    cp xpdf/pdftotext /opt && \
    cd .. \
    rm -rf xpdf-4.04

Make sure that the pdftotext command is available in your terminal after the installation. If xpdf is not installed correctly, you will see errors like:

pdftotext is not installed. It is part of xpdf or poppler-utils software suite.

See the xpdf website for more information about the installation procedure.

AzureConverter

The AzureConverter is based on Microsoft Azure's Form Recognizer service. Therefore, to be able to use this converter, you need an active Azure account and a Form Recognizer or Cognitive Services resource. You can follow this guide to set this up.

The AzureConverter works with both searchable PDFs and image-only PDFs. Furthermore, it supports the following file formats: JPEG, PNG, BMP, and TIFF.

Unlike the PDFToTextConverter, the AzureConverter doesn’t extract the tables in a file as plain text but generates separate Document objects of type table that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader , for example.

ParsrConverter

The ParsrConverter uses the open source Parsr tool by axa-group. To use this converter, you need to run a Parsr Docker container. To start a Parsr Docker container, run:
docker run -p 3001:3001 axarev/parsr

The ParsrConverter works with searchable PDFs containing a text layer as well as DOCX files. Like the AzureConverter, it doesn’t extract the tables in a file as plain text but generates separate Document objects of type table that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader , for example.

One downside of the ParsrConverter is that it can be extremely slow, especially for PDFs containing many pages.

TikaConverter

The TikaConverter uses the Apache Tika toolkit. To use this converter, you need to run a Tika Docker container. To start a Tika Docker container, run:
docker run -p 9998:9998 apache/tika:1.28.4

The TikaConverter works with a wide range of different formats, including searchable PDFs, DOCX files, PPT files and many more. It might be therefore useful in your Indexing Pipelines if you expect a variety of file formats and don’t want to route the different file formats to a dedicated converter using the FileClassifier node.

Like the PDFToTextConverter, the TikaConverter does not extract tables as separate Documents of type table but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables to False.

JsonConverter

JsonConverter takes JSON or JSONL files, parses them, and casts them into Documents. It accepts two formats:

A JSON file with a list of Document dictionaries
[
    {
        "content": [
            [
                "Language",
                "Imperative",
                "OO"
            ],
            [
                "C",
                "Yes",
                "No"
            ],
            [
                "Haskell",
                "No",
                "No"
            ],
            [
                "Python",
                "Yes",
                "Yes"
            ]
        ],
        "content_type": "table",
        "score": null,
        "meta": {
            "context": "Programming Languages",
            "page": 2
        },
        "id_hash_keys": [
            "content"
        ],
        "embedding": null,
        "id": "2fa0a0ee3494507df3b6404d01bfeea0"
    },
    {
        "content": "Programming languages are used for controlling the behavior of a machine (often a computer).",
        "content_type": "text",
        "score": null,
        "meta": {
            "context": "Programming Languages",
            "page": 1
        },
        "id_hash_keys": [
            "content"
        ],
        "embedding": null,
        "id": "b53a5cc03658cc1636ab38b72ab59cd4"
    },
    {
        "content": [
            [
                "Language",
                "Statements ratio",
                "Line ratio"
            ],
            [
                "C",
                1,
                1.0
            ],
            [
                "Python",
                6,
                6.5
            ]
        ],
        "content_type": "table",
        "score": null,
        "meta": {
            "context": "Expressiveness",
            "page": 3
        },
        "id_hash_keys": [
            "content"
        ],
        "embedding": null,
        "id": "8b9c2de764b5b1ea33b7092d359b44c1"
    }
]

A JSONL file with every line containing either a Document dictionary or a list of dictionaries
{"content": [["Language", "Imperative", "OO"], ["C", "Yes", "No"], ["Haskell", "No", "No"], ["Python", "Yes", "Yes"]], "content_type": "table", "score": null, "meta": {"context": "Programming Languages", "page": 2}, "id_hash_keys": ["content"], "embedding": null, "id": "2fa0a0ee3494507df3b6404d01bfeea0"}
{"content": "Programming languages are used for controlling the behavior of a machine (often a computer).", "content_type": "text", "score": null, "meta": {"context": "Programming Languages", "page": 1}, "id_hash_keys": ["content"], "embedding": null, "id": "b53a5cc03658cc1636ab38b72ab59cd4"}
{"content": [["Language", "Statements ratio", "Line ratio"], ["C", 1, 1.0], ["Python", 6, 6.5]], "content_type": "table", "score": null, "meta": {"context": "Expressiveness", "page": 3}, "id_hash_keys": ["content"], "embedding": null, "id": "8b9c2de764b5b1ea33b7092d359b44c1"}

The number of Document objects it creates depends on the number of Documents you pass to it in the input file.