FileConverter
Use FileConverters to extract text from files in different formats and cast it into the unified Document format. There are a number of converters available for converting PDFs, images, DOCX files, and more.
Position in a Pipeline | Either at the very beginning of an indexing Pipeline or after a FileClassifier |
Input | File name |
Output | Documents |
Classes | AzureConverter CSVTextConverter DocxToTextConverter ImageToTextConverter JSONConverter MarkdownConverter PDFToTextConverter ParsrConverter TikaConverter TextConverter |
Tutorial
To see an example of file converters in a pipeline, see the DocumentClassifier at Index Time tutorial.
File Converter Classes
Here's what each of the file converters type can do:
AzureConverter | Extracts text and tables from files in the following formats: PDF, JPEG, PNG, BMP, and TIFF. Uses Microsoft Azure's Form Recognizer. To use this converter, you must have an active Azure account and a Form Recognizer or Cognitive Services resource. For more information, see Form Recognizer. |
CSVTextConverter | Converts CSV files containing question-answer pairs into text Documents. The CSV file must have two columns: question and answer . The first column is interpreted as a question, and the second as the answer to the question.Tip: You can use it to search your FAQs as it preserves the question-answer pairs in the documents. See also FAQ-style QA Tutorial. Note: It doesn't handle tabular data. |
DocxToTextConverter | Extracts text from DOCX files. |
ImageToTextConverter | Extracts text from image files using Pytesseract. |
JSONConverter | Extracts text from JSON and JSONL files and casts it into Document objects. |
MarkdownConverter | Converts Markdown to plain text. |
PDFToTextConverter | Extracts text from a PDF file using the PyMuPDF library or xpdf, depending on the type you install. |
ParsrConverter | Extracts text and tables from PDF and .docx files using the open-source Parsr by axa-group. |
TikaConverter | Converts files into Documents using Apache Tika. |
TextConverter | Preprocesses text files and returns documents. |
Usage
Click a tab to read more about each converter and see how to initialize it:
from haystack.nodes import PDFToTextConverter
converter = PDFToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
from haystack.nodes import DocxToTextConverter
converter = DocxToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.docx"), meta=None)
# We recommend the Azure Form Recognizer service for parsing tables from PDFs or other complex document structures.
# (https://azure.microsoft.com/en-us/services/form-recognizer/)
from haystack.nodes import AzureConverter
converter = AzureConverter(
endpoint="some-url",
credential_key="my-secret-key"
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Haystack supports extraction of text from images using OCR.
from haystack.nodes import ImageToTextConverter
converter = ImageToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.png"), meta=None)
from haystack.nodes import MarkdownConverter
converter = MarkdownConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.md"), meta=None)
from haystack.nodes import CsvTextConverter
converter = CsvTextConverter()
docs = converter.convert(file_path=Path("my-file.csv"), meta=None)
from haystack.nodes import JsonConverter
converter = JsonConverter()
docs = converter.convert("data_file.json")
You can write the Documents a FileConverter generates directly into a DocumentStore using an indexing pipeline. To do so, run:
from haystack import Pipeline
from haystack.nodes import TextConverter
indexing_pipeline = Pipeline()
text_converter = TextConverter()
indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["TextConverter"])
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(files_to_index)
Choosing the Right PDF Converter
Haystack provides many different options for converting PDFs into Documents. Use this guide to help you pick the right one, as there are so many different types of PDF files. Each of the tools comes with different strengths and weaknesses, and you should choose one depending on your use case.
Text in PDF isn’t always one continuous stream. Often there are side notes, comment bubbles, tables, and columns. Readers can choose to read these elements in different orders.
Unfortunately, the PDF format doesn’t set strict standards regarding the order of the text stored in PDF files. Therefore, we can’t assure that the extracted reading order of your PDFs corresponds to how a typical human would read the document.
This issue arises rarely. When it does occur, it only appears with PDFs with more complex layouts, such as multi-column PDFs.
If you observe this behavior using one of the FileConverters, we recommend trying out another one on your files as each tool uses a slightly different technique to determine the reading order of a PDF.
PDFToTextConverter
The PDFToTextConverter
is a fast and lightweight PDF converter that converts PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer.
The PDFToTextConverter
cannot extract the text of image-only PDFs (for example, scanned documents.) It does not extract tables as separate Documents of type table
but treats them as plain text. You can discard numerical tables by setting remove_numeric_tables
to False
.
Haystack offers two versions of PDFToTextConverter
: one based on xpdf
and one based on PyMuPDF. PyMuPDF is faster and better maintained but it's licensed under AGPL, so it might not be suitable for all users. xpdf
has a more liberal license that allows commercial use.
To use the PyMuPDF version, install Haystack using pip install farm-haystack[pdf]
. This command installs all the required dependencies.
To use the xpdf
version, install Haystack as pip install farm-haystack
(no extras) and then install the xpdf
binaries if they're not already available on your system. In most cases, you will need to compile them from the source. On Linux, for example, you can install xpdf
with the following Bash command:
curl -O https://dl.xpdfreader.com/xpdf-4.04.tar.gz && \
tar -xvf xpdf-4.04.tar.gz && \
cd xpdf-4.04 && \
cmake . && \
make && \
cp xpdf/pdftotext /opt && \
cd .. \
rm -rf xpdf-4.04
Make sure that the pdftotext
command is available in your terminal after the installation. If xpdf
is not installed correctly, you will see errors like:
pdftotext is not installed. It is part of xpdf or poppler-utils software suite.
See the xpdf website for more information about the installation procedure.
AzureConverter
The AzureConverter
is based on Microsoft Azure's Form Recognizer service. Therefore, to be able to use this converter, you need an active Azure account and a Form Recognizer or Cognitive Services resource. You can follow this guide to set this up.
The AzureConverter
works with both searchable PDFs and image-only PDFs. Furthermore, it supports the following file formats: JPEG, PNG, BMP, and TIFF.
Unlike the PDFToTextConverter
, the AzureConverter doesn’t extract the tables in a file as plain text but generates separate Document objects of type table
that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader
, for example.
ParsrConverter
The ParsrConverter
uses the open source Parsr tool by axa-group. To use this converter, you need to run a Parsr Docker container. To start a Parsr Docker container, run:
docker run -p 3001:3001 axarev/parsr
The ParsrConverter
works with searchable PDFs containing a text layer as well as DOCX files. Like the AzureConverter
, it doesn’t extract the tables in a file as plain text but generates separate Document objects of type table
that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader
, for example.
One downside of the ParsrConverter
is that it can be extremely slow, especially for PDFs containing many pages.
TikaConverter
The TikaConverter
uses the Apache Tika toolkit. To use this converter, you need to run a Tika Docker container. To start a Tika Docker container, run:
docker run -p 9998:9998 apache/tika:1.28.4
The TikaConverter
works with a wide range of different formats, including searchable PDFs, DOCX files, PPT files and many more. It might be therefore useful in your Indexing Pipelines if you expect a variety of file formats and don’t want to route the different file formats to a dedicated converter using the FileClassifier
node.
Like the PDFToTextConverter
, the TikaConverter
does not extract tables as separate Documents of type table
but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables
to False
.
JsonConverter
JsonConverter
takes JSON or JSONL files, parses them, and casts them into Documents. It accepts two formats:
A JSON file with a list of Document dictionaries
[
{
"content": [
[
"Language",
"Imperative",
"OO"
],
[
"C",
"Yes",
"No"
],
[
"Haskell",
"No",
"No"
],
[
"Python",
"Yes",
"Yes"
]
],
"content_type": "table",
"score": null,
"meta": {
"context": "Programming Languages",
"page": 2
},
"id_hash_keys": [
"content"
],
"embedding": null,
"id": "2fa0a0ee3494507df3b6404d01bfeea0"
},
{
"content": "Programming languages are used for controlling the behavior of a machine (often a computer).",
"content_type": "text",
"score": null,
"meta": {
"context": "Programming Languages",
"page": 1
},
"id_hash_keys": [
"content"
],
"embedding": null,
"id": "b53a5cc03658cc1636ab38b72ab59cd4"
},
{
"content": [
[
"Language",
"Statements ratio",
"Line ratio"
],
[
"C",
1,
1.0
],
[
"Python",
6,
6.5
]
],
"content_type": "table",
"score": null,
"meta": {
"context": "Expressiveness",
"page": 3
},
"id_hash_keys": [
"content"
],
"embedding": null,
"id": "8b9c2de764b5b1ea33b7092d359b44c1"
}
]
A JSONL file with every line containing either a Document dictionary or a list of dictionaries
{"content": [["Language", "Imperative", "OO"], ["C", "Yes", "No"], ["Haskell", "No", "No"], ["Python", "Yes", "Yes"]], "content_type": "table", "score": null, "meta": {"context": "Programming Languages", "page": 2}, "id_hash_keys": ["content"], "embedding": null, "id": "2fa0a0ee3494507df3b6404d01bfeea0"}
{"content": "Programming languages are used for controlling the behavior of a machine (often a computer).", "content_type": "text", "score": null, "meta": {"context": "Programming Languages", "page": 1}, "id_hash_keys": ["content"], "embedding": null, "id": "b53a5cc03658cc1636ab38b72ab59cd4"}
{"content": [["Language", "Statements ratio", "Line ratio"], ["C", 1, 1.0], ["Python", 6, 6.5]], "content_type": "table", "score": null, "meta": {"context": "Expressiveness", "page": 3}, "id_hash_keys": ["content"], "embedding": null, "id": "8b9c2de764b5b1ea33b7092d359b44c1"}
The number of Document objects it creates depends on the number of Documents you pass to it in the input file.
Updated over 1 year ago