FileConverter
Use FileConverters to extract text from files in different formats and cast it into the unified Document format. There are a number of converters available for converting PDFs, images, DOCX files, and more.
Position in a Pipeline | Either at the very beginning of an indexing Pipeline or after a FileClassifier |
Input | File name |
Output | Documents |
Classes | AzureConverter CSVTextConverter DocxToTextConverter ImageToTextConverter JSONConverter MarkdownConverter PDFToTextConverter PDFToTextOCRConverter ParsrConverter TikaConverter TextConverter |
Tutorial
To see an example of file converters in a pipeline, see the DocumentClassifier at Index Time tutorial.
File Converter Classes
Here's what each of the file converters type can do:
AzureConverter | Extracts text and tables from files in the following formats: PDF, JPEG, PNG, BMP, and TIFF. Uses Microsoft Azure's Form Recognizer. To use this converter, you must have an active Azure account and a Form Recognizer or Cognitive Services resource. For more information, see Form Recognizer. |
CSVTextConverter | Converts CSV files containing question-answer pairs into text Documents. The CSV file must have two columns: question and answer . The first column is interpreted as a question, and the second as the answer to the question.Tip: You can use it to search your FAQs as it preserves the question-answer pairs in the documents. See also FAQ-style QA Tutorial. Note: It doesn't handle tabular data. |
DocxToTextConverter | Extracts text from DOCX files. |
ImageToTextConverter | Extracts text from image files using Pytesseract. |
JSONConverter | Extracts text from JSON and JSONL files and casts it into Document objects. |
MarkdownConverter | Converts Markdown to plain text. |
PDFToTextConverter | Extracts text from a PDF file using the pdftotext library. |
PDFToTextOCRConverter | Extracts text from PDF files that contain images using Pytesseract. |
ParsrConverter | Extracts text and tables from PDF and .docx files using the open-source Parsr by axa-group. |
TikaConverter | Converts files into Documents using Apache Tika. |
TextConverter | Preprocesses text files and returns documents. |
Usage
Click a tab to read more about each converter and see how to initialize it:
from haystack.nodes import PDFToTextConverter
converter = PDFToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Alternatively, if you have a PDF containing images, Haystack uses the pytessaract library under the hood for optical character recognition of image PDFs.
from haystack.nodes import PDFToTextOCRConverter
converter = PDFToTextOCRConverter(
remove_numeric_tables=False,
valid_languages=["deu","eng"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
from haystack.nodes import DocxToTextConverter
converter = DocxToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.docx"), meta=None)
# We recommend the Azure Form Recognizer service for parsing tables from PDFs or other complex document structures.
# (https://azure.microsoft.com/en-us/services/form-recognizer/)
from haystack.nodes import AzureConverter
converter = AzureConverter(
endpoint="some-url",
credential_key="my-secret-key"
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Haystack supports extraction of text from images using OCR.
from haystack.nodes import ImageToTextConverter
converter = ImageToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.png"), meta=None)
from haystack.nodes import MarkdownConverter
converter = MarkdownConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.md"), meta=None)
from haystack.nodes import CsvTextConverter
converter = CsvTextConverter()
docs = converter.convert(file_path=Path("my-file.csv"), meta=None)
from haystack.nodes import JsonConverter
converter = JsonConverter()
docs = converter.convert("data_file.json")
You can write the Documents a FileConverter generates directly into a DocumentStore using an indexing pipeline. To do so, run:
from haystack import Pipeline
from haystack.nodes import TextConverter
indexing_pipeline = Pipeline()
text_converter = TextConverter()
indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["TextConverter"])
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(files_to_index)
Choosing the Right PDF Converter
Haystack provides many different options for converting PDFs into Documents. This guide should lessen the challenge of picking the right one since there are so many different types of PDF files. Each of the tools comes with different strengths and weaknesses and you should choose one depending on your use case.
Text in PDF isn’t always one continuous stream. Often there are side notes, comment bubbles, tables, and columns. Readers can choose to read these elements in different orders.
Unfortunately, the PDF format doesn’t set strict standards regarding the order of the text stored in PDF files. Therefore, we can’t assure that the extracted reading order of your PDFs corresponds to how a typical human would read the document.
This issue arises rarely. When it does occur, it only appears with PDFs with more complex layouts, such as multi-column PDFs.
If you observe this behavior using one of the FileConverters, we recommend trying out another one on your files as each tool uses a slightly different technique to determine the reading order of a PDF.
PDFToTextConverter
The PDFToTextConverter
is a fast and lightweight PDF converter that uses the pdftotext tool to convert PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer.
The PDFToTextConverter
cannot extract the text of image-only PDFs (for example, scanned documents.) It does not extract tables as separate Documents of type table
but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables
to False
.
PDFToTextOCRConverter
The PDFToTextOCRConverter
performs optical character recognition (OCR) on your PDF files before extracting the text using the pytessearact tool. It, therefore, works with image-only PDFs as well. The PDFToTextOCRConverter
does not extract tables as separate Documents of type table but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables
to False
.
If your files consist mainly of searchable PDFs containing a text layer, we recommend to use the PDFToTextConverter
to reduce the overhead of applying OCR to your files.
AzureConverter
The AzureConverter
is based on Microsoft Azure's Form Recognizer service. Therefore, to be able to use this converter, you need an active Azure account and a Form Recognizer or Cognitive Services resource. You can follow this guide to set this up.
The AzureConverter
works with both searchable PDFs and image-only PDFs. Furthermore, it supports the following file formats: JPEG, PNG, BMP, and TIFF.
Unlike the PDFToTextConverter
and PDFToTextOCRConverter
, the AzureConverter doesn’t extract the tables in a file as plain text but generates separate Document objects of type table
that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader
, for example.
JsonConverter
JsonConverter
takes JSON or JSONL files, parses them, and casts them into Documents. It accepts two formats:
A JSON file with a list of Document dictionaries
[
{
"content": [
[
"Language",
"Imperative",
"OO"
],
[
"C",
"Yes",
"No"
],
[
"Haskell",
"No",
"No"
],
[
"Python",
"Yes",
"Yes"
]
],
"content_type": "table",
"score": null,
"meta": {
"context": "Programming Languages",
"page": 2
},
"id_hash_keys": [
"content"
],
"embedding": null,
"id": "2fa0a0ee3494507df3b6404d01bfeea0"
},
{
"content": "Programming languages are used for controlling the behavior of a machine (often a computer).",
"content_type": "text",
"score": null,
"meta": {
"context": "Programming Languages",
"page": 1
},
"id_hash_keys": [
"content"
],
"embedding": null,
"id": "b53a5cc03658cc1636ab38b72ab59cd4"
},
{
"content": [
[
"Language",
"Statements ratio",
"Line ratio"
],
[
"C",
1,
1.0
],
[
"Python",
6,
6.5
]
],
"content_type": "table",
"score": null,
"meta": {
"context": "Expressiveness",
"page": 3
},
"id_hash_keys": [
"content"
],
"embedding": null,
"id": "8b9c2de764b5b1ea33b7092d359b44c1"
}
]
A JSONL file with every line containing either a Document dictionary or a list of dictionaries
{"content": [["Language", "Imperative", "OO"], ["C", "Yes", "No"], ["Haskell", "No", "No"], ["Python", "Yes", "Yes"]], "content_type": "table", "score": null, "meta": {"context": "Programming Languages", "page": 2}, "id_hash_keys": ["content"], "embedding": null, "id": "2fa0a0ee3494507df3b6404d01bfeea0"}
{"content": "Programming languages are used for controlling the behavior of a machine (often a computer).", "content_type": "text", "score": null, "meta": {"context": "Programming Languages", "page": 1}, "id_hash_keys": ["content"], "embedding": null, "id": "b53a5cc03658cc1636ab38b72ab59cd4"}
{"content": [["Language", "Statements ratio", "Line ratio"], ["C", 1, 1.0], ["Python", 6, 6.5]], "content_type": "table", "score": null, "meta": {"context": "Expressiveness", "page": 3}, "id_hash_keys": ["content"], "embedding": null, "id": "8b9c2de764b5b1ea33b7092d359b44c1"}
The number of Document objects it creates depends on the number of Documents you pass to it in the input file.
ParsrConverter
The ParsrConverter
uses the open source Parsr tool by axa-group. To use this converter, you need to run a Parsr Docker container. To start a Parsr Docker container, run:
docker run -p 3001:3001 axarev/parsr
The ParsrConverter
works with searchable PDFs containing a text layer as well as DOCX files. Like the AzureConverter
, it doesn’t extract the tables in a file as plain text but generates separate Document objects of type table
that maintain the two-dimensional structure of the tables. This is useful in combination with the TableReader
, for example.
One downside of the ParsrConverter
is that it can be extremely slow, especially for PDFs containing many pages.
TikaConverter
The TikaConverter
uses the Apache Tika toolkit. To use this converter, you need to run a Tika Docker container. To start a Tika Docker container, run:
docker run -p 9998:9998 apache/tika:1.28.4
The TikaConverter
works with a wide range of different formats, including searchable PDFs, DOCX files, PPT files and many more. It might be therefore useful in your Indexing Pipelines if you expect a variety of file formats and don’t want to route the different file formats to a dedicated converter using the FileClassifier
node.
Like the PDFToTextConverter
and PDFToTextOCRConverter
, the TikaConverter
does not extract tables as separate Documents of type table
but treats them as plain text. Numerical tables can be discarded by setting remove_numeric_tables
to False
.
Updated about 1 month ago