FileConverters
Use FileConverters to extract text from files in different formats and cast it into the unified Document format. There are a number of converters avaialble for converting PDFs, images, DOCX files, and more.
Position in a Pipeline | At the very beginning of an indexing Pipeline |
Input | Filename |
Output | Documents |
Classes | PDFToTextConverter PDFToTextOCRConverter DocxToTextConverter AzureConverter ImageToTextConverter MarkdownConverter ParsrConverter TikaConverter TextConverter |
Tutorial
To see an example of file converters in a pipeline, see the advanced indexing tutorial.
File Converter Classes
Here's what each of the file convertes type can do:
PDFToTextConverter
: Extracts text from a PDF file using the pdftotext library.PDFToTextOCRConverter
: Extracts text from PDF files that contain images using the pytesseract library.DocxToTextConverter
: Extracts text from .docx files.AzureConverter
: Extracts text and tables from files in the following formats: PDF, JPEG, PNG, BMP, and TIFF. Uses Microsoft Azure's Form Recognizer. To use this converter, you must have an active Azure account and a Form Recognizer or Cognitive Services resource. For more information, see Form Recognizer.ImageToTextConverter
: Extracts text from image files using the pytesseract library.MarkdownConverter
: Converts markdown to plain text.ParsrConverter
: Extracts text and tables from PDF and .docx files using the open-source Parsr by axa-group.TikaConverter
: Converts files into Documents using Apache Tika.TextConverter
: Preprocesses text files and returns documents.
Usage
Click a tab to read more about each converter and see how to initialize it:
from haystack.nodes import PDFToTextConverter
converter = PDFToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Alternatively, if you have a PDF containing images, Haystack uses the pytessaract library under the hood for optical character recognition of image PDFs.
from haystack.nodes import PDFToTextOCRConverter
converter = PDFToTextOCRConverter(
remove_numeric_tables=False,
valid_languages=["deu","eng"]
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
from haystack.nodes import DocxToTextConverter</code>
converter = DocxToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.docx"), meta=None)
# We recommend the Azure Form Recognizer service for parsing tables from PDFs or other complex document structures.
# (https://azure.microsoft.com/en-us/services/form-recognizer/)
from haystack.nodes import AzureConverter
converter = AzureConverter(
endpoint="some-url",
credential_key="my-secret-key"
)
docs = converter.convert(file_path=Path("my-file.pdf"), meta=None)
# Haystack supports extraction of text from images using OCR.
from haystack.nodes import ImageToTextConverter
converter = ImageToTextConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.png"), meta=None)
from haystack.nodes import MarkdownConverter
converter = MarkdownConverter(
remove_numeric_tables=True,
valid_languages=["de","en"]
)
docs = converter.convert(file_path=Path("my-file.md"), meta=None)
Haystack also has a convert_files_to_docs()
utility function that will convert all txt or pdf files in a given directory.
from haystack.utils.preprocessing import convert_files_to_docs
docs = convert_files_to_docs(dir_path=doc_dir)
Updated about 1 year ago