PDFMinerToDocument

A component that converts complex PDF files to documents using pdfminer arguments.


Most common position in a pipeline	Before PreProcessors or right at the beginning of an indexing pipeline
Mandatory run variables	"sources": PDF file paths or `ByteStream` objects
Output variables	"documents": A list of documents
API reference	Converters
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/pdfminer.py

Overview

The PDFMinerToDocument component converts PDF files into documents using PDFMiner extraction tool arguments.

You can use it in an indexing pipeline to index the contents of a PDF file in a Document Store. It takes a list of file paths or ByteStreamobjects as input and outputs the converted result as a list of documents. Optionally, you can attach metadata to the documents through the meta input parameter.

When initializing the component, you can adjust several parameters to fit your PDF. See the full parameter list and descriptions in our API reference.

Usage

First, install pdfminer package to start using this converter:

pip install pdfminer.six

On its own

from haystack.components.converters import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]

print(documents[0].content)

# 'This is a text from the PDF file.'

In a pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PDFMinerToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", PDFMinerToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": file_names}})

Updated 11 months ago