DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord

PDFMinerToDocument

A component that converts complex PDF files to Documents using pdfminer arguments.

NamePDFMinerToDocument
Folder Path/converters/
Most common Position in a PipelineBefore PreProcessors or right at the beginning of an indexing pipeline
Mandatory Input variables"sources": PDF file paths or ByteStream objects
Output variables"documents": a list of Documents

Overview

The PDFMinerToDocument component converts PDF files into documents using PDFMiner extraction tool arguments.

You can use it in an indexing pipeline to index the contents of a PDF file in a Document Store. It takes a list of file paths or ByteStream objects as input and outputs the converted result as a list of documents. Optionally, you can attach metadata to the documents through the meta input parameter.

When initializing the component, you can adjust several parameters to fit your PDF. See the full parameter list and descriptions in our API reference.

Usage

First, install pdfminer package to start using this converter:

pip install pdfminer.six

On its own

from haystack.components.converters import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]

print(documents[0].content)

# 'This is a text from the PDF file.'

In a Pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PDFMinerToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", PDFMinerToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": file_names}})

Related Links

See the parameters details in our API reference: