PDFToImageContent
PDFToImageContent
reads local PDF files and converts them into ImageContent
objects. These are ready for multimodal AI pipelines, including tasks like image captioning, visual QA, or prompt-based generation.
Most common position in a pipeline | Before a ChatPromptBuilder in a query pipeline |
Mandatory run variables | "sources": A list of PDF file paths or ByteStreams |
Output variables | "image_contents": A list of ImageContent objects |
API reference | Image Converters |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/image/pdf_to_image.py |
Overview
PDFToImageContent
processes a list of PDF sources and converts them into ImageContent
objects, one for each page of the PDF. These can be used in multimodal pipelines that require base64-encoded image input.
Each source can be:
- A file path (string or
Path
), or - A
ByteStream
object.
Optionally, you can provide metadata using the meta
parameter. This can be a single dictionary (applied to all images) or a list matching the length of sources
.
Use the size
parameter to resize images while preserving aspect ratio. This reduces memory usage and transmission size, which is helpful when working with remote models or limited-resource environments.
This component is often used in query pipelines just before a ChatPromptBuilder
.
Usage
On its own
from haystack.components.converters.image import PDFToImageContent
converter = PDFToImageContent()
sources = ["file.pdf", "another_file.pdf"]
image_contents = converter.run(sources=sources)["image_contents"]
print(image_contents)
# [ImageContent(base64_image='...',
# mime_type='application/pdf',
# detail=None,
# meta={'file_path': 'file.pdf', 'page_number': 1}),
# ...]
In a pipeline
Use ImageFileToImageContent
to supply image data to a ChatPromptBuilder
for multimodal QA or captioning with an LLM.
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.converters.image import PDFToImageContent
# Query pipeline
pipeline = Pipeline()
pipeline.add_component("image_converter", PDFToImageContent(detail="auto"))
pipeline.add_component(
"chat_prompt_builder",
ChatPromptBuilder(
required_variables=["question"],
template="""{% message role="system" %}
You are a helpful assistant that answers questions using the provided images.
{% endmessage %}
{% message role="user" %}
Question: {{ question }}
{% for img in image_contents %}
{{ img | templatize_part }}
{% endfor %}
{% endmessage %}
"""
)
)
pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))
pipeline.connect("image_converter", "chat_prompt_builder.image_contents")
pipeline.connect("chat_prompt_builder", "llm")
sources = ["flan_paper.pdf"]
result = pipeline.run(
data={
"image_converter": {"sources": ["flan_paper.pdf"], "page_range":"9"},
"chat_prompt_builder": {"question": "What is the main takeaway of Figure 6?"}
}
)
print(result["replies"][0].text)
# ('The main takeaway of Figure 6 is that Flan-PaLM demonstrates improved '
# 'performance in zero-shot reasoning tasks when utilizing chain-of-thought '
# '(CoT) reasoning, as indicated by higher accuracy across different model '
# 'sizes compared to PaLM without finetuning. This highlights the importance of '
# 'instruction finetuning combined with CoT for enhancing reasoning '
# 'capabilities in models.')
Additional References
🧑🍳 Cookbook: Introduction to Multimodality
Updated 1 day ago