DoclingConverter
DoclingConverter converts PDF, DOCX, HTML, and other document formats to Haystack Documents using Docling, a document parsing library that understands document structure including layout, tables, and headings.
| Most common position in a pipeline | Before PreProcessors, or right at the beginning of an indexing pipeline |
| Mandatory run variables | sources: A list of file paths, URLs, or ByteStream objects |
| Output variables | documents: A list of documents |
| API reference | Docling |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/docling |
| Package name | docling-haystack |
Overview
The DoclingConverter takes a list of file paths, URLs, or ByteStream objects and uses Docling to parse them into a rich document representation that captures layout, tables, headings, and other structural elements.
The component supports three export modes, controlled by the export_type parameter:
ExportType.DOC_CHUNKS(default): Chunks each document using Docling'sHybridChunkerand returns oneDocumentper chunk. Chunk metadata includes structural context from Docling. Use this mode for indexing pipelines where downstream retrieval benefits from semantically coherent chunks.ExportType.MARKDOWN: Exports each input document as a single Markdown string in oneDocument. Use this mode when you want to preserve the full document content as formatted text.ExportType.JSON: Serializes the full Docling document to a JSON string in oneDocument. Use this mode when you need access to the complete structured representation.
You can customize parsing behavior by passing a pre-configured DocumentConverter instance via the converter parameter, and pass additional keyword arguments to Docling's conversion step via convert_kwargs. For ExportType.MARKDOWN, use md_export_kwargs to control Markdown rendering options (for example, image placeholder text). For ExportType.DOC_CHUNKS, provide a custom BaseChunker instance via the chunker parameter.
Document metadata is populated by a MetaExtractor instance. The default MetaExtractor adds Docling-specific metadata (chunk structure or document origin) under the dl_meta key. You can supply a custom BaseMetaExtractor implementation via the meta_extractor parameter. Additional metadata can be attached to all output Documents by passing a dictionary to the meta run parameter, or per source by passing a list of dictionaries.
Usage
Install the Docling integration:
On its own
from haystack_integrations.components.converters.docling import (
DoclingConverter,
ExportType,
)
# Default: chunk-based output
converter = DoclingConverter()
result = converter.run(sources=["report.pdf", "notes.docx"])
documents = result["documents"]
# Full document as Markdown
converter = DoclingConverter(export_type=ExportType.MARKDOWN)
result = converter.run(sources=["report.pdf"])
documents = result["documents"]
print(documents[0].content)
In a pipeline
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.docling import DoclingConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", DoclingConverter())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "writer")
pipeline.run({"converter": {"sources": ["report.pdf", "manual.docx"]}})
Because DoclingConverter with ExportType.DOC_CHUNKS already chunks the documents, you typically don't need a separate DocumentSplitter in the pipeline.
Additional Features
Custom chunking
Provide a custom Docling chunker to control how documents are split:
from docling.chunking import HybridChunker
from haystack_integrations.components.converters.docling import DoclingConverter
chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5", max_tokens=256)
converter = DoclingConverter(chunker=chunker)
result = converter.run(sources=["report.pdf"])
Attaching metadata
Pass a single dictionary to apply metadata to all output Documents, or a list to set metadata per source:
from haystack_integrations.components.converters.docling import DoclingConverter
converter = DoclingConverter()
# Same metadata for all sources
result = converter.run(
sources=["a.pdf", "b.pdf"],
meta={"project": "research"},
)
# Per-source metadata
result = converter.run(
sources=["a.pdf", "b.pdf"],
meta=[{"title": "Report A"}, {"title": "Report B"}],
)
Processing in-memory files
Pass ByteStream objects to convert files loaded into memory. Set file_path in the ByteStream metadata so Docling can detect the file format:
from haystack.dataclasses import ByteStream
from haystack_integrations.components.converters.docling import DoclingConverter
with open("report.pdf", "rb") as f:
data = f.read()
source = ByteStream(data=data, meta={"file_path": "report.pdf"})
converter = DoclingConverter()
result = converter.run(sources=[source])