Skip to main content
Version: 2.29-unstable

DoclingServeConverter

DoclingServeConverter converts PDF, DOCX, HTML, and other document formats to Haystack Documents by calling a DoclingServe HTTP server. Unlike the local DoclingConverter, this component has no heavy ML dependencies — all document parsing happens on the remote server.

Most common position in a pipelineBefore PreProcessors, or right at the beginning of an indexing pipeline
Mandatory run variablessources: A list of file paths, URLs, or ByteStream objects
Output variablesdocuments: A list of documents
API referenceDocling Serve
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/docling_serve
Package namedocling-serve-haystack

Overview

The DoclingServeConverter takes a list of file paths, URLs, or ByteStream objects and sends them to a running DoclingServe instance for parsing. Local files and ByteStream objects are uploaded to the /v1/convert/file endpoint; URL strings are sent to /v1/convert/source.

The component supports three export modes, controlled by the export_type parameter:

  • ExportType.MARKDOWN (default): Returns the document content as a Markdown string. Use this mode when you want well-structured text output with formatting preserved.
  • ExportType.TEXT: Returns plain text extracted from the document. Use this mode when you need clean, unformatted text.
  • ExportType.JSON: Returns the full Docling document representation as a JSON string. Use this mode when you need access to the complete structured representation.

Each source produces one Document in the output. Sources that fail to convert are skipped with a warning logged.

You can pass additional conversion options to the DoclingServe API via the convert_options parameter (for example, {"do_ocr": True, "ocr_engine": "tesseract"}). If the DoclingServe instance requires authentication, pass the API key via the api_key parameter or set the DOCLING_SERVE_API_KEY environment variable.

The component supports both synchronous (run) and asynchronous (run_async) execution.

Usage

Install the Docling Serve integration:

shell
pip install docling-serve-haystack

Start a DoclingServe instance locally (requires Docker):

shell
docker run -p 5001:5001 ghcr.io/docling-project/docling-serve-cpu:latest

On its own

python
from haystack_integrations.components.converters.docling_serve import (
DoclingServeConverter,
)

# Default: Markdown output
converter = DoclingServeConverter(base_url="http://localhost:5001")
result = converter.run(sources=["report.pdf", "notes.docx"])
documents = result["documents"]
print(documents[0].content[:200])

# Plain text output
from haystack_integrations.components.converters.docling_serve import ExportType

converter = DoclingServeConverter(
base_url="http://localhost:5001",
export_type=ExportType.TEXT,
)
result = converter.run(sources=["report.pdf"])
print(result["documents"][0].content)

In a pipeline

python
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.docling_serve import (
DoclingServeConverter,
)

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component(
"converter",
DoclingServeConverter(base_url="http://localhost:5001"),
)
pipeline.add_component("splitter", DocumentSplitter())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": ["report.pdf", "manual.docx"]}})

Additional Features

Converting URLs directly

Pass URL strings to convert remote documents without downloading them first:

python
from haystack_integrations.components.converters.docling_serve import (
DoclingServeConverter,
)

converter = DoclingServeConverter(base_url="http://localhost:5001")
result = converter.run(sources=["https://arxiv.org/pdf/2602.17316"])
print(result["documents"][0].content[:200])

Attaching metadata

Pass a single dictionary to apply metadata to all output Documents, or a list to set metadata per source:

python
from haystack_integrations.components.converters.docling_serve import (
DoclingServeConverter,
)

converter = DoclingServeConverter(base_url="http://localhost:5001")

# Same metadata for all sources
result = converter.run(
sources=["a.pdf", "b.pdf"],
meta={"project": "research"},
)

# Per-source metadata
result = converter.run(
sources=["a.pdf", "b.pdf"],
meta=[{"title": "Report A"}, {"title": "Report B"}],
)

Processing in-memory files

Pass ByteStream objects to convert files loaded into memory. Set file_path in the ByteStream metadata so DoclingServe can detect the file format:

python
from haystack.dataclasses import ByteStream
from haystack_integrations.components.converters.docling_serve import (
DoclingServeConverter,
)

with open("report.pdf", "rb") as f:
data = f.read()

source = ByteStream(data=data, meta={"file_path": "report.pdf"})
converter = DoclingServeConverter(base_url="http://localhost:5001")
result = converter.run(sources=[source])