Skip to main content
Version: 2.29

PaddleOCR

haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter

PaddleOCRVLDocumentConverter

Extracts text from documents using PaddleOCR's official document parsing API.

Uses PaddleOCRClient to parse documents via the PaddleOCR serving API. For more information, please refer to: https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html

Usage Example:

python
from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter

converter = PaddleOCRVLDocumentConverter(
base_url="http://xxxxx.aistudio-app.com",
)
result = converter.run(sources=["sample.pdf"])
documents = result["documents"]
raw_responses = result["raw_paddleocr_responses"]

init

python
__init__(
*,
base_url: str | None = None,
access_token: Secret = Secret.from_env_var(
["PADDLEOCR_ACCESS_TOKEN", "AISTUDIO_ACCESS_TOKEN"]
),
model: Model | str = Model.PADDLE_OCR_VL_16,
file_type: FileTypeInput = None,
use_doc_orientation_classify: bool | None = False,
use_doc_unwarping: bool | None = False,
use_layout_detection: bool | None = None,
use_chart_recognition: bool | None = None,
use_seal_recognition: bool | None = None,
use_ocr_for_image_block: bool | None = None,
layout_threshold: float | dict | None = None,
layout_nms: bool | None = None,
layout_unclip_ratio: float | list | dict | None = None,
layout_merge_bboxes_mode: str | dict | None = None,
layout_shape_mode: str | None = None,
prompt_label: str | None = None,
format_block_content: bool | None = None,
repetition_penalty: float | None = None,
temperature: float | None = None,
top_p: float | None = None,
min_pixels: int | None = None,
max_pixels: int | None = None,
max_new_tokens: int | None = None,
merge_layout_blocks: bool | None = None,
markdown_ignore_labels: list[str] | None = None,
vlm_extra_args: dict | None = None,
prettify_markdown: bool | None = None,
show_formula_number: bool | None = None,
restructure_pages: bool | None = None,
merge_tables: bool | None = None,
relevel_titles: bool | None = None,
visualize: bool | None = None,
additional_params: dict[str, Any] | None = None
) -> None

Create a PaddleOCRVLDocumentConverter component.

Parameters:

  • base_url (str | None) – Base URL for the PaddleOCR API. Falls back to PADDLEOCR_BASE_URL env var, then the SDK default.
  • access_token (Secret) – PaddleOCR access token. Falls back to PADDLEOCR_ACCESS_TOKEN env var.
  • model (Model | str) – Document parsing model. Defaults to Model.PADDLE_OCR_VL_16.
  • file_type (FileTypeInput) – "pdf", "image", or None for auto-detection.
  • use_doc_orientation_classify (bool | None) – Enable document orientation classification.
  • use_doc_unwarping (bool | None) – Enable text image unwarping.
  • use_layout_detection (bool | None) – Enable layout detection.
  • use_chart_recognition (bool | None) – Enable chart recognition.
  • use_seal_recognition (bool | None) – Enable seal recognition.
  • use_ocr_for_image_block (bool | None) – Recognize text in image blocks.
  • layout_threshold (float | dict | None) – Layout detection threshold.
  • layout_nms (bool | None) – Perform NMS on layout detection results.
  • layout_unclip_ratio (float | list | dict | None) – Layout unclip ratio.
  • layout_merge_bboxes_mode (str | dict | None) – Layout merge bounding boxes mode.
  • layout_shape_mode (str | None) – Layout shape mode.
  • prompt_label (str | None) – Prompt type for the VLM ("ocr", "formula", "table", "chart", "seal", "spotting").
  • format_block_content (bool | None) – Format block content.
  • repetition_penalty (float | None) – Repetition penalty for VLM sampling.
  • temperature (float | None) – Temperature for VLM sampling.
  • top_p (float | None) – Top-p for VLM sampling.
  • min_pixels (int | None) – Minimum pixels for VLM preprocessing.
  • max_pixels (int | None) – Maximum pixels for VLM preprocessing.
  • max_new_tokens (int | None) – Maximum tokens generated by the VLM.
  • merge_layout_blocks (bool | None) – Merge layout detection boxes for cross-column content.
  • markdown_ignore_labels (list[str] | None) – Layout labels to ignore in Markdown output.
  • vlm_extra_args (dict | None) – Extra configuration for the VLM.
  • prettify_markdown (bool | None) – Prettify output Markdown.
  • show_formula_number (bool | None) – Include formula numbers in Markdown output.
  • restructure_pages (bool | None) – Restructure results across multiple pages.
  • merge_tables (bool | None) – Merge tables across pages.
  • relevel_titles (bool | None) – Relevel titles.
  • visualize (bool | None) – Return visualization results.
  • additional_params (dict[str, Any] | None) – Extra options passed to PaddleOCRVLOptions.extra_options.

to_dict

python
to_dict() -> dict[str, Any]

Serialize the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> PaddleOCRVLDocumentConverter

Deserialize the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • PaddleOCRVLDocumentConverter – Deserialized component.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, Any]

Convert image or PDF files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of image or PDF file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. A single dict is applied to all documents; a list must match the number of sources.

Returns:

  • dict[str, Any] – A dictionary with:
  • documents: List of created Documents.
  • raw_paddleocr_responses: List of raw PaddleOCR API responses.