Version: 2.29

PaddleOCR

haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter

PaddleOCRVLDocumentConverter

Extracts text from documents using PaddleOCR's official document parsing API.

Uses PaddleOCRClient to parse documents via the PaddleOCR serving API. For more information, please refer to: https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html

Usage Example:

python

from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter

converter = PaddleOCRVLDocumentConverter(
    base_url="http://xxxxx.aistudio-app.com",
)
result = converter.run(sources=["sample.pdf"])
documents = result["documents"]
raw_responses = result["raw_paddleocr_responses"]

init

python

__init__(
    *,
    base_url: str | None = None,
    access_token: Secret = Secret.from_env_var(
        ["PADDLEOCR_ACCESS_TOKEN", "AISTUDIO_ACCESS_TOKEN"]
    ),
    model: Model | str = Model.PADDLE_OCR_VL_16,
    file_type: FileTypeInput = None,
    use_doc_orientation_classify: bool | None = False,
    use_doc_unwarping: bool | None = False,
    use_layout_detection: bool | None = None,
    use_chart_recognition: bool | None = None,
    use_seal_recognition: bool | None = None,
    use_ocr_for_image_block: bool | None = None,
    layout_threshold: float | dict | None = None,
    layout_nms: bool | None = None,
    layout_unclip_ratio: float | list | dict | None = None,
    layout_merge_bboxes_mode: str | dict | None = None,
    layout_shape_mode: str | None = None,
    prompt_label: str | None = None,
    format_block_content: bool | None = None,
    repetition_penalty: float | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    min_pixels: int | None = None,
    max_pixels: int | None = None,
    max_new_tokens: int | None = None,
    merge_layout_blocks: bool | None = None,
    markdown_ignore_labels: list[str] | None = None,
    vlm_extra_args: dict | None = None,
    prettify_markdown: bool | None = None,
    show_formula_number: bool | None = None,
    restructure_pages: bool | None = None,
    merge_tables: bool | None = None,
    relevel_titles: bool | None = None,
    visualize: bool | None = None,
    additional_params: dict[str, Any] | None = None
) -> None

Create a PaddleOCRVLDocumentConverter component.

Parameters:

base_url (str | None) – Base URL for the PaddleOCR API. Falls back to PADDLEOCR_BASE_URL env var, then the SDK default.
access_token (Secret) – PaddleOCR access token. Falls back to PADDLEOCR_ACCESS_TOKEN env var.
model (Model | str) – Document parsing model. Defaults to Model.PADDLE_OCR_VL_16.
file_type (FileTypeInput) – "pdf", "image", or None for auto-detection.
use_doc_orientation_classify (bool | None) – Enable document orientation classification.
use_doc_unwarping (bool | None) – Enable text image unwarping.
use_layout_detection (bool | None) – Enable layout detection.
use_chart_recognition (bool | None) – Enable chart recognition.
use_seal_recognition (bool | None) – Enable seal recognition.
use_ocr_for_image_block (bool | None) – Recognize text in image blocks.
layout_threshold (float | dict | None) – Layout detection threshold.
layout_nms (bool | None) – Perform NMS on layout detection results.
layout_unclip_ratio (float | list | dict | None) – Layout unclip ratio.
layout_merge_bboxes_mode (str | dict | None) – Layout merge bounding boxes mode.
layout_shape_mode (str | None) – Layout shape mode.
prompt_label (str | None) – Prompt type for the VLM ("ocr", "formula", "table", "chart", "seal", "spotting").
format_block_content (bool | None) – Format block content.
repetition_penalty (float | None) – Repetition penalty for VLM sampling.
temperature (float | None) – Temperature for VLM sampling.
top_p (float | None) – Top-p for VLM sampling.
min_pixels (int | None) – Minimum pixels for VLM preprocessing.
max_pixels (int | None) – Maximum pixels for VLM preprocessing.
max_new_tokens (int | None) – Maximum tokens generated by the VLM.
merge_layout_blocks (bool | None) – Merge layout detection boxes for cross-column content.
markdown_ignore_labels (list[str] | None) – Layout labels to ignore in Markdown output.
vlm_extra_args (dict | None) – Extra configuration for the VLM.
prettify_markdown (bool | None) – Prettify output Markdown.
show_formula_number (bool | None) – Include formula numbers in Markdown output.
restructure_pages (bool | None) – Restructure results across multiple pages.
merge_tables (bool | None) – Merge tables across pages.
relevel_titles (bool | None) – Relevel titles.
visualize (bool | None) – Return visualization results.
additional_params (dict[str, Any] | None) – Extra options passed to PaddleOCRVLOptions.extra_options.

to_dict

python

to_dict() -> dict[str, Any]

Serialize the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> PaddleOCRVLDocumentConverter

Deserialize the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

PaddleOCRVLDocumentConverter – Deserialized component.

run

python

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, Any]

Convert image or PDF files to Documents.

Parameters:

sources (list[str | Path | ByteStream]) – List of image or PDF file paths or ByteStream objects.
meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. A single dict is applied to all documents; a list must match the number of sources.

Returns:

dict[str, Any] – A dictionary with:
documents: List of created Documents.
raw_paddleocr_responses: List of raw PaddleOCR API responses.

haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter​

PaddleOCRVLDocumentConverter​

init​

to_dict​

from_dict​

run​

haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter

PaddleOCRVLDocumentConverter

init

to_dict

from_dict

run