PaddleOCR
haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter
PaddleOCRVLDocumentConverter
Extracts text from documents using PaddleOCR's official document parsing API.
Uses PaddleOCRClient to parse documents via the PaddleOCR serving API.
For more information, please refer to:
https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html
Usage Example:
python
from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter
converter = PaddleOCRVLDocumentConverter(
base_url="http://xxxxx.aistudio-app.com",
)
result = converter.run(sources=["sample.pdf"])
documents = result["documents"]
raw_responses = result["raw_paddleocr_responses"]
init
python
__init__(
*,
base_url: str | None = None,
access_token: Secret = Secret.from_env_var(
["PADDLEOCR_ACCESS_TOKEN", "AISTUDIO_ACCESS_TOKEN"]
),
model: Model | str = Model.PADDLE_OCR_VL_16,
file_type: FileTypeInput = None,
use_doc_orientation_classify: bool | None = False,
use_doc_unwarping: bool | None = False,
use_layout_detection: bool | None = None,
use_chart_recognition: bool | None = None,
use_seal_recognition: bool | None = None,
use_ocr_for_image_block: bool | None = None,
layout_threshold: float | dict | None = None,
layout_nms: bool | None = None,
layout_unclip_ratio: float | list | dict | None = None,
layout_merge_bboxes_mode: str | dict | None = None,
layout_shape_mode: str | None = None,
prompt_label: str | None = None,
format_block_content: bool | None = None,
repetition_penalty: float | None = None,
temperature: float | None = None,
top_p: float | None = None,
min_pixels: int | None = None,
max_pixels: int | None = None,
max_new_tokens: int | None = None,
merge_layout_blocks: bool | None = None,
markdown_ignore_labels: list[str] | None = None,
vlm_extra_args: dict | None = None,
prettify_markdown: bool | None = None,
show_formula_number: bool | None = None,
restructure_pages: bool | None = None,
merge_tables: bool | None = None,
relevel_titles: bool | None = None,
visualize: bool | None = None,
additional_params: dict[str, Any] | None = None
) -> None
Create a PaddleOCRVLDocumentConverter component.
Parameters:
- base_url (
str | None) – Base URL for the PaddleOCR API. Falls back toPADDLEOCR_BASE_URLenv var, then the SDK default. - access_token (
Secret) – PaddleOCR access token. Falls back toPADDLEOCR_ACCESS_TOKENenv var. - model (
Model | str) – Document parsing model. Defaults toModel.PADDLE_OCR_VL_16. - file_type (
FileTypeInput) – "pdf", "image", or None for auto-detection. - use_doc_orientation_classify (
bool | None) – Enable document orientation classification. - use_doc_unwarping (
bool | None) – Enable text image unwarping. - use_layout_detection (
bool | None) – Enable layout detection. - use_chart_recognition (
bool | None) – Enable chart recognition. - use_seal_recognition (
bool | None) – Enable seal recognition. - use_ocr_for_image_block (
bool | None) – Recognize text in image blocks. - layout_threshold (
float | dict | None) – Layout detection threshold. - layout_nms (
bool | None) – Perform NMS on layout detection results. - layout_unclip_ratio (
float | list | dict | None) – Layout unclip ratio. - layout_merge_bboxes_mode (
str | dict | None) – Layout merge bounding boxes mode. - layout_shape_mode (
str | None) – Layout shape mode. - prompt_label (
str | None) – Prompt type for the VLM ("ocr", "formula", "table", "chart", "seal", "spotting"). - format_block_content (
bool | None) – Format block content. - repetition_penalty (
float | None) – Repetition penalty for VLM sampling. - temperature (
float | None) – Temperature for VLM sampling. - top_p (
float | None) – Top-p for VLM sampling. - min_pixels (
int | None) – Minimum pixels for VLM preprocessing. - max_pixels (
int | None) – Maximum pixels for VLM preprocessing. - max_new_tokens (
int | None) – Maximum tokens generated by the VLM. - merge_layout_blocks (
bool | None) – Merge layout detection boxes for cross-column content. - markdown_ignore_labels (
list[str] | None) – Layout labels to ignore in Markdown output. - vlm_extra_args (
dict | None) – Extra configuration for the VLM. - prettify_markdown (
bool | None) – Prettify output Markdown. - show_formula_number (
bool | None) – Include formula numbers in Markdown output. - restructure_pages (
bool | None) – Restructure results across multiple pages. - merge_tables (
bool | None) – Merge tables across pages. - relevel_titles (
bool | None) – Relevel titles. - visualize (
bool | None) – Return visualization results. - additional_params (
dict[str, Any] | None) – Extra options passed toPaddleOCRVLOptions.extra_options.
to_dict
Serialize the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserialize the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
PaddleOCRVLDocumentConverter– Deserialized component.
run
python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, Any]
Convert image or PDF files to Documents.
Parameters:
- sources (
list[str | Path | ByteStream]) – List of image or PDF file paths or ByteStream objects. - meta (
dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. A single dict is applied to all documents; a list must match the number of sources.
Returns:
dict[str, Any]– A dictionary with:documents: List of created Documents.raw_paddleocr_responses: List of raw PaddleOCR API responses.