Skip to main content
Version: 2.19

PaddleOCR

Module haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter

PaddleOCRVLDocumentConverter

This component extracts text from documents using PaddleOCR's large model document parsing API.

PaddleOCR-VL is used behind the scenes. For more information, please refer to: https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html

Usage Example:

python
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import (
PaddleOCRVLDocumentConverter,
)

converter = PaddleOCRVLDocumentConverter(
api_url="http://xxxxx.aistudio-app.com/layout-parsing",
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
)

result = converter.run(sources=["sample.pdf"])

documents = result["documents"]
raw_responses = result["raw_paddleocr_responses"]

PaddleOCRVLDocumentConverter.__init__

python
def __init__(
*,
api_url: str,
access_token: Secret = Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
file_type: Optional[FileTypeInput] = None,
use_doc_orientation_classify: Optional[bool] = None,
use_doc_unwarping: Optional[bool] = None,
use_layout_detection: Optional[bool] = None,
use_chart_recognition: Optional[bool] = None,
layout_threshold: Optional[Union[float, dict]] = None,
layout_nms: Optional[bool] = None,
layout_unclip_ratio: Optional[Union[float, tuple[float, float],
dict]] = None,
layout_merge_bboxes_mode: Optional[Union[str, dict]] = None,
prompt_label: Optional[str] = None,
format_block_content: Optional[bool] = None,
repetition_penalty: Optional[float] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
min_pixels: Optional[int] = None,
max_pixels: Optional[int] = None,
prettify_markdown: Optional[bool] = None,
show_formula_number: Optional[bool] = None,
visualize: Optional[bool] = None,
additional_params: Optional[dict[str, Any]] = None)

Create a PaddleOCRVLDocumentConverter component.

Arguments:

  • api_url: API URL. To obtain the API URL, visit the PaddleOCR official website, click the API button in the upper-left corner, choose the example code for Large Model document parsing(PaddleOCR-VL), and copy the API_URL.
  • access_token: AI Studio access token. You can obtain it from this page.
  • file_type: File type. Can be "pdf" for PDF files, "image" for image files, or None for auto-detection. If not specified, the file type will be inferred from the file extension.
  • use_doc_orientation_classify: Whether to enable the document orientation classification function. Enabling this feature allows the input image to be automatically rotated to the correct orientation.
  • use_doc_unwarping: Whether to enable the text image unwarping function. Enabling this feature allows automatic correction of distorted text images.
  • use_layout_detection: Whether to enable the layout detection function.
  • use_chart_recognition: Whether to enable the chart recognition function.
  • layout_threshold: Layout detection threshold. Can be a float or a dict with page-specific thresholds.
  • layout_nms: Whether to perform NMS (Non-Maximum Suppression) on layout detection results.
  • layout_unclip_ratio: Layout unclip ratio. Can be a float, a tuple of (min, max), or a dict with page-specific values.
  • layout_merge_bboxes_mode: Layout merge bounding boxes mode. Can be a string or a dict.
  • prompt_label: Prompt type for the VLM. Possible values are "ocr", "formula", "table", and "chart".
  • format_block_content: Whether to format block content.
  • repetition_penalty: Repetition penalty parameter used in VLM sampling.
  • temperature: Temperature parameter used in VLM sampling.
  • top_p: Top-p parameter used in VLM sampling.
  • min_pixels: Minimum number of pixels allowed during VLM preprocessing.
  • max_pixels: Maximum number of pixels allowed during VLM preprocessing.
  • prettify_markdown: Whether to prettify the output Markdown text.
  • show_formula_number: Whether to include formula numbers in the output markdown text.
  • visualize: Whether to return visualization results.
  • additional_params: Additional parameters for calling the PaddleOCR API.

PaddleOCRVLDocumentConverter.to_dict

python
def to_dict() -> dict[str, Any]

Serialize the component to a dictionary.

Returns:

Dictionary with serialized data.

PaddleOCRVLDocumentConverter.from_dict

python
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "PaddleOCRVLDocumentConverter"

Deserialize the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

PaddleOCRVLDocumentConverter.run

python
@component.output_types(documents=list[Document],
raw_paddleocr_responses=list[dict[str, Any]])
def run(
sources: list[Union[str, Path, ByteStream]],
meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
) -> dict[str, Any]

Convert image or PDF files to Documents.

Arguments:

  • sources: List of image or PDF file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: A list of created Documents.
  • raw_paddleocr_responses: A list of raw PaddleOCR API responses.