PaddleOCR
Module haystack_integrations.components.converters.paddleocr.paddleocr_vl_document_converter
PaddleOCRVLDocumentConverter
This component extracts text from documents using PaddleOCR's large model document parsing API.
PaddleOCR-VL is used behind the scenes. For more information, please refer to: https://www.paddleocr.ai/latest/en/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL.html
Usage Example:
python
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import (
PaddleOCRVLDocumentConverter,
)
converter = PaddleOCRVLDocumentConverter(
api_url="http://xxxxx.aistudio-app.com/layout-parsing",
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
)
result = converter.run(sources=["sample.pdf"])
documents = result["documents"]
raw_responses = result["raw_paddleocr_responses"]
PaddleOCRVLDocumentConverter.__init__
python
def __init__(
*,
api_url: str,
access_token: Secret = Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
file_type: Optional[FileTypeInput] = None,
use_doc_orientation_classify: Optional[bool] = None,
use_doc_unwarping: Optional[bool] = None,
use_layout_detection: Optional[bool] = None,
use_chart_recognition: Optional[bool] = None,
layout_threshold: Optional[Union[float, dict]] = None,
layout_nms: Optional[bool] = None,
layout_unclip_ratio: Optional[Union[float, tuple[float, float],
dict]] = None,
layout_merge_bboxes_mode: Optional[Union[str, dict]] = None,
prompt_label: Optional[str] = None,
format_block_content: Optional[bool] = None,
repetition_penalty: Optional[float] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
min_pixels: Optional[int] = None,
max_pixels: Optional[int] = None,
prettify_markdown: Optional[bool] = None,
show_formula_number: Optional[bool] = None,
visualize: Optional[bool] = None,
additional_params: Optional[dict[str, Any]] = None)
Create a PaddleOCRVLDocumentConverter component.
Arguments:
api_url: API URL. To obtain the API URL, visit the PaddleOCR official website, click the API button in the upper-left corner, choose the example code for Large Model document parsing(PaddleOCR-VL), and copy theAPI_URL.access_token: AI Studio access token. You can obtain it from this page.file_type: File type. Can be "pdf" for PDF files, "image" for image files, orNonefor auto-detection. If not specified, the file type will be inferred from the file extension.use_doc_orientation_classify: Whether to enable the document orientation classification function. Enabling this feature allows the input image to be automatically rotated to the correct orientation.use_doc_unwarping: Whether to enable the text image unwarping function. Enabling this feature allows automatic correction of distorted text images.use_layout_detection: Whether to enable the layout detection function.use_chart_recognition: Whether to enable the chart recognition function.layout_threshold: Layout detection threshold. Can be a float or a dict with page-specific thresholds.layout_nms: Whether to perform NMS (Non-Maximum Suppression) on layout detection results.layout_unclip_ratio: Layout unclip ratio. Can be a float, a tuple of (min, max), or a dict with page-specific values.layout_merge_bboxes_mode: Layout merge bounding boxes mode. Can be a string or a dict.prompt_label: Prompt type for the VLM. Possible values are "ocr", "formula", "table", and "chart".format_block_content: Whether to format block content.repetition_penalty: Repetition penalty parameter used in VLM sampling.temperature: Temperature parameter used in VLM sampling.top_p: Top-p parameter used in VLM sampling.min_pixels: Minimum number of pixels allowed during VLM preprocessing.max_pixels: Maximum number of pixels allowed during VLM preprocessing.prettify_markdown: Whether to prettify the output Markdown text.show_formula_number: Whether to include formula numbers in the output markdown text.visualize: Whether to return visualization results.additional_params: Additional parameters for calling the PaddleOCR API.
PaddleOCRVLDocumentConverter.to_dict
Serialize the component to a dictionary.
Returns:
Dictionary with serialized data.
PaddleOCRVLDocumentConverter.from_dict
Deserialize the component from a dictionary.
Arguments:
data: Dictionary to deserialize from.
Returns:
Deserialized component.
PaddleOCRVLDocumentConverter.run
python
@component.output_types(documents=list[Document],
raw_paddleocr_responses=list[dict[str, Any]])
def run(
sources: list[Union[str, Path, ByteStream]],
meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
) -> dict[str, Any]
Convert image or PDF files to Documents.
Arguments:
sources: List of image or PDF file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: A list of created Documents.raw_paddleocr_responses: A list of raw PaddleOCR API responses.