Docling
haystack_integrations.components.converters.docling.converter
Docling Haystack converter module.
ExportType
Bases: str, Enum
Enumeration of available export types.
BaseMetaExtractor
Bases: ABC
BaseMetaExtractor.
extract_chunk_meta
Extract chunk meta.
extract_dl_doc_meta
Extract Docling document meta.
MetaExtractor
Bases: BaseMetaExtractor
MetaExtractor.
extract_chunk_meta
Extract chunk meta.
extract_dl_doc_meta
Extract Docling document meta.
DoclingConverter
Docling Haystack converter.
init
python
__init__(
converter: DocumentConverter | None = None,
convert_kwargs: dict[str, Any] | None = None,
export_type: ExportType = ExportType.DOC_CHUNKS,
md_export_kwargs: dict[str, Any] | None = None,
chunker: BaseChunker | None = None,
meta_extractor: BaseMetaExtractor | None = None,
) -> None
Create a Docling Haystack converter.
Parameters:
- converter (
DocumentConverter | None) – The DoclingDocumentConverterto use; if not set, a system default is used. - convert_kwargs (
dict[str, Any] | None) – Any parameters to pass to Docling conversion; if not set, a system default is used. - export_type (
ExportType) – The export mode to use:
ExportType.MARKDOWNcaptures each input document as a single markdownDocument.ExportType.DOC_CHUNKS(default) first chunks each input document and then returns oneDocumentper chunk.ExportType.JSONserializes the full Docling document to a JSON string.
- md_export_kwargs (
dict[str, Any] | None) – Any parameters to pass to Markdown export (applicable in case ofExportType.MARKDOWN). - chunker (
BaseChunker | None) – The Docling chunker instance to use; if not set, a system default is used. - meta_extractor (
BaseMetaExtractor | None) – The extractor instance to use for populating the output document metadata; if not set, a system default is used.
run
python
run(
paths: list[str | Path] | None = None,
sources: list[str | Path | ByteStream] | None = None,
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]
Run the DoclingConverter.
Parameters:
- paths (
list[str | Path] | None) – Deprecated. Usesourcesinstead. - sources (
list[str | Path | ByteStream] | None) – List of file paths, URLs, or ByteStream objects to convert. - meta (
dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If a source is a ByteStream, its own metadata is also merged into the output.
Returns:
dict[str, list[Document]]– A dictionary with key"documents"containing the output Haystack Documents.
Raises:
ValueError– Ifmetais a list whose length does not match the number of sources.RuntimeError– If an unexpectedexport_typeis encountered.