LibreOffice
haystack_integrations.components.converters.libreoffice.converter
LibreOfficeFileConverter
Component that uses libreoffice's command line utility (soffice) to convert files into various formats.
Usage examples
Simple conversion:
from pathlib import Path
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter
# Convert documents
converter = LibreOfficeFileConverter()
results = converter.run(sources=[Path("sample.doc")], output_file_type="docx")
print(results["output"]) # [ByteStream(data=b'...', meta={}, mime_type=None)]
Conversion pipeline:
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import DOCXToDocument
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter
# Create pipeline with components
pipeline = Pipeline()
pipeline.add_component("libreoffice_converter", LibreOfficeFileConverter())
pipeline.add_component("docx_converter", DOCXToDocument())
pipeline.connect("libreoffice_converter.output", "docx_converter.sources")
# Run pipeline and convert legacy documents into Haystack documents
results = pipeline.run(
{
"libreoffice_converter": {
"sources": [Path("sample_doc.doc")],
"output_file_type": "docx",
}
}
)
print(results["docx_converter"]["documents"])
SUPPORTED_TYPES
SUPPORTED_TYPES: dict[str, frozenset[str]] = {
"doc": frozenset(["pdf", "docx", "odt", "rtf", "txt", "html", "epub"]),
"docx": frozenset(["pdf", "doc", "odt", "rtf", "txt", "html", "epub"]),
"odt": frozenset(["pdf", "docx", "doc", "rtf", "txt", "html", "epub"]),
"rtf": frozenset(["pdf", "docx", "doc", "odt", "txt", "html"]),
"txt": frozenset(["pdf", "docx", "doc", "odt", "rtf", "html"]),
"html": frozenset(["pdf", "docx", "doc", "odt", "rtf", "txt"]),
"xlsx": frozenset(["pdf", "xls", "ods", "csv", "html"]),
"xls": frozenset(["pdf", "xlsx", "ods", "csv", "html"]),
"ods": frozenset(["pdf", "xlsx", "xls", "csv", "html"]),
"csv": frozenset(["pdf", "xlsx", "xls", "ods"]),
"pptx": frozenset(["pdf", "ppt", "odp", "html", "png", "jpg"]),
"ppt": frozenset(["pdf", "pptx", "odp", "html", "png", "jpg"]),
"odp": frozenset(["pdf", "pptx", "ppt", "html", "png", "jpg"]),
}
A non-exhaustive mapping of supported conversion types by this component. See https://help.libreoffice.org/latest/en-GB/text/shared/guide/convertfilters.html for more information.
init
Check whether soffice is installed.
Parameters:
- output_file_type (
OUTPUT_FILE_TYPE | None) – Target file format to convert to. Must be a valid conversion target for each source's input type — see :attr:SUPPORTED_TYPESfor the full mapping.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary to deserialize from.
Returns:
Self– The deserialized component.
run
run(
sources: Iterable[str | Path | ByteStream],
output_file_type: OUTPUT_FILE_TYPE | None = None,
) -> LibreOfficeFileConverterOutput
Convert office files to the specified output format using LibreOffice.
Parameters:
- sources (
Iterable[str | Path | ByteStream]) – List of sources to convert. Each source can be a file path (strorPath) or aByteStream. ForByteStreamsources, the input file type cannot be inferred from the filename, so onlyoutput_file_typeis validated (not the source type). - output_file_type (
OUTPUT_FILE_TYPE | None) – Target file format to convert to. Must be a valid conversion target for each source's input type — see :attr:SUPPORTED_TYPESfor the full mapping. If set, it will override theoutput_file_typeparameter provided during initialization.
Returns:
LibreOfficeFileConverterOutput– A dictionary with the following key:output: List ofByteStreamobjects containing the converted file data, in the same order assources.
Raises:
FileNotFoundError– If a source file path does not exist.OSError– If the internal temporary output directory is not writable.ValueError– If a source's file type is not in :attr:SUPPORTED_TYPES, or ifoutput_file_typeis not a valid conversion target for it, or ifoutput_file_typehas not been provided anywhere.
run_async
run_async(
sources: Iterable[str | Path | ByteStream],
output_file_type: OUTPUT_FILE_TYPE | None = None,
) -> LibreOfficeFileConverterOutput
Asynchronously convert office files to the specified output format using LibreOffice.
This is the asynchronous version of the run method with the same parameters and return values.
Parameters:
- sources (
Iterable[str | Path | ByteStream]) – List of sources to convert. Each source can be a file path (strorPath) or aByteStream. ForByteStreamsources, the input file type cannot be inferred from the filename, so onlyoutput_file_typeis validated (not the source type). - output_file_type (
OUTPUT_FILE_TYPE | None) – Target file format to convert to. Must be a valid conversion target for each source's input type — see :attr:SUPPORTED_TYPESfor the full mapping. If set, it will override theoutput_file_typeparameter provided during initialization.
Returns:
LibreOfficeFileConverterOutput– A dictionary with the following key:output: List ofByteStreamobjects containing the converted file data, in the same order assources.
Raises:
FileNotFoundError– If a source file path does not exist.OSError– If the internal temporary output directory is not writable.ValueError– If a source's file type is not in :attr:SUPPORTED_TYPES, or ifoutput_file_typeis not a valid conversion target for it, or ifoutput_file_typehas not been provided anywhere.