LibreOfficeFileConverter
A component that converts office files between formats using LibreOffice's command line interface (soffice).
| Most common position in a pipeline | Before a document converter (e.g. DOCXToDocument) when the source files need to be converted to a format that the converter supports |
| Mandatory run variables | sources: File paths or ByteStream objects; output_file_type: The target file format |
| Output variables | output: A list of ByteStream objects |
| API reference | LibreOffice |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/libreoffice |
Overview
LibreOfficeFileConverter converts office files from one format to another using LibreOffice's soffice command line utility. It supports a wide range of document, spreadsheet, and presentation formats and is useful when your pipeline receives files in a format that downstream converters don't support.
Unlike most converters, LibreOfficeFileConverter outputs ByteStream objects rather than Haystack Documents. This means it's typically chained with a document converter (such as DOCXToDocument or PyPDFToDocument) to produce the final Documents.
Requires LibreOffice to be installed and available in PATH as soffice. See the LibreOffice installation guide for details.
Supported conversions
| Category | Input formats | Possible output formats |
|---|---|---|
| Documents | doc, docx, odt, rtf, txt, html | pdf, docx, doc, odt, rtf, txt, html, epub |
| Spreadsheets | xlsx, xls, ods, csv | pdf, xlsx, xls, ods, csv, html |
| Presentations | pptx, ppt, odp | pdf, pptx, ppt, odp, html, png, jpg |
This is a non-exhaustive list. See the LibreOffice filter documentation for all supported conversions.
Usage
Install the LibreOffice integration:
On its own
from pathlib import Path
from haystack_integrations.components.converters.libreoffice import (
LibreOfficeFileConverter,
)
converter = LibreOfficeFileConverter()
result = converter.run(sources=[Path("sample.doc")], output_file_type="docx")
bytestreams = result["output"]
You can also set output_file_type at initialization to avoid passing it on every run() call:
In a pipeline
A common pattern is to chain LibreOfficeFileConverter with a document converter. The example below converts a legacy .doc file to .docx and then extracts it as a Haystack Document:
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import DOCXToDocument
from haystack_integrations.components.converters.libreoffice import (
LibreOfficeFileConverter,
)
pipeline = Pipeline()
pipeline.add_component(
"libreoffice_converter",
LibreOfficeFileConverter(output_file_type="docx"),
)
pipeline.add_component("docx_converter", DOCXToDocument())
pipeline.connect("libreoffice_converter.output", "docx_converter.sources")
result = pipeline.run(
{"libreoffice_converter": {"sources": [Path("legacy_report.doc")]}},
)
documents = result["docx_converter"]["documents"]