Skip to main content
Version: 2.26

LibreOfficeFileConverter

A component that converts office files between formats using LibreOffice's command line interface (soffice).

Most common position in a pipelineBefore a document converter (e.g. DOCXToDocument) when the source files need to be converted to a format that the converter supports
Mandatory run variablessources: File paths or ByteStream objects; output_file_type: The target file format
Output variablesoutput: A list of ByteStream objects
API referenceLibreOffice
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/libreoffice

Overview

LibreOfficeFileConverter converts office files from one format to another using LibreOffice's soffice command line utility. It supports a wide range of document, spreadsheet, and presentation formats and is useful when your pipeline receives files in a format that downstream converters don't support.

Unlike most converters, LibreOfficeFileConverter outputs ByteStream objects rather than Haystack Documents. This means it's typically chained with a document converter (such as DOCXToDocument or PyPDFToDocument) to produce the final Documents.

Requires LibreOffice to be installed and available in PATH as soffice. See the LibreOffice installation guide for details.

Supported conversions

CategoryInput formatsPossible output formats
Documentsdoc, docx, odt, rtf, txt, htmlpdf, docx, doc, odt, rtf, txt, html, epub
Spreadsheetsxlsx, xls, ods, csvpdf, xlsx, xls, ods, csv, html
Presentationspptx, ppt, odppdf, pptx, ppt, odp, html, png, jpg

This is a non-exhaustive list. See the LibreOffice filter documentation for all supported conversions.

Usage

Install the LibreOffice integration:

shell
pip install libreoffice-haystack

On its own

python
from pathlib import Path
from haystack_integrations.components.converters.libreoffice import (
LibreOfficeFileConverter,
)

converter = LibreOfficeFileConverter()
result = converter.run(sources=[Path("sample.doc")], output_file_type="docx")
bytestreams = result["output"]

You can also set output_file_type at initialization to avoid passing it on every run() call:

python
converter = LibreOfficeFileConverter(output_file_type="pdf")
result = converter.run(sources=[Path("report.pptx")])

In a pipeline

A common pattern is to chain LibreOfficeFileConverter with a document converter. The example below converts a legacy .doc file to .docx and then extracts it as a Haystack Document:

python
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import DOCXToDocument
from haystack_integrations.components.converters.libreoffice import (
LibreOfficeFileConverter,
)

pipeline = Pipeline()
pipeline.add_component(
"libreoffice_converter",
LibreOfficeFileConverter(output_file_type="docx"),
)
pipeline.add_component("docx_converter", DOCXToDocument())

pipeline.connect("libreoffice_converter.output", "docx_converter.sources")

result = pipeline.run(
{"libreoffice_converter": {"sources": [Path("legacy_report.doc")]}},
)
documents = result["docx_converter"]["documents"]