Documentation

MSGToDocument

Converts Microsoft Outlook .msg files to documents.

Most common position in a pipelineBefore PreProcessors , or right at the beginning of an indexing pipeline
Mandatory run variables"sources": A list of .msg file paths or ByteStream objects
Output variables"documents": A list of documents

"attachments": A list of ByteStream objects representing file attachments
API referenceConverters
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/msg.py

Overview

The MSGToDocument component converts Microsoft Outlook .msg files into documents. This component extracts the email metadata (such as sender, recipients, CC, BCC, subject) and body content. Additionally, any file attachments within the .msg file are extracted as ByteStream objects.

Usage

First, install the python-oxmsg package to start using this converter:

pip install python-oxmsg

On its own

from haystack.components.converters.msg import MSGToDocument
from datetime import datetime

converter = MSGToDocument()
results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
attachments = results["attachments"]

print(documents[0].content)

In a pipeline

The following setup enables efficient extraction, preprocessing, and indexing of .msg email files within a Haystack pipeline:

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import MSGToDocument
from haystack.components.writers import DocumentWriter

router = FileTypeRouter(mime_types=["application/vnd.ms-outlook"])
document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("router", router)
pipeline.add_component("converter", MSGToDocument())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("router.application/vnd.ms-outlook", "converter.sources")
pipeline.connect("converter.documents", "writer.documents")

file_names = ["email1.msg", "email2.msg"]
pipeline.run({"converter": {"sources": file_names}})