Skip to main content
Version: 2.30

MSSharePointFetcher

Fetches the full content of Microsoft SharePoint and OneDrive items via the Microsoft Graph API and returns it as ByteStreams.

Most common position in a pipelineAfter MSSharePointRetriever, before a Router or File Converters
Mandatory init variablesNone
Mandatory run variablesaccess_token: A delegated Microsoft Graph bearer token, typically wired from an upstream OAuthTokenResolver

targets: A list of Documents (from MSSharePointRetriever) or raw SharePoint/OneDrive web_url strings
Output variablesstreams: A list of ByteStreams holding the fetched content
API referenceMicrosoft SharePoint
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/microsoft_sharepoint
Package namemicrosoft-sharepoint-haystack

Overview

MSSharePointFetcher downloads the full content of Microsoft SharePoint and OneDrive items through the Microsoft Graph API and returns ByteStream objects, ready for a downstream converter.

It complements MSSharePointRetriever, which returns only Search snippets and metadata. Wire the retriever's documents (or a list of web_urls) into the fetcher to download the underlying content. The fetcher dispatches on the entity type of each hit:

  • Files (driveItem) are downloaded as their raw bytes (PDF, DOCX, ...).
  • List items (listItem) are returned as a JSON ByteStream of the item's column values (fields).
  • SharePoint pages (sitePage) are returned as an HTML ByteStream built from the page's web parts.

Each ByteStream's meta carries url, file_name, content_type, and a normalized entity_type (driveItem, listItem, or sitePage). Everything is resolved through the Microsoft Graph shares endpoint (plus the Pages API for pages), so only the web_url already exposed by the retriever is needed.

Because the output is a list of ByteStreams of mixed types, the typical next step is a FileTypeRouter that dispatches each stream to the right converter (PyPDFToDocument, DOCXToDocument, HTMLToDocument, or a JSON converter).

Authentication

The fetcher takes a per-user access_token as a run input. The token must carry delegated Microsoft Graph permissions (for example Files.Read.All for files and Sites.Read.All for list items and pages). Typically you wire it from an upstream OAuthTokenResolver, which emits a plain string. A Secret is also accepted and resolved internally.

Error handling and concurrency

  • raise_on_failure (default True): when False, a failed fetch is logged and the item is skipped, so the remaining items are still returned.
  • max_retries (default 3): retries on throttled (HTTP 429) and transient server errors.
  • max_concurrent_requests (default 5): bounds the number of items fetched concurrently by run_async to avoid tripping Microsoft Graph rate limits. It has no effect on the synchronous run, which fetches items one at a time.

Installation

Install the Microsoft SharePoint integration with:

shell
pip install microsoft-sharepoint-haystack

Usage

On its own

access_token below is a per-user delegated Microsoft Graph bearer token. You can pass either raw web_url strings or the Documents produced by MSSharePointRetriever.

python
from haystack_integrations.components.fetchers.microsoft_sharepoint import (
MSSharePointFetcher,
)

fetcher = MSSharePointFetcher()

result = fetcher.run(
access_token="my-delegated-graph-token",
targets=[
"https://contoso.sharepoint.com/sites/contoso-team/contoso-designs.docx",
],
)

for stream in result["streams"]:
print(stream.meta["file_name"], stream.meta["content_type"])

In a pipeline

The following query pipeline ties the whole integration together: an OAuthTokenResolver provides a token, MSSharePointRetriever searches SharePoint, MSSharePointFetcher downloads the matching items, and a FileTypeRouter sends each ByteStream to the right converter. Note that the resolver's single access_token output feeds both the retriever and the fetcher.

python
from haystack import Pipeline
from haystack.utils import Secret
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import PyPDFToDocument, DOCXToDocument

from haystack_integrations.components.connectors.oauth import OAuthTokenResolver
from haystack_integrations.utils.oauth import OAuthRefreshTokenSource
from haystack_integrations.components.retrievers.microsoft_sharepoint import (
MSSharePointRetriever,
)
from haystack_integrations.components.fetchers.microsoft_sharepoint import (
MSSharePointFetcher,
)

pipeline = Pipeline()
pipeline.add_component(
"resolver",
OAuthTokenResolver(
token_source=OAuthRefreshTokenSource(
token_url="https://login.microsoftonline.com/common/oauth2/v2.0/token",
client_id="aaa-bbb-ccc",
refresh_token=Secret.from_env_var("MS_REFRESH_TOKEN"),
scopes=[
"https://graph.microsoft.com/Files.Read.All",
"https://graph.microsoft.com/Sites.Read.All",
"offline_access",
],
),
),
)
pipeline.add_component("retriever", MSSharePointRetriever(top_k=5))
pipeline.add_component("fetcher", MSSharePointFetcher())
pipeline.add_component(
"router",
FileTypeRouter(
mime_types=[
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
],
),
)
pipeline.add_component("pdf_converter", PyPDFToDocument())
pipeline.add_component("docx_converter", DOCXToDocument())

# The same token feeds both the retriever and the fetcher.
pipeline.connect("resolver.access_token", "retriever.access_token")
pipeline.connect("resolver.access_token", "fetcher.access_token")

# The retrieved documents become the fetcher's targets.
pipeline.connect("retriever.documents", "fetcher.targets")

# Route each fetched ByteStream to the matching converter.
pipeline.connect("fetcher.streams", "router.sources")
pipeline.connect("router.application/pdf", "pdf_converter.sources")
pipeline.connect(
"router.application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"docx_converter.sources",
)

result = pipeline.run({"retriever": {"query": "quarterly roadmap"}})