MSSharePointFetcher
Fetches the full content of Microsoft SharePoint and OneDrive items via the Microsoft Graph API and returns it as ByteStreams.
| Most common position in a pipeline | After MSSharePointRetriever, before a Router or File Converters |
| Mandatory init variables | None |
| Mandatory run variables | access_token: A delegated Microsoft Graph bearer token, typically wired from an upstream OAuthTokenResolver targets: A list of Documents (from MSSharePointRetriever) or raw SharePoint/OneDrive web_url strings |
| Output variables | streams: A list of ByteStreams holding the fetched content |
| API reference | Microsoft SharePoint |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/microsoft_sharepoint |
| Package name | microsoft-sharepoint-haystack |
Overview
MSSharePointFetcher downloads the full content of Microsoft SharePoint and OneDrive items through the Microsoft Graph API and returns ByteStream objects, ready for a downstream converter.
It complements MSSharePointRetriever, which returns only Search snippets and metadata. Wire the retriever's documents (or a list of web_urls) into the fetcher to download the underlying content. The fetcher dispatches on the entity type of each hit:
- Files (
driveItem) are downloaded as their raw bytes (PDF, DOCX, ...). - List items (
listItem) are returned as a JSONByteStreamof the item's column values (fields). - SharePoint pages (
sitePage) are returned as an HTMLByteStreambuilt from the page's web parts.
Each ByteStream's meta carries url, file_name, content_type, and a normalized entity_type (driveItem, listItem, or sitePage). Everything is resolved through the Microsoft Graph shares endpoint (plus the Pages API for pages), so only the web_url already exposed by the retriever is needed.
Because the output is a list of ByteStreams of mixed types, the typical next step is a FileTypeRouter that dispatches each stream to the right converter (PyPDFToDocument, DOCXToDocument, HTMLToDocument, or a JSON converter).
Authentication
The fetcher takes a per-user access_token as a run input. The token must carry delegated Microsoft Graph permissions (for example Files.Read.All for files and Sites.Read.All for list items and pages). Typically you wire it from an upstream OAuthTokenResolver, which emits a plain string. A Secret is also accepted and resolved internally.
Error handling and concurrency
raise_on_failure(defaultTrue): whenFalse, a failed fetch is logged and the item is skipped, so the remaining items are still returned.max_retries(default3): retries on throttled (HTTP 429) and transient server errors.max_concurrent_requests(default5): bounds the number of items fetched concurrently byrun_asyncto avoid tripping Microsoft Graph rate limits. It has no effect on the synchronousrun, which fetches items one at a time.
Installation
Install the Microsoft SharePoint integration with:
Usage
On its own
access_token below is a per-user delegated Microsoft Graph bearer token. You can pass either raw web_url strings or the Documents produced by MSSharePointRetriever.
from haystack_integrations.components.fetchers.microsoft_sharepoint import (
MSSharePointFetcher,
)
fetcher = MSSharePointFetcher()
result = fetcher.run(
access_token="my-delegated-graph-token",
targets=[
"https://contoso.sharepoint.com/sites/contoso-team/contoso-designs.docx",
],
)
for stream in result["streams"]:
print(stream.meta["file_name"], stream.meta["content_type"])
In a pipeline
The following query pipeline ties the whole integration together: an OAuthTokenResolver provides a token, MSSharePointRetriever searches SharePoint, MSSharePointFetcher downloads the matching items, and a FileTypeRouter sends each ByteStream to the right converter. Note that the resolver's single access_token output feeds both the retriever and the fetcher.
from haystack import Pipeline
from haystack.utils import Secret
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import PyPDFToDocument, DOCXToDocument
from haystack_integrations.components.connectors.oauth import OAuthTokenResolver
from haystack_integrations.utils.oauth import OAuthRefreshTokenSource
from haystack_integrations.components.retrievers.microsoft_sharepoint import (
MSSharePointRetriever,
)
from haystack_integrations.components.fetchers.microsoft_sharepoint import (
MSSharePointFetcher,
)
pipeline = Pipeline()
pipeline.add_component(
"resolver",
OAuthTokenResolver(
token_source=OAuthRefreshTokenSource(
token_url="https://login.microsoftonline.com/common/oauth2/v2.0/token",
client_id="aaa-bbb-ccc",
refresh_token=Secret.from_env_var("MS_REFRESH_TOKEN"),
scopes=[
"https://graph.microsoft.com/Files.Read.All",
"https://graph.microsoft.com/Sites.Read.All",
"offline_access",
],
),
),
)
pipeline.add_component("retriever", MSSharePointRetriever(top_k=5))
pipeline.add_component("fetcher", MSSharePointFetcher())
pipeline.add_component(
"router",
FileTypeRouter(
mime_types=[
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
],
),
)
pipeline.add_component("pdf_converter", PyPDFToDocument())
pipeline.add_component("docx_converter", DOCXToDocument())
# The same token feeds both the retriever and the fetcher.
pipeline.connect("resolver.access_token", "retriever.access_token")
pipeline.connect("resolver.access_token", "fetcher.access_token")
# The retrieved documents become the fetcher's targets.
pipeline.connect("retriever.documents", "fetcher.targets")
# Route each fetched ByteStream to the matching converter.
pipeline.connect("fetcher.streams", "router.sources")
pipeline.connect("router.application/pdf", "pdf_converter.sources")
pipeline.connect(
"router.application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"docx_converter.sources",
)
result = pipeline.run({"retriever": {"query": "quarterly roadmap"}})