MSSharePointRetriever
Retrieves content from Microsoft SharePoint and OneDrive via the Microsoft Search (Graph) API.
| Most common position in a pipeline | At the start of a query pipeline, after an OAuthTokenResolver that provides the access_token |
| Mandatory init variables | None |
| Mandatory run variables | query: The search query string access_token: A delegated Microsoft Graph bearer token, typically wired from an upstream OAuthTokenResolver |
| Output variables | documents: A list of Documents holding the search snippets and resource metadata |
| API reference | Microsoft SharePoint |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/microsoft_sharepoint |
| Package name | microsoft-sharepoint-haystack |
Overview
MSSharePointRetriever searches a user's Microsoft SharePoint and OneDrive content through the Microsoft Search (Graph) API. Given a query, it calls POST /search/query and maps each hit to a Haystack Document whose content is the search snippet and whose meta carries the resource metadata: file_name, web_url, entity_type, created_date_time, last_modified_date_time, created_by, last_modified_by, mime_type, and file_extension. It also stores the SharePoint identifiers a downstream fetcher needs to read list items and pages by ID (site_id, list_id, list_item_id, list_item_unique_id).
The retriever does not download or convert the underlying files – it only returns Search snippets and metadata. To download the full content of the hits, compose it with MSSharePointFetcher followed by a converter.
Authentication
The retriever takes a per-user access_token as a run input. The token must carry delegated Microsoft Graph permissions (for example Files.Read.All, plus Sites.Read.All for site and list scoping); the Search API supports delegated permissions only. Typically you wire the token from an upstream OAuthTokenResolver, which emits a plain string. A Secret is also accepted and resolved internally.
Scoping and filtering the search
You can narrow what is searched in several ways:
entity_types: which Microsoft Search entity types to query. Defaults to["driveItem", "listItem"], which covers files, folders, SharePoint pages and news, and list items. Other valid values are"list"and"site".- KQL operators embedded directly in the query, for example
filetype:docx,author:"Jane Doe", orpath:"https://contoso.sharepoint.com/sites/Team". See the Keyword Query Language (KQL) syntax reference. query_template: a reusable template such as'{searchTerms} path:"https://contoso.sharepoint.com/sites/Team"', where the literal{searchTerms}placeholder is replaced by the run-time query.
Installation
Install the Microsoft SharePoint integration with:
Usage
On its own
access_token below is a per-user delegated Microsoft Graph bearer token. In production you would obtain it from an OAuthTokenResolver rather than pasting it in.
from haystack_integrations.components.retrievers.microsoft_sharepoint import (
MSSharePointRetriever,
)
retriever = MSSharePointRetriever(top_k=5)
result = retriever.run(
query="quarterly roadmap",
access_token="my-delegated-graph-token",
)
for doc in result["documents"]:
print(doc.meta["file_name"], "-", doc.meta["web_url"])
In a pipeline
The following pipeline obtains a token from an OAuthTokenResolver and feeds it into the retriever, so that running the pipeline requires only the query:
from haystack import Pipeline
from haystack.utils import Secret
from haystack_integrations.components.connectors.oauth import OAuthTokenResolver
from haystack_integrations.utils.oauth import OAuthRefreshTokenSource
from haystack_integrations.components.retrievers.microsoft_sharepoint import (
MSSharePointRetriever,
)
pipeline = Pipeline()
pipeline.add_component(
"resolver",
OAuthTokenResolver(
token_source=OAuthRefreshTokenSource(
token_url="https://login.microsoftonline.com/common/oauth2/v2.0/token",
client_id="aaa-bbb-ccc",
refresh_token=Secret.from_env_var("MS_REFRESH_TOKEN"),
scopes=[
"https://graph.microsoft.com/Files.Read.All",
"https://graph.microsoft.com/Sites.Read.All",
"offline_access",
],
),
),
)
pipeline.add_component("retriever", MSSharePointRetriever(top_k=5))
pipeline.connect("resolver.access_token", "retriever.access_token")
result = pipeline.run({"retriever": {"query": "quarterly roadmap"}})
documents = result["retriever"]["documents"]
To download and convert the full content of the retrieved hits, connect the retriever's documents output to a MSSharePointFetcher. See that page for an end-to-end retrieve-fetch-convert example.