Microsoft SharePoint
haystack_integrations.components.retrievers.microsoft_sharepoint.retriever
MSSharePointRetriever
Retrieves content from Microsoft SharePoint and OneDrive via the Microsoft Search (Graph) API.
Given a query, the retriever calls POST /search/query and maps each hit to a Haystack Document
whose content is the search snippet and whose meta carries the resource metadata (file_name,
web_url, entity_type, created_date_time, last_modified_date_time, created_by, last_modified_by,
mime_type, and file_extension). It does not download or convert the underlying files. Compose a
downstream fetcher/converter on the returned web_url when full file content is needed.
The retriever takes a per-user access_token as a run input, typically wired
from an upstream OAuthResolver. The token must carry delegated Microsoft Graph permissions
(for example Files.Read.All and, for site/list scoping, Sites.Read.All). The Search API supports
delegated permissions only.
Usage example
from haystack_integrations.components.retrievers.microsoft_sharepoint import (
MSSharePointRetriever,
)
retriever = MSSharePointRetriever(top_k=5)
# `access_token` is a per-user delegated Microsoft Graph bearer token.
result = retriever.run(
query="quarterly roadmap", access_token="my-delegated-graph-token"
)
documents = result["documents"]
In a pipeline, connect an upstream component that emits a per-user access_token to the retriever's
access_token input. See the integration documentation for a full example that obtains the token from
an OAuth provider.
init
__init__(
*,
entity_types: list[str] | None = None,
top_k: int = 10,
fields: list[str] | None = None,
query_template: str | None = None,
graph_url: str = _DEFAULT_GRAPH_URL,
timeout: float = 30.0,
max_retries: int = 3
) -> None
Initialize the retriever.
Parameters:
- entity_types (
list[str] | None) – The Microsoft Search entity types to query. Defaults to["driveItem", "listItem"], which covers files, folders, SharePoint pages and news, and list items. Other valid values are"list"and"site". See the supported values and combinations in the Microsoft docs. - top_k (
int) – The maximum number of documents to return. Maps to the Search APIsizeand is paginated when it exceeds a single page. - fields (
list[str] | None) – Optional list of resource properties to request via the Search APIfieldsselection (only honored forlistItemanddriveItementity types). See Get selected properties. - query_template (
str | None) – Optional query template used to scope the search, for example'{searchTerms} path:"https://contoso.sharepoint.com/sites/Team"'. The literal{searchTerms}placeholder is replaced by the run-time query. The template uses Keyword Query Language (KQL). - graph_url (
str) – The Microsoft Graph base URL. Defaults tohttps://graph.microsoft.com/v1.0. Override for sovereign clouds. - timeout (
float) – The HTTP timeout in seconds for each request to Microsoft Graph. - max_retries (
int) – The maximum number of retries for throttled (HTTP 429) or transient server errors.
Raises:
SharePointConfigError– Ifentity_typesis empty,top_kis not positive, ormax_retriesis negative.
run
run(
query: str, access_token: str | Secret, top_k: int | None = None
) -> dict[str, list[Document]]
Search SharePoint and OneDrive and return the matching documents.
Parameters:
- query (
str) – The search query string. Filter results by embedding Keyword Query Language (KQL) operators directly in the query, for examplefiletype:docx,author:"Jane Doe", orpath:"https://contoso.sharepoint.com/sites/Team". See the KQL syntax reference. - access_token (
str | Secret) – A delegated Microsoft Graph bearer token for the user whose content is searched, typically wired from an upstreamOAuthResolver(which emits a plainstr). ASecretis also accepted and resolved internally. - top_k (
int | None) – Overrides thetop_kconfigured at initialization for this run.
Returns:
dict[str, list[Document]]– A dictionary with adocumentskey holding the list of retrievedDocumentobjects.
Raises:
SharePointConfigError– Ifaccess_tokenis aSecretthat does not resolve to a string.SharePointRequestError– If Microsoft Graph returns an error response.
run_async
run_async(
query: str, access_token: str | Secret, top_k: int | None = None
) -> dict[str, list[Document]]
Asynchronously search SharePoint and OneDrive and return the matching documents.
Parameters:
- query (
str) – The search query string. Filter results by embedding Keyword Query Language (KQL) operators directly in the query, for examplefiletype:docx,author:"Jane Doe", orpath:"https://contoso.sharepoint.com/sites/Team". See the KQL syntax reference. - access_token (
str | Secret) – A delegated Microsoft Graph bearer token for the user whose content is searched, typically wired from an upstreamOAuthResolver(which emits a plainstr). ASecretis also accepted and resolved internally. - top_k (
int | None) – Overrides thetop_kconfigured at initialization for this run.
Returns:
dict[str, list[Document]]– A dictionary with adocumentskey holding the list of retrievedDocumentobjects.
Raises:
SharePointConfigError– Ifaccess_tokenis aSecretthat does not resolve to a string.SharePointRequestError– If Microsoft Graph returns an error response.
to_dict
Serialize this component to a dictionary.
Returns:
dict[str, Any]– The serialized component as a dictionary.
from_dict
Deserialize this component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary representation of this component.
Returns:
MSSharePointRetriever– The deserialized component instance.