Google Drive
haystack_integrations.components.retrievers.google_drive.retriever
GoogleDriveRetriever
Retrieves files from Google Drive via the Drive API v3 search (files.list) endpoint.
Given a query, the retriever runs a full-text search over the user's Drive (and optionally shared
drives) and maps each matching file to a Haystack Document. By default, each Document carries
resource metadata (file_name, file_id, web_url, mime_type, file_extension, author, and
timestamps) and uses the file description or name as content, because the Drive search API does
not return a text snippet. Set include_content=True to additionally export native Google
Docs/Sheets/Slides to text and use that as the Document content. Binary files (PDF, DOCX, ...) are
never downloaded. Compose a downstream fetcher/converter on the returned web_url/file_id when full
file content is needed.
The retriever takes a per-user access_token as a run input, typically wired from an upstream
OAuthResolver. The token must carry a delegated Google OAuth scope that allows search
(for example https://www.googleapis.com/auth/drive.readonly). The metadata-only
drive.metadata.readonly scope cannot search file content or export documents.
Usage example
from haystack import Pipeline
from haystack.utils import Secret
from haystack_integrations.components.connectors.oauth import OAuthResolver
from haystack_integrations.utils.oauth import RefreshTokenSource
from haystack_integrations.components.retrievers.google_drive import GoogleDriveRetriever
pipeline = Pipeline()
pipeline.add_component(
"resolver",
OAuthResolver(
token_source=RefreshTokenSource(
token_url="https://oauth2.googleapis.com/token",
client_id="aaa-bbb-ccc",
refresh_token=Secret.from_env_var("GOOGLE_REFRESH_TOKEN"),
scopes=["https://www.googleapis.com/auth/drive.readonly"],
),
),
)
pipeline.add_component("retriever", GoogleDriveRetriever(top_k=5))
pipeline.connect("resolver.access_token", "retriever.access_token")
result = pipeline.run({"retriever": {"query": "quarterly roadmap"}})
documents = result["retriever"]["documents"]
init
__init__(
*,
include_content: bool = False,
top_k: int = 10,
query_filter: str | None = None,
include_shared_drives: bool = False,
order_by: str | None = None,
fields: list[str] | None = None,
api_base_url: str = _DEFAULT_API_BASE_URL,
timeout: float = 30.0,
max_retries: int = 3
) -> None
Initialize the retriever.
Parameters:
- include_content (
bool) – WhenTrue, native Google Docs/Sheets/Slides are exported to text and the result becomes theDocumentcontent. Binary files are never downloaded. WhenFalse(the default),contentis the filedescriptionornameand no export request is made. - top_k (
int) – The maximum number of documents to return. Maps to the DrivepageSizeand is paginated when it exceeds a single page. - query_filter (
str | None) – Optional Drive query clause AND-ed with the full-text search term, for example"mimeType != 'application/vnd.google-apps.folder'"or"'<folderId>' in parents". - include_shared_drives (
bool) – WhenTrue, the search spans shared drives as well as the user's My Drive (setsincludeItemsFromAllDrives,supportsAllDrives, andcorpora=allDrives). - order_by (
str | None) – Optional DriveorderByexpression, for example"modifiedTime desc". - fields (
list[str] | None) – Optional list of file properties to request via the Drivefieldsselection. Defaults to a standard set covering the returned metadata. - api_base_url (
str) – The Drive API base URL. Defaults tohttps://www.googleapis.com/drive/v3. - timeout (
float) – The HTTP timeout in seconds for each request to the Drive API. - max_retries (
int) – The maximum number of retries on HTTP 429 (rate limit), 500, 502, 503, or 504 responses. Set to 0 to disable retries.
Raises:
GoogleDriveConfigError– Iftop_kis not positive ormax_retriesis negative.
run
run(
query: str, access_token: str | Secret, top_k: int | None = None
) -> dict[str, list[Document]]
Search Google Drive and return the matching documents.
Parameters:
- query (
str) – The search query string, matched against the full text of files viafullText contains. - access_token (
str | Secret) – A delegated Google OAuth bearer token for the user whose Drive is searched, typically wired from an upstreamOAuthResolver(which emits a plainstr). ASecretis also accepted and resolved internally. - top_k (
int | None) – Overrides thetop_kconfigured at initialization for this run.
Returns:
dict[str, list[Document]]– A dictionary with adocumentskey holding the list of retrievedDocumentobjects.
Raises:
GoogleDriveConfigError– Ifaccess_tokenis aSecretthat does not resolve to a string.GoogleDriveRequestError– If the Drive API returns an error response.httpx.HTTPError– If a network-level error occurs (for example a timeout or connection failure).
run_async
run_async(
query: str, access_token: str | Secret, top_k: int | None = None
) -> dict[str, list[Document]]
Asynchronously search Google Drive and return the matching documents.
Parameters:
- query (
str) – The search query string, matched against the full text of files viafullText contains. - access_token (
str | Secret) – A delegated Google OAuth bearer token for the user whose Drive is searched, typically wired from an upstreamOAuthResolver(which emits a plainstr). ASecretis also accepted and resolved internally. - top_k (
int | None) – Overrides thetop_kconfigured at initialization for this run.
Returns:
dict[str, list[Document]]– A dictionary with adocumentskey holding the list of retrievedDocumentobjects.
Raises:
GoogleDriveConfigError– Ifaccess_tokenis aSecretthat does not resolve to a string.GoogleDriveRequestError– If the Drive API returns an error response.httpx.HTTPError– If a network-level error occurs (for example a timeout or connection failure).
to_dict
Serialize this component to a dictionary.
Returns:
dict[str, Any]– The serialized component as a dictionary.
from_dict
Deserialize this component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary representation of this component.
Returns:
GoogleDriveRetriever– The deserialized component instance.