DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Secret Management

This page emphasizes secret management in Haystack components and introduces the Secret type for structured secret handling. It explains the drawbacks of hard-coding secrets in code and suggests using environment variables instead.

Many Haystack components interact with third-party frameworks and service providers such as Azure, Google Vertex AI, and Hugging Face. Their libraries often require the user to authenticate themselves to ensure they receive access to the underlying product. The authentication process usually works with a secret value that acts as an opaque identifier to the third-party backend. This secret is something that must be handled with caution.

Example Use Case

The Problem

Let’s consider an example RAG Pipeline that embeds a query, uses a Retriever component to locate Documents relevant to the query, and then leverages an LLM to generate an answer based on the retrieved Documents.

The OpenAIGenerator component used in the above Pipeline expects an API key to authenticate with OpenAI’s servers and perform the generation. Let’s assume that the component accepts a str value for it:

generator = OpenAIGenerator(model="gpt-4", api_key="sk-xxxxxxxxxxxxxxxxxx")
pipeline.add_component("generator", generator)

This works in a pinch, but this is bad practice - we shouldn’t hard-code such secrets in the codebase. An alternative would be to store the key in an environment variable externally, read from it in Python, and pass that to the component:

import os

api_key = os.environ.get("OPENAI_API_KEY")
generator = OpenAIGenerator(model="gpt-4", api_key=api_key)
pipeline.add_component("generator", generator)

This is better – the Pipeline works as intended, and we aren’t hard-coding any secrets in the code.

Remember that Pipelines are serializable. Since the API key is a secret, we should definitely avoid saving it to disk. Let’s modify the component’s to_dict method to exclude the key:

def to_dict(self) -> Dict[str, Any]:
	# Do not pass the `api_key` init parameter.
	return default_to_dict(self, model=self.model)

But what happens when the Pipeline is loaded from disk? In the best-case scenario, the component’s backend will automatically try to read the key from a hard-coded environment variable, and that key is the same as the one that was passed to the component before it was serialized. But in a worse case, the backend doesn’t look up the key in a hard-coded environment variable and fails when it gets called inside a pipeline.run() invocation.

The Solution

Structured Secret Handling

To avoid such problematic situations, Haystack introduces a simple Secret type. This type is used by components that require authentication, and it provides a consistent API that ensures that secrets do not get unwittingly serialized to disk. It also ensures that the user explicitly chooses how the secret is passed to the component.

The Secret class has two class methods:

  • from_token - Accepts a bare string token that acts as the secret.
  • from_env - Accepts the names of one or more environment variables that can contain the secret.

Both of those methods return an instance of the Secret type. The instance has an additional resolve_value method that either returns a str that represents the secret or None if the resolution fails. The return value of the above method can be passed to the backend expecting the secret.

Ensure that the components that accept one or more Secrets as initial parameters are saved in their to_dict method and loaded in their from_dict class method, respectively. Token-based Secret instances cannot be serialized and will raise an exception when attempted. Environment-based Secret instances can be serialized by default.

Example

from haystack.utils import Secret, deserialize_secrets_inplace

@component
class MyComponent:
  def __init__(self, api_key: Optional[Secret] = None, **kwargs):
    self.api_key = api_key
    self.backend = None

  def warm_up(self):
    # Call resolve_value to yield a single result. The semantics of the result is policy-dependent.
    # Currently, all supported policies will return a single string token.
    self.backend = SomeBackend(api_key=self.api_key.resolve_value() if self.api_key else None, ...)

  def to_dict(self):
    # Serialize the policy like any other (custom) data. If the policy is token-based, it will
    # raise an error.
    return default_to_dict(self, api_key=self.api_key.to_dict() if self.api_key else None, ...)

  @classmethod
  def from_dict(cls, data):
    # Deserialize the policy data before passing it to the generic from_dict function.
    api_key_data = data["init_parameters"]["api_key"]
    api_key = Secret.from_dict(api_key_data) if api_key_data is not None else None
    data["init_parameters"]["api_key"] = api_key
		# Alternatively, use the helper function.
		# deserialize_secrets_inplace(data["init_parameters"], keys=["api_key"])
    return default_from_dict(cls, data)

# No authentication.
component = MyComponent(api_key=None)

# Token based authentication
component = MyComponent(api_key=Secret.from_token("sk-randomAPIkeyasdsa32ekasd32e"))
component.to_dict() # Error! Can't serialize authentication tokens

# Environment variable based authentication
component = MyComponent(api_key=Secret.from_env("OPENAI_API_KEY"))
component.to_dict() # This is fine