Version: 2.29

Serializing Pipelines

Save your pipelines into a custom format and explore the serialization options.

Serialization means converting a pipeline to a format that you can save on your disk and load later.

Haystack supports YAML format for pipeline serialization.

Converting a Pipeline to YAML

Use the dumps() method to convert a Pipeline object to YAML:

python

from haystack import Pipeline

pipe = Pipeline()
print(pipe.dumps())

# Prints:
#
# components: {}
# connections: []
# max_runs_per_component: 100
# metadata: {}

You can also use dump() method to save the YAML representation of a pipeline in a file:

python

with open("/content/test.yml", "w") as file:
    pipe.dump(file)

Converting a Pipeline Back to Python

You can convert a YAML pipeline back into Python. Use the loads() method to convert a string representation of a pipeline (str, bytes or bytearray) or the load() method to convert a pipeline represented in a file-like object into a corresponding Python object.

Both loading methods support callbacks that let you modify components during the deserialization process. Therefore, loading a serialized pipeline or component assumes that the serialized definition originates from a trusted source and has been reviewed by the user.

Here is an example script:

python

from haystack import Pipeline
from haystack.core.serialization import DeserializationCallbacks
from typing import Type, Dict, Any

# This is the YAML you want to convert to Python:
pipeline_yaml = """
components:
  cleaner:
    init_parameters:
      remove_empty_lines: true
      remove_extra_whitespaces: true
      remove_regex: null
      remove_repeated_substrings: false
      remove_substrings: null
    type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
  converter:
    init_parameters:
      encoding: utf-8
    type: haystack.components.converters.txt.TextFileToDocument
connections:
- receiver: cleaner.documents
  sender: converter.documents
max_runs_per_component: 100
metadata: {}
"""


def component_pre_init_callback(
    component_name: str,
    component_cls: Type,
    init_params: Dict[str, Any],
):
    # This function gets called every time a component is deserialized.
    if component_name == "cleaner":
        assert "DocumentCleaner" in component_cls.__name__
        # Modify the init parameters. The modified parameters are passed to
        # the init method of the component during deserialization.
        init_params["remove_empty_lines"] = False
        print("Modified 'remove_empty_lines' to False in 'cleaner' component")
    else:
        print(f"Not modifying component {component_name} of class {component_cls}")


pipe = Pipeline.loads(
    pipeline_yaml,
    callbacks=DeserializationCallbacks(component_pre_init_callback),
)

Default Serialization Behavior

The serialization system uses default_to_dict and default_from_dict to handle many object types automatically. You typically do not need to implement custom to_dict/from_dict for:

Secrets: serialized and deserialized automatically so that sensitive values aren't stored in plain text.
ComponentDevice: device configuration is detected and restored automatically.
Objects with their own to_dict/from_dict: any init parameter whose type defines to_dict() is serialized by calling it; any dict in init_parameters with a type key pointing to a class with from_dict() is deserialized automatically.

To serialize or deserialize a single component, you can use component_to_dict and component_from_dict from haystack.core.serialization. They use the default behavior above as a fallback when the component doesn't define custom to_dict/from_dict:

python

from haystack import component
from haystack.core.serialization import component_from_dict, component_to_dict


@component
class Greeter:
    def __init__(self, message: str = "Hello"):
        self.message = message

    @component.output_types(greeting=str)
    def run(self, name: str):
        return {"greeting": f"{self.message}, {name}!"}


# Serialize a component instance to a dictionary
greeter = Greeter(message="Hi")
data = component_to_dict(greeter, "my_greeter")

# Deserialize back to a component instance
restored = component_from_dict(Greeter, data, "my_greeter")
assert restored.message == greeter.message

Init parameters must be stored as instance attributes

Default serialization only works when there is a 1:1 mapping between init parameter names and instance attributes. For every argument in __init__, the component must assign it to an attribute with the same name. For example, if you have def __init__(self, prompt: str), you must have self.prompt = prompt in the class. Otherwise the serialization logic can't find the value to serialize and raises an error or uses the default value if the parameter has one.

Performing Custom Serialization

Pipelines and components in Haystack can serialize simple components, including custom ones, out of the box. Code like this just works:

python

from haystack import component


@component
class RepeatWordComponent:
    def __init__(self, times: int):
        self.times = times

    @component.output_types(result=str)
    def run(self, word: str):
        return word * self.times

On the other hand, this code doesn't work if the final format is JSON, as the set type is not JSON-serializable:

python

from haystack import component


@component
class SetIntersector:
    def __init__(self, intersect_with: set):
        self.intersect_with = intersect_with

    @component.output_types(result=set)
    def run(self, data: set):
        return data.intersection(self.intersect_with)

In such cases, you can provide your own implementation from_dict and to_dict to components:

python

from haystack import component, default_from_dict, default_to_dict


class SetIntersector:
    def __init__(self, intersect_with: set):
        self.intersect_with = intersect_with

    @component.output_types(result=set)
    def run(self, data: set):
        return data.intersect(self.intersect_with)

    def to_dict(self):
        return default_to_dict(self, intersect_with=list(self.intersect_with))

    @classmethod
    def from_dict(cls, data):
        # convert the set into a list for the dict representation,
        # so it can be converted to JSON
        data["intersect_with"] = set(data["intersect_with"])
        return default_from_dict(cls, data)

Saving a Pipeline to a Custom Format

Once a pipeline is available in its dictionary format, the last step of serialization is to convert that dictionary into a format you can store or send over the wire. Haystack supports YAML out of the box, but if you need a different format, you can write a custom Marshaller.

A Marshaller is a Python class responsible for converting text to a dictionary and a dictionary to text according to a certain format. Marshallers must respect the Marshaller protocol, providing the methods marshal and unmarshal.

This is the code for a custom TOML marshaller that relies on the rtoml library:

python

# This code requires a `pip install rtoml`
from typing import Dict, Any, Union
import rtoml


class TomlMarshaller:
    def marshal(self, dict_: Dict[str, Any]) -> str:
        return rtoml.dumps(dict_)

    def unmarshal(self, data_: Union[str, bytes]) -> Dict[str, Any]:
        return dict(rtoml.loads(data_))

You can then pass a Marshaller instance to the methods dump, dumps, load, and loads:

python

from haystack import Pipeline
from my_custom_marshallers import TomlMarshaller

pipe = Pipeline()
pipe.dumps(TomlMarshaller())
# prints:
# 'max_runs_per_component = 100\nconnections = []\n\n[metadata]\n\n[components]\n'

Additional References

📓 Tutorial: Serializing LLM Pipelines

Converting a Pipeline to YAML​

Converting a Pipeline Back to Python​

Default Serialization Behavior​

Performing Custom Serialization​

Saving a Pipeline to a Custom Format​

Additional References​

Converting a Pipeline to YAML

Converting a Pipeline Back to Python

Default Serialization Behavior

Performing Custom Serialization

Saving a Pipeline to a Custom Format

Additional References