Version: 3.1-unstable

PythonCodeSplitter

PythonCodeSplitter splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible.


Most common position in a pipeline	In indexing pipelines after Converters, before Embedders or `DocumentWriter`
Mandatory run variables	`documents`: A list of Python source code documents
Output variables	`documents`: A list of Python source code documents split into syntax-aware chunks
API reference	PreProcessors
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py
Package name	`haystack-ai`

Overview

PythonCodeSplitter expects each input document's content to be valid Python source code. It parses the source with Python's ast module and creates ordered split units for:

Module docstrings
Consecutive import blocks
Top-level functions
Class headers
Methods and nested classes
Remaining top-level statements

The splitter merges these units in source order toward max_effective_lines. Effective lines are calculated from character length with ceil(len(source) / expected_chars_per_line), so long lines count as more than one line.

Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than oversized_factor * max_effective_lines, the splitter falls back to a line-based secondary split using DocumentSplitter. This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap.

By default, preserve_class_definition=True. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context.

If strip_docstrings=True, function, method, and class docstrings are removed from chunk content and stored in meta["docstrings"]. Module docstrings stay in the chunk content because they are their own top-level unit.

Per-chunk metadata

Each output document carries the metadata below. All fields from the parent document's meta (except split_id) are also propagated.

Field	Description
`source_id`	ID of the originating document
`split_id`	Sequential index of this chunk within its source document
`start_line`	First line of the chunk in the original source (1-indexed). Oversized secondary chunks keep the originating unit's range.
`end_line`	Last line of the chunk in the original source (1-indexed). Oversized secondary chunks keep the originating unit's range.
`unit_kinds`	List of syntactic unit kinds included in this chunk, such as `imports`, `function`, `class_header`, or `method`
`include_classes`	(when applicable) Ordered list of class names whose members appear in this chunk
`decorators`	(when applicable) Ordered list of decorator strings found on included functions, methods, or classes
`docstrings`	(when `strip_docstrings=True`) List of stripped docstring strings in source order
`secondary_split`	`True` if this chunk was produced by the oversized fallback splitter
`secondary_split_index`	Index of this piece within the secondary split sequence
`secondary_split_total`	Total number of pieces produced by the secondary split

Documents with None content raise ValueError, documents with non-string content raise TypeError, and invalid Python source raises SyntaxError. Empty documents are skipped.

Configuration

Parameter	Type	Default	Description
`min_effective_lines`	`int`	`20`	Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit.
`max_effective_lines`	`int`	`100`	Target effective lines per chunk. Units are merged greedily toward this value.
`expected_chars_per_line`	`int`	`45`	Character count used to estimate effective lines via `ceil(len(source) / expected_chars_per_line)`.
`oversized_factor`	`int`	`3`	Multiplier that triggers secondary line-based splitting for oversized syntactic units.
`strip_docstrings`	`bool`	`False`	Moves function, method, and class docstrings from content into `meta["docstrings"]`.
`preserve_class_definition`	`bool`	`True`	Prefixes class signatures on chunks that contain class members without the class header.
`secondary_split_overlap`	`int`	`5`	Line overlap used only by the oversized secondary split.
`secondary_split_length`	`int \| None`	`None`	Line length for the oversized secondary split. Defaults to `max_effective_lines` when `None`.

Usage

On its own

python

import textwrap

from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter

source = textwrap.dedent(
    '''
    """Math utilities."""
    from math import pi


    class Circle:
        """A circle."""

        def __init__(self, radius: float) -> None:
            self.radius = radius

        def area(self) -> float:
            return pi * self.radius * self.radius
    '''
).lstrip()

splitter = PythonCodeSplitter(
    min_effective_lines=4,
    max_effective_lines=12,
    strip_docstrings=True,
)

result = splitter.run(
    documents=[Document(content=source, meta={"file_name": "geometry.py"})],
)

for chunk in result["documents"]:
    print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes"))

With docstring stripping for RAG

Set strip_docstrings=True when docstrings are verbose. The docstring text is moved out of the chunk content into meta["docstrings"], keeping the stored chunk compact. Pass meta_fields_to_embed=["docstrings"] to your embedder so the docstring text still influences retrieval even though it is no longer in the chunk content.

python

from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter

source = '''
"""Example module."""
from math import pi


class Circle:
    """A circle defined by its radius."""

    def __init__(self, r: float) -> None:
        """Store the radius."""
        self.r = r

    def area(self) -> float:
        """Return the area of the circle."""
        return pi * self.r * self.r
'''

splitter = PythonCodeSplitter(
    min_effective_lines=20,
    max_effective_lines=100,
    strip_docstrings=True,
)
result = splitter.run(documents=[Document(content=source, meta={"file_name": "my_module.py"})])
for chunk in result["documents"]:
    print(chunk.content)
    print(chunk.meta.get("docstrings"))

In a pipeline

This pipeline converts Python files to documents, splits them with PythonCodeSplitter, and writes the chunks to an in-memory document store.

python

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import PythonCodeSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/project").glob("**/*.py"))
p.run({"converter": {"sources": files}})

Overview​

Per-chunk metadata​

Configuration​

Usage​

On its own​

With docstring stripping for RAG​

In a pipeline​

Overview

Per-chunk metadata

Configuration

Usage

On its own

With docstring stripping for RAG

In a pipeline