Skip to main content
Version: 2.30-unstable

PythonCodeSplitter

PythonCodeSplitter splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible.

Most common position in a pipelineIn indexing pipelines after Converters, before Embedders or DocumentWriter
Mandatory run variablesdocuments: A list of Python source code documents
Output variablesdocuments: A list of Python source code documents split into syntax-aware chunks
API referencePreProcessors
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py
Package namehaystack-ai

Overview

PythonCodeSplitter expects each input document's content to be valid Python source code. It parses the source with Python's ast module and creates ordered split units for:

  • Module docstrings
  • Consecutive import blocks
  • Top-level functions
  • Class headers
  • Methods and nested classes
  • Remaining top-level statements

The splitter merges these units in source order toward max_effective_lines. Effective lines are calculated from character length with ceil(len(source) / expected_chars_per_line), so long lines count as more than one line.

Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than oversized_factor * max_effective_lines, the splitter falls back to a line-based secondary split using DocumentSplitter. This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap.

By default, preserve_class_definition=True. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context.

If strip_docstrings=True, function, method, and class docstrings are removed from chunk content and stored in meta["docstrings"]. Module docstrings stay in the chunk content because they are their own top-level unit.

Per-chunk metadata

Each output document carries the metadata below. All fields from the parent document's meta (except split_id) are also propagated.

FieldDescription
source_idID of the originating document
split_idSequential index of this chunk within its source document
start_lineFirst line of the chunk in the original source (1-indexed). Oversized secondary chunks keep the originating unit's range.
end_lineLast line of the chunk in the original source (1-indexed). Oversized secondary chunks keep the originating unit's range.
unit_kindsList of syntactic unit kinds included in this chunk, such as imports, function, class_header, or method
include_classes(when applicable) Ordered list of class names whose members appear in this chunk
decorators(when applicable) Ordered list of decorator strings found on included functions, methods, or classes
docstrings(when strip_docstrings=True) List of stripped docstring strings in source order
secondary_splitTrue if this chunk was produced by the oversized fallback splitter
secondary_split_indexIndex of this piece within the secondary split sequence
secondary_split_totalTotal number of pieces produced by the secondary split

Documents with None content raise ValueError, documents with non-string content raise TypeError, and invalid Python source raises SyntaxError. Empty documents are skipped.

Configuration

ParameterTypeDefaultDescription
min_effective_linesint20Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit.
max_effective_linesint100Target effective lines per chunk. Units are merged greedily toward this value.
expected_chars_per_lineint45Character count used to estimate effective lines via ceil(len(source) / expected_chars_per_line).
oversized_factorint3Multiplier that triggers secondary line-based splitting for oversized syntactic units.
strip_docstringsboolFalseMoves function, method, and class docstrings from content into meta["docstrings"].
preserve_class_definitionboolTruePrefixes class signatures on chunks that contain class members without the class header.
secondary_split_overlapint5Line overlap used only by the oversized secondary split.
secondary_split_lengthint | NoneNoneLine length for the oversized secondary split. Defaults to max_effective_lines when None.

Usage

On its own

python
import textwrap

from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter

source = textwrap.dedent(
'''
"""Math utilities."""
from math import pi


class Circle:
"""A circle."""

def __init__(self, radius: float) -> None:
self.radius = radius

def area(self) -> float:
return pi * self.radius * self.radius
'''
).lstrip()

splitter = PythonCodeSplitter(
min_effective_lines=4,
max_effective_lines=12,
strip_docstrings=True,
)

result = splitter.run(
documents=[Document(content=source, meta={"file_name": "geometry.py"})],
)

for chunk in result["documents"]:
print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes"))

With docstring stripping for RAG

Set strip_docstrings=True when docstrings are verbose. The docstring text is moved out of the chunk content into meta["docstrings"], keeping the stored chunk compact. Pass meta_fields_to_embed=["docstrings"] to your embedder so the docstring text still influences retrieval even though it is no longer in the chunk content.

python
from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter

source = '''
"""Example module."""
from math import pi


class Circle:
"""A circle defined by its radius."""

def __init__(self, r: float) -> None:
"""Store the radius."""
self.r = r

def area(self) -> float:
"""Return the area of the circle."""
return pi * self.r * self.r
'''

splitter = PythonCodeSplitter(
min_effective_lines=20,
max_effective_lines=100,
strip_docstrings=True,
)
result = splitter.run(documents=[Document(content=source, meta={"file_name": "my_module.py"})])
for chunk in result["documents"]:
print(chunk.content)
print(chunk.meta.get("docstrings"))

In a pipeline

This pipeline converts Python files to documents, splits them with PythonCodeSplitter, and writes the chunks to an in-memory document store.

python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import PythonCodeSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/project").glob("**/*.py"))
p.run({"converter": {"sources": files}})