PythonCodeSplitter
PythonCodeSplitter splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible.
| Most common position in a pipeline | In indexing pipelines after Converters, before Embedders or DocumentWriter |
| Mandatory run variables | documents: A list of Python source code documents |
| Output variables | documents: A list of Python source code documents split into syntax-aware chunks |
| API reference | PreProcessors |
| GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py |
| Package name | haystack-ai |
Overview
PythonCodeSplitter expects each input document's content to be valid Python source code. It parses the source with Python's ast module and creates ordered split units for:
- Module docstrings
- Consecutive import blocks
- Top-level functions
- Class headers
- Methods and nested classes
- Remaining top-level statements
The splitter merges these units in source order toward max_effective_lines. Effective lines are calculated from character length with ceil(len(source) / expected_chars_per_line), so long lines count as more than one line.
Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than oversized_factor * max_effective_lines, the splitter falls back to a line-based secondary split using DocumentSplitter. This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap.
By default, preserve_class_definition=True. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context.
If strip_docstrings=True, function, method, and class docstrings are removed from chunk content and stored in meta["docstrings"]. Module docstrings stay in the chunk content because they are their own top-level unit.
Per-chunk metadata
Each output document carries the metadata below. All fields from the parent document's meta (except split_id) are also propagated.
| Field | Description |
|---|---|
source_id | ID of the originating document |
split_id | Sequential index of this chunk within its source document |
start_line | First line of the chunk in the original source (1-indexed). Oversized secondary chunks keep the originating unit's range. |
end_line | Last line of the chunk in the original source (1-indexed). Oversized secondary chunks keep the originating unit's range. |
unit_kinds | List of syntactic unit kinds included in this chunk, such as imports, function, class_header, or method |
include_classes | (when applicable) Ordered list of class names whose members appear in this chunk |
decorators | (when applicable) Ordered list of decorator strings found on included functions, methods, or classes |
docstrings | (when strip_docstrings=True) List of stripped docstring strings in source order |
secondary_split | True if this chunk was produced by the oversized fallback splitter |
secondary_split_index | Index of this piece within the secondary split sequence |
secondary_split_total | Total number of pieces produced by the secondary split |
Documents with None content raise ValueError, documents with non-string content raise TypeError, and invalid Python source raises SyntaxError. Empty documents are skipped.
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
min_effective_lines | int | 20 | Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit. |
max_effective_lines | int | 100 | Target effective lines per chunk. Units are merged greedily toward this value. |
expected_chars_per_line | int | 45 | Character count used to estimate effective lines via ceil(len(source) / expected_chars_per_line). |
oversized_factor | int | 3 | Multiplier that triggers secondary line-based splitting for oversized syntactic units. |
strip_docstrings | bool | False | Moves function, method, and class docstrings from content into meta["docstrings"]. |
preserve_class_definition | bool | True | Prefixes class signatures on chunks that contain class members without the class header. |
secondary_split_overlap | int | 5 | Line overlap used only by the oversized secondary split. |
secondary_split_length | int | None | None | Line length for the oversized secondary split. Defaults to max_effective_lines when None. |
Usage
On its own
import textwrap
from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter
source = textwrap.dedent(
'''
"""Math utilities."""
from math import pi
class Circle:
"""A circle."""
def __init__(self, radius: float) -> None:
self.radius = radius
def area(self) -> float:
return pi * self.radius * self.radius
'''
).lstrip()
splitter = PythonCodeSplitter(
min_effective_lines=4,
max_effective_lines=12,
strip_docstrings=True,
)
result = splitter.run(
documents=[Document(content=source, meta={"file_name": "geometry.py"})],
)
for chunk in result["documents"]:
print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes"))
With docstring stripping for RAG
Set strip_docstrings=True when docstrings are verbose. The docstring text is moved out of the chunk content into meta["docstrings"], keeping the stored chunk compact. Pass meta_fields_to_embed=["docstrings"] to your embedder so the docstring text still influences retrieval even though it is no longer in the chunk content.
from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter
source = '''
"""Example module."""
from math import pi
class Circle:
"""A circle defined by its radius."""
def __init__(self, r: float) -> None:
"""Store the radius."""
self.r = r
def area(self) -> float:
"""Return the area of the circle."""
return pi * self.r * self.r
'''
splitter = PythonCodeSplitter(
min_effective_lines=20,
max_effective_lines=100,
strip_docstrings=True,
)
result = splitter.run(documents=[Document(content=source, meta={"file_name": "my_module.py"})])
for chunk in result["documents"]:
print(chunk.content)
print(chunk.meta.get("docstrings"))
In a pipeline
This pipeline converts Python files to documents, splits them with PythonCodeSplitter, and writes the chunks to an in-memory document store.
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import PythonCodeSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80))
p.add_component("writer", DocumentWriter(document_store=document_store))
p.connect("converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
files = list(Path("path/to/your/project").glob("**/*.py"))
p.run({"converter": {"sources": files}})