MarkdownHeaderSplitter
Split documents at ATX-style Markdown headers (#, ##, and so on), with optional secondary splitting. Header hierarchy is preserved as metadata on each chunk.
| Most common position in a pipeline | In indexing pipelines after Converters and DocumentCleaner |
| Mandatory run variables | documents: A list of text documents to split. |
| Output variables | documents: A list of documents split at headers (and optionally by secondary split). |
| API reference | PreProcessors |
| GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/markdown_header_splitter.py |
Overview
The MarkdownHeaderSplitter processes text documents by:
- Splitting them into chunks at ATX-style Markdown headers (
#,##, …,######), preserving header hierarchy as metadata. - Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's
DocumentSplitter. - Preserving and propagating metadata such as parent headers, page numbers, and split IDs.
Only ATX-style headers are recognized (e.g. # Title). Setext-style headers (Underline with ===) aren't supported.
Parameters you can set when initializing the component:
page_break_character: Character used to identify page breaks. Defaults to form feed\f.keep_headers: IfTrue, headers remain in the chunk content. IfFalse, headers are moved to metadata only. Defaults toTrue.secondary_split: Optional secondary split after header splitting. Options:None,"word","passage","period","line". Defaults toNone.split_length: Maximum number of units per split when using secondary splitting. Defaults to200.split_overlap: Number of overlapping units between splits when using secondary splitting. Defaults to0.split_threshold: Minimum number of units per split when using secondary splitting. Defaults to0.skip_empty_documents: Whether to skip documents with empty content. Defaults toTrue.
Each output document's metadata includes:
source_id: ID of the original document.page_number: Page number. Updated whenpage_break_characteris found.split_id: Index of the chunk within its parent.header: The header text for this chunk.parent_headers: List of parent header texts in hierarchy order.
The component only works with text documents. Documents with None or non-string content raise a ValueError.
Usage
On its own
from haystack import Document
from haystack.components.preprocessors import MarkdownHeaderSplitter
text = (
"# Introduction\n"
"This is the intro section.\n"
"## Getting Started\n"
"Here is how to start.\n"
"## Advanced\n"
"Advanced content here."
)
doc = Document(content=text)
splitter = MarkdownHeaderSplitter(keep_headers=True)
result = splitter.run(documents=[doc])
# result["documents"] contains one document per header section,
# with meta["header"], meta["parent_headers"], meta["source_id"], and so on
With secondary splitting
When sections are long, you can add a secondary split, for example by word, so each chunk stays within a maximum size:
from haystack import Document
from haystack.components.preprocessors import MarkdownHeaderSplitter
text = "# Section\n" + "Some long body text. " * 50
doc = Document(content=text)
splitter = MarkdownHeaderSplitter(
keep_headers=True,
secondary_split="word",
split_length=20,
split_overlap=2,
)
splitter.warm_up() # required when using secondary_split
result = splitter.run(documents=[doc])
In a pipeline
This pipeline converts Markdown files to documents, cleans them, splits by headers, and writes to an in-memory document store:
from pathlib import Path
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import MarkdownHeaderSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component("text_file_converter", TextFileToDocument())
p.add_component("splitter", MarkdownHeaderSplitter(keep_headers=True))
p.add_component("writer", DocumentWriter(document_store=document_store))
p.connect("text_file_converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})