Skip to main content
Version: 2.25-unstable

MarkdownHeaderSplitter

Split documents at ATX-style Markdown headers (#, ##, and so on), with optional secondary splitting. Header hierarchy is preserved as metadata on each chunk.

Most common position in a pipelineIn indexing pipelines after Converters and DocumentCleaner
Mandatory run variablesdocuments: A list of text documents to split.
Output variablesdocuments: A list of documents split at headers (and optionally by secondary split).
API referencePreProcessors
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/markdown_header_splitter.py

Overview

The MarkdownHeaderSplitter processes text documents by:

  • Splitting them into chunks at ATX-style Markdown headers (#, ##, …, ######), preserving header hierarchy as metadata.
  • Optionally applying a secondary split (by word, passage, period, or line) to each chunk using Haystack's DocumentSplitter.
  • Preserving and propagating metadata such as parent headers, page numbers, and split IDs.

Only ATX-style headers are recognized (e.g. # Title). Setext-style headers (Underline with ===) aren't supported.

Parameters you can set when initializing the component:

  • page_break_character: Character used to identify page breaks. Defaults to form feed \f.
  • keep_headers: If True, headers remain in the chunk content. If False, headers are moved to metadata only. Defaults to True.
  • secondary_split: Optional secondary split after header splitting. Options: None, "word", "passage", "period", "line". Defaults to None.
  • split_length: Maximum number of units per split when using secondary splitting. Defaults to 200.
  • split_overlap: Number of overlapping units between splits when using secondary splitting. Defaults to 0.
  • split_threshold: Minimum number of units per split when using secondary splitting. Defaults to 0.
  • skip_empty_documents: Whether to skip documents with empty content. Defaults to True.

Each output document's metadata includes:

  • source_id: ID of the original document.
  • page_number: Page number. Updated when page_break_character is found.
  • split_id: Index of the chunk within its parent.
  • header: The header text for this chunk.
  • parent_headers: List of parent header texts in hierarchy order.

The component only works with text documents. Documents with None or non-string content raise a ValueError.

Usage

On its own

python
from haystack import Document
from haystack.components.preprocessors import MarkdownHeaderSplitter

text = (
"# Introduction\n"
"This is the intro section.\n"
"## Getting Started\n"
"Here is how to start.\n"
"## Advanced\n"
"Advanced content here."
)
doc = Document(content=text)
splitter = MarkdownHeaderSplitter(keep_headers=True)
result = splitter.run(documents=[doc])

# result["documents"] contains one document per header section,
# with meta["header"], meta["parent_headers"], meta["source_id"], and so on

With secondary splitting

When sections are long, you can add a secondary split, for example by word, so each chunk stays within a maximum size:

python
from haystack import Document
from haystack.components.preprocessors import MarkdownHeaderSplitter

text = "# Section\n" + "Some long body text. " * 50
doc = Document(content=text)
splitter = MarkdownHeaderSplitter(
keep_headers=True,
secondary_split="word",
split_length=20,
split_overlap=2,
)
splitter.warm_up() # required when using secondary_split
result = splitter.run(documents=[doc])

In a pipeline

This pipeline converts Markdown files to documents, cleans them, splits by headers, and writes to an in-memory document store:

python
from pathlib import Path

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import MarkdownHeaderSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("text_file_converter", TextFileToDocument())
p.add_component("splitter", MarkdownHeaderSplitter(keep_headers=True))
p.add_component("writer", DocumentWriter(document_store=document_store))
p.connect("text_file_converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})