HanLP integration for Haystack
Module haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter
ChineseDocumentSplitter
A DocumentSplitter for Chinese text.
'coarse' represents coarse granularity Chinese word segmentation, 'fine' represents fine granularity word segmentation, default is coarse granularity word segmentation.
Unlike English where words are usually separated by spaces, Chinese text is written continuously without spaces between words. Chinese words can consist of multiple characters. For example, the English word "America" is translated to "美国" in Chinese, which consists of two characters but is treated as a single word. Similarly, "Portugal" is "葡萄牙" in Chinese, three characters but one word. Therefore, splitting by word means splitting by these multi-character tokens, not simply by single characters or spaces.
Usage example
doc = Document(content=
"这是第一句话,这是第二句话,这是第三句话。"
"这是第四句话,这是第五句话,这是第六句话!"
"这是第七句话,这是第八句话,这是第九句话?"
)
splitter = ChineseDocumentSplitter(
split_by="word", split_length=10, split_overlap=3, respect_sentence_boundary=True
)
splitter.warm_up()
result = splitter.run(documents=[doc])
print(result["documents"])
ChineseDocumentSplitter.__init__
def __init__(split_by: Literal["word", "sentence", "passage", "page", "line",
"period", "function"] = "word",
split_length: int = 1000,
split_overlap: int = 200,
split_threshold: int = 0,
respect_sentence_boundary: bool = False,
splitting_function: Optional[Callable] = None,
granularity: Literal["coarse", "fine"] = "coarse")
Initialize the ChineseDocumentSplitter component.
Arguments:
split_by
: The unit for splitting your documents. Choose from:word
for splitting by spaces (" ")period
for splitting by periods (".")page
for splitting by form feed ("\f")passage
for splitting by double line breaks ("\n\n")line
for splitting each line ("\n")sentence
for splitting by HanLP sentence tokenizersplit_length
: The maximum number of units in each split.split_overlap
: The number of overlapping units for each split.split_threshold
: The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split.respect_sentence_boundary
: Choose whether to respect sentence boundaries when splitting by "word". If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.splitting_function
: Necessary whensplit_by
is set to "function". This is a function which must accept a singlestr
as input and return alist
ofstr
as output, representing the chunks after splitting.granularity
: The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
Raises:
ValueError
: If the granularity is not 'coarse' or 'fine'.
ChineseDocumentSplitter.run
def run(documents: List[Document]) -> Dict[str, List[Document]]
Split documents into smaller chunks.
Arguments:
documents
: The documents to split.
Raises:
RuntimeError
: If the Chinese word segmentation model is not loaded.
Returns:
A dictionary containing the split documents.
ChineseDocumentSplitter.warm_up
def warm_up() -> None
Warm up the component by loading the necessary models.
ChineseDocumentSplitter.chinese_sentence_split
def chinese_sentence_split(text: str) -> List[Dict[str, Any]]
Split Chinese text into sentences.
Arguments:
text
: The text to split.
Returns:
A list of split sentences.
ChineseDocumentSplitter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
ChineseDocumentSplitter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChineseDocumentSplitter"
Deserializes the component from a dictionary.