DocumentationAPI Reference📓 TutorialsđŸ§‘â€đŸŗ Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

HanLP

HanLP integration for Haystack

Module haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter

ChineseDocumentSplitter

A DocumentSplitter for Chinese text.

'coarse' represents coarse granularity Chinese word segmentation, 'fine' represents fine granularity word segmentation, default is coarse granularity word segmentation.

Unlike English where words are usually separated by spaces, Chinese text is written continuously without spaces between words. Chinese words can consist of multiple characters. For example, the English word "America" is translated to "įžŽå›Ŋ" in Chinese, which consists of two characters but is treated as a single word. Similarly, "Portugal" is "č‘Ąč„į‰™" in Chinese, three characters but one word. Therefore, splitting by word means splitting by these multi-character tokens, not simply by single characters or spaces.

Usage example

doc = Document(content=
    "čŋ™æ˜¯įŦŦ一åĨč¯īŧŒčŋ™æ˜¯įŦŦäēŒåĨč¯īŧŒčŋ™æ˜¯įŦŦ三åĨč¯ã€‚"
    "čŋ™æ˜¯įŦŦ四åĨč¯īŧŒčŋ™æ˜¯įŦŦäē”åĨč¯īŧŒčŋ™æ˜¯įŦŦ六åĨč¯īŧ"
    "čŋ™æ˜¯įŦŦ七åĨč¯īŧŒčŋ™æ˜¯įŦŦå…ĢåĨč¯īŧŒčŋ™æ˜¯įŦŦ䚝åĨč¯īŧŸ"
)

splitter = ChineseDocumentSplitter(
    split_by="word", split_length=10, split_overlap=3, respect_sentence_boundary=True
)
splitter.warm_up()
result = splitter.run(documents=[doc])
print(result["documents"])

ChineseDocumentSplitter.__init__

def __init__(split_by: Literal["word", "sentence", "passage", "page", "line",
                               "period", "function"] = "word",
             split_length: int = 1000,
             split_overlap: int = 200,
             split_threshold: int = 0,
             respect_sentence_boundary: bool = False,
             splitting_function: Optional[Callable] = None,
             granularity: Literal["coarse", "fine"] = "coarse")

Initialize the ChineseDocumentSplitter component.

Arguments:

  • split_by: The unit for splitting your documents. Choose from:
  • word for splitting by spaces (" ")
  • period for splitting by periods (".")
  • page for splitting by form feed ("\f")
  • passage for splitting by double line breaks ("\n\n")
  • line for splitting each line ("\n")
  • sentence for splitting by HanLP sentence tokenizer
  • split_length: The maximum number of units in each split.
  • split_overlap: The number of overlapping units for each split.
  • split_threshold: The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split.
  • respect_sentence_boundary: Choose whether to respect sentence boundaries when splitting by "word". If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
  • splitting_function: Necessary when split_by is set to "function". This is a function which must accept a single str as input and return a list of str as output, representing the chunks after splitting.
  • granularity: The granularity of Chinese word segmentation, either 'coarse' or 'fine'.

Raises:

  • ValueError: If the granularity is not 'coarse' or 'fine'.

ChineseDocumentSplitter.run

def run(documents: List[Document]) -> Dict[str, List[Document]]

Split documents into smaller chunks.

Arguments:

  • documents: The documents to split.

Raises:

  • RuntimeError: If the Chinese word segmentation model is not loaded.

Returns:

A dictionary containing the split documents.

ChineseDocumentSplitter.warm_up

def warm_up() -> None

Warm up the component by loading the necessary models.

ChineseDocumentSplitter.chinese_sentence_split

def chinese_sentence_split(text: str) -> List[Dict[str, Any]]

Split Chinese text into sentences.

Arguments:

  • text: The text to split.

Returns:

A list of split sentences.

ChineseDocumentSplitter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

ChineseDocumentSplitter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChineseDocumentSplitter"

Deserializes the component from a dictionary.