Version: 2.20

RegexTextExtractor

Extracts text from chat messages or strings using a regular expression pattern.


Most common position in a pipeline	After a Chat Generator to parse structured output from LLM responses
Mandatory init variables	`regex_pattern`: The regular expression pattern used to extract text
Mandatory run variables	`text_or_messages`: A string or a list of `ChatMessage` objects to search through
Output variables	`captured_text`: The extracted text from the first capture group
API reference	Extractors
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/regex_text_extractor.py

Overview

RegexTextExtractor parses text input or ChatMessage objects using a regular expression pattern and extracts text captured by capture groups. This is useful for extracting structured information from LLM outputs that follow specific formats, such as XML-like tags or other patterns.

The component works with both plain strings and lists of ChatMessage objects. When given a list of messages, it processes only the last message.

The regex pattern should include at least one capture group (text within parentheses) to specify what text to extract. If no capture group is provided, the entire match is returned instead.

Handling no matches

By default, when the pattern doesn't match, the component returns an empty dictionary {}. You can change this behavior with the return_empty_on_no_match parameter:

python

from haystack.components.extractors import RegexTextExtractor

# Default behavior - returns empty dict when no match
extractor_default = RegexTextExtractor(regex_pattern=r'<answer>(.*?)</answer>')
result = extractor_default.run(text_or_messages="No answer tags here")
print(result)  # Output: {}

# Alternative behavior - returns empty string when no match
extractor_explicit = RegexTextExtractor(
    regex_pattern=r'<answer>(.*?)</answer>',
    return_empty_on_no_match=False
)
result = extractor_explicit.run(text_or_messages="No answer tags here")
print(result)  # Output: {'captured_text': ''}

note

The default behavior of returning {} when no match is found is deprecated and will change in a future release to return {'captured_text': ''} instead. Set return_empty_on_no_match=False explicitly if you want the new behavior now.

Usage

On its own

This example extracts a URL from an XML-like tag structure:

python

from haystack.components.extractors import RegexTextExtractor

# Create extractor with a pattern that captures the URL value
extractor = RegexTextExtractor(regex_pattern='<issue url="(.+?)">')

# Extract from a string
result = extractor.run(text_or_messages='<issue url="github.com/example/issue/123">Issue description</issue>')
print(result)
# Output: {'captured_text': 'github.com/example/issue/123'}

With ChatMessages

When working with LLM outputs in chat pipelines, you can extract structured data from ChatMessage objects:

python

from haystack.components.extractors import RegexTextExtractor
from haystack.dataclasses import ChatMessage

extractor = RegexTextExtractor(regex_pattern=r'```json\s*(.*?)\s*```', return_empty_on_no_match=False)

# Simulating an LLM response with JSON in a code block
messages = [
    ChatMessage.from_user("Extract the data"),
    ChatMessage.from_assistant('Here is the data:\n```json\n{"name": "Alice", "age": 30}\n```')
]

result = extractor.run(text_or_messages=messages)
print(result)
# Output: {'captured_text': '{"name": "Alice", "age": 30}'}

In a pipeline

This example demonstrates extracting a specific section from a structured LLM response. The pipeline asks an LLM to analyze a topic and format its response with XML-like tags for different sections. The RegexTextExtractor then pulls out only the summary, discarding the rest of the response.

The LLM generates a full response with both <analysis> and <summary> sections, but only the content inside <summary> tags is extracted and returned.

python

from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.extractors import RegexTextExtractor
from haystack.dataclasses import ChatMessage

pipe = Pipeline()
pipe.add_component("prompt_builder", ChatPromptBuilder())
pipe.add_component("llm", OpenAIChatGenerator())
pipe.add_component("extractor", RegexTextExtractor(regex_pattern=r'<summary>(.*?)</summary>', return_empty_on_no_match=False))

pipe.connect("prompt_builder.prompt", "llm.messages")
pipe.connect("llm.replies", "extractor.text_or_messages")

# Instruct the LLM to use a specific structured format
messages = [
    ChatMessage.from_system(
        "Respond using this exact format:\n"
        "<analysis>Your detailed analysis here</analysis>\n"
        "<summary>A one-sentence summary</summary>"
    ),
    ChatMessage.from_user("What are the main benefits and drawbacks of remote work?")
]

# Run the pipeline (requires OPENAI_API_KEY environment variable)
result = pipe.run({"prompt_builder": {"template": messages}})
print(result["extractor"]["captured_text"])
# Output: 'Remote work offers flexibility and eliminates commuting but can lead to isolation and blurred work-life boundaries.'

Overview​

Handling no matches​

Usage​

On its own​

With ChatMessages​

In a pipeline​

Overview

Handling no matches

Usage

On its own

With ChatMessages

In a pipeline