RegexTextExtractor
Extracts text from chat messages or strings using a regular expression pattern.
| Most common position in a pipeline | After a Chat Generator to parse structured output from LLM responses |
| Mandatory init variables | regex_pattern: The regular expression pattern used to extract text |
| Mandatory run variables | text_or_messages: A string or a list of ChatMessage objects to search through |
| Output variables | captured_text: The extracted text from the first capture group |
| API reference | Extractors |
| GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/regex_text_extractor.py |
Overview
RegexTextExtractor parses text input or ChatMessage objects using a regular expression pattern and extracts text captured by capture groups. This is useful for extracting structured information from LLM outputs that follow specific formats, such as XML-like tags or other patterns.
The component works with both plain strings and lists of ChatMessage objects. When given a list of messages, it processes only the last message.
The regex pattern should include at least one capture group (text within parentheses) to specify what text to extract. If no capture group is provided, the entire match is returned instead.
Handling no matches
By default, when the pattern doesn't match, the component returns an empty dictionary {}. You can change this behavior with the return_empty_on_no_match parameter:
from haystack.components.extractors import RegexTextExtractor
# Default behavior - returns empty dict when no match
extractor_default = RegexTextExtractor(regex_pattern=r'<answer>(.*?)</answer>')
result = extractor_default.run(text_or_messages="No answer tags here")
print(result) # Output: {}
# Alternative behavior - returns empty string when no match
extractor_explicit = RegexTextExtractor(
regex_pattern=r'<answer>(.*?)</answer>',
return_empty_on_no_match=False
)
result = extractor_explicit.run(text_or_messages="No answer tags here")
print(result) # Output: {'captured_text': ''}
The default behavior of returning {} when no match is found is deprecated and will change in a future release to return {'captured_text': ''} instead. Set return_empty_on_no_match=False explicitly if you want the new behavior now.
Usage
On its own
This example extracts a URL from an XML-like tag structure:
from haystack.components.extractors import RegexTextExtractor
# Create extractor with a pattern that captures the URL value
extractor = RegexTextExtractor(regex_pattern='<issue url="(.+?)">')
# Extract from a string
result = extractor.run(text_or_messages='<issue url="github.com/example/issue/123">Issue description</issue>')
print(result)
# Output: {'captured_text': 'github.com/example/issue/123'}
With ChatMessages
When working with LLM outputs in chat pipelines, you can extract structured data from ChatMessage objects:
from haystack.components.extractors import RegexTextExtractor
from haystack.dataclasses import ChatMessage
extractor = RegexTextExtractor(regex_pattern=r'```json\s*(.*?)\s*```', return_empty_on_no_match=False)
# Simulating an LLM response with JSON in a code block
messages = [
ChatMessage.from_user("Extract the data"),
ChatMessage.from_assistant('Here is the data:\n```json\n{"name": "Alice", "age": 30}\n```')
]
result = extractor.run(text_or_messages=messages)
print(result)
# Output: {'captured_text': '{"name": "Alice", "age": 30}'}
In a pipeline
This example demonstrates extracting a specific section from a structured LLM response. The pipeline asks an LLM to analyze a topic and format its response with XML-like tags for different sections. The RegexTextExtractor then pulls out only the summary, discarding the rest of the response.
The LLM generates a full response with both <analysis> and <summary> sections, but only the content inside <summary> tags is extracted and returned.
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.extractors import RegexTextExtractor
from haystack.dataclasses import ChatMessage
pipe = Pipeline()
pipe.add_component("prompt_builder", ChatPromptBuilder())
pipe.add_component("llm", OpenAIChatGenerator())
pipe.add_component("extractor", RegexTextExtractor(regex_pattern=r'<summary>(.*?)</summary>', return_empty_on_no_match=False))
pipe.connect("prompt_builder.prompt", "llm.messages")
pipe.connect("llm.replies", "extractor.text_or_messages")
# Instruct the LLM to use a specific structured format
messages = [
ChatMessage.from_system(
"Respond using this exact format:\n"
"<analysis>Your detailed analysis here</analysis>\n"
"<summary>A one-sentence summary</summary>"
),
ChatMessage.from_user("What are the main benefits and drawbacks of remote work?")
]
# Run the pipeline (requires OPENAI_API_KEY environment variable)
result = pipe.run({"prompt_builder": {"template": messages}})
print(result["extractor"]["captured_text"])
# Output: 'Remote work offers flexibility and eliminates commuting but can lead to isolation and blurred work-life boundaries.'