DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Extracting information from documents.

Module haystack_experimental.components.extractors.llm_metadata_extractor

LLMProvider

Currently LLM providers supported by LLMMetadataExtractor.

LLMProvider.from_str

@staticmethod
def from_str(string: str) -> "LLMProvider"

Convert a string to a LLMProvider enum.

LLMMetadataExtractor

Extracts metadata from documents using a Large Language Model (LLM) from OpenAI.

The metadata is extracted by providing a prompt to n LLM that generates the metadata.

from haystack import Document
from haystack.components.generators import OpenAIGenerator
from haystack_experimental.components.extractors import LLMMetadataExtractor

NER_PROMPT = '''
-Goal-
Given text and a list of entity types, identify all entities of those types from the text.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [organization, product, service, industry]
Format each entity as {"entity": <entity_name>, "entity_type": <entity_type>}

2. Return output in a single list with all the entities identified in steps 1.

-Examples-
######################
Example 1:
entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
text:
Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer base and high cross-border usage.
We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent agreement with Emirates Skywards.
And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital issuers are equally
------------------------
output:
{"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
#############################
-Real Data-
######################
entity_types: [company, organization, person, country, product, service]
text: {{input_text}}
######################
output:
'''

docs = [
    Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
    Document(content="Hugging Face is a company founded in Paris, France and is known for its Transformers library")
]

extractor = LLMMetadataExtractor(prompt=NER_PROMPT, expected_keys=["entities"], generator_api="openai", prompt_variable='input_text')
extractor.run(documents=docs)
>> {'documents': [
    Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
    meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
          {'entity': 'Haystack', 'entity_type': 'product'}]}),
    Document(id=.., content: 'Hugging Face is a company founded in Paris, France and is known for its Transformers library',
    meta: {'entities': [
            {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'Paris', 'entity_type': 'city'},
            {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
            ]})
       ]
   }
>>

LLMMetadataExtractor.__init__

def __init__(prompt: str,
             prompt_variable: str,
             expected_keys: List[str],
             generator_api: Union[str, LLMProvider],
             generator_api_params: Optional[Dict[str, Any]] = None,
             page_range: Optional[List[Union[str, int]]] = None,
             raise_on_failure: bool = False)

Initializes the LLMMetadataExtractor.

Arguments:

  • prompt: The prompt to be used for the LLM.
  • prompt_variable: The variable in the prompt to be processed by the PromptBuilder.
  • expected_keys: The keys expected in the JSON output from the LLM.
  • generator_api: The API provider for the LLM. Currently supported providers are: "openai", "openai_azure", "aws_bedrock", "google_vertex"
  • generator_api_params: The parameters for the LLM generator.
  • page_range: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 11, 12. If None, metadata will be extracted from the entire document for each document in the documents list. This parameter is optional and can be overridden in the run method.
  • raise_on_failure: Whether to raise an error on failure to validate JSON output.

LLMMetadataExtractor.is_valid_json_and_has_expected_keys

def is_valid_json_and_has_expected_keys(expected: List[str],
                                        received: str) -> bool

Output must be a valid JSON with the expected keys.

Arguments:

  • expected: Names of expected outputs
  • received: Names of received outputs

Raises:

  • ValueError: If the output is not a valid JSON with the expected keys:
  • with raise_on_failure set to True a ValueError is raised.
  • with raise_on_failure set to False a warning is issued and False is returned.

Returns:

True if the received output is a valid JSON with the expected keys, False otherwise.

LLMMetadataExtractor.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

LLMMetadataExtractor.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "LLMMetadataExtractor"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary with serialized data.

Returns:

An instance of the component.

LLMMetadataExtractor.run

@component.output_types(documents=List[Document], errors=Dict[str, Any])
def run(documents: List[Document],
        page_range: Optional[List[Union[str, int]]] = None)

Extract metadata from documents using a Language Model.

If page_range is provided, the metadata will be extracted from the specified range of pages. This component will split the documents into pages and extract metadata from the specified range of pages. The metadata will be extracted from the entire document if page_range is not provided.

The original documents will be returned updated with the extracted metadata.

Arguments:

  • documents: List of documents to extract metadata from.
  • page_range: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 11, 12. If None, metadata will be extracted from the entire document for each document in the documents list.

Returns:

A dictionary with the keys:

  • "documents": The original list of documents updated with the extracted metadata.
  • "errors": A dictionary with document IDs as keys and error messages as values.