JSONConverter
Converts JSON files to text documents.
Most common position in a pipeline | Before PreProcessors , or right at the beginning of an indexing pipeline |
Mandatory init variables | ONE OF, OR BOTH: "jq_schema": A jq filter string to extract content "content_key": A key string to extract document content |
Mandatory run variables | "sources": A list of file paths or ByteStream objects |
Output variables | "documents": A list of documents |
API reference | Converters |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/json.py |
Overview
JSONConverter
converts one or more JSON files into a text document.
Parameters Overview
To initialize JSONConverter
, you must provide either jq_schema
, or content_key
parameter, or both.
jq_schema
parameter filter extracts nested data from JSON files. Refer to the jq documentation for filter syntax. If not set, the entire JSON file is used.
The content_key
parameter lets you specify which key in the extracted data will be the document's content.
- If both
jq_schema
andcontent_key
are set, thecontent_key
is searched in the data extracted byjq_schema
. Non-object data will be skipped. - If only
jq_schema
is set, the extracted value must be scalar; objects or arrays will be skipped. - If only
content_key
is set, the source must be a JSON object, or it will be skipped.
Check out the API reference for the full list of parameters.
Usage
You need to install the jq
package to use this Converter:
pip install jq
Example
Here is an example of simple component usage:
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'This is the content of my document'
In the following more complex example, we provide a jq_schema
string to filter the JSON source files and extra_meta_fields
to extract from the filtered data:
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
" slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors'
print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
Updated 4 months ago