DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
Documentation

JSONConverter

Converts JSON files to text documents.

Most common position in a pipelineBefore PreProcessors , or right at the beginning of an indexing pipeline
Mandatory init variablesONE OF, OR BOTH:

"jq_schema": A jq filter string to extract content

"content_key": A key string to extract document content
Mandatory run variables"sources": A list of file paths or ByteStream objects
Output variables"documents": A list of documents
API referenceConverters
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/json.py

Overview

JSONConverter converts one or more JSON files into a text document.

Parameters Overview

To initialize JSONConverter, you must provide either jq_schema, or content_key parameter, or both.

jq_schema parameter filter extracts nested data from JSON files. Refer to the jq documentation for filter syntax. If not set, the entire JSON file is used.

The content_key parameter lets you specify which key in the extracted data will be the document's content.

  • If both jq_schema and content_key are set, the content_key is searched in the data extracted by jq_schema. Non-object data will be skipped.
  • If only jq_schema is set, the extracted value must be scalar; objects or arrays will be skipped.
  • If only content_key is set, the source must be a JSON object, or it will be skipped.

Check out the API reference for the full list of parameters.

Usage

You need to install the jq package to use this Converter:

pip install jq

Example

Here is an example of simple component usage:

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))

converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'This is the content of my document'

In the following more complex example, we provide a jq_schema string to filter the JSON source files and extra_meta_fields to extract from the filtered data:

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
  "laureates": [
    {
      "firstname": "Enrico",
      "surname": "Fermi",
      "motivation": "for his demonstrations of the existence of new radioactive elements produced "
      "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
      " slow neutrons",
    },
    {
      "firstname": "Rita",
      "surname": "Levi-Montalcini",
      "motivation": "for their discoveries of growth factors",
    },
  ],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
  jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'

print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
# 'for their discoveries of growth factors'

print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}