Loading JSON
In this section, we assume the JSON data is in NDJSON (Newline delimited JSON) format, known as JSONEachRow
in ClickHouse. This is the preferred format for loading JSON due to its brevity and efficient use of space, but others are supported for both input and output.
Consider the following JSON sample, representing a row from the Python PyPI dataset:
In order to load this JSON object into ClickHouse, a table schema must be defined. A simple schema for this is shown below, where JSON keys are mapped to column names:
We have selected an ordering key here via the ORDER BY
clause. For further details on ordering keys and how to choose them, see here.
ClickHouse can load data JSON in several formats, automatically inferring the type from the extension and contents. We can read JSON files for the above table using the S3 function:
Note how we are not required to specify the file format. Instead, we use a glob pattern to read all *.json.gz
files in the bucket. ClickHouse automatically infers the format is JSONEachRow
(ndjson) from the file extension and contents. A format can be manually specified through parameter functions in case ClickHouse is unable to detect it.
The above files are also compressed. This is automatically detected and handled by ClickHouse.
To load the rows in these files, we can use an INSERT INTO SELECT
:
Rows can also be loaded inline using the FORMAT
clause e.g.
These examples assume the use of the JSONEachRow format. Other common JSON formats are supported, with examples provided of loading these here.
The above provided a very simple example of loading JSON data. For more complex JSON, including nested structures, see the guide Designing JSON schema.