First Impressions on Clickhouse Json Datatype

Jan Barrera
3 min readMay 24, 2022

Starting with version 22.3, Clickhouse supports the Json datatype, albeit an experimental feature that needs to be explicitly set.

set allow_experimental_object_type = 1

This new feature gives Clickhouse greater flexibility in dealing with ingestion, storage, and retrieval of semi-structured data and complex objects.

What is Clickhouse?

Clickhouse is a super-fast columnar database that is well-suited for real-time analytics. It is also open-source (e.i. free to use) and supports a language very similar to standard SQL. Here is benchmark comparison with other analytic databases: https://clickhouse.com/benchmark/dbms (caveat, this was performed by Clickhouse so it might be biased.)

Clickhouse and JSON

Prior to version 22.3, it is difficult to work with semi-structured on Clickhouse. Typically, semi-structured data can be ingested by using a map datatype. A map provides extended columns not initially set on the table. The downside of using maps however is that query operations are very slow. Moreover, the key-value datatypes need to be explicitly defined.

Performance Test

We started with creating a table with just string and integer native datatypes.

CREATE TABLE table_native_columns
(
id_col Int32,
int_col Int32,
string_col String
)
ENGINE = MergeTree
PRIMARY KEY id_col

then we also create a table with a json datatype.

CREATE TABLE table_json_columns
(
id_col Int32,
json Json
)
ENGINE = MergeTree
PRIMARY KEY id_col

We load 500 million rows of test data on both tables. Then we ran simple test queries with aggregation and grouping.

SELECT
string_col,
sum(int_col)
FROM table_native_columns
GROUP BY string_col

Average query time: 1.741 seconds.

Prior to version 22.3, Clickhouse have json functions to process and manipulate json data. Now, json columns can be queried via a convenient dot notation, making queries more intuitive and easily understandable.

SELECT
json.string_col,
sum(json.int_col)
FROM table_json_columns
GROUP BY json.string_col

Average query time: 1.750 seconds.

Surprisingly, there is not much significant difference on the query performance.

We also tested nested json structures. We run a similar query and interestingly, we got a similar performance.

SELECT
json.string_col,
sum(json.int.int_col)
FROM table_json_columns
GROUP BY json.string.string_col

Average query time: 1.754 seconds.

This seems to indicate that there is no difference in query performance when querying native columns and json columns. This is an impressive feature for Clickhouse — being able to store and retrieve complex json objects with the same performance as querying native datatypes.

To complete our test, we also considered the map datatype. We were interested how json columns compared with map columns in query performance.

We create a similar table but with a map column.

CREATE TABLE table_map_columns
(
id_col Int32,
map Map(String,String)
)
ENGINE = MergeTree
PRIMARY KEY id_col

Query.

SELECT
map['string_col'],
sum(toInt32(map['int_col']))
FROM table_map_columns
GROUP BY map['string_col']

Average query time: 9.781 seconds.

This is ~5x slower as compared to the json and native datatype. Note that we also needed to cast the string value to integer. Regardless, we no longer want to use maps in storing semi-structured data.

Limitation on Kafka Table

Lastly, it is interesting to note that Kafka table engine does not support json datatype. We tried it out and we received this message:

DB::Exception: Cannot create table with column of type Object, because storage Kafka doesn't support dynamic subcolumns. (ILLEGAL_COLUMN)

I think it would be a major win if ever Clickhouse supports this feature. Integration with Kafka would be seamless and without the need to process messages on ingestion.

Notable Impressions:

  1. Querying, aggregating and grouping json columns is as fast as native columns. On the contrary, map datatype is at least 5x slower using a similar query .
  2. Json datatype is a convenient way to store json objects. You can store semi-structured and complex objects and retrieve them with ease, and almost the same speed as native columns.
  3. Json datatype is not supported in Kafka table engine but hopeful for this feature in the future.

--

--