mirror of
https://github.com/apache/impala.git
synced 2026-01-17 21:00:36 -05:00
When performing zero slots scans on a JSON table for operations like count(*), we don't require specific data from the JSON, we only need the number of top-level JSON objects. However, the current JSON parser based on rapidjson still decodes and copies specific data from the JSON, even in zero slots scans. Skipping these steps can significantly improve scan performance. This patch introduces a JSON skipper to conduct zero slots scans on JSON data. Essentially, it is a simplified version of a rapidjson parser, removing specific data decoding and copying operations, resulting in faster parsing of the number of JSON objects. The skipper retains the ability to recognize malformed JSON and provide specific error codes same as the rapidjson parser. Nevertheless, as it bypasses specific data parsing, it cannot identify string encoding errors or numeric overflow errors. Despite this, these data errors do not impact the counting of JSON objects, so it is acceptable to ignore them. The TEXT scanner exhibits similar behavior. Additionally, a new query option, disable_optimized_json_count_star, has been added to disable this optimization and revert to the old behavior. In the performance test of TPC-DS with a format of json/none and a scale of 10GB, the performance optimization is shown in the following tables: +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ | TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | json / none / none | 6.78 | 6.88 | -1.46% | 4.93% | 3.63% | 9 | -1.51% | -0.74 | -0.72 | | TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT | json / none / none | 2.42 | 6.75 | I -64.20% | 6.44% | 4.58% | 9 | I -177.75% | -3.36 | -37.55 | | TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED | json / none / none | 2.42 | 7.03 | I -65.63% | 3.93% | 4.39% | 9 | I -194.13% | -3.36 | -42.82 | +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%]) +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ | Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows | +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ | 01:AGGREGATE | 2.58% | 54.85ms | 58.88ms | -6.85% | * 14.43% * | 115.82ms | 133.11ms | -12.99% | 3 | 3 | 3 | 1 | | 00:SCAN HDFS | 97.41% | 2.07s | 6.07s | -65.84% | 5.87% | 2.43s | 6.95s | -65.01% | 3 | 3 | 28.80M | 143.83M | +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%]) +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ | Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows | +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ | 00:SCAN HDFS | 99.35% | 2.07s | 6.49s | -68.15% | 4.83% | 2.37s | 7.49s | -68.32% | 3 | 3 | 28.80M | 143.83M | +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ Testing: - Added new test cases in TestQueriesJsonTables to verify that query results are consistent before and after optimization. - Passed existing JSON scanning-related tests. Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e Reviewed-on: http://gerrit.cloudera.org:8080/21039 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
21 lines
462 B
Plaintext
21 lines
462 B
Plaintext
====
|
|
---- QUERY
|
|
# Testing scanning of complex JSON. JsonParser and HdfsJsonScanner do not support complex
|
|
# types yet, so they will be set as null for now.
|
|
select id, name, spouse, child from complex_json
|
|
---- TYPES
|
|
int, string, string, string
|
|
---- RESULTS
|
|
1,'Alice','NULL','NULL'
|
|
2,'Bob','NULL','NULL'
|
|
5,'Emily','NULL','NULL'
|
|
13,'Liam','NULL','NULL'
|
|
15,'Nora','NULL','NULL'
|
|
====
|
|
---- QUERY
|
|
select count(*) from complex_json
|
|
---- TYPES
|
|
bigint
|
|
---- RESULTS
|
|
5
|
|
==== |