mirror of
https://github.com/apache/impala.git
synced 2026-01-31 00:00:20 -05:00
When performing zero slots scans on a JSON table for operations like count(*), we don't require specific data from the JSON, we only need the number of top-level JSON objects. However, the current JSON parser based on rapidjson still decodes and copies specific data from the JSON, even in zero slots scans. Skipping these steps can significantly improve scan performance. This patch introduces a JSON skipper to conduct zero slots scans on JSON data. Essentially, it is a simplified version of a rapidjson parser, removing specific data decoding and copying operations, resulting in faster parsing of the number of JSON objects. The skipper retains the ability to recognize malformed JSON and provide specific error codes same as the rapidjson parser. Nevertheless, as it bypasses specific data parsing, it cannot identify string encoding errors or numeric overflow errors. Despite this, these data errors do not impact the counting of JSON objects, so it is acceptable to ignore them. The TEXT scanner exhibits similar behavior. Additionally, a new query option, disable_optimized_json_count_star, has been added to disable this optimization and revert to the old behavior. In the performance test of TPC-DS with a format of json/none and a scale of 10GB, the performance optimization is shown in the following tables: +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ | TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | json / none / none | 6.78 | 6.88 | -1.46% | 4.93% | 3.63% | 9 | -1.51% | -0.74 | -0.72 | | TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT | json / none / none | 2.42 | 6.75 | I -64.20% | 6.44% | 4.58% | 9 | I -177.75% | -3.36 | -37.55 | | TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED | json / none / none | 2.42 | 7.03 | I -65.63% | 3.93% | 4.39% | 9 | I -194.13% | -3.36 | -42.82 | +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%]) +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ | Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows | +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ | 01:AGGREGATE | 2.58% | 54.85ms | 58.88ms | -6.85% | * 14.43% * | 115.82ms | 133.11ms | -12.99% | 3 | 3 | 3 | 1 | | 00:SCAN HDFS | 97.41% | 2.07s | 6.07s | -65.84% | 5.87% | 2.43s | 6.95s | -65.01% | 3 | 3 | 28.80M | 143.83M | +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%]) +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ | Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows | +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ | 00:SCAN HDFS | 99.35% | 2.07s | 6.49s | -68.15% | 4.83% | 2.37s | 7.49s | -68.32% | 3 | 3 | 28.80M | 143.83M | +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ Testing: - Added new test cases in TestQueriesJsonTables to verify that query results are consistent before and after optimization. - Passed existing JSON scanning-related tests. Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e Reviewed-on: http://gerrit.cloudera.org:8080/21039 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
34 lines
1.4 KiB
Plaintext
34 lines
1.4 KiB
Plaintext
====
|
|
---- QUERY
|
|
# Testing scanning of multi-line JSON, rapidjson can handle cases where line breaks appear
|
|
# in JSON (except for those that appear in numbers and strings), so in most cases it can
|
|
# scan multi-line JSON. However, line breaks in strings and numbers are treated as invalid
|
|
# values, and the scanner returns null. Additionally, it should be noted that if the line
|
|
# break in the multi-line JSON is near the beginning of the scan range, it may cause the
|
|
# parser to misjudge the starting position of the first complete JSON (because it always
|
|
# starts parsing from the position after the first line break). This usually has no
|
|
# effect (except report a error), but if there happens to be a sub-object immediately
|
|
# after the line break, itwill cause an extra line of data to be scanned. If the line
|
|
# break in the multi-line JSONis also at the beginning of the scan range, it will cause
|
|
# the last line of data from the previous scan range to be incomplete.
|
|
select id, key, value from multiline_json
|
|
---- TYPES
|
|
int, string, string
|
|
---- RESULTS
|
|
1,'normal object','abcdefg'
|
|
2,'multiline string','NULL'
|
|
3,'multiline number','1234'
|
|
4,'multiline object1','abcdefg'
|
|
5,'multiline object2','abcdefg'
|
|
6,'multiline object3','abcdefg'
|
|
7,'multiline object4','abcdefg'
|
|
8,'one line multiple objects','obj1'
|
|
9,'one line multiple objects','obj2'
|
|
====
|
|
---- QUERY
|
|
select count(*) from multiline_json
|
|
---- TYPES
|
|
bigint
|
|
---- RESULTS
|
|
9
|
|
==== |