Files
impala/testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
Eyizoha ec59578106 IMPALA-12786: Optimize count(*) for JSON scans
When performing zero slots scans on a JSON table for operations like
count(*), we don't require specific data from the JSON, we only need the
number of top-level JSON objects. However, the current JSON parser based
on rapidjson still decodes and copies specific data from the JSON, even
in zero slots scans. Skipping these steps can significantly improve scan
performance.

This patch introduces a JSON skipper to conduct zero slots scans on JSON
data. Essentially, it is a simplified version of a rapidjson parser,
removing specific data decoding and copying operations, resulting in
faster parsing of the number of JSON objects. The skipper retains the
ability to recognize malformed JSON and provide specific error codes
same as the rapidjson parser. Nevertheless, as it bypasses specific
data parsing, it cannot identify string encoding errors or numeric
overflow errors. Despite this, these data errors do not impact the
counting of JSON objects, so it is acceptable to ignore them. The TEXT
scanner exhibits similar behavior.

Additionally, a new query option, disable_optimized_json_count_star, has
been added to disable this optimization and revert to the old behavior.

In the performance test of TPC-DS with a format of json/none and a scale
of 10GB, the performance optimization is shown in the following tables:
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload  | Query                     | File Format        | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | json / none / none | 6.78   | 6.88        |   -1.46%   |   4.93%   |   3.63%        | 9     |   -1.51%       | -0.74   | -0.72  |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | json / none / none | 2.42   | 6.75        | I -64.20%  |   6.44%   |   4.58%        | 9     | I -177.75%     | -3.36   | -37.55 |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | json / none / none | 2.42   | 7.03        | I -65.63%  |   3.93%   |   4.39%        | 9     | I -194.13%     | -3.36   | -42.82 |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%])
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg     | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| 01:AGGREGATE | 2.58%      | 54.85ms | 58.88ms  | -6.85%     | * 14.43% * | 115.82ms | 133.11ms | -12.99%    | 3      | 3     | 3      | 1         |
| 00:SCAN HDFS | 97.41%     | 2.07s   | 6.07s    | -65.84%    |   5.87%    | 2.43s    | 6.95s    | -65.01%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%])
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg   | Base Avg | Delta(Avg) | StdDev(%) | Max   | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| 00:SCAN HDFS | 99.35%     | 2.07s | 6.49s    | -68.15%    |   4.83%   | 2.37s | 7.49s    | -68.32%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+

Testing:
- Added new test cases in TestQueriesJsonTables to verify that query
  results are consistent before and after optimization.
- Passed existing JSON scanning-related tests.

Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e
Reviewed-on: http://gerrit.cloudera.org:8080/21039
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-07-10 14:37:19 +00:00

34 lines
1.4 KiB
Plaintext

====
---- QUERY
# Testing scanning of multi-line JSON, rapidjson can handle cases where line breaks appear
# in JSON (except for those that appear in numbers and strings), so in most cases it can
# scan multi-line JSON. However, line breaks in strings and numbers are treated as invalid
# values, and the scanner returns null. Additionally, it should be noted that if the line
# break in the multi-line JSON is near the beginning of the scan range, it may cause the
# parser to misjudge the starting position of the first complete JSON (because it always
# starts parsing from the position after the first line break). This usually has no
# effect (except report a error), but if there happens to be a sub-object immediately
# after the line break, itwill cause an extra line of data to be scanned. If the line
# break in the multi-line JSONis also at the beginning of the scan range, it will cause
# the last line of data from the previous scan range to be incomplete.
select id, key, value from multiline_json
---- TYPES
int, string, string
---- RESULTS
1,'normal object','abcdefg'
2,'multiline string','NULL'
3,'multiline number','1234'
4,'multiline object1','abcdefg'
5,'multiline object2','abcdefg'
6,'multiline object3','abcdefg'
7,'multiline object4','abcdefg'
8,'one line multiple objects','obj1'
9,'one line multiple objects','obj2'
====
---- QUERY
select count(*) from multiline_json
---- TYPES
bigint
---- RESULTS
9
====