impala

jprdonnelly/impala

Fork 0

mirror of https://github.com/apache/impala.git synced 2026-01-24 06:00:49 -05:00

Commit Graph

Author	SHA1	Message	Date
Eyizoha	ec59578106	IMPALA-12786: Optimize count() for JSON scans When performing zero slots scans on a JSON table for operations like count(), we don't require specific data from the JSON, we only need the number of top-level JSON objects. However, the current JSON parser based on rapidjson still decodes and copies specific data from the JSON, even in zero slots scans. Skipping these steps can significantly improve scan performance. This patch introduces a JSON skipper to conduct zero slots scans on JSON data. Essentially, it is a simplified version of a rapidjson parser, removing specific data decoding and copying operations, resulting in faster parsing of the number of JSON objects. The skipper retains the ability to recognize malformed JSON and provide specific error codes same as the rapidjson parser. Nevertheless, as it bypasses specific data parsing, it cannot identify string encoding errors or numeric overflow errors. Despite this, these data errors do not impact the counting of JSON objects, so it is acceptable to ignore them. The TEXT scanner exhibits similar behavior. Additionally, a new query option, disable_optimized_json_count_star, has been added to disable this optimization and revert to the old behavior. In the performance test of TPC-DS with a format of json/none and a scale of 10GB, the performance optimization is shown in the following tables: +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| TPCDS(10) \| TPCDS-Q_COUNT_UNOPTIMIZED \| json / none / none \| 6.78 \| 6.88 \| -1.46% \| 4.93% \| 3.63% \| 9 \| -1.51% \| -0.74 \| -0.72 \| \| TPCDS(10) \| TPCDS-Q_COUNT_ZERO_SLOT \| json / none / none \| 2.42 \| 6.75 \| I -64.20% \| 6.44% \| 4.58% \| 9 \| I -177.75% \| -3.36 \| -37.55 \| \| TPCDS(10) \| TPCDS-Q_COUNT_OPTIMIZED \| json / none / none \| 2.42 \| 7.03 \| I -65.63% \| 3.93% \| 4.39% \| 9 \| I -194.13% \| -3.36 \| -42.82 \| +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%]) +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ \| Operator \| % of Query \| Avg \| Base Avg \| Delta(Avg) \| StdDev(%) \| Max \| Base Max \| Delta(Max) \| #Hosts \| #Inst \| #Rows \| Est #Rows \| +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ \| 01:AGGREGATE \| 2.58% \| 54.85ms \| 58.88ms \| -6.85% \| * 14.43% * \| 115.82ms \| 133.11ms \| -12.99% \| 3 \| 3 \| 3 \| 1 \| \| 00:SCAN HDFS \| 97.41% \| 2.07s \| 6.07s \| -65.84% \| 5.87% \| 2.43s \| 6.95s \| -65.01% \| 3 \| 3 \| 28.80M \| 143.83M \| +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%]) +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ \| Operator \| % of Query \| Avg \| Base Avg \| Delta(Avg) \| StdDev(%) \| Max \| Base Max \| Delta(Max) \| #Hosts \| #Inst \| #Rows \| Est #Rows \| +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ \| 00:SCAN HDFS \| 99.35% \| 2.07s \| 6.49s \| -68.15% \| 4.83% \| 2.37s \| 7.49s \| -68.32% \| 3 \| 3 \| 28.80M \| 143.83M \| +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ Testing: - Added new test cases in TestQueriesJsonTables to verify that query results are consistent before and after optimization. - Passed existing JSON scanning-related tests. Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e Reviewed-on: http://gerrit.cloudera.org:8080/21039 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-07-10 14:37:19 +00:00
Eyizoha	2f06a7b052	IMPALA-10798: Initial support for reading JSON files Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Added a startup flag, enable_json_scanner, to be able to disable this feature if we hit critical bugs in production. Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Reviewed-on: http://gerrit.cloudera.org:8080/19699 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-05 16:55:41 +00:00

Author

SHA1

Message

Date

Eyizoha

ec59578106

IMPALA-12786: Optimize count(*) for JSON scans

When performing zero slots scans on a JSON table for operations like
count(*), we don't require specific data from the JSON, we only need the
number of top-level JSON objects. However, the current JSON parser based
on rapidjson still decodes and copies specific data from the JSON, even
in zero slots scans. Skipping these steps can significantly improve scan
performance.

This patch introduces a JSON skipper to conduct zero slots scans on JSON
data. Essentially, it is a simplified version of a rapidjson parser,
removing specific data decoding and copying operations, resulting in
faster parsing of the number of JSON objects. The skipper retains the
ability to recognize malformed JSON and provide specific error codes
same as the rapidjson parser. Nevertheless, as it bypasses specific
data parsing, it cannot identify string encoding errors or numeric
overflow errors. Despite this, these data errors do not impact the
counting of JSON objects, so it is acceptable to ignore them. The TEXT
scanner exhibits similar behavior.

Additionally, a new query option, disable_optimized_json_count_star, has
been added to disable this optimization and revert to the old behavior.

In the performance test of TPC-DS with a format of json/none and a scale
of 10GB, the performance optimization is shown in the following tables:
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload  | Query                     | File Format        | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | json / none / none | 6.78   | 6.88        |   -1.46%   |   4.93%   |   3.63%        | 9     |   -1.51%       | -0.74   | -0.72  |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | json / none / none | 2.42   | 6.75        | I -64.20%  |   6.44%   |   4.58%        | 9     | I -177.75%     | -3.36   | -37.55 |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | json / none / none | 2.42   | 7.03        | I -65.63%  |   3.93%   |   4.39%        | 9     | I -194.13%     | -3.36   | -42.82 |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%])
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg     | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| 01:AGGREGATE | 2.58%      | 54.85ms | 58.88ms  | -6.85%     | * 14.43% * | 115.82ms | 133.11ms | -12.99%    | 3      | 3     | 3      | 1         |
| 00:SCAN HDFS | 97.41%     | 2.07s   | 6.07s    | -65.84%    |   5.87%    | 2.43s    | 6.95s    | -65.01%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%])
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg   | Base Avg | Delta(Avg) | StdDev(%) | Max   | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| 00:SCAN HDFS | 99.35%     | 2.07s | 6.49s    | -68.15%    |   4.83%   | 2.37s | 7.49s    | -68.32%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+

Testing:
- Added new test cases in TestQueriesJsonTables to verify that query
  results are consistent before and after optimization.
- Passed existing JSON scanning-related tests.

Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e
Reviewed-on: http://gerrit.cloudera.org:8080/21039
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2024-07-10 14:37:19 +00:00

Eyizoha

2f06a7b052

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2023-09-05 16:55:41 +00:00

2 Commits