mirror of
https://github.com/apache/impala.git
synced 2025-12-22 19:35:22 -05:00
IMPALA-10798: Initial support for reading JSON files
Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Added a startup flag, enable_json_scanner, to be able to disable this feature if we hit critical bugs in production. Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Reviewed-on: http://gerrit.cloudera.org:8080/19699 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
committed by
Impala Public Jenkins
parent
b73bc49ea7
commit
2f06a7b052
180
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
vendored
Normal file
180
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
vendored
Normal file
@@ -0,0 +1,180 @@
|
||||
====
|
||||
---- QUERY
|
||||
select * from alltypeserror order by id
|
||||
---- ERRORS
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '0'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.bool_col, type: BOOLEAN, data: 'errtrue'
|
||||
Error converting column: functional_json.alltypeserror.tinyint_col, type: TINYINT, data: 'err9'
|
||||
Error converting column: functional_json.alltypeserror.smallint_col, type: SMALLINT, data: 'err9'
|
||||
Error converting column: functional_json.alltypeserror.int_col, type: INT, data: 'err9'
|
||||
Error converting column: functional_json.alltypeserror.bigint_col, type: BIGINT, data: 'err90'
|
||||
Error converting column: functional_json.alltypeserror.float_col, type: FLOAT, data: 'err9.000000'
|
||||
Error converting column: functional_json.alltypeserror.double_col, type: DOUBLE, data: 'err90.900000'
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '0000-01-01 00:00:00'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.double_col, type: DOUBLE, data: 'err70.700000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.float_col, type: FLOAT, data: 'err6.000000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.bigint_col, type: BIGINT, data: 'err50'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.int_col, type: INT, data: 'err4'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.smallint_col, type: SMALLINT, data: 'err3'
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '2002-14-10 00:00:00'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.tinyint_col, type: TINYINT, data: 'err2'
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '1999-10-10 90:10:10'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.bool_col, type: BOOLEAN, data: 'errfalse'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.float_col, type: FLOAT, data: 'xyz3.000000'
|
||||
Error converting column: functional_json.alltypeserror.double_col, type: DOUBLE, data: 'xyz30.300000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.tinyint_col, type: TINYINT, data: 'xyz5'
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '0009-01-01 00:00:00'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '0'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.double_col, type: DOUBLE, data: 'xyz70.700000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '2020-20-10 10:10:10.123'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.bool_col, type: BOOLEAN, data: 't\rue'
|
||||
Error converting column: functional_json.alltypeserror.tinyint_col, type: TINYINT, data: 'err30'
|
||||
Error converting column: functional_json.alltypeserror.smallint_col, type: SMALLINT, data: 'err30'
|
||||
Error converting column: functional_json.alltypeserror.int_col, type: INT, data: 'err30'
|
||||
Error converting column: functional_json.alltypeserror.bigint_col, type: BIGINT, data: 'err300'
|
||||
Error converting column: functional_json.alltypeserror.float_col, type: FLOAT, data: 'err30..000000'
|
||||
Error converting column: functional_json.alltypeserror.double_col, type: DOUBLE, data: 'err300.900000'
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '0000-01-01 00:00:00'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.int_col, type: INT, data: 'abc9'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.tinyint_col, type: TINYINT, data: 'abc7'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.int_col, type: INT, data: 'abc5'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '2020-10-10 10:70:10.123'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.smallint_col, type: SMALLINT, data: 'abc3'
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '2020-10-10 60:10:10.123'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserror.timestamp_col, type: TIMESTAMP, data: '2020-10-40 10:10:10.123'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
---- RESULTS
|
||||
0,NULL,NULL,0,0,0,0.0,0.0,'01/01/09','0',NULL,2009,1
|
||||
1,NULL,NULL,1,1,10,1.0,10.1,'01/01/09','1',1999-10-10 00:00:00,2009,1
|
||||
2,true,NULL,NULL,2,20,2.0,20.2,'01/01/09','2',NULL,2009,1
|
||||
3,false,3,NULL,NULL,30,3.0,30.3,'01/01/09','3',NULL,2009,1
|
||||
4,true,4,4,NULL,NULL,4.0,40.4,'01/01/09','4',1970-01-01 00:00:00,2009,1
|
||||
5,false,5,5,5,NULL,NULL,50.5,'01/01/09','5',1970-01-01 00:00:00,2009,1
|
||||
6,true,6,6,6,60,NULL,NULL,'01/01/09','6',1970-01-01 00:00:00,2009,1
|
||||
7,NULL,NULL,7,7,70,7.0,NULL,'01/01/09','7',1970-01-01 00:00:00,2009,1
|
||||
8,false,NULL,NULL,8,80,8.0,80.8,'01/01/09','8',1970-01-01 00:00:00,2009,1
|
||||
9,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'01/01/09','9',NULL,2009,1
|
||||
10,NULL,NULL,NULL,0,0,0.0,0.0,'02/01/09','0',2009-01-01 00:00:00,2009,2
|
||||
11,false,NULL,NULL,NULL,10,1.0,10.1,'02/01/09','1',2009-01-01 00:00:00,2009,2
|
||||
12,true,2,NULL,NULL,NULL,2.0,20.2,'02/01/09','2',2009-01-01 00:00:00,2009,2
|
||||
13,false,3,3,NULL,NULL,NULL,NULL,'02/01/09','3',2009-01-01 00:00:00,2009,2
|
||||
14,true,4,4,4,40,NULL,NULL,'02/01/09','4',2009-01-01 00:00:00,2009,2
|
||||
15,false,NULL,5,5,50,5.0,50.5,'02/01/09','5',NULL,2009,2
|
||||
16,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'02/01/09','6',NULL,2009,2
|
||||
17,false,7,7,7,70,7.0,NULL,'02/01/09','7',2009-01-01 00:00:00,2009,2
|
||||
18,true,8,8,8,80,8.0,80.8,'02/01/09','8',2009-01-01 00:00:00,2009,2
|
||||
19,false,9,9,9,90,9.0,90.9,'02/01/09','9',2009-01-01 00:00:00,2009,2
|
||||
20,true,0,0,0,0,0.0,0.0,'03/01/09','0',2020-10-10 10:10:10.123000000,2009,3
|
||||
21,false,1,1,1,10,1.0,10.1,'03/01/09','1',NULL,2009,3
|
||||
22,true,2,2,2,20,2.0,20.2,'03/01/09','2',NULL,2009,3
|
||||
23,false,3,NULL,3,30,3.0,30.3,'03/01/09','3',NULL,2009,3
|
||||
24,true,4,4,4,40,4.0,40.4,'03/01/09','4',NULL,2009,3
|
||||
25,false,5,5,NULL,50,5.0,50.5,'03/01/09','5',2020-10-10 10:10:10.123000000,2009,3
|
||||
26,true,6,6,6,60,6.0,60.6,'03/01/09','6',2020-10-10 10:10:10.123000000,2009,3
|
||||
27,false,NULL,7,7,70,7.0,70.7,'03/01/09','7',2020-10-10 10:10:10.123000000,2009,3
|
||||
28,true,8,8,8,80,8.0,80.8,'03/01/09','8',2020-10-10 10:10:10.123000000,2009,3
|
||||
29,false,9,9,NULL,90,9.0,90.9,'03/01/09','9',2020-10-10 10:10:10.123000000,2009,3
|
||||
30,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'01/01/10','10',NULL,2009,3
|
||||
---- TYPES
|
||||
int, boolean, tinyint, smallint, int, bigint, float, double, string, string, timestamp, int, int
|
||||
====
|
||||
---- QUERY
|
||||
select * from alltypeserrornonulls order by id
|
||||
---- ERRORS
|
||||
Error converting column: functional_json.alltypeserrornonulls.timestamp_col, type: TIMESTAMP, data: '123456'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.bool_col, type: BOOLEAN, data: 'errfalse'
|
||||
Error converting column: functional_json.alltypeserrornonulls.timestamp_col, type: TIMESTAMP, data: '1990-00-01 10:10:10'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.tinyint_col, type: TINYINT, data: 'err2'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.smallint_col, type: SMALLINT, data: 'err3'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.int_col, type: INT, data: 'err4'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.bigint_col, type: BIGINT, data: 'err50'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.float_col, type: FLOAT, data: 'err6.000000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.double_col, type: DOUBLE, data: 'err70.700000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.bool_col, type: BOOLEAN, data: 'errtrue'
|
||||
Error converting column: functional_json.alltypeserrornonulls.tinyint_col, type: TINYINT, data: 'err9'
|
||||
Error converting column: functional_json.alltypeserrornonulls.smallint_col, type: SMALLINT, data: 'err9'
|
||||
Error converting column: functional_json.alltypeserrornonulls.int_col, type: INT, data: 'err9'
|
||||
Error converting column: functional_json.alltypeserrornonulls.bigint_col, type: BIGINT, data: 'err90'
|
||||
Error converting column: functional_json.alltypeserrornonulls.float_col, type: FLOAT, data: 'err9.000000'
|
||||
Error converting column: functional_json.alltypeserrornonulls.double_col, type: DOUBLE, data: 'err90.900000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.float_col, type: FLOAT, data: 'xyz3.000000'
|
||||
Error converting column: functional_json.alltypeserrornonulls.double_col, type: DOUBLE, data: 'xyz30.300000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.tinyint_col, type: TINYINT, data: 'xyz5'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.double_col, type: DOUBLE, data: 'xyz70.700000'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.smallint_col, type: SMALLINT, data: 'abc3'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.int_col, type: INT, data: 'abc5'
|
||||
Error converting column: functional_json.alltypeserrornonulls.timestamp_col, type: TIMESTAMP, data: '2012-Mar-22 11:20:01.123'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.tinyint_col, type: TINYINT, data: 'abc7'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.timestamp_col, type: TIMESTAMP, data: '11:20:01.123 2012-03-22 '
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
Error converting column: functional_json.alltypeserrornonulls.int_col, type: INT, data: 'abc9'
|
||||
row_regex: .*Error parsing row: file: $NAMENODE/.* before offset: \d+
|
||||
---- RESULTS
|
||||
0,true,0,0,0,0,0,0,'01/01/09','0',NULL,2009,1
|
||||
1,NULL,1,1,1,10,1,10.1,'01/01/09','1',NULL,2009,1
|
||||
2,true,NULL,2,2,20,2,20.2,'01/01/09','2',2012-03-22 11:20:01.123000000,2009,1
|
||||
3,false,3,NULL,3,30,3,30.3,'01/01/09','3',2012-03-22 11:20:01.123000000,2009,1
|
||||
4,true,4,4,NULL,40,4,40.4,'01/01/09','4',2012-03-22 11:20:01.123000000,2009,1
|
||||
5,false,5,5,5,NULL,5,50.5,'01/01/09','5',2012-03-22 11:20:01.123000000,2009,1
|
||||
6,true,6,6,6,60,NULL,60.6,'01/01/09','6',2012-03-22 11:20:01.123000000,2009,1
|
||||
7,false,7,7,7,70,7,NULL,'01/01/09','7',2012-03-22 11:20:01.123000000,2009,1
|
||||
8,false,8,8,8,80,8,80.8,'01/01/09','8',2012-03-22 11:20:01.123000000,2009,1
|
||||
9,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'01/01/09','9',2012-03-22 11:20:01.123000000,2009,1
|
||||
10,true,0,0,0,0,0,0,'02/01/09','0',2012-03-22 11:20:01.123000000,2009,2
|
||||
11,false,1,1,1,10,1,10.1,'02/01/09','1',2012-03-22 11:20:01.123000000,2009,2
|
||||
12,true,2,2,2,20,2,20.2,'02/01/09','2',2012-03-22 11:20:01.123000000,2009,2
|
||||
13,false,3,3,3,30,NULL,NULL,'02/01/09','3',2012-03-22 11:20:01.123000000,2009,2
|
||||
14,true,4,4,4,40,4,40.4,'02/01/09','4',2012-03-22 11:20:01.123000000,2009,2
|
||||
15,false,NULL,5,5,50,5,50.5,'02/01/09','5',2012-03-22 11:20:01.123000000,2009,2
|
||||
16,true,6,6,6,60,6,60.6,'02/01/09','6',2012-03-22 11:20:01.123000000,2009,2
|
||||
17,false,7,7,7,70,7,NULL,'02/01/09','7',2012-03-22 11:20:01.123000000,2009,2
|
||||
18,true,8,8,8,80,8,80.8,'02/01/09','8',2012-03-22 11:20:01.123000000,2009,2
|
||||
19,false,9,9,9,90,9,90.90000000000001,'02/01/09','9',2012-03-22 11:20:01.123000000,2009,2
|
||||
20,true,0,0,0,0,0,0,'03/01/09','0',2012-03-22 11:20:01.123000000,2009,3
|
||||
21,false,1,1,1,10,1,10.1,'03/01/09','1',2012-03-22 11:20:01.123000000,2009,3
|
||||
22,true,2,2,2,20,2,20.2,'03/01/09','2',2012-03-22 11:20:01.123000000,2009,3
|
||||
23,false,3,NULL,3,30,3,30.3,'03/01/09','3',2012-03-22 11:20:01.123000000,2009,3
|
||||
24,true,4,4,4,40,4,40.4,'03/01/09','4',2012-03-22 11:20:01.123000000,2009,3
|
||||
25,false,5,5,NULL,50,5,50.5,'03/01/09','5',NULL,2009,3
|
||||
26,true,6,6,6,60,6,60.6,'03/01/09','6',2012-03-22 11:20:01.123000000,2009,3
|
||||
27,false,NULL,7,7,70,7,70.7,'03/01/09','7',2012-03-22 11:20:01.123000000,2009,3
|
||||
28,true,8,8,8,80,8,80.8,'03/01/09','8',NULL,2009,3
|
||||
29,false,9,9,NULL,90,9,90.90000000000001,'03/01/09','9',2012-03-22 00:00:00,2009,3
|
||||
---- TYPES
|
||||
int, boolean, tinyint, smallint, int, bigint, float, double, string, string, timestamp, int, int
|
||||
====
|
||||
Reference in New Issue
Block a user