impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 15:00:38 -05:00

Author	SHA1	Message	Date
Tim Armstrong	153663c22f	IMPALA-4123: Columnar decoding in Parquet The idea is to optimise the common case where there are long runs of NULL or non-NULL values (i.e. the def level is repeated). We can detect this cheaply by keying the decoding loop in the column reader off the state of the def level RLE decoder - if there's a long run of repeated levels, we can skip checking the def level for every value. We still fall back to decoding, caching and reading value-by-value a batch of def levels whenever the next def level is not in a repeated run. We still use the old approach for decoding rep levels. There might be some benefit to using the same approach for rep levels if repeated def and rep level runs line up. These changes should unlock further optimizations because more time is spent in simple kernel functions, e.g. UnpackAndDecode32Values() for dictionary decompression, which is very optimisable using SIMD etc. Snappy decompression now seems to be the main CPU bottleneck for decoding snappy-compressed Parquet. Perf: Running TPC-H scale factor 60 on uncompressed and snappy parquet both showed a ~4% speedup overall. Microbenchmarks on uncompressed parquet show scans only doing dictionary decoding on uncompressed Parquet is ~75% faster: set mt_dop=1; select min(l_returnflag) from lineitem; Testing: We have alltypes agg with a mix of null and non-null. Many tables have long runs of non-null values. Added new test data and coverage: * a test table manynulls with long runs of null values. * a large CHAR test table * missing coverage for materialising pos slot in flattened nested types scan. * Extended dict test to test longer runs. * A larger version of complextypestbl with interesting collection shapes - NULL collections, empty collections, etc, particularly runs of collections with the same shape. * Test interaction of timestamp validation with conversion * Ran code coverage build to confirm all code paths are tested * ASAN and exhaustive runs. Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Reviewed-on: http://gerrit.cloudera.org:8080/8319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-17 01:48:05 +00:00
Tim Armstrong	d05f73f415	IMPALA-7647: Add HS2/Impyla dimension to TestQueries I used some ideas from Alex Leblang's abandoned patch: https://gerrit.cloudera.org/#/c/137/ in order to run .test files through HS2. The advantage of using Impyla is that much of the code will be reusable for any Python client implementing the standard Python dbapi and does not require us implementing yet another thrift client. This gives us better coverage of non-trivial result sets from HS2, including handling of NULLs, error logs and more interesting result sets than the basic HS2 tests. I added HS2 coverage to TestQueries, which has a reasonable variety of queries and covers the data types in alltypes. I also added TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of results sets with NULLs was limited so I added a couple of queries. Places where results differ from Beeswax: * Impyla is a Python dbapi client so must convert timestamps into python datetime objects, which only have microsecond precision. Therefore result timestamps within nanosecond precision are truncated. * The HS2 interface reports the NULL type as BOOLEAN as a workaround for IMPALA-914. * The Beeswax interface reported VARCHAR as STRING, but HS2 reports VARCHAR. I dealt with different results by adding additional result sections so that the expected differences between the clients/protocols were explicit. Limitations: * Not all of the same methods are implemented as for beeswax, so some tests that have more complicated interactions with the client will not work with HS2 yet. * We don't have a way to get the affected row count for inserts. I also simplified the ImpalaConnection API by removing some unnecessary methods and moved some generic methods to the base class. Testing: * Confirmed that it detected IMPALA-7588 by re-applying the buggy patch. * Ran exhaustive and CentOS6 tests. Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e Reviewed-on: http://gerrit.cloudera.org:8080/11546 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-09 00:45:10 +00:00
Tim Armstrong	579e33207b	IMPALA-6368: make test_chars parallel Previously it had to be executed serially because it modified tables in the functional database. This change separates out tests that use temporary tables and runs those in a unique_database. Testing: Ran locally in a loop with parallelism of 4 for a while. Change-Id: I2f62ede90f619b8cebbb1276bab903e7555d9744 Reviewed-on: http://gerrit.cloudera.org:8080/9022 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-19 09:55:52 +00:00
Dimitris Tsirogiannis	c88d179413	IMPALA-1636: Generalize index-based partition pruning to allow constant expressions This commit enables fast partition pruning for cases where constant expressions appear in binary or IN predicates. During partition pruning, the constant expressions are evaluated in the BE and are replaced by the computed results as LiteralExprs. Change-Id: Ie8a2accf260391117559dc6c0a565f907c516478 Reviewed-on: http://gerrit.cloudera.org:8080/144 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-03-07 09:51:27 +00:00
Alex Behm	c0f2e043b4	Fix exhaustive test runs: Preserve types when substituting root output exprs. A recent change (3ccee71) to fix resetAnalysisState() of NullLiterals exposed another bug during exhaustive test runs. For insert queries into Parquet, the types in the schema of the generated Parquet files are based on the insert exprs, correctly assuming that the FE handles all the necessary casting to make sure the Parquet file schema and the table schema match. Since we apply an smap on the output exprs towards the end of planning, NullLiterals were reset to the NULL_TYPE, causing the Parquet schema to incorrectly have BOOLEAN columns (we cast naked NULL_LITERALS to BOOLEAN in toThrift()), leading to a mismatch of the Parquet schema and the table schema. Subsequent queries on such a table failed, correctly reporting a type mismatch. The fix is to preserve types when doing the substitution on the output exprs. Change-Id: I135f1b826b06a6a200df7b73343d2eb1fb4b7b80 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5453 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5455	2014-11-30 01:08:08 -08:00
Nong Li	e2d7fb6402	Some test case cleanup. Change-Id: Ic29b7c1f5fd714a1e2cc41bf0e55c0d11c782862 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4791 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5090 Reviewed-by: Nong Li <nong@cloudera.com>	2014-11-03 22:33:08 -08:00
Victor Bittorf	7b244d34b6	IMPALA-1344: Fixed analytic aggregations with CHAR The fix is to only register aggregates for string, not for CHAR or VARCHAR. The CHAR and and VARCHAR types are implicitly cast to STRING for aggregation. Also, fixed aggregate fn builtins that should not ignore distinct. Change-Id: If4c1a2c6127360c2c8127a5c02949df74fafc85a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4717 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:16:50 -07:00
Victor Bittorf	a62500ee28	Changed CHAR & VARCHAR max length to match Hive. Also modified the text of the analysis exception for lengths that are too long or short because John said they were unclear. Change-Id: I9427d5c39298aa8207672e50e10fe527c5076599 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4698 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:16:45 -07:00
Victor Bittorf	c29ed3761e	IMPALA-1339: NULLs incorrectly hashed in groupby Problem: hash table assumed all raw values were at most 16 bytes. This maximum was increased to to support up to 128 bytes for CHARs. Change-Id: I107c58b9a013d5db46ff5586bcdceee3961346e9 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4701 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:16:36 -07:00
Victor Bittorf	d5fd59e2ed	IMPALA-1337: Aggregation failures for VARCHAR The issue is that the aggregation node needed to use IsVarLen; previously it assumed TYPE_STRING was the only variable length type. Change-Id: I9545e8d405937a47b25c9042f97854851a448c6e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4690 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:14:51 -07:00
Victor Bittorf	f4626b03e6	IMPALA-1322: Fix related issue There is an issue related to IMPALA-1322. The expression list when laying out memory was being improperly index. Change-Id: I2eef84a812b451d87ecb8afd304e765aff1f5a6b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4675 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:14:44 -07:00
Victor Bittorf	794e70b0bd	Fix CHAR/VARCHAR Aggregation This fixes an issue where VARCHAR and CHAR could error in some aggregations. The cause of the problem is that the BE currently does not support CHAR/VARCHAR as arguments to aggregates, they require an implicit cast to string first. The resolution is to have these operators return STRING instead of CHAR() or VARCHAR(). Note that the CHAR() comparisons still ignore spaces for min/max. This takes advantage of the fact that STRING, VARCHAR(), and CHAR() values are all handled as a StringVal for exprs. The STRING aggregates are registered as CHAR() and VARCHAR(*) aggregates and the front end converts the return type to a STRING in all cases. Also includes a fix for a TODO about casting between CHAR and VARCHAR. Change-Id: I1d3a9cc48e426286ce63677324a8c680e67b005a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4573 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:13:17 -07:00
Victor Bittorf	fa502f973a	IMPALA-1319: Fixed CHAR padding for numeric casts IMPALA-1322: Crash on VARCHAR/CHAR join Fixed 2 issues: (1) Disabled codegen for CHAR in hash join equality (2) fixed memory layout for CHAR (3) Fixed a regression where space padding could be dropped for numeric casts. Change-Id: I6475fd527ca0d67c7d4d5ec7e561549e43fbc336 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4640 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:12:44 -07:00
Victor Bittorf	658f05f63c	IMPALA-1316: crash on VARCHAR join Fixed codegen issue casing some VARCHAR joins to crash. Change-Id: Ib2674199a3b2c3c5a5fd63cfae0b64e3b1ca158b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4616 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-10-06 15:11:10 -07:00
Victor Bittorf	afbc2c28a3	Char Partition Fix Fixed bug CHAR and VARCHAR partition columns. Also, disables CHAR and VARCHAR for UDAs and UDFs. Change-Id: I67ccd746cb4c063f8a7a984df9564fa9122fdf43 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4493 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-09-26 12:02:54 -07:00
Victor Bittorf	9939c9d009	Bugfix and tests for CHAR(N) and VARCHAR(N) Fixed a bug when setting the length in reading/write text files for CHAR(N). Also added chars_tiny table for testing CHAR(N) and VARCHAR(N). Change-Id: If5d5db30afa4b00cf03c68c6a845f182970329f4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4415 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-23 07:30:07 -07:00
Victor Bittorf	6289121261	CHAR(N) Followup Patch This patch addresses: 1. Char doesn't use codegen 2. Not in-lining large CHAR(N) for N > 128 3. Parquet reader/writer for CHAR(N) and VARCHAR(N) Change-Id: I83a29a8bd312841a3e29bfe2243884074570f247 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4280 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-20 16:12:03 -07:00
Victor Bittorf	a1892a17d5	IMPALA-1248: Fixed CHAR(N) in VALUES clause. Queries like; INSERT INTO table VALUES (CAST("..." AS CHAR(N))) Used codegen path and failed; changed to use interpreted path. Change-Id: Id80274580df268b3f828dec19a2e0b0578061ca8 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4362 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-20 16:07:16 -07:00
Victor Bittorf	8bebf2b196	CHAR: adding support for CHAR(N) Support for CHAR is implemented as a StringVal in the backend. TODO: 1. Parquet Reader/writer 2. Codegen slot ref 3. Codegen text reader 4. Don't inline large chars 5. update impala-hs2-server.cc with CHAR support Change-Id: Ibba2c89cea971cb740001ea7975bf3e929150471 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4075 Reviewed-by: Nong Li <nong@cloudera.com> Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-13 00:19:20 -07:00
Victor Bittorf	2dce31f6c2	Adding VARCHAR front & backend. VARCHAR is treated as StringVal in the backend. All UDAs and UDFs which accept STRING will also accept VARCHAR(N). TODO: Reverted Avro codegen to fix Jenkins; needs separate patch. Change-Id: Ifc120b6f0fe1f996b11a48b134d339ad3719331e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/2527 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit 3fcbf4f677b8e26c37eded4d8bb628e6fc53c1e9) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4058	2014-08-27 13:52:58 -07:00

20 Commits