Commit Graph

20 Commits

Author SHA1 Message Date
Tim Armstrong
153663c22f IMPALA-4123: Columnar decoding in Parquet
The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Reviewed-on: http://gerrit.cloudera.org:8080/8319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-17 01:48:05 +00:00
Tim Armstrong
d05f73f415 IMPALA-7647: Add HS2/Impyla dimension to TestQueries
I used some ideas from Alex Leblang's abandoned patch:
https://gerrit.cloudera.org/#/c/137/ in order to run .test files through
HS2. The advantage of using Impyla is that much of the code will be
reusable for any Python client implementing the standard Python dbapi
and does not require us implementing yet another thrift client.

This gives us better coverage of non-trivial result sets from HS2,
including handling of NULLs, error logs and more interesting result
sets than the basic HS2 tests.

I added HS2 coverage to TestQueries, which has a reasonable variety of
queries and covers the data types in alltypes. I also added
TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage
of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of
results sets with NULLs was limited so I added a couple of queries.

Places where results differ from Beeswax:
* Impyla is a Python dbapi client so must convert timestamps into python datetime
  objects, which only have microsecond precision. Therefore result
  timestamps within nanosecond precision are truncated.
* The HS2 interface reports the NULL type as BOOLEAN as a workaround for
  IMPALA-914.
* The Beeswax interface reported VARCHAR as STRING, but HS2 reports
  VARCHAR.

I dealt with different results by adding additional result sections so
that the expected differences between the clients/protocols were
explicit.

Limitations:
* Not all of the same methods are implemented as for beeswax, so some
  tests that have more complicated interactions with the client will not
  work with HS2 yet.
* We don't have a way to get the affected row count for inserts.

I also simplified the ImpalaConnection API by removing some unnecessary
methods and moved some generic methods to the base class.

Testing:
* Confirmed that it detected IMPALA-7588 by re-applying the buggy patch.
* Ran exhaustive and CentOS6 tests.

Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e
Reviewed-on: http://gerrit.cloudera.org:8080/11546
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-09 00:45:10 +00:00
Tim Armstrong
579e33207b IMPALA-6368: make test_chars parallel
Previously it had to be executed serially because it modified tables in
the functional database.

This change separates out tests that use temporary tables and runs those
in a unique_database.

Testing:
Ran locally in a loop with parallelism of 4 for a while.

Change-Id: I2f62ede90f619b8cebbb1276bab903e7555d9744
Reviewed-on: http://gerrit.cloudera.org:8080/9022
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-19 09:55:52 +00:00
Dimitris Tsirogiannis
c88d179413 IMPALA-1636: Generalize index-based partition pruning to allow constant
expressions

This commit enables fast partition pruning for cases where constant
expressions appear in binary or IN predicates. During partition pruning,
the constant expressions are evaluated in the BE and are replaced by the
computed results as LiteralExprs.

Change-Id: Ie8a2accf260391117559dc6c0a565f907c516478
Reviewed-on: http://gerrit.cloudera.org:8080/144
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-03-07 09:51:27 +00:00
Alex Behm
c0f2e043b4 Fix exhaustive test runs: Preserve types when substituting root output exprs.
A recent change (3ccee71) to fix resetAnalysisState() of NullLiterals
exposed another bug during exhaustive test runs.
For insert queries into Parquet, the types in the schema of the generated
Parquet files are based on the insert exprs, correctly assuming that
the FE handles all the necessary casting to make sure the Parquet file
schema and the table schema match.
Since we apply an smap on the output exprs towards the end of planning,
NullLiterals were reset to the NULL_TYPE, causing the Parquet schema
to incorrectly have BOOLEAN columns (we cast naked NULL_LITERALS to
BOOLEAN in toThrift()), leading to a mismatch of the Parquet schema
and the table schema. Subsequent queries on such a table failed,
correctly reporting a type mismatch.

The fix is to preserve types when doing the substitution on the output exprs.

Change-Id: I135f1b826b06a6a200df7b73343d2eb1fb4b7b80
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5453
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5455
2014-11-30 01:08:08 -08:00
Nong Li
e2d7fb6402 Some test case cleanup.
Change-Id: Ic29b7c1f5fd714a1e2cc41bf0e55c0d11c782862
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4791
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5090
Reviewed-by: Nong Li <nong@cloudera.com>
2014-11-03 22:33:08 -08:00
Victor Bittorf
7b244d34b6 IMPALA-1344: Fixed analytic aggregations with CHAR
The fix is to only register aggregates for string, not for CHAR or VARCHAR. The CHAR and
and VARCHAR types are implicitly cast to STRING for aggregation.

Also, fixed aggregate fn builtins that should not ignore distinct.

Change-Id: If4c1a2c6127360c2c8127a5c02949df74fafc85a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4717
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:16:50 -07:00
Victor Bittorf
a62500ee28 Changed CHAR & VARCHAR max length to match Hive.
Also modified the text of the analysis exception for lengths that are too long or
short because John said they were unclear.

Change-Id: I9427d5c39298aa8207672e50e10fe527c5076599
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4698
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:16:45 -07:00
Victor Bittorf
c29ed3761e IMPALA-1339: NULLs incorrectly hashed in groupby
Problem: hash table assumed all raw values were at most 16 bytes. This maximum was
increased to to support up to 128 bytes for CHARs.

Change-Id: I107c58b9a013d5db46ff5586bcdceee3961346e9
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4701
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:16:36 -07:00
Victor Bittorf
d5fd59e2ed IMPALA-1337: Aggregation failures for VARCHAR
The issue is that the aggregation node needed to use IsVarLen; previously
it assumed TYPE_STRING was the only variable length type.

Change-Id: I9545e8d405937a47b25c9042f97854851a448c6e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4690
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:14:51 -07:00
Victor Bittorf
f4626b03e6 IMPALA-1322: Fix related issue
There is an issue related to IMPALA-1322. The expression list when laying out memory
was being improperly index.

Change-Id: I2eef84a812b451d87ecb8afd304e765aff1f5a6b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4675
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:14:44 -07:00
Victor Bittorf
794e70b0bd Fix CHAR/VARCHAR Aggregation
This fixes an issue where VARCHAR and CHAR could error in some aggregations.
The cause of the problem is that the BE currently does not support CHAR/VARCHAR as
arguments to aggregates, they require an implicit cast to string first.
The resolution is to have these operators return STRING instead of CHAR(*) or VARCHAR(*).
Note that the CHAR(*) comparisons still ignore spaces for min/max.

This takes advantage of the fact that STRING, VARCHAR(*), and CHAR(*) values are all
handled as a StringVal for exprs. The STRING aggregates are registered as CHAR(*) and
VARCHAR(*) aggregates and the front end converts the return type to a STRING in all cases.

Also includes a fix for a TODO about casting between CHAR and VARCHAR.

Change-Id: I1d3a9cc48e426286ce63677324a8c680e67b005a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4573
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:13:17 -07:00
Victor Bittorf
fa502f973a IMPALA-1319: Fixed CHAR padding for numeric casts
IMPALA-1322: Crash on VARCHAR/CHAR join

Fixed 2 issues:
  (1) Disabled codegen for CHAR in hash join equality
  (2) fixed memory layout for CHAR
  (3) Fixed a regression where space padding could be dropped for numeric casts.

Change-Id: I6475fd527ca0d67c7d4d5ec7e561549e43fbc336
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4640
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:12:44 -07:00
Victor Bittorf
658f05f63c IMPALA-1316: crash on VARCHAR join
Fixed codegen issue casing some VARCHAR joins to crash.

Change-Id: Ib2674199a3b2c3c5a5fd63cfae0b64e3b1ca158b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4616
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-10-06 15:11:10 -07:00
Victor Bittorf
afbc2c28a3 Char Partition Fix
Fixed bug CHAR and VARCHAR partition columns. Also, disables CHAR and VARCHAR for UDAs
and UDFs.

Change-Id: I67ccd746cb4c063f8a7a984df9564fa9122fdf43
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4493
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-09-26 12:02:54 -07:00
Victor Bittorf
9939c9d009 Bugfix and tests for CHAR(N) and VARCHAR(N)
Fixed a bug when setting the length in reading/write text files for CHAR(N).
Also added chars_tiny table for testing CHAR(N) and VARCHAR(N).

Change-Id: If5d5db30afa4b00cf03c68c6a845f182970329f4
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4415
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-09-23 07:30:07 -07:00
Victor Bittorf
6289121261 CHAR(N) Followup Patch
This patch addresses:
  1. Char doesn't use codegen
  2. Not in-lining large CHAR(N) for N > 128
  3. Parquet reader/writer for CHAR(N) and VARCHAR(N)

Change-Id: I83a29a8bd312841a3e29bfe2243884074570f247
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4280
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-09-20 16:12:03 -07:00
Victor Bittorf
a1892a17d5 IMPALA-1248: Fixed CHAR(N) in VALUES clause.
Queries like;
INSERT INTO table VALUES (CAST("..." AS CHAR(N)))
Used codegen path and failed; changed to use interpreted path.

Change-Id: Id80274580df268b3f828dec19a2e0b0578061ca8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4362
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-09-20 16:07:16 -07:00
Victor Bittorf
8bebf2b196 CHAR: adding support for CHAR(N)
Support for CHAR is implemented as a StringVal in the backend.

TODO:
  1. Parquet Reader/writer
  2. Codegen slot ref
  3. Codegen text reader
  4. Don't inline large chars
  5. update impala-hs2-server.cc with CHAR support

Change-Id: Ibba2c89cea971cb740001ea7975bf3e929150471
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4075
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-09-13 00:19:20 -07:00
Victor Bittorf
2dce31f6c2 Adding VARCHAR front & backend.
VARCHAR is treated as StringVal in the backend. All UDAs and UDFs which accept STRING
will also accept VARCHAR(N).

TODO: Reverted Avro codegen to fix Jenkins; needs separate patch.

Change-Id: Ifc120b6f0fe1f996b11a48b134d339ad3719331e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/2527
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 3fcbf4f677b8e26c37eded4d8bb628e6fc53c1e9)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4058
2014-08-27 13:52:58 -07:00