Implement a test that generates random decimal numbers in the pytest
framework, performs a random mathemtaical operation in Impala and
verifies that the result is correct by doing the same operating using
the Python decimal module. We try to generate not only completely random
decimal numbers, but also numbers that have interesting properties, such
as the number being a power of two.
Change-Id: I4328125de5c583ec8ead1f78d9a08703b18b2d85
Reviewed-on: http://gerrit.cloudera.org:8080/8898
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
This patch maps a signed integer logical type in parquet to a supported
Impala column type. This change introduces the following mapping -
INT_8 -> TINYINT
INT_16 -> SMALLINT
INT_32 -> INT
INT_64 -> BIGINT
Also, added a parquet file with the following schema for testing -
schema {
optional int32 id;
optional int32 tinyint_col (INT_8);
optional int32 smallint_col (INT_16);
optional int32 int_col;
optional int64 bigint_col;
}
Change-Id: I47a8371858c9597c6a440808cf6f933532468927
Reviewed-on: http://gerrit.cloudera.org:8080/8548
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins
When materialising a nested collection, has_template_tuple() should use
the template tuple for the collection, not the top-level tuple.
Testing:
Added tests based on nested-types-basic.test that operate on a simple
partitioned table. The tests reliably crashed Impala before the fix.
Change-Id: Ic808b824ce3b31af0539036d8ca23d17b18deab4
Reviewed-on: http://gerrit.cloudera.org:8080/8947
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
In this patch we implement rounding when casting string to decimal if
DECIMAL_V2 is enabled. The backend method that parses strings and
converts them to decimals is refactored to make it easier to understand.
Testing:
- Added some BE tests.
Change-Id: Icd8b92727fb384e6ff2d145e4aab7ae5d27db26d
Reviewed-on: http://gerrit.cloudera.org:8080/8774
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
precision.
This commit follows 16d8dd58.
This patch adds a test case that inspects the thrift profile of a
completed query, and verifies that the "Start Time" and
"End Time" of the query have nanosecond precision. We chose to
work with the thrift profile directly, rather than parse the debug
web page, as it is the thrift profile which is consumed by
management API clients of Impala.
Change-Id: Id3421a34cc029ebca551730084c7cbd402d5c109
Reviewed-on: http://gerrit.cloudera.org:8080/8784
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV()
function.
Testing:
- modified/improved existing functional tests
- core/hdfs run passed
Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b
Reviewed-on: http://gerrit.cloudera.org:8080/8840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Currently implementation of rand/random built-in functions
use rand_r of C library. We recognized its randomness was poor.
pcg32 of third party library shows better randomness than rand_r.
Testing:
Revise unit test in expr-test
Add E2E test to random.test
Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684
Reviewed-on: http://gerrit.cloudera.org:8080/8355
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.
SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.
Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html
The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.
Testing:
- added functional tests
- core/hdfs run passed
Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
test_profile_fragment_instances was recently added to verify that the
final runtime profile for a query has the expected fragments and exec
nodes. The test fails on local filesystem builds, though, as it
assumes there will be 3 impalads and therefore 3 fragment instances,
but there is only 1 impalad on local filesystem builds.
The fix is to disable the test on local filesystem builds.
Change-Id: I2c98f160406081626f17709809b8efee9eae1450
Reviewed-on: http://gerrit.cloudera.org:8080/8809
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
test_basic_filters has been occasionally failing due to a line missing
from a runtime profile for a particular query.
The problem is that the query returns all of its results before all of
its fragment instances are finished executing (due to a limit). Then,
when one fragment instance reports its status, the coordinator returns
to it a 'cancelled' status, causing all remaining instances for that
backend to be cancelled.
Sometimes this cancellation happens quickly enough that the relevant
fragment instances have not yet sent a status report when they are
cancelled. They will still send a report in finalize, but as the
coordinator only updates its runtime profile for 'ok' status reports,
not 'cancelled', the final runtime profile doesn't end up with any
data for those fragment instances, which means the test does not find
the line in the runtime profile its checking for.
The fix is to have the coordinator update its runtime profile with
every status report it recieves, regardless of error status.
Testing:
- Ran existing runtime profile tests, which rely on profile output,
in a loop.
- Manually tested some scenarios with failed queries and checked that
the new profile output is reasonable.
- Added a new e2e test that runs the affected query and checks for the
presence of info for all expected exec node in the profile. This
repros the underlying issue consistently.
Change-Id: I4f581c7c8039f02a33712515c5bffab942309bba
Reviewed-on: http://gerrit.cloudera.org:8080/8754
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This change moves the creation of the runtime profile from DataSink::Prepare()
to the ctor of DataSink derived classes. This makes sure that DataSink::Close()
and other functions can access the profile even if the DataSink fails to initialize.
Testing done: Added a test case which triggers failure in the initialization of output
expressions in a HdfsTableSink. Impalad crashed consistently without the fix.
Change-Id: I2a683000ef180027b929dbebe78bc2a530a4767e
Reviewed-on: http://gerrit.cloudera.org:8080/8770
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, constant expressions for the LHS of the IN predicate
are not supported. This patch adds this support as a rewrite in
StmtRewriter (where subqueries are rewritten to joins). Since
there is a nested-loop variant of left semijoin, support for IN
is handled by not erring out. NOT IN is handled by a rewrite to
corresponding NOT EXISTS predicate. Support for NOT IN with a
correlated subquery is not included in this change.
Re-organized the frontend subquery analysis tests to expand coverage.
Testing:
- added frontend subquery analysis tests
- added e2e tests
Change-Id: I0d69889a3c72e90be9d4ccf47d2816819ae32acb
Reviewed-on: http://gerrit.cloudera.org:8080/8322
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
TestRuntimeFilters.test_basic_filters is flaky on ASAN as sometimes
the runtime filters aren't recieved within the specified
RUNTIME_FILTER_WAIT_TIME_MS.
This patch increases the timeout for ASAN builds.
Change-Id: I8c20cbb75a9b6da73137f220657aa75dea9dfdce
Reviewed-on: http://gerrit.cloudera.org:8080/8646
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The e2e unit tests for udfs can interact via the backend
lib_cache, causing test flakes. IMPALA-6215 explains a
race between the lib_cache and UdfExecutor in the frontend
which is the likely the root cause.
Two e2e tests use the same jar (test_java_udfs and
test_udf_invalid_symbol), test_udf_invalid_symbol drops a
function from that jar, which causes the use of that jar to
fail in the test_java_udfs test. Since the state of lib_cache
is per process, its state causes these interactions across
unit tests.
This change avoids the interactions by using separate jars for
the separate tests.
Change-Id: Ica3538788b1d2ab5e361261e2ade62780b838e65
Reviewed-on: http://gerrit.cloudera.org:8080/8593
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.
Testing:
- extended nested-types-parquet-stats e2e test cases.
Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Reviewed-on: http://gerrit.cloudera.org:8080/8480
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.
In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.
Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.
Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.
For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.
Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.
Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
lot of runtime filters are renumbered as we always generate min-max
filters even if they don't end up getting assigned and they take up
some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.
Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
(full self join of lineitem). The difference in running time was
negligible - 24.46s with filters on, 24.15s with filters off for
a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
join on lineitem with a join condition that never matches). The
filters resulted in a significant speedup - 0.26s with filters on,
1.46s with filters off for a ~5.6x speedup. This query is added to
targeted-perf.
Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Reviewed-on: http://gerrit.cloudera.org:8080/7793
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
Switch the decoders to using more batch-oriented interfaces. As an
intermediate step this doesn't make the interfaces of LevelDecoder
or DictDecoder batch-oriented, only the lower-level utility classes.
The next step would be to change those interfaces to be batch-oriented
and make according optimisations in parquet. This could deliver much
larger perf improvements than the current patch.
The high-level changes are.
* BitReader -> BatchedBitReader, which is built to unpack runs of 32
bit-packed values efficiently.
* RleDecoder -> RleBatchDecoder, which exposes the repeated and literal
runs to the caller and uses BatchedBitReader to unpack literal runs
efficiently.
* Dict decoding uses RleBatchDecoder to decode repeated runs efficiently
and uses the BitPacking utilities to unpack and encode in a single
step.
Also removes an older benchmark that isn't too interesting (since
the batch-oriented approach to encoding and decoding is so much
faster than the value-by-value approach).
Testing:
* Ran core tests.
* Updated unit tests to exercise new code.
* Added test coverage for the deprecated bit-packed level encoding to
that it still works (there was no coverage previously).
Perf:
Single-node benchmarks showed a few % performance gain. 16 node cluster
benchmarks only showed a gain for TPC-H nested.
Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6
Reviewed-on: http://gerrit.cloudera.org:8080/8267
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Testing:
Previously I needed ~20 iterations to get the test to fail on my local
machine. After these changes I haven't been able to reproduce the
failure
Change-Id: I2bea7b0f770dec362a6df075da4e340402bd1d5d
Reviewed-on: http://gerrit.cloudera.org:8080/8562
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
IMPALA-5546 added the ability to create unpartitioned Kudu tables, but
when SHOW CREATE TABLE is run on it still prints 'PARTITION BY' just
without a partition clause. This patch removes the 'PARTITION BY' from
the output.
Testing:
- Added test that runs SHOW CREATE on an unpartitioned Kudu table.
Change-Id: Icc327266cfb8b5c05efec97348528cea6904bb20
Reviewed-on: http://gerrit.cloudera.org:8080/8506
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.
Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.
Performance:
Initial perf testing using tpcds_1000 shows no regression.
Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Reviewed-on: http://gerrit.cloudera.org:8080/7822
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
test_wait_time has been flaky recently on ASAN due to hitting a
timeout. The fix is to increase the timeout for ASAN builds.
Change-Id: Iee005bee8e0a535ce59d2e23e56be6004f2eb9de
Reviewed-on: http://gerrit.cloudera.org:8080/8427
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
We don't know the root cause yet but try to improve things:
* Eliminate one possible cause of flakiness - unfinished fragments left
from previous queries.
* Print a profile if an assertion fails so we can see why it failed.
Testing:
Ran core tests.
Change-Id: Ic332dddd96931db807abb960db43b99e5fd0f256
Reviewed-on: http://gerrit.cloudera.org:8080/8403
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
'Test case 16' in test_row_filters has been failing occasionaly on
ASAN as the runtime filters are not generated within the specified
RUNTIME_FILTER_WAIT_TIME_MS. The fix is to increase
RUNTIME_FILTER_WAIT_TIME_MS.
This patch updates all of the tests in test_row_filters to use the
same timeout, which is set to a higher value for ASAN builds.
Change-Id: Ia098735594b36a72f02bf7edd051171689618051
Reviewed-on: http://gerrit.cloudera.org:8080/8358
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Testing:
Added test case to verify that CopyRows in select node is successfully
codegened.
Improved test coverage for select node with limit.
Performance:
Queries used (num_nodes set to 1):
500 Predicates: select * from (select * from tpch_parquet.lineitem
limit 6001215) t1 where l_partkey > 10 and l_extendedprice > 10000 and
l_linenumber > 1 and l_comment >'foo0' .... and l_comment >'foo500'
order by l_orderkey limit 10;
1 Predicate: select * from (select * from tpch_parquet.lineitem
limit 6001215) t1 where l_partkey > 10 order by l_orderkey limit 10;
+--------------+-----------------------------------------------------+
| | 500 Predicates | 1 Predicate |
| +------------+-------------+------------+-------------+
| | After | Before | After | Before |
+--------------+------------+-------------+------------+-------------+
| Select Node | 12s385ms | 1m1s | 234ms | 797ms |
| Codegen time | 2s619ms | 1s962ms | 200ms | 181ms |
+--------------+------------+-------------+------------+-------------+
Change-Id: Ie0d496d004418468e16b6f564f90f45ebbf87c1e
Reviewed-on: http://gerrit.cloudera.org:8080/8196
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Main source for TPCDS query and result definitions: https://github.com/gregrahn/tpcds-kit.
TPC-DS v2.5.0 qualification queries from G. Rahn, Cloudera, Inc.
Data set constructed in mini-cluster using $IMPALA_HOME/buildall.sh -testdata....
This commit continues previous work on IMPALA-5376 in the ASF Impala repo
and the Cloudera Gerrit service.
This commit splits multi-query tests in the TPC-DS suite definition into one
query and result set per test file, as the test framework requires. Names for
such files have -1, -2... inner suffixes.
The portion of the TPC-DS test suite in this commit passes.
It contains no failures, as reflected by runs of
$IMPALA_HOME/tests/run-tests.py query_test/test_tpcds_queries.py ...
IMPALA-6007 addresses the TPC-DS cases that require skipping (because we don't
support them or they flap) or expected-failure (xfail, because we support them
but they fail due to bugs.) These require some added tooling for non-Pytest
frameworks like the stress test to avoid attempting them until they work.
Tests that flap are marked to skip, with a bug ID, since they don't reliably pass or xfail.
Expected result sets come from the TPC-DS kit. Some TPC-DS test cases
in this commit have been modified in sematically-neutral ways so as to pass
on Impala.
The tests/query_test/test_tpcds_queries.py driver file is authoritative for the
active/skip/xfail status for each case and a brief reason. The following list
describes the current status as:
--- test-name
deviance from TPC-DS spec
changes made
--- tpcds-q22a.test
RESULT MISMATCH in LSD of AVG() values
FIXED, HAND_ROUNDED AVG() VALUES IN RESULT SET
--- tpcds-q26.test
RESULT MISMATCH in LSD of AVG() values
ABSENT, IMPALA-6087
--- tpcds-q28.test
RESULT MISMATCH in LSD of AVG() values
ABSENT, IMPALA-6087
--- tpcds-q30.test
UNRECOGNIZED CHARACTER
ABSENT, IMPALA-5961.
--- tpcds-q31.test
RESULT MISMATCH in LSD of DECIMAL values
ABSENT, IMPALA-5956.
--- tpcds-q35a.test
RESULT MISMATCH
ABSENT, IMPALA-5950.
--- tpcds-q36a.test
RESULT MISMATCH
ABSENT, IMPALA-4741
--- tpcds-q47.test
RESULT MISMATCH in LSD of DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q48.test
RESULT MISMATCH in scalar value
ABSENT, IMPALA-5950.
--- tpcds-q49.test
RESULT MISMATCH in LSD of DECIMAL values
ABSENT, IMPALA-5945
--- tpcds-q57.test
RESULT MISMATCH, excess scale in DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q58.test
RESULT MISMATCH in DECIMAL values
ABSENT, IMPALA-5946
--- tpcds-q59.test
RESULT MISMATCH, excess scale in DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q61.test
RESULT MISMATCH in DECIMAL value
FIXED. CAST RESULT QUOTIENT TO DECIMAL(15, 4), TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q63.test
RESULT MISMATCH, excess scale in DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q64.test
RESULT MISMATCH
ADDED ORDER BY COLUMNS.
--- tpcds-q66.test
RESULT MISMATCH
ABSENT, IMPALA-4741
--- tpcds-q77a.test
RESULT MISMATCH
FIXED. TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q78.test
RESULT MISMATCH
FIXED. TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q83.test
RESULT MISMATCH
ABSENT, IMPALA-5945.
--- tpcds-q85.test
MISSING TABLE "reason"
ABSENT, IMPALA-5960
--- tpcds-q86a.test
RESULT MISMATCH
FIXED. TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q89.test
RESULT MISMATCH, DECIMAL values flap
ABSENT, ADDED ROUND(2) TO 8th COLUMN, TAKE ACTUAL RESULTS AS EXPECTED, IMPALA-5956.
--- tpcds-q90.test
RESULT MISMATCH
ABSENT, IMPALA-5945.
--- tpcds-q93.test
MISSING TABLE "reason"
ABSENT, IMPALA-5960
--- tpcds-q98.test
RESULT MISMATCH
FIXED, ADDED ROUND() TO LAST COLUMN
Change-Id: I6e284888600a7a69d1f23fcb7dac21cbb13b7d66
Reviewed-on: http://gerrit.cloudera.org:8080/8102
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds an always_false flag in bloom filters. The flag is set
if nothing has been inserted into the bloom filter. HdfsScanner uses
this flag to early terminate the scan at file and split granularities.
Testing: It passes existing tests. Two test cases are added checking
that an always-false runtime filter can filter out files and splits.
In single node perf tests, time spent on primitive_empty_build_join_1
is reduced by 75%.
Change-Id: If680240a3cd4583fc97c3192177d86d9567c4f8d
Reviewed-on: http://gerrit.cloudera.org:8080/8170
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
A recent commit "IMPALA-5448: fix invalid number of splits reported in
Parquet scan node" neglected to account for the fact that in some
environments, Impala runs without Hive. The typical pattern for tests
that use Hive is skip them if they are executed against such
environments.
Change-Id: I3ad4b72839f8ac3bcb824287d02dd6964eea3e3e
Reviewed-on: http://gerrit.cloudera.org:8080/8259
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.
Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1
This patch introduces a compression types set for the above cases.
Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.
Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Reviewed-on: http://gerrit.cloudera.org:8080/8147
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
test_scanners_fuzz.py currently tests compressed parquet but
does not test uncompressed parquet. This fix adds a new test
case for uncompressed parquet.
Testing
-------
Ran the query_test/test_scanners_fuzz.py in a loop (5 times)
and there was no impalad crash seen.
Change-Id: I760de7203a51cf82b16016fa8043cadc7c8325bc
Reviewed-on: http://gerrit.cloudera.org:8080/8056
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Impala tries to always store column names in lower case. As part of a
cleanup of issues related to upper case Kudu column names, a check was
added in Analyzer to enforce this.
The check fails when doing star expansion on a struct to select all
fields in the case where a table was created in Hive with upper case
letters in a struct field name. This happens because Hive does not
covert struct field names to all lower case in HMS.
The solution is to force StructField names to lower case.
Testing:
- Added a test in test_nested_types.py
- Fixed FE test that expected struct field to be output in upper case.
Change-Id: Iacd9714ac2301a55ee8b64f0102f6f156fb0370e
Reviewed-on: http://gerrit.cloudera.org:8080/8169
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
No tests were added/dropped or modified. They are consolidated into
fewer .test files.
Change-Id: Idda4b34b5e6e9b5012b177a4c00077aa7fec394c
Reviewed-on: http://gerrit.cloudera.org:8080/8153
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
If a scan range is skipped at runtime the scan node skips reading
the range and never figures out the underlying compression codec used
to compress the files. In such a scenario we default the compression
codec to NONE which can be misleading. This change marks these files
as filtered in the scan node profile
e.g. - File Formats: TEXT/NONE:364 TEXT/NONE(Skipped):1460
Change-Id: I797916505f62e568f4159e07099481b8ff571da2
Reviewed-on: http://gerrit.cloudera.org:8080/7245
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
This patch keeps test_row_availbility from randomly failing. In this test
the time interval between the 'Rows available' timeline event and the
previous event in the runtime profile is measured in order to make sure
that the rows become available after a specific amount of time. This
measurement is not correct since the previous event is that the
coordinator finished sending the query fragments to the backends, which
means the execution on some backends might have already started. This
patch tracks another event "Ready to start" as the beginning of the time
interval instead. The coordinator begins to send the query fragments to
the backends after this event so the time check should always pass.
Change-Id: I96142f1868a26426cbc34aa9a0e0a56979df66c3
Reviewed-on: http://gerrit.cloudera.org:8080/8036
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Having the repetition level set to REPEATED on the root schema
resulted a scan to fail with error when Impala tried to parse that
table.
As a solution, the 'REPEATED' repetition level is ignored when the
root schema is processed. The reasoning behind is that the Parquet
format description says that the repetition level of the root schema
should not be set to REPEATED anyway, so it's safe to ignore it in
case it is set to this value for some reason.
Change-Id: I7ea84589e1d122ad9d43adde46893ec0ecc5f9c4
Reviewed-on: http://gerrit.cloudera.org:8080/7870
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
The concurrent test driver did not pick them up because the name
prefix did not match the workload dirname. The query test driver
used a hardcoded prefix.
Testing done: Ran tests/stress/concurrent_select.py,
tests/query_test/test_tpch_nested_queries.py locally; latter
passed, former hit IMPALA-5855 after correctly locating all 22
new tpch_nested query files.
Change-Id: Ie067b201ae20b4f4c61a98be7ac1ec5a3f8febd8
Reviewed-on: http://gerrit.cloudera.org:8080/7891
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins
Add a targeted test that confirms that setting the query option will
force spilling.
Testing:
Ran test_spilling locally.
Change-Id: Ida6b55b2dee0779b1739af5d75943518ec40d6ce
Reviewed-on: http://gerrit.cloudera.org:8080/7809
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
We check that the number/types of columns in a Kudu DML match the
underlying table during analysis. However, it's possible that the
schema may be modified between analysis and execution, and if it's
modified in incompatible ways it can cause Impala to crash.
Once the KuduTable object has been opened by the KuduTableSink, its
schema will remain the same, so we can check in Open() that the schema
is what we're expecting.
If the schema changes between Open() and when the WriteOp is sent to Kudu,
Kudu will send back an error and we already handle this gracefully.
Testing:
- Added an e2e test that concurrently inserts into a Kudu table while
dropping and then adding a column. It relies on timing, but running
in a loop locally it caused Impala to crash every time without this
change.
Change-Id: I9fd6bf164310df0041144f75f5ee722665e9f587
Reviewed-on: http://gerrit.cloudera.org:8080/7688
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
* Test for disable_unsafe_spills
* Test for buffer size > I/O size (--read_size)
Change-Id: I03de00394bb6bbcf381250f816e22a4b987f1135
Reviewed-on: http://gerrit.cloudera.org:8080/7787
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
IMPALA-4672: Part 2 regressed NAAJ by tightening up the spilling
invariants (e.g. can't unpin with spilling disabled) but we
didn't have tests for spilling NAAJs that could detect the
regression. This patch adds those tests, fixes the regressions,
and improves NAAJ by reliably spilling the probe side and not
trying to bring the whole probe side into memory.
The changes are:
* All null-aware streams start off in memory and are only unpinned if
spilling is enabled.
* The null-aware build partition can be spilled in the same way as hash
partitions.
* Probe streams are unpinned whenever there is memory pressure - if
spilling is enabled and either a build partition is spilled or
appending to the probe stream fails.
* Spilled probe streams are not re-pinned in EvaluateNullProbe().
Instead we just iterate over the rows of the stream.
Testing:
Add query tests where the three different buckets of rows are large
enough to spill: the build and probe of the null-aware partition and the
null probe rows.
Test both spilling and in-memory (with spilling disabled) cases.
Change-Id: Ie2e60eb4dd32bd287a31479a6232400df65964c1
Reviewed-on: http://gerrit.cloudera.org:8080/7367
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This separation will help the user better understand the query
runtime profile.
Testing:
Modified an existing test case.
Change-Id: Ibfc7832963fa0bd278a45c06a5a54e1bf40d8876
Reviewed-on: http://gerrit.cloudera.org:8080/7721
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Adds support for a "max_row_size" query option that instructs Impala
to reserve enough memory to process rows of the specified size. For
spilling operators, the planner reserves enough memory to process
rows of this size. The advantage of this compared to simply
specifying larger values for min_spillable_buffer_size and
default_spillable_buffer_size is that operators may be able to
handler larger rows without increasing the size of all their
buffers.
The default value is 512KB. I picked that number because it doesn't
increase minimum reservations *too* much even with smaller buffers
like 64kb but should be large enough for almost all reasonable
workloads.
This is implemented in the aggs and joins using the variable page size
support added to BufferedTupleStream in an earlier commit. The synopsis
is that each stream requires reservation for one default-sized page
per read and write iterator, and temporarily requires reservation
for a max-sized page when reading or writing larger pages. The
max-sized write reservation is released immediately after the row
is appended and the max-size read reservation is released after
advancing to the next row.
The sorter and analytic simply use max-sized buffers for all pages
in the stream.
Testing:
Updated existing planner tests to reflect default max_row_size. Added
new planner tests to test the effect of the query option.
Added "set" test to check validation of query option.
Added end-to-end tests exercising spilling operators with large rows
with and without spilling induced by SET_DENY_RESERVATION_PROBABILITY.
Change-Id: Ic70f6dddbcef124bb4b329ffa2e42a74a1826570
Reviewed-on: http://gerrit.cloudera.org:8080/7629
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Make the call to format() compatible with older versions of python
(< 2.7), which expect an indices in the string being formatted,
e.g. "{0} {1} {2}".format('foo', 'bar', 'baz'). Without the numbers,
format() raises an exception.
Tested by running this test suite using python 2.6.6. Before the
patch, the tests failed. After the patch, they pass.
Change-Id: I5384aaf83a6a1f3c7643ed9f15de2dba1a5913a5
Reviewed-on: http://gerrit.cloudera.org:8080/7761
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins
Rejects queries during admission control if:
* the largest (across all backends) min buffer reservation is
greater than the query mem_limit or buffer_pool_limit
* the sum of the min buffer reservations across the cluster
is larger than the pool max mem resources
There are some other interesting cases to consider later:
* every per-backend min buffer reservation is less than the
associated backend's process mem_limit; the current
admission control code doesn't know about other backend's
proc mem_limits.
Also reduces minimum non-reservation memory (IMPALA-5810).
See the JIRA for experimental results that show this
slightly improves min memory requirements for small queries.
One reason to tweak this is to compensate for the fact that
BufferedBlockMgr didn't count small buffers against the
BlockMgr limit, but BufferPool counts all buffers against
it.
Testing:
* Adds new test cases in test_admission_controller.py
* Adds BE tests in reservation-tracker-test for the
reservation-util code.
Change-Id: Iabe87ce8f460356cfe4d1be4d7092c5900f9d79b
Reviewed-on: http://gerrit.cloudera.org:8080/7678
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.
If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.
We fix the problem by writing the varlen partial tuple data to
boundary_pool, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.
Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
changes to the data files. This change was helpful for finding
issues while developing this patch.
Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Reviewed-on: http://gerrit.cloudera.org:8080/7639
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
When an in-memory blocking aggregation or join is in the GetNext()
phase where it is outputting accumulated rows then we expect
memory consumption to monotonically decrease because no more
rows will be accumulated in memory.
This change adds support to release unused reservation and makes
use of it for in-memory aggregations and sorts.
We don't release memory for operators with spilled data, since they
may need the reservation to bring it back into memory. We also
don't release memory in subplans, since it will probably be used
in a later iteration of the subplan.
Testing:
Updated spilling test that now requires less memory.
Ran stress test binary search on tpch_parquet. No changes, except
Q18 now requires 325MB instead of 450MB to execute without spilling.
Ran query with two sorts in the same pipeline and watched /memz to
confirm that the first node in the pipeline was incrementally releasing
memory. Added a regression test based on this experiment.
Added a backend test to directly test reservation decreasing.
Change-Id: I6f4d0ad127d5fcd14b9821a7c127eec11d98692f
Reviewed-on: http://gerrit.cloudera.org:8080/7619
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This reduces the scratch limit to the same value as used in
TestScratchDisk.
Change-Id: If5c42b6ded44d86c3a430a983096f14c0b88a287
Reviewed-on: http://gerrit.cloudera.org:8080/7664
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
Revert commit 3059024bd8 for
IMPALA-4795: Allow fetching function obj from catalog using
signature
This commit seems to cause TestUdfExecution.test_java_udfs
to fail periodically.
IMPALA-4795 wasn't a critical fix, so lets just revert it
until we know we can fix the flaky test at the same time.
Change-Id: Iae56a75e8ec44af6dae50f18869a486e5f8b608c
Reviewed-on: http://gerrit.cloudera.org:8080/7616
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Impala Public Jenkins
This adds most of the end-to-end tests described in the test plan.
See http://goo.gl/v3Strz.
* End-to-end test for disk spill encryption.
* Admission control test for the case when acquiring initial
reservation fails.
* Initial reservation acquire failure test
* scratch_limit tests for Join, Agg, Sort, Analytic
* Memory usage scaling tests for Join, Agg, Sort, Analytic
Also splits out the slow sort queries in test_spilling and moves them
to exhaustive so the individual tests run faster and have better
parallelism.
Testing:
Ran all the core tests. Will do a full exhaustive run before
committing.
Change-Id: I554aa5ddfef4f8e75295596e720a14eee1afa17f
Reviewed-on: http://gerrit.cloudera.org:8080/7552
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Add debug action to deny reservation increases with some probability.
This allows us to test various scenarios, particularly:
* The case when the node only gets its initial reservation and must
run to completion without increasing its reservation.
* The case when there is some memory pressure and the node sometimes
gets a reservation increase and sometimes doesn't.
E.g. to deny all reservation requests after an ExecNode has opened:
set debug_action=-1:OPEN:SET_DENY_RESERVATION_PROBABILITY@1.0
This was applied to test_spilling. It caught a bug in the PAGG
with spilling string aggregations.
This required some minor extensions to the debug actions.
* Allow debug actions that apply to all ExecNodes if node_id is -1.
* Allow passing parameters to debug actions. The current grammar of the
actions is not well-oriented towards extension, so I resorted to using
@ as a new delimiter.
I also optimised ExecDebugAction() so that it is much faster in the
common case and extended --disable_mem_pools to prevent the buffer pool
from holding onto unused buffers.
Change-Id: Ied39bb091b12156e5dc61b528c6c0cd8de3fe657
Reviewed-on: http://gerrit.cloudera.org:8080/7022
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins