impala

mirror of https://github.com/apache/impala.git synced 2025-12-30 03:01:44 -05:00

Author	SHA1	Message	Date
poojanilangekar	c6f9b61ec2	IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans This change ensures that the planner computes parquet conjuncts only when for scans containing parquet files. Additionally, it also handles PARQUET_DICTIONARY_FILTERING and PARQUET_READ_STATISTICS query options in the planner. Testing was carried out independently on parquet and non-parquet scans: 1. Parquet scans were tested via the existing parquet-filtering planner test. Additionally, a new test [parquet-filtering-disabled] was added to ensure that the explain plan generated skips parquet predicates based on the query options. 2. Non-parquet scans were tested manually to ensure that the functions to compute parquet conjucts were not invoked. Additional test cases were added to the parquet-filtering planner test to scan non parquet tables and ensure that the plans do not contain conjuncts based on parquet statistics. 3. A parquet partition was added to the alltypesmixedformat table in the functional database. Planner tests were added to ensure that Parquet conjuncts are constructed only when the Parquet partition is included in the query. Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7 Reviewed-on: http://gerrit.cloudera.org:8080/10704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 02:06:50 +00:00
Tianyi Wang	61e6a47776	IMPALA-7236: Fix the parsing of ALLOW_ERASURE_CODED_FILES This patch adds a missing "break" statement in a switch statement changed by IMPALA-7102. Also fixes an non-deterministic test case. Change-Id: Ife1e791541e3f4fed6bec00945390c7d7681e824 Reviewed-on: http://gerrit.cloudera.org:8080/10857 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-03 23:49:44 +00:00
Bikramjeet Vig	30e82c63ec	IMPALA-7190: Remove unsupported format writer support This patch removes write support for unsupported formats like Sequence, Avro and compressed text. Also, the related query options ALLOW_UNSUPPORTED_FORMATS and SEQ_COMPRESSION_MODE have been migrated to the REMOVED query options type. Testing: Ran exhaustive build. Change-Id: I821dc7495a901f1658daa500daf3791b386c7185 Reviewed-on: http://gerrit.cloudera.org:8080/10823 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-03 20:34:27 +00:00
Taras Bobrovytsky	8060f4d50e	IMPALA-7102 (Part 1): Disable reading of erasure coding by default In this patch we add a query option ALLOW_ERASURE_CODED_FILES, that allows us to enable or disable the support of erasure coded files. Even though Impala should be able to handle HDFS erasure coded files already, this feature hasn't been tested thoroughly yet. Also, Impala lacks metrics, observability and DDL commands related to erasure coding. This is a query option instead of a startup flag because we want to make it possible for advanced users to enable the feature. We may also need a follow on patch to also disable the write path with this flag. Cherry-picks: not for 2.x Change-Id: Icd3b1754541262467a6e67068b0b447882a40fb3 Reviewed-on: http://gerrit.cloudera.org:8080/10646 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-29 23:26:35 +00:00
poojanilangekar	e988c36bf0	IMPALA-6305: Allow column definitions in ALTER VIEW This change adds support to change column definitions in ALTER VIEW statements. This support only required minor changes in the parser and the AlterViewStmt constructor. Here's an example syntax: alter view foo (a, b comment 'helloworld') as select * from bar; describe foo; +------+--------+------------+ \| name \| type \| comment \| +------+--------+------------+ \| a \| string \| \| \| b \| string \| helloworld \| +------+--------+------------+ The following tests were modified: 1. ParserTest - To check that the parser handles column definitions for alter view statements. 2. AnalyzerDDLTest - To ensure that the analyzer supports the change column definitions parsed. 3. TestDdlStatements - To verify the end-to-end functioning of ALTER VIEW statements with change column definitions. 4. AuthorizationTest - To ensure that alter table commands with column definitions check permissions as expected. Change-Id: I6073444a814a24d97e80df15fcd39be2812f63fc Reviewed-on: http://gerrit.cloudera.org:8080/10720 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-27 03:41:47 +00:00
Joe McDonnell	c8bfcbd6e8	IMPALA-7200: Fix missing FILESYSTEM_PREFIX hitting local dataload As part of IMPALA-3307, we copy a time-zone database into HDFS. This command is failing on local filesystem due to a missing FILESYSTEM_PREFIX. This adds FILESYSTEM_PREFIX for this command. Change-Id: I972192f22943baef6043a4c9db54d5d48089ea9d Reviewed-on: http://gerrit.cloudera.org:8080/10803 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-23 05:04:11 +00:00
Fredy Wijaya	92292e79f0	IMPALA-7180: Pin Impala CDH dependencies For IMPALA_MINICLUSTER_PROFILE=3 (Hadoop 3.x components), pin the CDH dependencies by storing the CDH tarballs and Maven repository in S3. This solves the issue of build coherency between the the CDH tarballs and Maven dependencies. For IMPALA_MINICLUSTER_PROFILE=2 (Hadoop 2.x components), pin the CDH dependencies by storing only the CDH tarballs in S3. The Maven repository will still use https://repository.cloudera.com, so there is still a possibility of a build coherency issue. For each CDH dependency, there is a unique build number in each repository URL to indicate the build number that created those CDH dependencies. This informaton can be useful for debugging issues related to CDH dependencies. This patch introduces CDH_DOWNLOAD_HOST and CDH_BUILD_NUMBER environment variables that can be overriden, which can be useful for running an integration job. This patch also fixes dependency issues in Hadoop that transitively depend on snapshot versions of dependencies that no longer exist, i.e. - net.minidev:json-smart:2.3-SNAPSHOT (HADOOP-14903) - org.glassfish:javax.el:3.0.1-b06-SNAPSHOT The fix is to force the dependencies by using the released versions of those dependencies. Testing: - Ran all core tests on IMPALA_MINICLUSTER_PROFILE=2 and IMPALA_MINICLUSTER_PROFILE=3 Cherry-picks: not for 2.x Change-Id: I66c0dcb8abdd0d187490a761f129cda3b3500990 Reviewed-on: http://gerrit.cloudera.org:8080/10748 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-23 01:46:40 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Joe McDonnell	a541ac6039	IMPALA-7193: Fix cluster startup args in create-load-data.sh The minicluster args for dataload changed to a bash array in IMPALA-7119, and this requires a special syntax to derefence and get the whole array. This fixes the invocation to use the right syntax ($BASH_VAR[@] rather than $BASH_VAR). Change-Id: Ie9a24c0e9fa34e43697b16b48cf219f47f30c0cc Reviewed-on: http://gerrit.cloudera.org:8080/10782 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 06:23:23 +00:00
Tianyi Wang	00519a68d2	IMPALA-7169: Prevent HDFS from checkpointing trash until 3000 AD HDFS trash checkpointing renames files in the trash folder and breaks impala tests. Impala set the trash checkpointing interval to 1440 to try to postpone it for 24 hours. Unfortunately that told HDFS to do it when the UNIX time is a multiple of 1440 * 60 and it broke trash-related tests run around midnight in GMT. This patch sets the interval to 541728000 so that HDFS won't do the checkpointing until Jan 1st 3000, and HDFS will checkpoint every 1030 years after that. Change-Id: I9452f7e44c7679f86a947cd20115c078757223d8 Reviewed-on: http://gerrit.cloudera.org:8080/10742 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-20 01:32:36 +00:00
Tim Armstrong	894ab8e980	IMPALA-7115: set a default THREAD_RESERVATION_LIMIT value The value is chosen to allow only queries that have a reasonable chance of succeeding, albeit with poor performance because of the high number of threads. Testing: Added a test to make sure that the default value rejects a large query. Change-Id: I31d3fa3f6305c360922649dba53a9026c9563384 Reviewed-on: http://gerrit.cloudera.org:8080/10628 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-19 03:02:49 +00:00
Joe McDonnell	147e962f2d	IMPALA-7119: Restart whole minicluster when HDFS replication stalls After loading data, we wait for HDFS to replicate all of the blocks appropriately. If this takes too long, we restart HDFS. However, HBase can fail if HDFS is restarted and HBase is unable to write its logs. In general, there is no real reason to keep HBase and the other minicluster components running while restarting HDFS. This changes the HDFS health check to restart the whole minicluster and Impala rather than just HDFS. Testing: - Tested with a modified version that always does the restart in the HDFS health check and verified that the tests pass Change-Id: I58ffe301708c78c26ee61aa754a06f46c224c6e2 Reviewed-on: http://gerrit.cloudera.org:8080/10665 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-18 21:46:11 +00:00
Michael Ho	51ff47d05e	IMPALA-5168: Codegen HASH_PARTITIONED KrpcDataStreamSender::Send() This change codegens the hash partitioning logic of KrpcDataStreamSender::Send() when the partitioning strategy is HASH_PARTITIONED. It does so by unrolling the loop which evaluates each row against the partitioning expressions and hashes the result. It also replaces the number of channels of that sender with a constant at runtime. With this change, we get reasonable speedup with some benchmarks: +------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +------------+-----------------------+---------+------------+------------+----------------+ \| TPCH(_300) \| parquet / none / none \| 20.03 \| -6.44% \| 13.56 \| -7.15% \| +------------+-----------------------+---------+------------+------------+----------------+ +---------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +---------------------+-----------------------+---------+------------+------------+----------------+ \| TARGETED-PERF(_300) \| parquet / none / none \| 58.59 \| -5.56% \| 12.28 \| -5.30% \| +---------------------+-----------------------+---------+------------+------------+----------------+ +-------------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------------+-----------------------+---------+------------+------------+----------------+ \| TPCDS-UNMODIFIED(_1000) \| parquet / none / none \| 15.60 \| -3.10% \| 7.16 \| -4.33% \| +-------------------------+-----------------------+---------+------------+------------+----------------+ +-------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------+-----------------------+---------+------------+------------+----------------+ \| TPCH_NESTED(_300) \| parquet / none / none \| 30.93 \| -3.02% \| 17.46 \| -4.71% \| +-------------------+-----------------------+---------+------------+------------+----------------+ Change-Id: I1c44cc9312c062cc7a5a4ac9156ceaa31fb887ff Reviewed-on: http://gerrit.cloudera.org:8080/10421 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-14 23:37:00 +00:00
Tim Armstrong	d8ed07f112	IMPALA-6035: Add query options to limit thread reservation Adds two options: THREAD_RESERVATION_LIMIT and THREAD_RESERVATION_AGGREGATE_LIMIT, which are both enforced by admission control based on planner resource requirements and the schedule. The mechanism used is the same as the minimum reservation checks. THREAD_RESERVATION_LIMIT limits the total number of reserved threads in fragments scheduled on a single backend. THREAD_RESERVATION_AGGREGATE_LIMIT limits the sum of reserved threads across all fragments. This also slightly improves the minimum reservation error message to include the host name. Testing: Added end-to-end tests that exercise the code paths. Ran core tests. Change-Id: I5b5bbbdad5cd6b24442eb6c99a4d38c2ad710007 Reviewed-on: http://gerrit.cloudera.org:8080/10365 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-14 03:25:55 +00:00
Thomas Tauber-Marshall	bf2124bf30	IMPALA-6929: Support multi-column range partitions for Kudu Kudu allows specifying range partitions over multiple columns. Impala already has support for doing this when the partitions are specified with '=', but if the partitions are specified with '<' or '<=', the parser would return an error. This patch modifies the parser to allow for creating Kudu tables like: create table kudu_test (a int, b int, primary key(a, b)) partition by range(a, b) (partition (0, 0) <= values < (1, 1)); and similary to alter partitions like: alter table kudu_test add range partition (1, 1) <= values < (2, 2); Testing: - Modified functional_kudu.jointbl's schema so that we have a table in functional with a multi-column range partition to test things against. - Added FE and E2E tests for CREATE and ALTER. Change-Id: I0141dd3344a4f22b186f513b7406f286668ef1e7 Reviewed-on: http://gerrit.cloudera.org:8080/10441 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-13 00:10:13 +00:00
Csaba Ringhofer	06fe321050	IMPALA-7417: Remove DCHECKs with unnecessary constraint on dictionary encoding bit width Reading dictionary encoded Parquet data pages where the bit width is larger than the encoded type's size (e.g. coding 8 bit TINYINT with 16 bit dictionary indices) led to DCHECK error in debug builds. Impala does not create such parquet files (an N bit type can have maximum 2^N distinct values, so N bit dictionary indices are enough for a dictionary that contains every possible value), but the Parquet standard does not forbid to do so. These DCHECKs were probably introduced by a copy paste error (similar checks exist in the non-dictionary encoded bit reader functions, where they are valid). Testing: - a new test is added to check that these data pages can be decoded correctly Change-Id: I9ff3b00cbcab09dec11b3607d7d9a9c2c0025e1a Reviewed-on: http://gerrit.cloudera.org:8080/10683 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-11 23:25:46 +00:00
Zoltan Borok-Nagy	e6ca7ca14d	IMPALA-7108: IllegalStateException hit during CardinalityCheckNode.<init> Since IMPALA-6314 on runtime scalar subqueries we set LIMIT 2 in StmtRewriter.mergeExpr(). We do that because later we add a CardinalityCheckNode on top of such subqueries and with LIMIT 2 we can still check if they return more than one row. In the constructor of CardinalityCheckNode there is a precondition that checks if the child node has LIMIT 2 to be certain that we've set the limit for all the necessary cases. However, some subqueries will get a LIMIT 1 later breaking the precondition in CardinalityCheckNode. An example to these subqueries is a select stmt that selects from an inline view that returns a single row: select * from functional.alltypes where int_col = (select f.id from ( select * from functional.alltypes limit 1) f); Note that we shouldn't add a CardinalityCheckNode to the plan of this query in the first place. To generate a proper plan I updated SelectStmt.returnsSingleRow() because this method didn't handle this case well. I also changed the precondition from Preconditions.checkState(child.getLimit() == 2); to Preconditions.checkState(child.getLimit() <= 2); in order to be more permissive. I added tests for the aforementioned query. Change-Id: I82a7a3fe26db3e12131c030c4ad055a9c4955407 Reviewed-on: http://gerrit.cloudera.org:8080/10605 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-08 20:15:50 +00:00
Tim Armstrong	8f7a3f9c6c	IMPALA-7124: delete keystore when formatting cluster This avoids leftover keystore files from previous instances of the cluster, e.g. testdata/cluster/cdh6/node-1/data/kms.keystore. I checked on my local system and that is the only file outside of the nn and dn subdirectories. Change-Id: I9e6c1afa0df95b26446ccafa0c942e28ba848dae Reviewed-on: http://gerrit.cloudera.org:8080/10613 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com>	2018-06-06 17:16:15 +00:00
Lars Volker	c9e8f2f7e7	IMPALA-7008: Rewrite query to make it not return 100M rows One query in spilling.test is expected to fail. When it does not fail, it returned 100M rows, which would then cause the Python test code to consume memory until it gets OOM-killed by the kernel. To fix this, we rewrite the query. I tested this locally to make sure that the query still fails as expected on HDFS. Change-Id: I31956d3092a7e69b979f631df3a6dfda14ebe140 Reviewed-on: http://gerrit.cloudera.org:8080/10597 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-05 01:14:35 +00:00
Thomas Tauber-Marshall	ba7893cb9e	IMPALA-6338: Disable more flaky bloom filter tests Until IMPALA-6338 is fixed, temporarily disable tests that are affected by it - any test that has a 'limit' and relies on the contents of the runtime profile. This patch disables the runtime profile check for all such tests in bloom_filter.test Change-Id: Ifc9da892efa3b27d63056ad8e3befac82808ffdb Reviewed-on: http://gerrit.cloudera.org:8080/10530 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-30 08:00:50 +00:00
Joe McDonnell	9a5410570e	IMPALA-7061: Rework HBase splitting and assignment Some frontend PlannerTests rely on HBase tables being arranged in a deterministic way. Specifically, the HBase tables need to be split with specific region boundaries and those regions need to be assigned to specific HBase region servers. Currently, the tables are created without splits and testdata/bin/split-hbase.sh runs Java code in HBaseTestDataRegionAssignment to split and assign the tables. This runs during dataload via testdata/bin/create-load-data.sh and during tests with bin/run-all-tests.sh. There are problems with both parts of this process. The table splitting is flaky. Since significant time can pass between the assignments and the tests, rebalancing means the assignments are not always stable. This changes the process so that the HBase tables are created with the splits already specified via the HBase shell. The splits remain stable over time. PlannerTestBase runs the assignment code in HBaseTestDataRegionAssignment at the start of the PlannerTests. This makes the assignments deterministic. No other tests depends on the exact assignments, so this does not regress anything. Testing: - Local testing - Ran gerrit-verify-dryrun-external - Verified minicluster profile 2 compiles Change-Id: I3d639128a856254a6ccb93d6750f531974b5f897 Reviewed-on: http://gerrit.cloudera.org:8080/10447 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-25 00:28:18 +00:00
Taras Bobrovytsky	fd7e7c93c5	IMPALA-7039: Ignore the port in HBase planner tests Before this patch, we used to check the HBase port in the HBase planner tests. This caused a failure when HBase was running on a different port than expected. We fix the problem in this patch by not checking the HBase port. Testing: ran the FE tests and they passed. Change-Id: I8eb7628061b2ebaf84323b37424925e9a64f70a0 Reviewed-on: http://gerrit.cloudera.org:8080/10459 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-22 03:43:40 +00:00
Tim Armstrong	f4f28d310c	IMPALA-6941: load more text scanner compression plugins Add extensions for LZ4 and ZSTD (which are supported by Hadoop). Even without a plugin this results in better behaviour because we don't try to treat the files with unknown extensions as uncompressed text. Also allow loading tables containing files with unsupported compression types. There was weird behaviour before we knew of the file extension but didn't support querying the table - the catalog would load the table but the impalad would fail processing the catalog update. The simplest way to fix it is to just allow loading the tables. Similarly, make the "LOAD DATA" operation more permissive - we can copy files into a directory even if we can't decompress them. Switch to always checking plugin version - running mismatched plugin is inherently unsafe. Testing: Positive case where LZO is loaded is exercised. Added coverage for negative case where LZO is disabled. Fixed test gaps: * Querying LZO table with LZO plugin not available. * Interacting with tables with known but unsupported text compressions. * Querying files with unknown compression suffixes (which are treated as uncompressed text). Change-Id: If2a9c4a4a11bed81df706e9e834400bfedfe48e6 Reviewed-on: http://gerrit.cloudera.org:8080/10165 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-18 03:44:46 +00:00
Philip Zeyliger	5b824408af	IMPALA-7035: Configure jceks.key.serialFilter for KMS. Configures a Java property for KMS to account for JDK 8u171's security fixes. I was seeing impala-py.test tests/metadata/test_hdfs_encryption.py fail with the following error: AssertionError: Error creating encryption zone: RemoteException: Can't recover key for testkey1 from keystore file:/home/impdev/Impala/testdata/cluster/cdh6/node-1/data/kms.keystore The issue is described in HDFS-13494, and I imagine it'll be fixed in due time. In the meanwhile, setting this property seems to do the trick. Change-Id: I2d21c9cce3b91e8fd8b2b4f1cda75e3958c977d5 Reviewed-on: http://gerrit.cloudera.org:8080/10418 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 22:15:32 +00:00
Zoltan Borok-Nagy	ccf19f9f8f	IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Reviewed-on: http://gerrit.cloudera.org:8080/9693 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 20:22:02 +00:00
Joe McDonnell	2e9f5c90eb	IMPALA-7043: HBase split failure should not fail dataload HBase splitting can fail due to changes in HBase code. It is useful to still do tests even if HBase splitting failed. As it is today, buildall.sh will abort if create-load-data.sh's invocation of split-hbase.sh fails. No tests run, even though the HBase splitting affects only a small portion of our tests. This changes create-load-data.sh to keep going with dataload if HBase splitting fails. It outputs the same errors to the log as it would before this change. It adds a message to explain that it is ignoring the failure and there may be related test failures. Change-Id: I7497fe8c9f1655a34b2743462d8b7248eb94554e Reviewed-on: http://gerrit.cloudera.org:8080/10437 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 09:13:01 +00:00
Lars Volker	a64cfc523e	IMPALA-7032: Disable codegen for CHAR type null literals Analogous to IMPALA-6435, we have to disable codegen for CHAR type null literals. Otherwise we will crash in impala::NullLiteral::GetCodegendComputeFn(). This change adds a test to make sure that the crash is fixed. Change-Id: I34033362263cf1292418f69c5ca1a3b84aed39a9 Reviewed-on: http://gerrit.cloudera.org:8080/10409 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2018-05-16 00:00:15 +00:00
Tianyi Wang	13a1acd7e4	IMPALA-7003: Deflake erasure coding data loading Erasure coding data loading is flaky in two ways: 1. HBase sometimes doesn't work because of HBase-19369 2. Nested data loading sometimes fails because the HDFS namenode cannot find enough good datanodes. For problem 1, this patch enables erasure coding only on /test-warehouse directory. For problem 2, this patch sets dfs.namenode.redundancy.considerLoad to false, preventing namenode from excluding heavily-loaded datanodes. Change-Id: I219106cd3ec7ffab7a834700f2a722b165e5f66c Reviewed-on: http://gerrit.cloudera.org:8080/10362 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-15 23:59:58 +00:00
Zoltan Borok-Nagy	fab65d4479	IMPALA-7022: TestQueries.test_subquery: Subquery must not return more than one row TestQueries.test_subquery sometimes fails during exhaustive tests. In the tests we expect to catch an exception that is prefixed by the "Query aborted:" string. The prefix is usually added by impala_beeswax.py::wait_for_completion(), but in rare cases it isn't added. From the point of the test it is irrelevant if the exception is prefixed with "Query aborted:" or not, so I removed it from the expected exception string. Change-Id: I3b8655ad273b1dd7a601099f617db609e4a4797b Reviewed-on: http://gerrit.cloudera.org:8080/10407 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-05-15 23:37:06 +00:00
Tim Armstrong	3661100fa3	IMPALA-6645: Enable disk spill encryption by default Perf: Targeted benchmarks with a heavily spilling query on a machine with PCLMULQDQ support show < 5% of CPU time spent in encryption and decryption. PCLMULQDQ was introduced in AMD Bulldozer (c. 2011) and Intel Westmere (c. 2010). Testing: Ran core tests with the change. Updated the custom cluster test to exercise the non-default configuration. Change-Id: Iee4be2a95d689f66c3663d99e4df0fb3968893a9 Reviewed-on: http://gerrit.cloudera.org:8080/10345 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-05-15 22:23:14 +00:00
njanarthanan	bb9237d942	IMPALA-6819: Add new performance test workload - tpcds-unmodified used by Impala Performance Tests Description: Impala versions prior to 2.5 didn't have Runtimefilters, which made TPC-DS queries run very slow, so queries under tpcds have explicit partition filters to workaround the limitation. Post Impala 2.5 adding tpcds-unmodified which has the unmodified version of the workload to provide more coverage Testing: Ran the performance tests using the new workload and all tests passed Change-Id: I3957621d88b80fffc8fc89fd8104a58137b86e92 Reviewed-on: http://gerrit.cloudera.org:8080/9973 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-13 09:06:06 +00:00
Tim Armstrong	e12ee485cf	IMPALA-6957: calc thread resource requirement in planner This only factors in fragment execution threads. E.g. this does not try to account for the number of threads on the old Thrift RPC code path if that is enabled. This is loosely related to the old VCores estimate, but is different in that it: * Directly ties into the notion of required threads in ThreadResourceMgr. * Is a strict upper bound on the number of such threads, rather than an estimate. Does not include "optional" threads. ThreadResourceMgr in the backend bounds the number of "optional" threads per impalad, so the number of execution threads on a backend is limited by sum(required threads per query) + CpuInfo::num_cores() * FLAGS_num_threads_per_core DCHECKS in the backend enforce that the calculation is correct. They were actually hit in KuduScanNode because of some races in thread management leading to multiple "required" threads running. Now the first thread in the multithreaded scans never exits, which means that it's always safe for any of the other threads to exit early, which simplifies the logic a lot. Testing: Updated planner tests. Ran core tests. Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140 Reviewed-on: http://gerrit.cloudera.org:8080/10256 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-12 01:43:37 +00:00
Tim Armstrong	25c13bfdd6	IMPALA-7010: don't run memory usage tests on non-HDFS Moved a number of tests with tuned mem_limits. In some cases this required separating the tests from non-tuned functional tests. TestQueryMemLimit used very high and very low limits only, so seemed safe to run in all configurations. Change-Id: I9686195a29dde2d87b19ef8bb0e93e08f8bee662 Reviewed-on: http://gerrit.cloudera.org:8080/10370 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-11 22:41:49 +00:00
Joe McDonnell	b126b2d105	IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2 There is a Hive bug in Hive 1.1.0 that can result in a NullPointerException when doing parallel Hive operations (see IMPALA-6532). Since dataload goes parallel on Hive loads starting with IMPALA-6372, dataload can hit this error on Hive 1.1.0 (i.e. IMPALA_MINICLUSTER_PROFILE=2). This is impacting builds on the 2.x branch. This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2. IMPALA_MINICLUSTER_PROFILE=3 uses a newer version of Hive that has a fix for this, so this continues to use parallel dataload for that case. Parallelism can be reenabled when Hive 1.1.0 gets the fix from Hive 2.1.1. Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6 Reviewed-on: http://gerrit.cloudera.org:8080/10306 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-10 01:30:13 +00:00
njanarthanan	c35ec6c9bd	IMPALA-6819: Add new queries to targeted-perf workload Description: Adding new queries to the targeted-perf workload that is used by Impala performance tests run via $IMPALA_HOME/bin/run-workload.py Testing: Ran the performance tests for the targeted-perf workload and all the tests passed Change-Id: I5c415924d0bb6da1b1f5df6cb16b95a1d2eaa3ab Reviewed-on: http://gerrit.cloudera.org:8080/9979 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: David Knupp <dknupp@cloudera.com>	2018-05-09 23:08:44 +00:00
Michael Brown	6a86c8e8d4	IMPALA-6781: expand ORDER BY in some TPCH queries Fix determinism in TPCH Q3, Q10, Q18 by adding another column to the queries' ORDER BY to guarantee deterministic results. With TPCH 10000 these queries were producing differing results across stress test runs. They were all valid, but the LIMIT without the more specific ORDER BY meant that several different result sets were possible. By adding these columns, we sort by a column that has uniqueness across all rows returned. Testing: Repeated runs of these specific TPCH queries via: impala-py.test -k Q18 tests/query_test/test_tpch_queries.py Stress test on a 140-node cluster with TPCH 10000 loaded. Previously when using these queries, the stress test would incorrectly report incorrect results. Change-Id: If74d127fb57546b1948a34aa6d2e68cdc6880fae Reviewed-on: http://gerrit.cloudera.org:8080/10351 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-09 19:07:00 +00:00
Sailesh Mukil	f13abdca67	IMPALA-6975: TestRuntimeRowFilters.test_row_filters failing with Memory limit exceeded This test has started failing relatively frequently. We think that this may be due to timing differences of when RPCs arrive from the recent changes with KRPC. Increasing the memory limit should allow this test to pass consistently. Change-Id: Ie39482e2a0aee402ce156b11cce51038cff5e61a Reviewed-on: http://gerrit.cloudera.org:8080/10315 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-05 03:01:35 +00:00
Taras Bobrovytsky	c05696dd6a	IMPALA-6949: Add the option to start the minicluster with EC enabled In this patch we add the "ERASURE_CODING" enviornment variable. If we enable it, a cluster with 5 data nodes will be created during data loading and HDFS will be started with erasure coding enabled. Testing: I ran the core build, and verified that erasure coding gets enabled in HDFS. Many of our EE tests failed however. Cherry-picks: not for 2.x Change-Id: I397aed491354be21b0a8441ca671232dca25146c Reviewed-on: http://gerrit.cloudera.org:8080/10275 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-05 01:20:59 +00:00
Thomas Tauber-Marshall	ba84ad03cb	IMPALA-6954: Fix problems with CTAS into Kudu with an expr rewrite This patch fixes two problems: - Previously a CTAS into a Kudu table where an expr rewrite occurred would create an unpartitioned table, due to the partition info being reset in TableDataLayout and then never reconstructed. Since the Kudu partition info is set by the parser and never changes, the solution is to not reset it. - Previously a CTAS into a Kudu table with a range partition where an expr rewrite occurred would fail with an analysis exception due to a Precondition check in RangePartition.analyze that checked that the RangePartition wasn't already analyzed, as the analysis can't be done twice. Since the state in RangePartition never changes, it doesn't need to be reanalyzed and we can just return instead of failing on the check. Testing: - Added an e2e test that creates a partitioned Kudu table with a CTAS with a rewrite, and checks that the expected partitions are created. Change-Id: I731743bd84cc695119e99342e1b155096147f0ed Reviewed-on: http://gerrit.cloudera.org:8080/10251 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-02 02:54:23 +00:00
Lars Volker	52a2d90f98	Warn about Hadoop / Java version incompatibility Running Hadoop 3 with Java 7 can result in some obscure error messages. This change adds a warning to impala-config.sh when using Hadoop 3 with Java 7. Your development environment is configured for Hadoop 3 and Java 7. Hadoop 3 requires at least Java 8. Your JAVA binary currently points to /usr/lib/jvm/java-7-oracle-amd64/bin/java and reports the following version: java version "1.7.0_75" Java(TM) SE Runtime Environment (build 1.7.0_75-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode) It also catches failure of the minicluster start and prints an additional warning when running with Hadoop 3 and Java 7. Cherry-picks: not for 2.x Change-Id: I4d8b505cf045eeb562d16ce4ce09da0712dc03eb Reviewed-on: http://gerrit.cloudera.org:8080/10244 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-01 20:40:28 +00:00
Tim Armstrong	418c705787	IMPALA-6679,IMPALA-6678: reduce scan reservation This has two related changes. IMPALA-6679: defer scanner reservation increases ------------------------------------------------ When starting each scan range, check to see how big the initial scan range is (the full thing for row-based formats, the footer for Parquet) and determine whether more reservation would be useful. For Parquet, base the ideal reservation on the actual column layout of each file. This avoids reserving memory that we won't use for the actual files that we're scanning. This also avoid the need to estimate ideal reservation in the planner. We also release scanner thread reservations above the minimum as soon as threads complete, so that resources can be released slightly earlier. IMPALA-6678: estimate Parquet column size for reservation --------------------------------------------------------- This change also reduces reservation computed by the planner in certain cases by estimating the on-disk size of column data based on stats. It also reduces the default per-column reservation to 4MB since it appears that < 8MB columns are generally common in practice and the method for estimating column size is biased towards over-estimating. There are two main cases to consider for the performance implications: * Memory is available to improve query perf - if we underestimate, we can increase the reservation so we can do "efficient" 8MB I/Os for large columns. * The ideal reservation is not available - query performance is affected because we can't overlap I/O and compute as much and may do smaller (probably 4MB I/Os). However, we should avoid pathological behaviour like tiny I/Os. When stats are not available, we just default to reserving 4MB per column, which typically is more memory than required. When stats are available, the memory required can be reduced below when some heuristic tell us with high confidence that the column data for most or all files is smaller than 4MB. The stats-based heuristic could reduce scan performance if both the conservative heuristics significantly underestimate the column size and memory is constrained such that we can't increase the scan reservation at runtime (in which case the memory might be used by a different operator or scanner thread). Observability: Added counters to track when threads were not spawned due to reservation and to track when reservation increases are requested and denied. These allow determining if performance may have been affected by memory availability. Testing: Updated test_mem_usage_scaling.py memory requirements and added steps to regenerate the requirements. Loops test for a while to flush out flakiness. Added targeted planner and query tests for reservation calculations and increases. Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842 Reviewed-on: http://gerrit.cloudera.org:8080/9757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	93d714c645	IMPALA-6560: fix regression test for IMPALA-2376 The test is modified to increase the size of collections allocated. num_nodes and mt_dop query options are set to make execution as deterministic as possible. I looped the test overnight to try to flush out flakiness. Adds support for row_regex lines in CATCH sections so that we can match a larger part of the error message. Change-Id: I024cb6b57647902b1735defb885cd095fd99738c Reviewed-on: http://gerrit.cloudera.org:8080/9681 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	d7bba82192	IMPALA-6587: free buffers before ScanRange::Cancel() returns ScanRange::Cancel() now waits until an in-flight read finishes so that the disk I/O buffer being processed by the disk thread is freed when Cancel() returns. The fix is to set a 'read_in_flight_' flag on the scan range while the disk thread is doing the read. Cancel() blocks until read_in_flight_ == false. The code is refactored to move more logic into ScanRange and to avoid holding RequestContext::lock_ for longer than necessary. Testing: Added query test that reproduces the issue. Added a unit test and a stress option that reproduces the problem in a targeted way. Ran disk-io-mgr-stress test for a few hours. Ran it under TSAN and inspected output to make sure there were no non-benign data races. Change-Id: I87182b6bd51b5fb0b923e7e4c8d08a44e7617db2 Reviewed-on: http://gerrit.cloudera.org:8080/9680 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	fb5dc9eb48	IMPALA-4835: switch I/O buffers to buffer pool This is the following squashed patches that were reverted. I will fix the known issues with some follow-on patches. ====================================================================== IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation In preparation for switching the I/O mgr to the buffer pool, this removes and cleans up a lot of code so that the switchover patch starts from a cleaner slate. * Remove the free buffer cache (which will be replaced by buffer pool's own caching). * Make memory limit exceeded error checking synchronous (in anticipation of having to propagate buffer pool errors synchronously). * Simplify error propagation - remove the (ineffectual) code that enqueued BufferDescriptors containing error statuses. * Document locking scheme better in a few places, make it part of the function signature when it seemed reasonable. * Move ReturnBuffer() to ScanRange, because it is intrinsically connected with the lifecycle of a scan range. * Separate external ReturnBuffer() and internal CleanUpBuffer() interfaces - previously callers of ReturnBuffer() were fudging the num_buffers_in_reader accounting to make the external interface work. * Eliminate redundant state in ScanRange: 'eosr_returned_' and 'is_cancelled_'. * Clarify the logic around calling Close() for the last BufferDescriptor. -> There appeared to be an implicit assumption that buffers would be freed in the order they were returned from the scan range, so that the "eos" buffer was returned last. Instead just count the number of outstanding buffers to detect the last one. -> Touching the is_cancelled_ field without holding a lock was hard to reason about - violated locking rules and it was unclear that it was race-free. * Remove DiskIoMgr::Read() to simplify the interface. It is trivial to inline at the callsites. This will probably regress performance somewhat because of the cache removal, so my plan is to merge it around the same time as switching the I/O mgr to allocate from the buffer pool. I'm keeping the patches separate to make reviewing easier. Testing: * Ran exhaustive tests * Ran the disk-io-mgr-stress-test overnight ====================================================================== IMPALA-4835: Part 2: Allocate scan range buffers upfront This change is a step towards reserving memory for buffers from the buffer pool and constraining per-scanner memory requirements. This change restructures the DiskIoMgr code so that each ScanRange operates with a fixed set of buffers that are allocated upfront and recycled as the I/O mgr works through the ScanRange. One major change is that ScanRanges get blocked when a buffer is not available and get unblocked when a client returns a buffer via ReturnBuffer(). I was able to remove the logic to maintain the blocked_ranges_ list by instead adding a separate set with all ranges that are active. There is also some miscellaneous cleanup included - e.g. reducing the amount of code devoted to maintaining counters and metrics. One tricky part of the existing code was the it called IssueInitialRanges() with empty lists of files and depended on DiskIoMgr::AddScanRanges() to not check for cancellation in that case. See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue ranges for empty lists of files. I plan to merge this along with the actual buffer pool switch, but separated it out to allow review of the DiskIoMgr changes separate from other aspects of the buffer pool switchover. Testing: * Ran core and exhaustive tests. ====================================================================== IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00 Reviewed-on: http://gerrit.cloudera.org:8080/9679 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Taras Bobrovytsky	d0f838b66a	IMPALA-6340,IMPALA-6518: Check that decimal types are compatible in FE In this patch we implement strict decimal type checking in the FE in various situations when DECIMAL_V2 is enabled. What is affected: - Union. If we union two decimals and it is not possible to come up with a decimal that will be able to contain all the digits, an error is thrown. For example, the union(decimal(20, 10), decimal(20, 20)) returns decimal(30, 20). However, for union(decimal(38, 0), decimal(38, 38)) the ideal return type would be decimal(76,38), but this is too large, so an error is thrown. - Insert. If we are inserting a decimal value into a column where we are not guaranteed that all digits will fit, an error is thrown. For example, inserting a decimal(38,0) value into a decimal(38,38) column. - Functions such as coalesce(). If we are unable to determine the output type that guarantees that all digits will fit from all the arguments, an error is thrown. For example, coalesce(decimal(38,38), decimal(38,0)) will throw an error. - Hash Join. When joining on two decimals, if a type cannot be determined that both columns can be cast to, we throw an error. For example, join on decimal(38,0) and decimal(38,38) will result in an error. To avoid these errors, you need to use CAST() on some of the decimals. In this patch we also change the output decimal calculation of decimal round, truncate and related functions. If these functions are a no-op, the resulting decimal type is the same as the input type. Testing: - Ran a core build which passed. Change-Id: Id406f4189e01a909152985fabd5cca7a1527a568 Reviewed-on: http://gerrit.cloudera.org:8080/9930 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 03:33:02 +00:00
Thomas Tauber-Marshall	87be63e321	IMPALA-6821: Push down limits into Kudu This patch takes advantage of a recent change in Kudu (KUDU-16) that exposes the ability to set limits on KuduScanners. Since each KuduScanner corresponds to a scan token, and there will be multiple scan tokens per query, this is just a performance optimization in cases where the limit is smaller than the number of rows per token, and Impala still needs to apply the limit on our side for cases where the limit is greater than the number of rows per token. Testing: - Added e2e tests for various situations where limits are applied at a Kudu scan node. - For the query 'select * from tpch_kudu.lineitem limit 1', a best case perf scenario for this change where the limit is highly effective, the time spent in the Kudu scan node was reduced from 6.107ms to 3.498ms (avg over 3 runs). - For the query 'select count() from (select from tpch_kudu.lineitem limit 1000000) v', a worst case perf scenario for this change where the limit is ineffective, the time spent in the Kudu scan node was essentially unchanged, 32.815ms previously vs. 29.532ms (avg over 3 runs). Change-Id: Ibe35e70065d8706b575e24fe20902cd405b49941 Reviewed-on: http://gerrit.cloudera.org:8080/10119 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-27 21:55:11 +00:00
Zoltan Borok-Nagy	1e79f14798	IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries If a scalar subquery is used with a binary predicate, or, used in an arithmetic expression, it must return only one row/column to be valid. If this cannot be guaranteed at parse time through a single row aggregate or limit clause, Impala fails the query like such. E.g., currently the following query is not allowed: SELECT bigint_col FROM alltypesagg WHERE id = (SELECT id FROM alltypesagg WHERE id = 1) However, it would be allowed if the query contained a LIMIT 1 clause, or instead of id it was max(id). This commit makes the example valid by introducing a runtime check to test if the subquery returns a single row. If the subquery returns more than one row, it aborts the query with an error. I added a new node type, called CardinalityCheckNode. It is created during planning on top of the subquery when needed, then during execution it checks if its child only returns a single row. I extended the frontend tests and e2e tests as well. Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06 Reviewed-on: http://gerrit.cloudera.org:8080/9005 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-27 20:06:56 +00:00
Philip Zeyliger	2e6a63e31e	IMPALA-6070: Further improvements to test-with-docker. This commit tackles a few additions and improvements to test-with-docker. In general, I'm adding workloads (e.g., exhaustive, rat-check), tuning memory setting and parallelism, and trying to speed things up. Bug fixes: * Embarassingly, I was still skipping thrift-server-test in the backend tests. This was a mistake in handling feedback from my last review. * I made the timeline a little bit taller to clip less. Adding workloads: * I added the RAT licensing check. * I added exhaustive runs. This led me to model the suites a little bit more in Python, with a class representing a suite with a bunch of data about the suite. It's not perfect and still coupled with the entrypoint.sh shell script, but it feels workable. As part of adding exhaustive tests, I had to re-work the timeout handling, since now different suites meaningfully have different timeouts. Speed ups: * To speed up test runs, I added a mechanism to split py.test suites into multiple shards with a py.test argument. This involved a little bit of work in conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh. Furthermore, I moved a bit more logic about managing the list of suites into Python. * Doing the full build with "-notests" and only building the backend tests in the relevant target that needs them. This speeds up "docker commit" significantly by removing about 20GB from the container. I had to indicates that expr-codegen-test depends on expr-codegen-test-ir, which was missing. * I sped up copying the Kudu data: previously I did both a move and a copy; now I'm doing a move followed by a move. One of the moves is cross-filesystem so is slow, but this does half the amount of copying. Memory usage: * I tweaked the memlimit_gb settings to have a higher default. I've been fighting empirically to have the tests run well on c4.8xlarge and m4.10xlarge. The more memory a minicluster and test suite run uses, the fewer parallel suites we can run. By observing the peak processes at the tail of a run (with a new "memory_usage" function that uses a ps/sort/awk trick) and by observing peak container total_rss, I found that we had several JVMs that didn't have Xmx settings set. I added Xms/Xmx settings in a few places: * The non-first Impalad does very little JVM work, so having an Xmx keeps it small, even in the parallel tests. * Datanodes do work, but they essentially were never garbage collecting, because JVM defaults let them use up to 1/4th the machine memory. (I observed this based on RSS at the end of the run; nothing fancier.) Adding Xms/Xmx settings helped. * Similarly, I piped the settings through to HBase. A few daemons still run without resource limitations, but they don't seem to be a problem. Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c Reviewed-on: http://gerrit.cloudera.org:8080/10123 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:47:29 +00:00
Zoltan Borok-Nagy	25422c74b2	IMPALA-6934: Wrong results with EXISTS subquery containing ORDER BY, LIMIT, and OFFSET Queries may return wrong results if an EXISTS subquery has an ORDER BY with a LIMIT and OFFSET clause. The EXISTS subquery may incorrectly evaluate to TRUE even though it is FALSE. The bug was found during the code review of IMPALA-6314 (https://gerrit.cloudera.org/#/c/9005/). Turned out QueryStmt.setLimit() wipes the offset. I modified it to keep the offset expr. Added tests to 'PlannerTest/subquery-rewrite.test' and 'QueryTest/subquery.test' Change-Id: I9693623d3d0a8446913261252f8e4a07935645e0 Reviewed-on: http://gerrit.cloudera.org:8080/10218 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:12:38 +00:00
Tim Armstrong	d879fa9930	IMPALA-6905: support regexes with more verifiers Support row_regex and other lines for the subset and superset verifiers, which previously assumed that lines in the actual and expected had to match exactly. Use in test_stats_extrapolation to make the test more robust to irrelevant changes in the explain plan. Testing: Manually modified a superset and a subset test to check that tests fail as expected. Change-Id: Ia7a28d421c8e7cd84b14d07fcb71b76449156409 Reviewed-on: http://gerrit.cloudera.org:8080/10155 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 00:56:36 +00:00

1 2 3 4 5 ...

1878 Commits