impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 03:02:48 -05:00

Author	SHA1	Message	Date
poojanilangekar	c6f9b61ec2	IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans This change ensures that the planner computes parquet conjuncts only when for scans containing parquet files. Additionally, it also handles PARQUET_DICTIONARY_FILTERING and PARQUET_READ_STATISTICS query options in the planner. Testing was carried out independently on parquet and non-parquet scans: 1. Parquet scans were tested via the existing parquet-filtering planner test. Additionally, a new test [parquet-filtering-disabled] was added to ensure that the explain plan generated skips parquet predicates based on the query options. 2. Non-parquet scans were tested manually to ensure that the functions to compute parquet conjucts were not invoked. Additional test cases were added to the parquet-filtering planner test to scan non parquet tables and ensure that the plans do not contain conjuncts based on parquet statistics. 3. A parquet partition was added to the alltypesmixedformat table in the functional database. Planner tests were added to ensure that Parquet conjuncts are constructed only when the Parquet partition is included in the query. Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7 Reviewed-on: http://gerrit.cloudera.org:8080/10704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 02:06:50 +00:00
Tianyi Wang	61e6a47776	IMPALA-7236: Fix the parsing of ALLOW_ERASURE_CODED_FILES This patch adds a missing "break" statement in a switch statement changed by IMPALA-7102. Also fixes an non-deterministic test case. Change-Id: Ife1e791541e3f4fed6bec00945390c7d7681e824 Reviewed-on: http://gerrit.cloudera.org:8080/10857 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-03 23:49:44 +00:00
Bikramjeet Vig	30e82c63ec	IMPALA-7190: Remove unsupported format writer support This patch removes write support for unsupported formats like Sequence, Avro and compressed text. Also, the related query options ALLOW_UNSUPPORTED_FORMATS and SEQ_COMPRESSION_MODE have been migrated to the REMOVED query options type. Testing: Ran exhaustive build. Change-Id: I821dc7495a901f1658daa500daf3791b386c7185 Reviewed-on: http://gerrit.cloudera.org:8080/10823 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-03 20:34:27 +00:00
Taras Bobrovytsky	8060f4d50e	IMPALA-7102 (Part 1): Disable reading of erasure coding by default In this patch we add a query option ALLOW_ERASURE_CODED_FILES, that allows us to enable or disable the support of erasure coded files. Even though Impala should be able to handle HDFS erasure coded files already, this feature hasn't been tested thoroughly yet. Also, Impala lacks metrics, observability and DDL commands related to erasure coding. This is a query option instead of a startup flag because we want to make it possible for advanced users to enable the feature. We may also need a follow on patch to also disable the write path with this flag. Cherry-picks: not for 2.x Change-Id: Icd3b1754541262467a6e67068b0b447882a40fb3 Reviewed-on: http://gerrit.cloudera.org:8080/10646 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-29 23:26:35 +00:00
poojanilangekar	e988c36bf0	IMPALA-6305: Allow column definitions in ALTER VIEW This change adds support to change column definitions in ALTER VIEW statements. This support only required minor changes in the parser and the AlterViewStmt constructor. Here's an example syntax: alter view foo (a, b comment 'helloworld') as select * from bar; describe foo; +------+--------+------------+ \| name \| type \| comment \| +------+--------+------------+ \| a \| string \| \| \| b \| string \| helloworld \| +------+--------+------------+ The following tests were modified: 1. ParserTest - To check that the parser handles column definitions for alter view statements. 2. AnalyzerDDLTest - To ensure that the analyzer supports the change column definitions parsed. 3. TestDdlStatements - To verify the end-to-end functioning of ALTER VIEW statements with change column definitions. 4. AuthorizationTest - To ensure that alter table commands with column definitions check permissions as expected. Change-Id: I6073444a814a24d97e80df15fcd39be2812f63fc Reviewed-on: http://gerrit.cloudera.org:8080/10720 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-27 03:41:47 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Tim Armstrong	894ab8e980	IMPALA-7115: set a default THREAD_RESERVATION_LIMIT value The value is chosen to allow only queries that have a reasonable chance of succeeding, albeit with poor performance because of the high number of threads. Testing: Added a test to make sure that the default value rejects a large query. Change-Id: I31d3fa3f6305c360922649dba53a9026c9563384 Reviewed-on: http://gerrit.cloudera.org:8080/10628 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-19 03:02:49 +00:00
Michael Ho	51ff47d05e	IMPALA-5168: Codegen HASH_PARTITIONED KrpcDataStreamSender::Send() This change codegens the hash partitioning logic of KrpcDataStreamSender::Send() when the partitioning strategy is HASH_PARTITIONED. It does so by unrolling the loop which evaluates each row against the partitioning expressions and hashes the result. It also replaces the number of channels of that sender with a constant at runtime. With this change, we get reasonable speedup with some benchmarks: +------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +------------+-----------------------+---------+------------+------------+----------------+ \| TPCH(_300) \| parquet / none / none \| 20.03 \| -6.44% \| 13.56 \| -7.15% \| +------------+-----------------------+---------+------------+------------+----------------+ +---------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +---------------------+-----------------------+---------+------------+------------+----------------+ \| TARGETED-PERF(_300) \| parquet / none / none \| 58.59 \| -5.56% \| 12.28 \| -5.30% \| +---------------------+-----------------------+---------+------------+------------+----------------+ +-------------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------------+-----------------------+---------+------------+------------+----------------+ \| TPCDS-UNMODIFIED(_1000) \| parquet / none / none \| 15.60 \| -3.10% \| 7.16 \| -4.33% \| +-------------------------+-----------------------+---------+------------+------------+----------------+ +-------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------+-----------------------+---------+------------+------------+----------------+ \| TPCH_NESTED(_300) \| parquet / none / none \| 30.93 \| -3.02% \| 17.46 \| -4.71% \| +-------------------+-----------------------+---------+------------+------------+----------------+ Change-Id: I1c44cc9312c062cc7a5a4ac9156ceaa31fb887ff Reviewed-on: http://gerrit.cloudera.org:8080/10421 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-14 23:37:00 +00:00
Tim Armstrong	d8ed07f112	IMPALA-6035: Add query options to limit thread reservation Adds two options: THREAD_RESERVATION_LIMIT and THREAD_RESERVATION_AGGREGATE_LIMIT, which are both enforced by admission control based on planner resource requirements and the schedule. The mechanism used is the same as the minimum reservation checks. THREAD_RESERVATION_LIMIT limits the total number of reserved threads in fragments scheduled on a single backend. THREAD_RESERVATION_AGGREGATE_LIMIT limits the sum of reserved threads across all fragments. This also slightly improves the minimum reservation error message to include the host name. Testing: Added end-to-end tests that exercise the code paths. Ran core tests. Change-Id: I5b5bbbdad5cd6b24442eb6c99a4d38c2ad710007 Reviewed-on: http://gerrit.cloudera.org:8080/10365 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-14 03:25:55 +00:00
Thomas Tauber-Marshall	bf2124bf30	IMPALA-6929: Support multi-column range partitions for Kudu Kudu allows specifying range partitions over multiple columns. Impala already has support for doing this when the partitions are specified with '=', but if the partitions are specified with '<' or '<=', the parser would return an error. This patch modifies the parser to allow for creating Kudu tables like: create table kudu_test (a int, b int, primary key(a, b)) partition by range(a, b) (partition (0, 0) <= values < (1, 1)); and similary to alter partitions like: alter table kudu_test add range partition (1, 1) <= values < (2, 2); Testing: - Modified functional_kudu.jointbl's schema so that we have a table in functional with a multi-column range partition to test things against. - Added FE and E2E tests for CREATE and ALTER. Change-Id: I0141dd3344a4f22b186f513b7406f286668ef1e7 Reviewed-on: http://gerrit.cloudera.org:8080/10441 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-13 00:10:13 +00:00
Zoltan Borok-Nagy	e6ca7ca14d	IMPALA-7108: IllegalStateException hit during CardinalityCheckNode.<init> Since IMPALA-6314 on runtime scalar subqueries we set LIMIT 2 in StmtRewriter.mergeExpr(). We do that because later we add a CardinalityCheckNode on top of such subqueries and with LIMIT 2 we can still check if they return more than one row. In the constructor of CardinalityCheckNode there is a precondition that checks if the child node has LIMIT 2 to be certain that we've set the limit for all the necessary cases. However, some subqueries will get a LIMIT 1 later breaking the precondition in CardinalityCheckNode. An example to these subqueries is a select stmt that selects from an inline view that returns a single row: select * from functional.alltypes where int_col = (select f.id from ( select * from functional.alltypes limit 1) f); Note that we shouldn't add a CardinalityCheckNode to the plan of this query in the first place. To generate a proper plan I updated SelectStmt.returnsSingleRow() because this method didn't handle this case well. I also changed the precondition from Preconditions.checkState(child.getLimit() == 2); to Preconditions.checkState(child.getLimit() <= 2); in order to be more permissive. I added tests for the aforementioned query. Change-Id: I82a7a3fe26db3e12131c030c4ad055a9c4955407 Reviewed-on: http://gerrit.cloudera.org:8080/10605 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-08 20:15:50 +00:00
Lars Volker	c9e8f2f7e7	IMPALA-7008: Rewrite query to make it not return 100M rows One query in spilling.test is expected to fail. When it does not fail, it returned 100M rows, which would then cause the Python test code to consume memory until it gets OOM-killed by the kernel. To fix this, we rewrite the query. I tested this locally to make sure that the query still fails as expected on HDFS. Change-Id: I31956d3092a7e69b979f631df3a6dfda14ebe140 Reviewed-on: http://gerrit.cloudera.org:8080/10597 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-05 01:14:35 +00:00
Thomas Tauber-Marshall	ba7893cb9e	IMPALA-6338: Disable more flaky bloom filter tests Until IMPALA-6338 is fixed, temporarily disable tests that are affected by it - any test that has a 'limit' and relies on the contents of the runtime profile. This patch disables the runtime profile check for all such tests in bloom_filter.test Change-Id: Ifc9da892efa3b27d63056ad8e3befac82808ffdb Reviewed-on: http://gerrit.cloudera.org:8080/10530 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-30 08:00:50 +00:00
Tim Armstrong	f4f28d310c	IMPALA-6941: load more text scanner compression plugins Add extensions for LZ4 and ZSTD (which are supported by Hadoop). Even without a plugin this results in better behaviour because we don't try to treat the files with unknown extensions as uncompressed text. Also allow loading tables containing files with unsupported compression types. There was weird behaviour before we knew of the file extension but didn't support querying the table - the catalog would load the table but the impalad would fail processing the catalog update. The simplest way to fix it is to just allow loading the tables. Similarly, make the "LOAD DATA" operation more permissive - we can copy files into a directory even if we can't decompress them. Switch to always checking plugin version - running mismatched plugin is inherently unsafe. Testing: Positive case where LZO is loaded is exercised. Added coverage for negative case where LZO is disabled. Fixed test gaps: * Querying LZO table with LZO plugin not available. * Interacting with tables with known but unsupported text compressions. * Querying files with unknown compression suffixes (which are treated as uncompressed text). Change-Id: If2a9c4a4a11bed81df706e9e834400bfedfe48e6 Reviewed-on: http://gerrit.cloudera.org:8080/10165 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-18 03:44:46 +00:00
Zoltan Borok-Nagy	ccf19f9f8f	IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Reviewed-on: http://gerrit.cloudera.org:8080/9693 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 20:22:02 +00:00
Lars Volker	a64cfc523e	IMPALA-7032: Disable codegen for CHAR type null literals Analogous to IMPALA-6435, we have to disable codegen for CHAR type null literals. Otherwise we will crash in impala::NullLiteral::GetCodegendComputeFn(). This change adds a test to make sure that the crash is fixed. Change-Id: I34033362263cf1292418f69c5ca1a3b84aed39a9 Reviewed-on: http://gerrit.cloudera.org:8080/10409 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2018-05-16 00:00:15 +00:00
Zoltan Borok-Nagy	fab65d4479	IMPALA-7022: TestQueries.test_subquery: Subquery must not return more than one row TestQueries.test_subquery sometimes fails during exhaustive tests. In the tests we expect to catch an exception that is prefixed by the "Query aborted:" string. The prefix is usually added by impala_beeswax.py::wait_for_completion(), but in rare cases it isn't added. From the point of the test it is irrelevant if the exception is prefixed with "Query aborted:" or not, so I removed it from the expected exception string. Change-Id: I3b8655ad273b1dd7a601099f617db609e4a4797b Reviewed-on: http://gerrit.cloudera.org:8080/10407 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-05-15 23:37:06 +00:00
Tim Armstrong	3661100fa3	IMPALA-6645: Enable disk spill encryption by default Perf: Targeted benchmarks with a heavily spilling query on a machine with PCLMULQDQ support show < 5% of CPU time spent in encryption and decryption. PCLMULQDQ was introduced in AMD Bulldozer (c. 2011) and Intel Westmere (c. 2010). Testing: Ran core tests with the change. Updated the custom cluster test to exercise the non-default configuration. Change-Id: Iee4be2a95d689f66c3663d99e4df0fb3968893a9 Reviewed-on: http://gerrit.cloudera.org:8080/10345 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-05-15 22:23:14 +00:00
Tim Armstrong	e12ee485cf	IMPALA-6957: calc thread resource requirement in planner This only factors in fragment execution threads. E.g. this does not try to account for the number of threads on the old Thrift RPC code path if that is enabled. This is loosely related to the old VCores estimate, but is different in that it: * Directly ties into the notion of required threads in ThreadResourceMgr. * Is a strict upper bound on the number of such threads, rather than an estimate. Does not include "optional" threads. ThreadResourceMgr in the backend bounds the number of "optional" threads per impalad, so the number of execution threads on a backend is limited by sum(required threads per query) + CpuInfo::num_cores() * FLAGS_num_threads_per_core DCHECKS in the backend enforce that the calculation is correct. They were actually hit in KuduScanNode because of some races in thread management leading to multiple "required" threads running. Now the first thread in the multithreaded scans never exits, which means that it's always safe for any of the other threads to exit early, which simplifies the logic a lot. Testing: Updated planner tests. Ran core tests. Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140 Reviewed-on: http://gerrit.cloudera.org:8080/10256 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-12 01:43:37 +00:00
Tim Armstrong	25c13bfdd6	IMPALA-7010: don't run memory usage tests on non-HDFS Moved a number of tests with tuned mem_limits. In some cases this required separating the tests from non-tuned functional tests. TestQueryMemLimit used very high and very low limits only, so seemed safe to run in all configurations. Change-Id: I9686195a29dde2d87b19ef8bb0e93e08f8bee662 Reviewed-on: http://gerrit.cloudera.org:8080/10370 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-11 22:41:49 +00:00
Sailesh Mukil	f13abdca67	IMPALA-6975: TestRuntimeRowFilters.test_row_filters failing with Memory limit exceeded This test has started failing relatively frequently. We think that this may be due to timing differences of when RPCs arrive from the recent changes with KRPC. Increasing the memory limit should allow this test to pass consistently. Change-Id: Ie39482e2a0aee402ce156b11cce51038cff5e61a Reviewed-on: http://gerrit.cloudera.org:8080/10315 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-05 03:01:35 +00:00
Thomas Tauber-Marshall	ba84ad03cb	IMPALA-6954: Fix problems with CTAS into Kudu with an expr rewrite This patch fixes two problems: - Previously a CTAS into a Kudu table where an expr rewrite occurred would create an unpartitioned table, due to the partition info being reset in TableDataLayout and then never reconstructed. Since the Kudu partition info is set by the parser and never changes, the solution is to not reset it. - Previously a CTAS into a Kudu table with a range partition where an expr rewrite occurred would fail with an analysis exception due to a Precondition check in RangePartition.analyze that checked that the RangePartition wasn't already analyzed, as the analysis can't be done twice. Since the state in RangePartition never changes, it doesn't need to be reanalyzed and we can just return instead of failing on the check. Testing: - Added an e2e test that creates a partitioned Kudu table with a CTAS with a rewrite, and checks that the expected partitions are created. Change-Id: I731743bd84cc695119e99342e1b155096147f0ed Reviewed-on: http://gerrit.cloudera.org:8080/10251 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-02 02:54:23 +00:00
Tim Armstrong	418c705787	IMPALA-6679,IMPALA-6678: reduce scan reservation This has two related changes. IMPALA-6679: defer scanner reservation increases ------------------------------------------------ When starting each scan range, check to see how big the initial scan range is (the full thing for row-based formats, the footer for Parquet) and determine whether more reservation would be useful. For Parquet, base the ideal reservation on the actual column layout of each file. This avoids reserving memory that we won't use for the actual files that we're scanning. This also avoid the need to estimate ideal reservation in the planner. We also release scanner thread reservations above the minimum as soon as threads complete, so that resources can be released slightly earlier. IMPALA-6678: estimate Parquet column size for reservation --------------------------------------------------------- This change also reduces reservation computed by the planner in certain cases by estimating the on-disk size of column data based on stats. It also reduces the default per-column reservation to 4MB since it appears that < 8MB columns are generally common in practice and the method for estimating column size is biased towards over-estimating. There are two main cases to consider for the performance implications: * Memory is available to improve query perf - if we underestimate, we can increase the reservation so we can do "efficient" 8MB I/Os for large columns. * The ideal reservation is not available - query performance is affected because we can't overlap I/O and compute as much and may do smaller (probably 4MB I/Os). However, we should avoid pathological behaviour like tiny I/Os. When stats are not available, we just default to reserving 4MB per column, which typically is more memory than required. When stats are available, the memory required can be reduced below when some heuristic tell us with high confidence that the column data for most or all files is smaller than 4MB. The stats-based heuristic could reduce scan performance if both the conservative heuristics significantly underestimate the column size and memory is constrained such that we can't increase the scan reservation at runtime (in which case the memory might be used by a different operator or scanner thread). Observability: Added counters to track when threads were not spawned due to reservation and to track when reservation increases are requested and denied. These allow determining if performance may have been affected by memory availability. Testing: Updated test_mem_usage_scaling.py memory requirements and added steps to regenerate the requirements. Loops test for a while to flush out flakiness. Added targeted planner and query tests for reservation calculations and increases. Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842 Reviewed-on: http://gerrit.cloudera.org:8080/9757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	93d714c645	IMPALA-6560: fix regression test for IMPALA-2376 The test is modified to increase the size of collections allocated. num_nodes and mt_dop query options are set to make execution as deterministic as possible. I looped the test overnight to try to flush out flakiness. Adds support for row_regex lines in CATCH sections so that we can match a larger part of the error message. Change-Id: I024cb6b57647902b1735defb885cd095fd99738c Reviewed-on: http://gerrit.cloudera.org:8080/9681 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	d7bba82192	IMPALA-6587: free buffers before ScanRange::Cancel() returns ScanRange::Cancel() now waits until an in-flight read finishes so that the disk I/O buffer being processed by the disk thread is freed when Cancel() returns. The fix is to set a 'read_in_flight_' flag on the scan range while the disk thread is doing the read. Cancel() blocks until read_in_flight_ == false. The code is refactored to move more logic into ScanRange and to avoid holding RequestContext::lock_ for longer than necessary. Testing: Added query test that reproduces the issue. Added a unit test and a stress option that reproduces the problem in a targeted way. Ran disk-io-mgr-stress test for a few hours. Ran it under TSAN and inspected output to make sure there were no non-benign data races. Change-Id: I87182b6bd51b5fb0b923e7e4c8d08a44e7617db2 Reviewed-on: http://gerrit.cloudera.org:8080/9680 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	fb5dc9eb48	IMPALA-4835: switch I/O buffers to buffer pool This is the following squashed patches that were reverted. I will fix the known issues with some follow-on patches. ====================================================================== IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation In preparation for switching the I/O mgr to the buffer pool, this removes and cleans up a lot of code so that the switchover patch starts from a cleaner slate. * Remove the free buffer cache (which will be replaced by buffer pool's own caching). * Make memory limit exceeded error checking synchronous (in anticipation of having to propagate buffer pool errors synchronously). * Simplify error propagation - remove the (ineffectual) code that enqueued BufferDescriptors containing error statuses. * Document locking scheme better in a few places, make it part of the function signature when it seemed reasonable. * Move ReturnBuffer() to ScanRange, because it is intrinsically connected with the lifecycle of a scan range. * Separate external ReturnBuffer() and internal CleanUpBuffer() interfaces - previously callers of ReturnBuffer() were fudging the num_buffers_in_reader accounting to make the external interface work. * Eliminate redundant state in ScanRange: 'eosr_returned_' and 'is_cancelled_'. * Clarify the logic around calling Close() for the last BufferDescriptor. -> There appeared to be an implicit assumption that buffers would be freed in the order they were returned from the scan range, so that the "eos" buffer was returned last. Instead just count the number of outstanding buffers to detect the last one. -> Touching the is_cancelled_ field without holding a lock was hard to reason about - violated locking rules and it was unclear that it was race-free. * Remove DiskIoMgr::Read() to simplify the interface. It is trivial to inline at the callsites. This will probably regress performance somewhat because of the cache removal, so my plan is to merge it around the same time as switching the I/O mgr to allocate from the buffer pool. I'm keeping the patches separate to make reviewing easier. Testing: * Ran exhaustive tests * Ran the disk-io-mgr-stress-test overnight ====================================================================== IMPALA-4835: Part 2: Allocate scan range buffers upfront This change is a step towards reserving memory for buffers from the buffer pool and constraining per-scanner memory requirements. This change restructures the DiskIoMgr code so that each ScanRange operates with a fixed set of buffers that are allocated upfront and recycled as the I/O mgr works through the ScanRange. One major change is that ScanRanges get blocked when a buffer is not available and get unblocked when a client returns a buffer via ReturnBuffer(). I was able to remove the logic to maintain the blocked_ranges_ list by instead adding a separate set with all ranges that are active. There is also some miscellaneous cleanup included - e.g. reducing the amount of code devoted to maintaining counters and metrics. One tricky part of the existing code was the it called IssueInitialRanges() with empty lists of files and depended on DiskIoMgr::AddScanRanges() to not check for cancellation in that case. See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue ranges for empty lists of files. I plan to merge this along with the actual buffer pool switch, but separated it out to allow review of the DiskIoMgr changes separate from other aspects of the buffer pool switchover. Testing: * Ran core and exhaustive tests. ====================================================================== IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00 Reviewed-on: http://gerrit.cloudera.org:8080/9679 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Taras Bobrovytsky	d0f838b66a	IMPALA-6340,IMPALA-6518: Check that decimal types are compatible in FE In this patch we implement strict decimal type checking in the FE in various situations when DECIMAL_V2 is enabled. What is affected: - Union. If we union two decimals and it is not possible to come up with a decimal that will be able to contain all the digits, an error is thrown. For example, the union(decimal(20, 10), decimal(20, 20)) returns decimal(30, 20). However, for union(decimal(38, 0), decimal(38, 38)) the ideal return type would be decimal(76,38), but this is too large, so an error is thrown. - Insert. If we are inserting a decimal value into a column where we are not guaranteed that all digits will fit, an error is thrown. For example, inserting a decimal(38,0) value into a decimal(38,38) column. - Functions such as coalesce(). If we are unable to determine the output type that guarantees that all digits will fit from all the arguments, an error is thrown. For example, coalesce(decimal(38,38), decimal(38,0)) will throw an error. - Hash Join. When joining on two decimals, if a type cannot be determined that both columns can be cast to, we throw an error. For example, join on decimal(38,0) and decimal(38,38) will result in an error. To avoid these errors, you need to use CAST() on some of the decimals. In this patch we also change the output decimal calculation of decimal round, truncate and related functions. If these functions are a no-op, the resulting decimal type is the same as the input type. Testing: - Ran a core build which passed. Change-Id: Id406f4189e01a909152985fabd5cca7a1527a568 Reviewed-on: http://gerrit.cloudera.org:8080/9930 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 03:33:02 +00:00
Thomas Tauber-Marshall	87be63e321	IMPALA-6821: Push down limits into Kudu This patch takes advantage of a recent change in Kudu (KUDU-16) that exposes the ability to set limits on KuduScanners. Since each KuduScanner corresponds to a scan token, and there will be multiple scan tokens per query, this is just a performance optimization in cases where the limit is smaller than the number of rows per token, and Impala still needs to apply the limit on our side for cases where the limit is greater than the number of rows per token. Testing: - Added e2e tests for various situations where limits are applied at a Kudu scan node. - For the query 'select * from tpch_kudu.lineitem limit 1', a best case perf scenario for this change where the limit is highly effective, the time spent in the Kudu scan node was reduced from 6.107ms to 3.498ms (avg over 3 runs). - For the query 'select count() from (select from tpch_kudu.lineitem limit 1000000) v', a worst case perf scenario for this change where the limit is ineffective, the time spent in the Kudu scan node was essentially unchanged, 32.815ms previously vs. 29.532ms (avg over 3 runs). Change-Id: Ibe35e70065d8706b575e24fe20902cd405b49941 Reviewed-on: http://gerrit.cloudera.org:8080/10119 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-27 21:55:11 +00:00
Zoltan Borok-Nagy	1e79f14798	IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries If a scalar subquery is used with a binary predicate, or, used in an arithmetic expression, it must return only one row/column to be valid. If this cannot be guaranteed at parse time through a single row aggregate or limit clause, Impala fails the query like such. E.g., currently the following query is not allowed: SELECT bigint_col FROM alltypesagg WHERE id = (SELECT id FROM alltypesagg WHERE id = 1) However, it would be allowed if the query contained a LIMIT 1 clause, or instead of id it was max(id). This commit makes the example valid by introducing a runtime check to test if the subquery returns a single row. If the subquery returns more than one row, it aborts the query with an error. I added a new node type, called CardinalityCheckNode. It is created during planning on top of the subquery when needed, then during execution it checks if its child only returns a single row. I extended the frontend tests and e2e tests as well. Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06 Reviewed-on: http://gerrit.cloudera.org:8080/9005 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-27 20:06:56 +00:00
Zoltan Borok-Nagy	25422c74b2	IMPALA-6934: Wrong results with EXISTS subquery containing ORDER BY, LIMIT, and OFFSET Queries may return wrong results if an EXISTS subquery has an ORDER BY with a LIMIT and OFFSET clause. The EXISTS subquery may incorrectly evaluate to TRUE even though it is FALSE. The bug was found during the code review of IMPALA-6314 (https://gerrit.cloudera.org/#/c/9005/). Turned out QueryStmt.setLimit() wipes the offset. I modified it to keep the offset expr. Added tests to 'PlannerTest/subquery-rewrite.test' and 'QueryTest/subquery.test' Change-Id: I9693623d3d0a8446913261252f8e4a07935645e0 Reviewed-on: http://gerrit.cloudera.org:8080/10218 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:12:38 +00:00
Tim Armstrong	d879fa9930	IMPALA-6905: support regexes with more verifiers Support row_regex and other lines for the subset and superset verifiers, which previously assumed that lines in the actual and expected had to match exactly. Use in test_stats_extrapolation to make the test more robust to irrelevant changes in the explain plan. Testing: Manually modified a superset and a subset test to check that tests fail as expected. Change-Id: Ia7a28d421c8e7cd84b14d07fcb71b76449156409 Reviewed-on: http://gerrit.cloudera.org:8080/10155 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 00:56:36 +00:00
Tianyi Wang	8e86678d65	IMPALA-5690: Part 1: Rename ostream operators for thrift types Thrift 0.9.3 implements "ostream& operator<<(ostream&, T)" for thrift data types while impala did the same to enums and special types including TNetworkAddress and TUniqueId. To prepare for the upgrade of thrift 0.9.3, this patch renames these impala defined functions. In the absence of operator<<, assertion macros like DCHECK_EQ can no longer be used on non-enum thrift defined types. Change-Id: I9c303997411237e988ef960157f781776f6fcb60 Reviewed-on: http://gerrit.cloudera.org:8080/9168 Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-20 10:28:12 +00:00
Thomas Tauber-Marshall	b68e06997c	IMPALA-6880: disable flaky bloom filter test This test is made flaky by IMPALA-6338. While that is being worked on, temporarily disable this test. Change-Id: I595645b0f2875614294adc7abb4572aec1be8ad5 Reviewed-on: http://gerrit.cloudera.org:8080/10122 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-19 23:43:19 +00:00
Tim Armstrong	3ebf30a2a4	IMPALA-6847: work around high memory estimates for AC Adds MAX_MEM_ESTIMATE_FOR_ADMISSION query option, which takes effect if and only if * Memory-based admission control is enabled for the pool * No mem_limit is set (i.e. best practices are not being followed) In that case min(MAX_MEM_ESTIMATE_FOR_ADMISSION, mem_estimate) is used for admission control instead of mem_estimate. This provides a way to override the planner's estimate if it happens to be incorrect and are preventing the query from running. Setting MEM_LIMIT is usually a better alternative but sometimes it is not feasible to set MEM_LIMIT for each individual query. Testing: Added an admission control test to verify that query option allows queries with high estimates to run. Also tested manually on a minicluster started with: start-impala-cluster.py --impalad_args='-vmodule admission-controller=3 \ -default_pool_mem_limit 12884901888' Change-Id: Ia5fc32a507ad0f00f564dfe4f954a829ac55d14e Reviewed-on: http://gerrit.cloudera.org:8080/10058 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-18 01:18:20 +00:00
Tianyi Wang	9a751f00b8	IMPALA-6822: Add a query option to control shuffling by distinct exprs IMPALA-4794 changed the distinct aggregation behavior to shuffling by both grouping exprs and the distinct expr. It's slower in queries where the NDVs of grouping exprs are high and data are uniformly distributed among groups. This patch adds a query option controlling this behavior, letting users switch to the old plan. Change-Id: Icb4b4576fb29edd62cf4b4ba0719c0e0a2a5a8dc Reviewed-on: http://gerrit.cloudera.org:8080/9949 Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-12 22:01:35 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Zoltan Borok-Nagy	2ee914d5b3	IMPALA-5903: Inconsistent specification of result set and result set metadata Before this commit it was quite random which DDL oprations returned a result set and which didn't. With this commit, every DDL operations return a summary of its execution. They declare their result set schema in Frontend.java, and provide the summary in CalatogOpExecutor.java. Updated the tests according to the new behavior. Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9 Reviewed-on: http://gerrit.cloudera.org:8080/9090 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 02:21:48 +00:00
Tim Armstrong	2995be8238	IMPALA-5607: part 1: breaking extract/date_part changes This is the compatibility-breaking part of Jinchul Kim's change to add additional units. To support nanoseconds we need to widen the output type of these functions. We also change the meaning of "milliseconds" to include the seconds component. Cherry-picks: not for 2.x Change-Id: I42d83712d9bb3a4900bec38a9c009dcf2a1fe019 Reviewed-on: http://gerrit.cloudera.org:8080/9957 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-10 04:00:37 +00:00
Thomas Tauber-Marshall	d437f956ca	IMPALA-6338: Disable flaky bloom filter test The underlying issue in IMPALA-6338 causes successful queries that are cancelled internally due to all results having been returned to, in rare cases, have info missing from the profile. This has caused flaky tests but has low impact on users, and unfortunately with the current query lifecycle logic in the coordinator, there is no simple solution. There is ongoing work to improve query lifecycle logic in the coordinator holistically, see IMPALA-5384. This work will eventually address the underlying cause of IMPALA-6338. Until then, we disable the tests that have been flaky. Change-Id: Ie30b88fb8fb7780fc3a7153c05fdc3606145ce35 Reviewed-on: http://gerrit.cloudera.org:8080/9822 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-09 21:52:57 +00:00
Bikramjeet Vig	75e1bd1bcd	IMPALA-6771: Fix in-predicate set up bug Fixes a bug that introduced default initialized values in the set data structure used to check for set membership that can cause wrong results. Testing: Added a test case that checks for the same. Change-Id: I7e776dbcb7ee4a9b64e1295134a27d332f5415b6 Reviewed-on: http://gerrit.cloudera.org:8080/9891 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-04 21:51:29 +00:00
Fredy Wijaya	8173e9ab4d	IMPALA-6571: NullPointerException in SHOW CREATE TABLE for HBase tables This patch fixes the NullPointerException in SHOW CREATE TABLE for HBase tables. Testing: - Moved the content of back hbase-show-create-table.test to show-create-table.test - Ran show-create-table end-to-end tests Change-Id: Ibe018313168fac5dcbd80be9a8f28b71a2c0389b Reviewed-on: http://gerrit.cloudera.org:8080/9884 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-04 00:12:30 +00:00
Fredy Wijaya	08d386f0fc	IMPALA-6724: Allow creating/dropping functions with the same name as built-ins This patch removes restriction on creating a function with the same name as the built-in function. The reason for lifting the restriction is to avoid a name clash when introducing new built-in functions. The patch also fixes some inconsistent behavior when creating or dropping a function when the name specified is fully-qualified or not. Refer to the below tables for more information. Create function: +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| FQ Name \| Built-in DB \| Function Name \| Existing Behavior \| New Behavior \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| Yes \| Yes \| Same as built-in \| Same name exception \| Cannot modify system database \| \| Yes \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| Yes \| No \| Same as built-in \| Function created \| Function created \| \| Yes \| No \| Different than built-in \| Function created \| Function created \| \| No \| Yes \| Same as built-in \| Same name exception \| Cannot modify system database \| \| No \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| No \| No \| Same as built-in \| Same name exception \| Function created \| \| No \| No \| Different than built-in \| Function created \| Function created \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ Drop function: +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| FQ Name \| Built-in DB \| Function Name \| Existing Behavior \| New Behavior \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| Yes \| Yes \| Same as built-in \| Cannot modify system database \| Cannot modify system database \| \| Yes \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| Yes \| No \| Same as built-in \| Function dropped \| Function dropped \| \| Yes \| No \| Different than built-in \| Function dropped \| Function dropped \| \| No \| Yes \| Same as built-in \| Cannot modify system database \| Cannot modify system database \| \| No \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| No \| No \| Same as built-in \| Cannot modify system database \| Function dropped \| \| No \| No \| Different than built-in \| Function dropped \| Function dropped \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ Select function (no new behavior): +---------+-------------+-------------------------+--------------------------------------------------------+ \| FQ Name \| Built-in DB \| Function Name \| Behavior \| +---------+-------------+-------------------------+--------------------------------------------------------+ \| Yes \| Yes \| Same as built-in \| Function in the specified database (built-in) executed \| \| Yes \| Yes \| Different than built-in \| Unknown function exception \| \| Yes \| No \| Same as built-in \| Function in the specified database executed \| \| Yes \| No \| Different than built-in \| Function in the specified database executed \| \| No \| Yes \| Same as built-in \| Built-in function executed \| \| No \| Yes \| Different than built-in \| Unknown function exception \| \| No \| No \| Same as built-in \| Built-in function executed \| \| No \| No \| Different than built-in \| Function in the current database executed \| +---------+-------------+-------------------------+--------------------------------------------------------+ Testing: - Ran front-end tests - Added end-to-end DDL function tests Cherry-picks: not for 2.x Change-Id: Ic30df56ac276970116715c14454a5a2477b185fa Reviewed-on: http://gerrit.cloudera.org:8080/9800 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 21:12:31 +00:00
Fredy Wijaya	c3ab27681f	IMPALA-6739: Exception in ALTER TABLE SET statements The patch fixes issues with executing ALTER TABLE SET statements when there are no matching partitions. The patch also removes incorrect precondition i.e. (partitionSet == null \|\| !partitionSet.isEmpty()) in ALTER TABLE SET statements because a partitionSet can be null when PARTITION is not specified in the ALTER TABLE SET statement and partitionSet can be empty when there is no matching partition. For example: Matching partitions (partitionSet != null && !partitionSet.isEmpty()): > alter table functional.alltypesagg partition(year=2009, month=1) set fileformat parquet; No matching partitions (partitionSet != null && partitionSet.isEmpty()): > alter table functional.alltypesagg partition(year=2009, month=1) set fileformat parquet; No partition specified (partitionSet == null): > alter table functional.alltypesagg set fileformat parquet; Testing: - Added a new test - Ran all front-end tests Change-Id: I793e827d5cf5b7986bd150dd9706df58da3417f3 Reviewed-on: http://gerrit.cloudera.org:8080/9819 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 21:05:40 +00:00
Thomas Tauber-Marshall	832974383c	IMPALA-6445: Test for kudu master address with whitespace A concern was brought up that Impala might not handle kudu master addresses containing whitespace correctly. Turns out that the Kudu client takes care of stripping whitespace, so it works, but it would be good to have a test to ensure it continues to work. Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576 Reviewed-on: http://gerrit.cloudera.org:8080/9876 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 20:29:51 +00:00
Bikramjeet Vig	4a39e7c29f	IMPALA-5980: Upgrade to LLVM 5.0.1 Highlighting a few changes in LLVM: - Minor changes to some function signatures - Minor changes to error handling - Split Bitcode/ReaderWriter.h - https://reviews.llvm.org/D26502 - Introduced an optional new GVN optimization pass. Needed to fix a bunch of new clang-tidy warnings. Testing: Ran core and ASAN tests successfully. Performance: Ran single node TPC-H and targeted perf with scale factor 60. Both improved on average. Identified regression in "primitive_filter_in_predicate" which will be addressed by IMPALA-6621. +-------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------+-----------------------+---------+------------+------------+----------------+ \| TARGETED-PERF(60) \| parquet / none / none \| 22.29 \| -0.12% \| 3.90 \| +3.16% \| \| TPCH(60) \| parquet / none / none \| 15.97 \| -3.64% \| 10.14 \| -4.92% \| +-------------------+-----------------------+---------+------------+------------+----------------+ +-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Num Clients \| Iters \| +-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| TARGETED-PERF(60) \| PERF_LIMIT-Q1 \| parquet / none / none \| 0.01 \| 0.00 \| R +156.43% \| * 25.80% * \| * 17.14% * \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_in_predicate \| parquet / none / none \| 3.39 \| 1.92 \| R +76.33% \| 3.23% \| 4.37% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_string_non_selective \| parquet / none / none \| 1.25 \| 1.11 \| +12.46% \| 3.41% \| 5.36% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_decimal_selective \| parquet / none / none \| 1.40 \| 1.25 \| +12.25% \| 3.57% \| 3.44% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_string_like \| parquet / none / none \| 16.87 \| 15.65 \| +7.78% \| 5.05% \| 0.37% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_min_max_runtime_filter \| parquet / none / none \| 1.79 \| 1.71 \| +4.77% \| 0.71% \| 1.73% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_broadcast_join_2 \| parquet / none / none \| 0.60 \| 0.58 \| +3.64% \| 3.19% \| 3.81% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_string_selective \| parquet / none / none \| 0.95 \| 0.93 \| +2.91% \| 5.23% \| 5.85% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_broadcast_join_3 \| parquet / none / none \| 4.33 \| 4.21 \| +2.83% \| 5.46% \| 3.25% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_bigint_lowndv \| parquet / none / none \| 4.59 \| 4.47 \| +2.82% \| 3.73% \| 1.14% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_3 \| parquet / none / none \| 0.20 \| 0.19 \| +2.65% \| 4.76% \| 2.24% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q1 \| parquet / none / none \| 2.49 \| 2.43 \| +2.31% \| 1.06% \| 1.93% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q6 \| parquet / none / none \| 2.04 \| 2.00 \| +2.09% \| 3.51% \| 2.80% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q3 \| parquet / none / none \| 12.37 \| 12.17 \| +1.62% \| 0.80% \| 2.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q5 \| parquet / none / none \| 4.52 \| 4.45 \| +1.54% \| 1.23% \| 1.08% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q6 \| parquet / none / none \| 2.95 \| 2.91 \| +1.33% \| 1.92% \| 1.67% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q4 \| parquet / none / none \| 3.71 \| 3.66 \| +1.26% \| 0.34% \| 0.53% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q1 \| parquet / none / none \| 18.69 \| 18.47 \| +1.19% \| 0.75% \| 0.31% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q7 \| parquet / none / none \| 8.15 \| 8.07 \| +0.99% \| 3.92% \| 1.58% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_decimal_highndv \| parquet / none / none \| 31.31 \| 31.01 \| +0.97% \| 1.74% \| 1.14% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q5 \| parquet / none / none \| 7.59 \| 7.53 \| +0.78% \| 0.38% \| 0.99% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q4 \| parquet / none / none \| 21.25 \| 21.09 \| +0.76% \| 0.76% \| 0.75% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_4 \| parquet / none / none \| 0.24 \| 0.24 \| +0.75% \| 3.14% \| 4.76% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q19 \| parquet / none / none \| 7.88 \| 7.82 \| +0.74% \| 2.39% \| 2.64% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_orderby_bigint \| parquet / none / none \| 5.10 \| 5.07 \| +0.61% \| 0.74% \| 0.54% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q3 \| parquet / none / none \| 3.61 \| 3.59 \| +0.60% \| 1.45% \| 0.90% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_orderby_all \| parquet / none / none \| 27.63 \| 27.48 \| +0.55% \| 0.85% \| 0.10% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q4 \| parquet / none / none \| 5.81 \| 5.79 \| +0.45% \| 1.65% \| 2.16% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q13 \| parquet / none / none \| 23.49 \| 23.43 \| +0.27% \| 0.83% \| 0.63% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q21 \| parquet / none / none \| 68.88 \| 68.76 \| +0.18% \| 0.22% \| 0.19% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_decimal_lowndv.test \| parquet / none / none \| 4.38 \| 4.37 \| +0.09% \| 2.45% \| 0.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_5 \| parquet / none / none \| 10.40 \| 10.40 \| +0.07% \| 0.77% \| 0.50% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_long_predicate \| parquet / none / none \| 222.37 \| 222.23 \| +0.06% \| 0.25% \| 0.25% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q8 \| parquet / none / none \| 10.65 \| 10.65 \| +0.03% \| 0.55% \| 1.40% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_shuffle_join_one_to_many_string_with_groupby \| parquet / none / none \| 261.84 \| 261.87 \| -0.01% \| 0.91% \| 0.74% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q3 \| parquet / none / none \| 9.44 \| 9.45 \| -0.02% \| 0.92% \| 1.33% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q16 \| parquet / none / none \| 5.21 \| 5.21 \| -0.02% \| 1.46% \| 1.64% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_top-n_all \| parquet / none / none \| 34.58 \| 34.62 \| -0.11% \| 0.22% \| 0.19% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_topn_bigint \| parquet / none / none \| 4.24 \| 4.25 \| -0.13% \| 6.66% \| 2.03% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q2 \| parquet / none / none \| 3.23 \| 3.24 \| -0.34% \| 2.03% \| 0.32% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_broadcast_join_1 \| parquet / none / none \| 0.18 \| 0.18 \| -0.40% \| 6.16% \| 2.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_exchange_broadcast \| parquet / none / none \| 46.27 \| 46.51 \| -0.52% \| 7.83% \| * 15.60% * \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_bigint_pk \| parquet / none / none \| 114.32 \| 114.92 \| -0.52% \| 0.24% \| 0.61% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q22 \| parquet / none / none \| 6.66 \| 6.70 \| -0.53% \| 1.39% \| 0.84% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q20 \| parquet / none / none \| 5.78 \| 5.81 \| -0.62% \| 1.25% \| 0.67% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q2 \| parquet / none / none \| 2.53 \| 2.55 \| -0.64% \| 3.86% \| 3.72% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q5 \| parquet / none / none \| 0.58 \| 0.58 \| -0.75% \| 0.99% \| 6.89% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q7 \| parquet / none / none \| 2.05 \| 2.07 \| -0.86% \| 2.16% \| 4.73% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_shuffle_join_union_all_with_groupby \| parquet / none / none \| 54.86 \| 55.34 \| -0.87% \| 0.25% \| 0.66% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_2 \| parquet / none / none \| 7.52 \| 7.59 \| -0.98% \| 1.53% \| 1.73% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q9 \| parquet / none / none \| 36.43 \| 36.79 \| -1.00% \| 1.60% \| 7.39% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q1 \| parquet / none / none \| 2.79 \| 2.82 \| -1.10% \| 1.15% \| 2.25% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q11 \| parquet / none / none \| 1.95 \| 1.97 \| -1.18% \| 3.14% \| 2.24% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q2 \| parquet / none / none \| 10.98 \| 11.11 \| -1.24% \| 0.77% \| 1.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_small_join_1 \| parquet / none / none \| 0.22 \| 0.22 \| -1.34% \| * 13.03% * \| * 12.31% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q7 \| parquet / none / none \| 42.82 \| 43.41 \| -1.37% \| 1.63% \| 1.51% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_empty_build_join_1 \| parquet / none / none \| 3.30 \| 3.35 \| -1.54% \| 2.15% \| 1.27% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q6 \| parquet / none / none \| 10.34 \| 10.54 \| -1.81% \| 0.24% \| 2.02% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_bigint_highndv \| parquet / none / none \| 32.80 \| 33.46 \| -1.98% \| 1.29% \| 0.61% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_decimal_non_selective \| parquet / none / none \| 1.62 \| 1.67 \| -3.01% \| 0.79% \| 1.65% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_1 \| parquet / none / none \| 0.13 \| 0.14 \| -3.36% \| 8.66% \| * 12.66% * \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_exchange_shuffle \| parquet / none / none \| 84.92 \| 87.96 \| -3.46% \| 1.46% \| 1.50% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q12 \| parquet / none / none \| 6.98 \| 7.31 \| -4.57% \| 1.03% \| 7.13% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q18 \| parquet / none / none \| 47.54 \| 50.39 \| -5.64% \| 5.70% \| 5.53% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_bigint_non_selective \| parquet / none / none \| 0.88 \| 0.96 \| -7.81% \| 4.27% \| 5.97% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q15 \| parquet / none / none \| 8.14 \| 9.15 \| -11.09% \| 0.63% \| * 10.44% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q10 \| parquet / none / none \| 12.66 \| 14.28 \| -11.34% \| 4.32% \| 1.14% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q17 \| parquet / none / none \| 10.31 \| 12.59 \| -18.14% \| 0.65% \| 3.72% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_bigint_selective \| parquet / none / none \| 0.14 \| 0.19 \| I -27.60% \| * 32.55% * \| * 39.78% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q14 \| parquet / none / none \| 6.10 \| 11.00 \| I -44.55% \| 4.06% \| 3.84% \| 1 \| 5 \| +-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ Change-Id: Ib0a15cb53feab89e7b35a56b67b3b30eb3e62c6b Reviewed-on: http://gerrit.cloudera.org:8080/9584 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-28 04:25:27 +00:00
Taras Bobrovytsky	8fec1911e5	IMPALA-6230, IMPALA-6468: Fix the output type of round() and related fns Before this patch, the output type of round() ceil() floor() trunc() was not always the same as the input type. It was also inconsistent in general. For example, round(double) returned an integer, but round(double, int) returned a double. After looking at other database systems, we decided that the guideline should be that the output type should be the same as the input type. In this patch, we change the behavior of the previously mentioned functions so that if a double is given then a double is returned. We also modify the rounding behavior to always round away from zero. Before, we were rounding towards positive infinity in some cases. Testinging: - Updated tests - Ran an exhaustive build which passed. Cherry-picks: not for 2.x Change-Id: I77541678012edab70b182378b11ca8753be53f97 Reviewed-on: http://gerrit.cloudera.org:8080/9346 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-24 04:43:01 +00:00
Philip Zeyliger	783de170c9	IMPALA-4277: Support multiple versions of Hadoop ecosystem Adds support for building against two sets of Hadoop ecosystem components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE, which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for Hadoop 3, Hive 2, and so on). We intend (in a trivial follow-on change soon) to make 3 the new default and to explicitly deprecate 2, but this change only does not switch the default yet. We support both to facilitate a smoother transition, but support will be removed soon in the Impala 3.x line. The switch is done at build time, following the pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). Switching back and forth requires running 'cmake' again. Doing this at build-time avoids complicating the Java code with classloader configuration. There are relatively few incompatible APIs. This implementation encapsulates that by extracting some Java code into fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the pattern established by IMPALA-5184, but, to avoid a proliferation of directories, I've moved the Hive files into the same tree.) pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I consolidated the Hive changes into the same directory structure. For Maven, I introduced Maven "profiles" to handle the two cases where the dependencies (and exclusions) differ. These are driven by the $IMPALA_MINICLUSTER_PROFILE environment variable. For Sentry, exception class names changed. We work around this by adding "isSentry...(Exception)" methods with two different implementations. Sentry is also doing some odd shading, whereby some exceptions are "sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism to create a SentryAuthProvider is slightly different. The easiest way to see the differences is to run: diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java The Sentry work is based on a change by Zach Amsden. In addition, we recently added an explicit "refresh" permission. In Sentry 2, this required creating an ImpalaPrivilegeModel to capture that. It's a slight customization of Hive's equivalent class. For Parquet, the difference is even more mechanical. The package names gone from "parquet" to "org.apache.parquet". The affected code was extracted into ParquetHelper, but only one copy exists. The second copy is generated at build-time using sed. In the rare cases where we need to behave differently at runtime, MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates what version we were built aginst. One of the cases is the results expected by various frontend tests. I avoided the issue by translating one error string into another, which handled the diversion in one place, rather than complicating the several locations which look for "No FileSystem for scheme..." errors. The HBase APIs we use for splitting regions at test time changed. This patch includes a re-write of that code for the new APIs. This piece was contributed by Zach Amsden. To work with newer versions of dependencies, I updated the version of httpcomponents.core we use to 4.4.9. We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase binaries to s3://native-toolchain, and amended the shell scripts to launch the right things. There are minor mechanical differences. Some of this was based on earlier work by Joe McDonnell and Zach Amsden. Hive's logging is changed in Hive 2, necessitating creating a log4j2.properties template and using it appropriately. Furthermore, Hadoop3's new shell script re-writes do a certain amount of classpath de-duplication, causing some issues with locating the relevant logging configurations. Accomodations exist in the code to deal with that. parquet-filtering.test was updated to turn off stats filtering. Older Hive didn't write Parquet statistics, but newer Hive does. By turning off stats filtering, we test what the test had intended to test. For views-compatibility.test, it seems that Hive 2 has fixed certain bugs that we were testing for in Hive. I've added a HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that. For AuthorizationTest, different hive versions show slightly different things for extended output. To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing to see here. rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%) CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This was done manually, but without changing anything except what Java required in terms of accessibility and boilerplate. rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%) copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%) Testing: Ran core & exhaustive tests with both profiles. Cherry-picks: not for 2.x. Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7 Reviewed-on: http://gerrit.cloudera.org:8080/9716 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-23 20:56:00 +00:00
Tim Armstrong	588e1d46e9	IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Reviewed-on: http://gerrit.cloudera.org:8080/9403 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 02:47:33 +00:00
Tim Armstrong	e148c1a7c3	IMPALA-6589: remove invalid DCHECK in parquet reader The DCHECK was only valid if the Parquet file metadata is internally consistent, with the number of values reported by the metadata matching the number of encoded levels. The DCHECK was intended to directly detect misuse of the RleBatchDecoder interface, which would lead to incorrect results. However, our other test coverage for reading Parquet files is sufficient to test the correctness of level decoding. Testing: Added a minimal corrupt test file that reproduces the issue. Change-Id: Idd6e09f8c8cca8991be5b5b379f6420adaa97daa Reviewed-on: http://gerrit.cloudera.org:8080/9556 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-17 02:52:19 +00:00
Fredy Wijaya	41a516f949	IMPALA-6655: Add owner information on database creation Add owner information on database creation. > create database foo; > describe database extended foo; +---------+----------+---------+ \| name \| location \| comment \| +---------+----------+---------+ \| foo \| \| \| \| Owner: \| \| \| \| \| user1 \| USER \| +---------+----------+---------+ Testing: - Ran end-to-end query and metadata tests Change-Id: Id74ec9bd3cb7954999305e9cd9085cbf50921a78 Reviewed-on: http://gerrit.cloudera.org:8080/9637 Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-16 19:28:37 +00:00

1 2 3 4 5 ...

1133 Commits