Commit Graph

1133 Commits

Author SHA1 Message Date
poojanilangekar
c6f9b61ec2 IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans
This change ensures that the planner computes parquet conjuncts
only when for scans containing parquet files. Additionally, it
also handles PARQUET_DICTIONARY_FILTERING and
PARQUET_READ_STATISTICS query options in the planner.

Testing was carried out independently on parquet and non-parquet
scans:
  1. Parquet scans were tested via the existing parquet-filtering
     planner test. Additionally, a new test
     [parquet-filtering-disabled] was added to ensure that the
     explain plan generated skips parquet predicates based on the
     query options.
  2. Non-parquet scans were tested manually to ensure that the
     functions to compute parquet conjucts were not invoked.
     Additional test cases were added to the parquet-filtering
     planner test to scan non parquet tables and ensure that the
     plans do not contain conjuncts based on parquet statistics.
  3. A parquet partition was added to the alltypesmixedformat
     table in the functional database. Planner tests were added
     to ensure that Parquet conjuncts are constructed only when
     the Parquet partition is included in the query.

Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7
Reviewed-on: http://gerrit.cloudera.org:8080/10704
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-06 02:06:50 +00:00
Tianyi Wang
61e6a47776 IMPALA-7236: Fix the parsing of ALLOW_ERASURE_CODED_FILES
This patch adds a missing "break" statement in a switch statement
changed by IMPALA-7102.
Also fixes an non-deterministic test case.

Change-Id: Ife1e791541e3f4fed6bec00945390c7d7681e824
Reviewed-on: http://gerrit.cloudera.org:8080/10857
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-03 23:49:44 +00:00
Bikramjeet Vig
30e82c63ec IMPALA-7190: Remove unsupported format writer support
This patch removes write support for unsupported formats like Sequence,
Avro and compressed text. Also, the related query options
ALLOW_UNSUPPORTED_FORMATS and SEQ_COMPRESSION_MODE have been migrated
to the REMOVED query options type.

Testing:
Ran exhaustive build.

Change-Id: I821dc7495a901f1658daa500daf3791b386c7185
Reviewed-on: http://gerrit.cloudera.org:8080/10823
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-03 20:34:27 +00:00
Taras Bobrovytsky
8060f4d50e IMPALA-7102 (Part 1): Disable reading of erasure coding by default
In this patch we add a query option ALLOW_ERASURE_CODED_FILES, that
allows us to enable or disable the support of erasure coded files. Even
though Impala should be able to handle HDFS erasure coded files already,
this feature hasn't been tested thoroughly yet. Also, Impala lacks
metrics, observability and DDL commands related to erasure coding. This
is a query option instead of a startup flag because we want to make it
possible for advanced users to enable the feature.

We may also need a follow on patch to also disable the write path with
this flag.

Cherry-picks: not for 2.x

Change-Id: Icd3b1754541262467a6e67068b0b447882a40fb3
Reviewed-on: http://gerrit.cloudera.org:8080/10646
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-29 23:26:35 +00:00
poojanilangekar
e988c36bf0 IMPALA-6305: Allow column definitions in ALTER VIEW
This change adds support to change column definitions in ALTER VIEW
statements. This support only required minor changes in the parser
and the AlterViewStmt constructor.

Here's an example syntax:
    alter view foo (a, b comment 'helloworld') as
    select * from bar;

    describe foo;
    +------+--------+------------+
    | name | type   | comment    |
    +------+--------+------------+
    | a    | string |            |
    | b    | string | helloworld |
    +------+--------+------------+

The following tests were modified:
1. ParserTest - To check that the parser handles column definitions
   for alter view statements.
2. AnalyzerDDLTest - To ensure that the analyzer supports the
   change column definitions parsed.
3. TestDdlStatements - To verify the end-to-end functioning of
   ALTER VIEW statements with change column definitions.
4. AuthorizationTest - To ensure that alter table commands with
   column definitions check permissions as expected.

Change-Id: I6073444a814a24d97e80df15fcd39be2812f63fc
Reviewed-on: http://gerrit.cloudera.org:8080/10720
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-27 03:41:47 +00:00
Attila Jeges
17749dbcfc IMPALA-3307: Add support for IANA time-zone db
Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-22 13:18:58 +00:00
Tim Armstrong
894ab8e980 IMPALA-7115: set a default THREAD_RESERVATION_LIMIT value
The value is chosen to allow only queries that have a reasonable chance
of succeeding, albeit with poor performance because of the high
number of threads.

Testing:
Added a test to make sure that the default value rejects a large query.

Change-Id: I31d3fa3f6305c360922649dba53a9026c9563384
Reviewed-on: http://gerrit.cloudera.org:8080/10628
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-19 03:02:49 +00:00
Michael Ho
51ff47d05e IMPALA-5168: Codegen HASH_PARTITIONED KrpcDataStreamSender::Send()
This change codegens the hash partitioning logic of
KrpcDataStreamSender::Send() when the partitioning strategy
is HASH_PARTITIONED. It does so by unrolling the loop which
evaluates each row against the partitioning expressions and
hashes the result. It also replaces the number of channels
of that sender with a constant at runtime.

With this change, we get reasonable speedup with some benchmarks:

+------------+-----------------------+---------+------------+------------+----------------+
| Workload   | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+------------+-----------------------+---------+------------+------------+----------------+
| TPCH(_300) | parquet / none / none | 20.03   | -6.44%     | 13.56      | -7.15%         |
+------------+-----------------------+---------+------------+------------+----------------+

+---------------------+-----------------------+---------+------------+------------+----------------+
| Workload            | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+---------------------+-----------------------+---------+------------+------------+----------------+
| TARGETED-PERF(_300) | parquet / none / none | 58.59   | -5.56%     | 12.28      | -5.30%         |
+---------------------+-----------------------+---------+------------+------------+----------------+

+-------------------------+-----------------------+---------+------------+------------+----------------+
| Workload                | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+-------------------------+-----------------------+---------+------------+------------+----------------+
| TPCDS-UNMODIFIED(_1000) | parquet / none / none | 15.60   | -3.10%     | 7.16       | -4.33%         |
+-------------------------+-----------------------+---------+------------+------------+----------------+

+-------------------+-----------------------+---------+------------+------------+----------------+
| Workload          | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+-------------------+-----------------------+---------+------------+------------+----------------+
| TPCH_NESTED(_300) | parquet / none / none | 30.93   | -3.02%     | 17.46      | -4.71%         |
+-------------------+-----------------------+---------+------------+------------+----------------+

Change-Id: I1c44cc9312c062cc7a5a4ac9156ceaa31fb887ff
Reviewed-on: http://gerrit.cloudera.org:8080/10421
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-14 23:37:00 +00:00
Tim Armstrong
d8ed07f112 IMPALA-6035: Add query options to limit thread reservation
Adds two options: THREAD_RESERVATION_LIMIT and
THREAD_RESERVATION_AGGREGATE_LIMIT, which are both enforced by admission
control based on planner resource requirements and the schedule. The
mechanism used is the same as the minimum reservation checks.

THREAD_RESERVATION_LIMIT limits the total number of reserved threads in
fragments scheduled on a single backend.
THREAD_RESERVATION_AGGREGATE_LIMIT limits the sum of reserved threads
across all fragments.

This also slightly improves the minimum reservation error message to
include the host name.

Testing:
Added end-to-end tests that exercise the code paths.

Ran core tests.

Change-Id: I5b5bbbdad5cd6b24442eb6c99a4d38c2ad710007
Reviewed-on: http://gerrit.cloudera.org:8080/10365
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-14 03:25:55 +00:00
Thomas Tauber-Marshall
bf2124bf30 IMPALA-6929: Support multi-column range partitions for Kudu
Kudu allows specifying range partitions over multiple columns. Impala
already has support for doing this when the partitions are specified
with '=', but if the partitions are specified with '<' or '<=', the
parser would return an error.

This patch modifies the parser to allow for creating Kudu tables like:
create table kudu_test (a int, b int, primary key(a, b))
  partition by range(a, b) (partition (0, 0) <= values < (1, 1));
and similary to alter partitions like:
alter table kudu_test add range partition (1, 1) <= values < (2, 2);

Testing:
- Modified functional_kudu.jointbl's schema so that we have a table
  in functional with a multi-column range partition to test things
  against.
- Added FE and E2E tests for CREATE and ALTER.

Change-Id: I0141dd3344a4f22b186f513b7406f286668ef1e7
Reviewed-on: http://gerrit.cloudera.org:8080/10441
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-13 00:10:13 +00:00
Zoltan Borok-Nagy
e6ca7ca14d IMPALA-7108: IllegalStateException hit during CardinalityCheckNode.<init>
Since IMPALA-6314 on runtime scalar subqueries we set LIMIT 2
in StmtRewriter.mergeExpr(). We do that because later we add a
CardinalityCheckNode on top of such subqueries and with
LIMIT 2 we can still check if they return more than one row.

In the constructor of CardinalityCheckNode there is a
precondition that checks if the child node has LIMIT 2 to
be certain that we've set the limit for all the necessary
cases.

However, some subqueries will get a LIMIT 1 later breaking the
precondition in CardinalityCheckNode. An example to these
subqueries is a select stmt that selects from an inline view
that returns a single row:

select * from functional.alltypes
where int_col = (select f.id from (
                 select * from functional.alltypes limit 1) f);

Note that we shouldn't add a CardinalityCheckNode to the plan
of this query in the first place. To generate a proper plan I
updated SelectStmt.returnsSingleRow() because this method didn't
handle this case well.

I also changed
the precondition from
Preconditions.checkState(child.getLimit() == 2);
to
Preconditions.checkState(child.getLimit() <= 2);
in order to be more permissive.

I added tests for the aforementioned query.

Change-Id: I82a7a3fe26db3e12131c030c4ad055a9c4955407
Reviewed-on: http://gerrit.cloudera.org:8080/10605
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-08 20:15:50 +00:00
Lars Volker
c9e8f2f7e7 IMPALA-7008: Rewrite query to make it not return 100M rows
One query in spilling.test is expected to fail. When it does not fail,
it returned 100M rows, which would then cause the Python test code to
consume memory until it gets OOM-killed by the kernel.

To fix this, we rewrite the query. I tested this locally to make sure
that the query still fails as expected on HDFS.

Change-Id: I31956d3092a7e69b979f631df3a6dfda14ebe140
Reviewed-on: http://gerrit.cloudera.org:8080/10597
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-05 01:14:35 +00:00
Thomas Tauber-Marshall
ba7893cb9e IMPALA-6338: Disable more flaky bloom filter tests
Until IMPALA-6338 is fixed, temporarily disable tests that are
affected by it - any test that has a 'limit' and relies on the
contents of the runtime profile. This patch disables the runtime
profile check for all such tests in bloom_filter.test

Change-Id: Ifc9da892efa3b27d63056ad8e3befac82808ffdb
Reviewed-on: http://gerrit.cloudera.org:8080/10530
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-30 08:00:50 +00:00
Tim Armstrong
f4f28d310c IMPALA-6941: load more text scanner compression plugins
Add extensions for LZ4 and ZSTD (which are supported by Hadoop).
Even without a plugin this results in better behaviour because
we don't try to treat the files with unknown extensions as
uncompressed text.

Also allow loading tables containing files with unsupported
compression types. There was weird behaviour before we knew
of the file extension but didn't support querying the table -
the catalog would load the table but the impalad would fail
processing the catalog update. The simplest way to fix it
is to just allow loading the tables.

Similarly, make the "LOAD DATA" operation more permissive -
we can copy files into a directory even if we can't
decompress them.

Switch to always checking plugin version - running mismatched plugin
is inherently unsafe.

Testing:
Positive case where LZO is loaded is exercised. Added
coverage for negative case where LZO is disabled.

Fixed test gaps:
* Querying LZO table with LZO plugin not available.
* Interacting with tables with known but unsupported text
  compressions.
* Querying files with unknown compression suffixes (which are
  treated as uncompressed text).

Change-Id: If2a9c4a4a11bed81df706e9e834400bfedfe48e6
Reviewed-on: http://gerrit.cloudera.org:8080/10165
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-18 03:44:46 +00:00
Zoltan Borok-Nagy
ccf19f9f8f IMPALA-5842: Write page index in Parquet files
This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Reviewed-on: http://gerrit.cloudera.org:8080/9693
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-17 20:22:02 +00:00
Lars Volker
a64cfc523e IMPALA-7032: Disable codegen for CHAR type null literals
Analogous to IMPALA-6435, we have to disable codegen for CHAR type null
literals. Otherwise we will crash in
impala::NullLiteral::GetCodegendComputeFn().

This change adds a test to make sure that the crash is fixed.

Change-Id: I34033362263cf1292418f69c5ca1a3b84aed39a9
Reviewed-on: http://gerrit.cloudera.org:8080/10409
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Lars Volker <lv@cloudera.com>
2018-05-16 00:00:15 +00:00
Zoltan Borok-Nagy
fab65d4479 IMPALA-7022: TestQueries.test_subquery: Subquery must not return more than one row
TestQueries.test_subquery sometimes fails during exhaustive
tests.

In the tests we expect to catch an exception that is
prefixed by the "Query aborted:" string. The prefix is
usually added by impala_beeswax.py::wait_for_completion(),
but in rare cases it isn't added.

From the point of the test it is irrelevant if the exception
is prefixed with "Query aborted:" or not, so I removed it
from the expected exception string.

Change-Id: I3b8655ad273b1dd7a601099f617db609e4a4797b
Reviewed-on: http://gerrit.cloudera.org:8080/10407
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-05-15 23:37:06 +00:00
Tim Armstrong
3661100fa3 IMPALA-6645: Enable disk spill encryption by default
Perf:
Targeted benchmarks with a heavily spilling query on a machine
with PCLMULQDQ support show < 5% of CPU time spent in encryption and
decryption. PCLMULQDQ was introduced in AMD Bulldozer (c. 2011)
and Intel Westmere (c. 2010).

Testing:
Ran core tests with the change.

Updated the custom cluster test to exercise the non-default
configuration.

Change-Id: Iee4be2a95d689f66c3663d99e4df0fb3968893a9
Reviewed-on: http://gerrit.cloudera.org:8080/10345
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-05-15 22:23:14 +00:00
Tim Armstrong
e12ee485cf IMPALA-6957: calc thread resource requirement in planner
This only factors in fragment execution threads. E.g. this does *not*
try to account for the number of threads on the old Thrift RPC
code path if that is enabled.

This is loosely related to the old VCores estimate, but is different in
that it:
* Directly ties into the notion of required threads in
  ThreadResourceMgr.
* Is a strict upper bound on the number of such threads, rather than
  an estimate.

Does not include "optional" threads. ThreadResourceMgr in the backend
bounds the number of "optional" threads per impalad, so the number of
execution threads on a backend is limited by

  sum(required threads per query) +
      CpuInfo::num_cores() * FLAGS_num_threads_per_core

DCHECKS in the backend enforce that the calculation is correct. They
were actually hit in KuduScanNode because of some races in thread
management leading to multiple "required" threads running. Now the
first thread in the multithreaded scans never exits, which means
that it's always safe for any of the other threads to exit early,
which simplifies the logic a lot.

Testing:
Updated planner tests.

Ran core tests.

Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140
Reviewed-on: http://gerrit.cloudera.org:8080/10256
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-12 01:43:37 +00:00
Tim Armstrong
25c13bfdd6 IMPALA-7010: don't run memory usage tests on non-HDFS
Moved a number of tests with tuned mem_limits. In some cases
this required separating the tests from non-tuned functional
tests.

TestQueryMemLimit used very high and very low limits only, so seemed
safe to run in all configurations.

Change-Id: I9686195a29dde2d87b19ef8bb0e93e08f8bee662
Reviewed-on: http://gerrit.cloudera.org:8080/10370
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-11 22:41:49 +00:00
Sailesh Mukil
f13abdca67 IMPALA-6975: TestRuntimeRowFilters.test_row_filters failing with Memory limit exceeded
This test has started failing relatively frequently. We think that
this may be due to timing differences of when RPCs arrive from the
recent changes with KRPC.

Increasing the memory limit should allow this test to pass
consistently.

Change-Id: Ie39482e2a0aee402ce156b11cce51038cff5e61a
Reviewed-on: http://gerrit.cloudera.org:8080/10315
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-05 03:01:35 +00:00
Thomas Tauber-Marshall
ba84ad03cb IMPALA-6954: Fix problems with CTAS into Kudu with an expr rewrite
This patch fixes two problems:
- Previously a CTAS into a Kudu table where an expr rewrite occurred
  would create an unpartitioned table, due to the partition info being
  reset in TableDataLayout and then never reconstructed. Since the
  Kudu partition info is set by the parser and never changes, the
  solution is to not reset it.
- Previously a CTAS into a Kudu table with a range partition where an
  expr rewrite occurred would fail with an analysis exception due to
  a Precondition check in RangePartition.analyze that checked that
  the RangePartition wasn't already analyzed, as the analysis can't
  be done twice. Since the state in RangePartition never changes, it
  doesn't need to be reanalyzed and we can just return instead of
  failing on the check.

Testing:
- Added an e2e test that creates a partitioned Kudu table with a CTAS
  with a rewrite, and checks that the expected partitions are created.

Change-Id: I731743bd84cc695119e99342e1b155096147f0ed
Reviewed-on: http://gerrit.cloudera.org:8080/10251
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-02 02:54:23 +00:00
Tim Armstrong
418c705787 IMPALA-6679,IMPALA-6678: reduce scan reservation
This has two related changes.

IMPALA-6679: defer scanner reservation increases
------------------------------------------------
When starting each scan range, check to see how big the initial scan
range is (the full thing for row-based formats, the footer for
Parquet) and determine whether more reservation would be useful.

For Parquet, base the ideal reservation on the actual column layout
of each file. This avoids reserving memory that we won't use for
the actual files that we're scanning. This also avoid the need to
estimate ideal reservation in the planner.

We also release scanner thread reservations above the minimum as
soon as threads complete, so that resources can be released slightly
earlier.

IMPALA-6678: estimate Parquet column size for reservation
---------------------------------------------------------
This change also reduces reservation computed by the planner in certain
cases by estimating the on-disk size of column data based on stats. It
also reduces the default per-column reservation to 4MB since it appears
that < 8MB columns are generally common in practice and the method for
estimating column size is biased towards over-estimating. There are two
main cases to consider for the performance implications:
* Memory is available to improve query perf - if we underestimate, we
  can increase the reservation so we can do "efficient" 8MB I/Os for
  large columns.
* The ideal reservation is not available - query performance is affected
  because we can't overlap I/O and compute as much and may do smaller
  (probably 4MB I/Os). However, we should avoid pathological behaviour
  like tiny I/Os.

When stats are not available, we just default to reserving 4MB per
column, which typically is more memory than required. When stats are
available, the memory required can be reduced below when some heuristic
tell us with high confidence that the column data for most or all files
is smaller than 4MB.

The stats-based heuristic could reduce scan performance if both the
conservative heuristics significantly underestimate the column size
and memory is constrained such that we can't increase the scan
reservation at runtime (in which case the memory might be used by
a different operator or scanner thread).

Observability:
Added counters to track when threads were not spawned due to reservation
and to track when reservation increases are requested and denied. These
allow determining if performance may have been affected by memory
availability.

Testing:
Updated test_mem_usage_scaling.py memory requirements and added steps
to regenerate the requirements. Loops test for a while to flush out
flakiness.

Added targeted planner and query tests for reservation calculations and
increases.

Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842
Reviewed-on: http://gerrit.cloudera.org:8080/9757
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-28 23:41:39 +00:00
Tim Armstrong
93d714c645 IMPALA-6560: fix regression test for IMPALA-2376
The test is modified to increase the size of collections allocated.
num_nodes and mt_dop query options are set to make execution as
deterministic as possible.

I looped the test overnight to try to flush out flakiness.

Adds support for row_regex lines in CATCH sections so that we can
match a larger part of the error message.

Change-Id: I024cb6b57647902b1735defb885cd095fd99738c
Reviewed-on: http://gerrit.cloudera.org:8080/9681
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-04-28 23:41:39 +00:00
Tim Armstrong
d7bba82192 IMPALA-6587: free buffers before ScanRange::Cancel() returns
ScanRange::Cancel() now waits until an in-flight read finishes so
that the disk I/O buffer being processed by the disk thread is
freed when Cancel() returns.

The fix is to set a 'read_in_flight_' flag on the scan range
while the disk thread is doing the read. Cancel() blocks until
read_in_flight_ == false.

The code is refactored to move more logic into ScanRange and
to avoid holding RequestContext::lock_ for longer than necessary.

Testing:
Added query test that reproduces the issue.

Added a unit test and a stress option that reproduces the problem in a
targeted way.

Ran disk-io-mgr-stress test for a few hours. Ran it under TSAN and
inspected output to make sure there were no non-benign data races.

Change-Id: I87182b6bd51b5fb0b923e7e4c8d08a44e7617db2
Reviewed-on: http://gerrit.cloudera.org:8080/9680
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-04-28 23:41:39 +00:00
Tim Armstrong
fb5dc9eb48 IMPALA-4835: switch I/O buffers to buffer pool
This is the following squashed patches that were reverted.

I will fix the known issues with some follow-on patches.

======================================================================
IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation

In preparation for switching the I/O mgr to the buffer pool, this
removes and cleans up a lot of code so that the switchover patch starts
from a cleaner slate.

* Remove the free buffer cache (which will be replaced by buffer pool's
  own caching).
* Make memory limit exceeded error checking synchronous (in anticipation
  of having to propagate buffer pool errors synchronously).
* Simplify error propagation - remove the (ineffectual) code that
  enqueued BufferDescriptors containing error statuses.
* Document locking scheme better in a few places, make it part of the
  function signature when it seemed reasonable.
* Move ReturnBuffer() to ScanRange, because it is intrinsically
  connected with the lifecycle of a scan range.
* Separate external ReturnBuffer() and internal CleanUpBuffer()
  interfaces - previously callers of ReturnBuffer() were fudging
  the num_buffers_in_reader accounting to make the external interface work.
* Eliminate redundant state in ScanRange: 'eosr_returned_' and
  'is_cancelled_'.
* Clarify the logic around calling Close() for the last
  BufferDescriptor.
  -> There appeared to be an implicit assumption that buffers would be
     freed in the order they were returned from the scan range, so that
     the "eos" buffer was returned last. Instead just count the number
     of outstanding buffers to detect the last one.
  -> Touching the is_cancelled_ field without holding a lock was hard to
     reason about - violated locking rules and it was unclear that it
     was race-free.
* Remove DiskIoMgr::Read() to simplify the interface. It is trivial to
  inline at the callsites.

This will probably regress performance somewhat because of the cache
removal, so my plan is to merge it around the same time as switching
the I/O mgr to allocate from the buffer pool. I'm keeping the patches
separate to make reviewing easier.

Testing:
* Ran exhaustive tests
* Ran the disk-io-mgr-stress-test overnight

======================================================================
IMPALA-4835: Part 2: Allocate scan range buffers upfront

This change is a step towards reserving memory for buffers from the
buffer pool and constraining per-scanner memory requirements. This
change restructures the DiskIoMgr code so that each ScanRange operates
with a fixed set of buffers that are allocated upfront and recycled as
the I/O mgr works through the ScanRange.

One major change is that ScanRanges get blocked when a buffer is not
available and get unblocked when a client returns a buffer via
ReturnBuffer(). I was able to remove the logic to maintain the
blocked_ranges_ list by instead adding a separate set with all ranges
that are active.

There is also some miscellaneous cleanup included - e.g. reducing the
amount of code devoted to maintaining counters and metrics.

One tricky part of the existing code was the it called
IssueInitialRanges() with empty lists of files and depended on
DiskIoMgr::AddScanRanges() to not check for cancellation in that case.
See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue
ranges for empty lists of files.

I plan to merge this along with the actual buffer pool switch, but
separated it out to allow review of the DiskIoMgr changes separate from
other aspects of the buffer pool switchover.

Testing:
* Ran core and exhaustive tests.

======================================================================
IMPALA-4835: Part 3: switch I/O buffers to buffer pool

This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.

* The planner reserves enough memory to run a single scanner per
  scan node.
* The multi-threaded scan node must increase reservation before
  spinning up more threads.
* The scanner implementations must be careful to stay within their
  assigned reservation.

The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.

Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.

I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.

Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
  action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
  since the algorithm is non-trivial.

Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.

Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00
Reviewed-on: http://gerrit.cloudera.org:8080/9679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-04-28 23:41:39 +00:00
Taras Bobrovytsky
d0f838b66a IMPALA-6340,IMPALA-6518: Check that decimal types are compatible in FE
In this patch we implement strict decimal type checking in the FE in
various situations when DECIMAL_V2 is enabled. What is affected:
- Union. If we union two decimals and it is not possible to come up
  with a decimal that will be able to contain all the digits, an error
  is thrown. For example, the union(decimal(20, 10), decimal(20, 20))
  returns decimal(30, 20). However, for union(decimal(38, 0),
  decimal(38, 38)) the ideal return type would be decimal(76,38), but
  this is too large, so an error is thrown.
- Insert. If we are inserting a decimal value into a column where we are
  not guaranteed that all digits will fit, an error is thrown. For
  example, inserting a decimal(38,0) value into a decimal(38,38) column.
- Functions such as coalesce(). If we are unable to determine the output
  type that guarantees that all digits will fit from all the arguments,
  an error is thrown. For example,
  coalesce(decimal(38,38), decimal(38,0)) will throw an error.
- Hash Join. When joining on two decimals, if a type cannot be
  determined that both columns can be cast to, we throw an error.
  For example, join on decimal(38,0) and decimal(38,38) will result
  in an error.

To avoid these errors, you need to use CAST() on some of the decimals.

In this patch we also change the output decimal calculation of decimal
round, truncate and related functions. If these functions are a no-op,
the resulting decimal type is the same as the input type.

Testing:
- Ran a core build which passed.

Change-Id: Id406f4189e01a909152985fabd5cca7a1527a568
Reviewed-on: http://gerrit.cloudera.org:8080/9930
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-28 03:33:02 +00:00
Thomas Tauber-Marshall
87be63e321 IMPALA-6821: Push down limits into Kudu
This patch takes advantage of a recent change in Kudu (KUDU-16) that
exposes the ability to set limits on KuduScanners. Since each
KuduScanner corresponds to a scan token, and there will be multiple
scan tokens per query, this is just a performance optimization in
cases where the limit is smaller than the number of rows per token,
and Impala still needs to apply the limit on our side for cases where
the limit is greater than the number of rows per token.

Testing:
- Added e2e tests for various situations where limits are applied at
  a Kudu scan node.
- For the query 'select * from tpch_kudu.lineitem limit 1', a best
  case perf scenario for this change where the limit is highly
  effective, the time spent in the Kudu scan node was reduced from
  6.107ms to 3.498ms (avg over 3 runs).
- For the query 'select count(*) from (select * from
  tpch_kudu.lineitem limit 1000000) v', a worst case perf scenario for
  this change where the limit is ineffective, the time spent in the
  Kudu scan node was essentially unchanged, 32.815ms previously vs.
  29.532ms (avg over 3 runs).

Change-Id: Ibe35e70065d8706b575e24fe20902cd405b49941
Reviewed-on: http://gerrit.cloudera.org:8080/10119
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-27 21:55:11 +00:00
Zoltan Borok-Nagy
1e79f14798 IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries
If a scalar subquery is used with a binary predicate,
or, used in an arithmetic expression, it must return
only one row/column to be valid. If this cannot be
guaranteed at parse time through a single row aggregate
or limit clause, Impala fails the query like such.

E.g., currently the following query is not allowed:
SELECT bigint_col
FROM alltypesagg
WHERE id = (SELECT id FROM alltypesagg WHERE id = 1)

However, it would be allowed if the query contained
a LIMIT 1 clause, or instead of id it was max(id).

This commit makes the example valid by introducing a
runtime check to test if the subquery returns a single
row. If the subquery returns more than one row, it
aborts the query with an error.

I added a new node type, called CardinalityCheckNode. It
is created during planning on top of the subquery when
needed, then during execution it checks if its child
only returns a single row.

I extended the frontend tests and e2e tests as well.

Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06
Reviewed-on: http://gerrit.cloudera.org:8080/9005
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-27 20:06:56 +00:00
Zoltan Borok-Nagy
25422c74b2 IMPALA-6934: Wrong results with EXISTS subquery containing ORDER BY, LIMIT, and OFFSET
Queries may return wrong results if an EXISTS subquery has
an ORDER BY with a LIMIT and OFFSET clause. The EXISTS
subquery may incorrectly evaluate to TRUE even though it is
FALSE.

The bug was found during the code review of IMPALA-6314
(https://gerrit.cloudera.org/#/c/9005/). Turned out
QueryStmt.setLimit() wipes the offset. I modified it to
keep the offset expr.

Added tests to 'PlannerTest/subquery-rewrite.test' and
'QueryTest/subquery.test'

Change-Id: I9693623d3d0a8446913261252f8e4a07935645e0
Reviewed-on: http://gerrit.cloudera.org:8080/10218
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 20:12:38 +00:00
Tim Armstrong
d879fa9930 IMPALA-6905: support regexes with more verifiers
Support row_regex and other lines for the subset and superset verifiers,
which previously assumed that lines in the actual and expected had to
match exactly.

Use in test_stats_extrapolation to make the test more robust to
irrelevant changes in the explain plan.

Testing:
Manually modified a superset and a subset test to check that tests fail
as expected.

Change-Id: Ia7a28d421c8e7cd84b14d07fcb71b76449156409
Reviewed-on: http://gerrit.cloudera.org:8080/10155
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 00:56:36 +00:00
Tianyi Wang
8e86678d65 IMPALA-5690: Part 1: Rename ostream operators for thrift types
Thrift 0.9.3 implements "ostream& operator<<(ostream&, T)" for thrift
data types while impala did the same to enums and special types
including TNetworkAddress and TUniqueId. To prepare for the upgrade of
thrift 0.9.3, this patch renames these impala defined functions. In the
absence of operator<<, assertion macros like DCHECK_EQ can no longer be
used on non-enum thrift defined types.

Change-Id: I9c303997411237e988ef960157f781776f6fcb60
Reviewed-on: http://gerrit.cloudera.org:8080/9168
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-20 10:28:12 +00:00
Thomas Tauber-Marshall
b68e06997c IMPALA-6880: disable flaky bloom filter test
This test is made flaky by IMPALA-6338. While that is being worked on,
temporarily disable this test.

Change-Id: I595645b0f2875614294adc7abb4572aec1be8ad5
Reviewed-on: http://gerrit.cloudera.org:8080/10122
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-19 23:43:19 +00:00
Tim Armstrong
3ebf30a2a4 IMPALA-6847: work around high memory estimates for AC
Adds MAX_MEM_ESTIMATE_FOR_ADMISSION query option, which takes
effect if and only if
* Memory-based admission control is enabled for the pool
* No mem_limit is set (i.e. best practices are not being followed)

In that case min(MAX_MEM_ESTIMATE_FOR_ADMISSION, mem_estimate)
is used for admission control instead of mem_estimate.

This provides a way to override the planner's estimate if
it happens to be incorrect and are preventing the query from
running. Setting MEM_LIMIT is usually a better alternative
but sometimes it is not feasible to set MEM_LIMIT for each
individual query.

Testing:
Added an admission control test to verify that query option allows
queries with high estimates to run.

Also tested manually on a minicluster started with:

  start-impala-cluster.py --impalad_args='-vmodule admission-controller=3 \
      -default_pool_mem_limit 12884901888'

Change-Id: Ia5fc32a507ad0f00f564dfe4f954a829ac55d14e
Reviewed-on: http://gerrit.cloudera.org:8080/10058
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-18 01:18:20 +00:00
Tianyi Wang
9a751f00b8 IMPALA-6822: Add a query option to control shuffling by distinct exprs
IMPALA-4794 changed the distinct aggregation behavior to shuffling by
both grouping exprs and the distinct expr. It's slower in queries
where the NDVs of grouping exprs are high and data are uniformly
distributed among groups. This patch adds a query option controlling
this behavior, letting users switch to the old plan.

Change-Id: Icb4b4576fb29edd62cf4b4ba0719c0e0a2a5a8dc
Reviewed-on: http://gerrit.cloudera.org:8080/9949
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-12 22:01:35 +00:00
stiga-huang
818cd8fa27 IMPALA-5717: Support for reading ORC data files
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 05:13:02 +00:00
Zoltan Borok-Nagy
2ee914d5b3 IMPALA-5903: Inconsistent specification of result set and result set metadata
Before this commit it was quite random which DDL oprations
returned a result set and which didn't.

With this commit, every DDL operations return a summary of
its execution. They declare their result set schema in
Frontend.java, and provide the summary in CalatogOpExecutor.java.

Updated the tests according to the new behavior.

Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9
Reviewed-on: http://gerrit.cloudera.org:8080/9090
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 02:21:48 +00:00
Tim Armstrong
2995be8238 IMPALA-5607: part 1: breaking extract/date_part changes
This is the compatibility-breaking part of Jinchul Kim's change
to add additional units. To support nanoseconds we need to
widen the output type of these functions. We also change
the meaning of "milliseconds" to include the seconds component.

Cherry-picks: not for 2.x

Change-Id: I42d83712d9bb3a4900bec38a9c009dcf2a1fe019
Reviewed-on: http://gerrit.cloudera.org:8080/9957
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-10 04:00:37 +00:00
Thomas Tauber-Marshall
d437f956ca IMPALA-6338: Disable flaky bloom filter test
The underlying issue in IMPALA-6338 causes successful queries that are
cancelled internally due to all results having been returned to, in
rare cases, have info missing from the profile. This has caused flaky
tests but has low impact on users, and unfortunately with the current
query lifecycle logic in the coordinator, there is no simple solution.

There is ongoing work to improve query lifecycle logic in the
coordinator holistically, see IMPALA-5384. This work will eventually
address the underlying cause of IMPALA-6338. Until then, we disable
the tests that have been flaky.

Change-Id: Ie30b88fb8fb7780fc3a7153c05fdc3606145ce35
Reviewed-on: http://gerrit.cloudera.org:8080/9822
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-09 21:52:57 +00:00
Bikramjeet Vig
75e1bd1bcd IMPALA-6771: Fix in-predicate set up bug
Fixes a bug that introduced default initialized values in the set data
structure used to check for set membership that can cause wrong results.

Testing:
Added a test case that checks for the same.

Change-Id: I7e776dbcb7ee4a9b64e1295134a27d332f5415b6
Reviewed-on: http://gerrit.cloudera.org:8080/9891
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-04 21:51:29 +00:00
Fredy Wijaya
8173e9ab4d IMPALA-6571: NullPointerException in SHOW CREATE TABLE for HBase tables
This patch fixes the NullPointerException in SHOW CREATE TABLE for HBase
tables.

Testing:
- Moved the content of back hbase-show-create-table.test to
  show-create-table.test
- Ran show-create-table end-to-end tests

Change-Id: Ibe018313168fac5dcbd80be9a8f28b71a2c0389b
Reviewed-on: http://gerrit.cloudera.org:8080/9884
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-04 00:12:30 +00:00
Fredy Wijaya
08d386f0fc IMPALA-6724: Allow creating/dropping functions with the same name as built-ins
This patch removes restriction on creating a function with the same name
as the built-in function. The reason for lifting the restriction is to
avoid a name clash when introducing new built-in functions. The patch
also fixes some inconsistent behavior when creating or dropping a function
when the name specified is fully-qualified or not.

Refer to the below tables for more information.

Create function:
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| FQ Name | Built-in DB | Function Name           | Existing Behavior             | New Behavior                  |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| Yes     | Yes         | Same as built-in        | Same name exception           | Cannot modify system database |
| Yes     | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| Yes     | No          | Same as built-in        | Function created              | Function created              |
| Yes     | No          | Different than built-in | Function created              | Function created              |
| No      | Yes         | Same as built-in        | Same name exception           | Cannot modify system database |
| No      | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| No      | No          | Same as built-in        | Same name exception           | Function created              |
| No      | No          | Different than built-in | Function created              | Function created              |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+

Drop function:
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| FQ Name | Built-in DB | Function Name           | Existing Behavior             | New Behavior                  |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| Yes     | Yes         | Same as built-in        | Cannot modify system database | Cannot modify system database |
| Yes     | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| Yes     | No          | Same as built-in        | Function dropped              | Function dropped              |
| Yes     | No          | Different than built-in | Function dropped              | Function dropped              |
| No      | Yes         | Same as built-in        | Cannot modify system database | Cannot modify system database |
| No      | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| No      | No          | Same as built-in        | Cannot modify system database | Function dropped              |
| No      | No          | Different than built-in | Function dropped              | Function dropped              |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+

Select function (no new behavior):
+---------+-------------+-------------------------+--------------------------------------------------------+
| FQ Name | Built-in DB | Function Name           | Behavior                                               |
+---------+-------------+-------------------------+--------------------------------------------------------+
| Yes     | Yes         | Same as built-in        | Function in the specified database (built-in) executed |
| Yes     | Yes         | Different than built-in | Unknown function exception                             |
| Yes     | No          | Same as built-in        | Function in the specified database executed            |
| Yes     | No          | Different than built-in | Function in the specified database executed            |
| No      | Yes         | Same as built-in        | Built-in function executed                             |
| No      | Yes         | Different than built-in | Unknown function exception                             |
| No      | No          | Same as built-in        | Built-in function executed                             |
| No      | No          | Different than built-in | Function in the current database executed              |
+---------+-------------+-------------------------+--------------------------------------------------------+

Testing:
- Ran front-end tests
- Added end-to-end DDL function tests

Cherry-picks: not for 2.x

Change-Id: Ic30df56ac276970116715c14454a5a2477b185fa
Reviewed-on: http://gerrit.cloudera.org:8080/9800
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 21:12:31 +00:00
Fredy Wijaya
c3ab27681f IMPALA-6739: Exception in ALTER TABLE SET statements
The patch fixes issues with executing ALTER TABLE SET statements when
there are no matching partitions.

The patch also removes incorrect precondition i.e.
(partitionSet == null || !partitionSet.isEmpty()) in ALTER TABLE SET
statements because a partitionSet can be null when PARTITION is not
specified in the ALTER TABLE SET statement and partitionSet can be
empty when there is no matching partition. For example:

Matching partitions (partitionSet != null && !partitionSet.isEmpty()):
> alter table functional.alltypesagg partition(year=2009, month=1)
  set fileformat parquet;

No matching partitions (partitionSet != null && partitionSet.isEmpty()):
> alter table functional.alltypesagg partition(year=2009, month=1)
  set fileformat parquet;

No partition specified (partitionSet == null):
> alter table functional.alltypesagg set fileformat parquet;

Testing:
- Added a new test
- Ran all front-end tests

Change-Id: I793e827d5cf5b7986bd150dd9706df58da3417f3
Reviewed-on: http://gerrit.cloudera.org:8080/9819
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 21:05:40 +00:00
Thomas Tauber-Marshall
832974383c IMPALA-6445: Test for kudu master address with whitespace
A concern was brought up that Impala might not handle kudu master
addresses containing whitespace correctly. Turns out that the Kudu
client takes care of stripping whitespace, so it works, but it would
be good to have a test to ensure it continues to work.

Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576
Reviewed-on: http://gerrit.cloudera.org:8080/9876
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 20:29:51 +00:00
Bikramjeet Vig
4a39e7c29f IMPALA-5980: Upgrade to LLVM 5.0.1
Highlighting a few changes in LLVM:
- Minor changes to some function signatures
- Minor changes to error handling
- Split Bitcode/ReaderWriter.h - https://reviews.llvm.org/D26502
- Introduced an optional new GVN optimization pass.

Needed to fix a bunch of new clang-tidy warnings.

Testing:
Ran core and ASAN tests successfully.

Performance:
Ran single node TPC-H and targeted perf with scale factor 60. Both
improved on average. Identified regression in
"primitive_filter_in_predicate" which will be addressed by IMPALA-6621.

+-------------------+-----------------------+---------+------------+------------+----------------+
| Workload          | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+-------------------+-----------------------+---------+------------+------------+----------------+
| TARGETED-PERF(60) | parquet / none / none | 22.29   | -0.12%     | 3.90       | +3.16%         |
| TPCH(60)          | parquet / none / none | 15.97   | -3.64%     | 10.14      | -4.92%         |
+-------------------+-----------------------+---------+------------+------------+----------------+

+-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| Workload          | Query                                                  | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
+-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| TARGETED-PERF(60) | PERF_LIMIT-Q1                                          | parquet / none / none | 0.01   | 0.00        | R +156.43% | * 25.80% * | * 17.14% *     | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_in_predicate                          | parquet / none / none | 3.39   | 1.92        | R +76.33%  |   3.23%    |   4.37%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_string_non_selective                  | parquet / none / none | 1.25   | 1.11        |   +12.46%  |   3.41%    |   5.36%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_decimal_selective                     | parquet / none / none | 1.40   | 1.25        |   +12.25%  |   3.57%    |   3.44%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_string_like                           | parquet / none / none | 16.87  | 15.65       |   +7.78%   |   5.05%    |   0.37%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_min_max_runtime_filter                       | parquet / none / none | 1.79   | 1.71        |   +4.77%   |   0.71%    |   1.73%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_broadcast_join_2                             | parquet / none / none | 0.60   | 0.58        |   +3.64%   |   3.19%    |   3.81%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_string_selective                      | parquet / none / none | 0.95   | 0.93        |   +2.91%   |   5.23%    |   5.85%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_broadcast_join_3                             | parquet / none / none | 4.33   | 4.21        |   +2.83%   |   5.46%    |   3.25%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_bigint_lowndv                        | parquet / none / none | 4.59   | 4.47        |   +2.82%   |   3.73%    |   1.14%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_3                          | parquet / none / none | 0.20   | 0.19        |   +2.65%   |   4.76%    |   2.24%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q1                                            | parquet / none / none | 2.49   | 2.43        |   +2.31%   |   1.06%    |   1.93%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q6                                            | parquet / none / none | 2.04   | 2.00        |   +2.09%   |   3.51%    |   2.80%        | 1           | 5     |
| TPCH(60)          | TPCH-Q3                                                | parquet / none / none | 12.37  | 12.17       |   +1.62%   |   0.80%    |   2.45%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q5                                         | parquet / none / none | 4.52   | 4.45        |   +1.54%   |   1.23%    |   1.08%        | 1           | 5     |
| TPCH(60)          | TPCH-Q6                                                | parquet / none / none | 2.95   | 2.91        |   +1.33%   |   1.92%    |   1.67%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q4                                         | parquet / none / none | 3.71   | 3.66        |   +1.26%   |   0.34%    |   0.53%        | 1           | 5     |
| TPCH(60)          | TPCH-Q1                                                | parquet / none / none | 18.69  | 18.47       |   +1.19%   |   0.75%    |   0.31%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q7                                         | parquet / none / none | 8.15   | 8.07        |   +0.99%   |   3.92%    |   1.58%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_decimal_highndv                      | parquet / none / none | 31.31  | 31.01       |   +0.97%   |   1.74%    |   1.14%        | 1           | 5     |
| TPCH(60)          | TPCH-Q5                                                | parquet / none / none | 7.59   | 7.53        |   +0.78%   |   0.38%    |   0.99%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q4                                            | parquet / none / none | 21.25  | 21.09       |   +0.76%   |   0.76%    |   0.75%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_4                          | parquet / none / none | 0.24   | 0.24        |   +0.75%   |   3.14%    |   4.76%        | 1           | 5     |
| TPCH(60)          | TPCH-Q19                                               | parquet / none / none | 7.88   | 7.82        |   +0.74%   |   2.39%    |   2.64%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_orderby_bigint                               | parquet / none / none | 5.10   | 5.07        |   +0.61%   |   0.74%    |   0.54%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q3                                         | parquet / none / none | 3.61   | 3.59        |   +0.60%   |   1.45%    |   0.90%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_orderby_all                                  | parquet / none / none | 27.63  | 27.48       |   +0.55%   |   0.85%    |   0.10%        | 1           | 5     |
| TPCH(60)          | TPCH-Q4                                                | parquet / none / none | 5.81   | 5.79        |   +0.45%   |   1.65%    |   2.16%        | 1           | 5     |
| TPCH(60)          | TPCH-Q13                                               | parquet / none / none | 23.49  | 23.43       |   +0.27%   |   0.83%    |   0.63%        | 1           | 5     |
| TPCH(60)          | TPCH-Q21                                               | parquet / none / none | 68.88  | 68.76       |   +0.18%   |   0.22%    |   0.19%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_decimal_lowndv.test                  | parquet / none / none | 4.38   | 4.37        |   +0.09%   |   2.45%    |   0.45%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_5                          | parquet / none / none | 10.40  | 10.40       |   +0.07%   |   0.77%    |   0.50%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_long_predicate                               | parquet / none / none | 222.37 | 222.23      |   +0.06%   |   0.25%    |   0.25%        | 1           | 5     |
| TPCH(60)          | TPCH-Q8                                                | parquet / none / none | 10.65  | 10.65       |   +0.03%   |   0.55%    |   1.40%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_shuffle_join_one_to_many_string_with_groupby | parquet / none / none | 261.84 | 261.87      |   -0.01%   |   0.91%    |   0.74%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q3                                            | parquet / none / none | 9.44   | 9.45        |   -0.02%   |   0.92%    |   1.33%        | 1           | 5     |
| TPCH(60)          | TPCH-Q16                                               | parquet / none / none | 5.21   | 5.21        |   -0.02%   |   1.46%    |   1.64%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_top-n_all                                    | parquet / none / none | 34.58  | 34.62       |   -0.11%   |   0.22%    |   0.19%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_topn_bigint                                  | parquet / none / none | 4.24   | 4.25        |   -0.13%   |   6.66%    |   2.03%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q2                                         | parquet / none / none | 3.23   | 3.24        |   -0.34%   |   2.03%    |   0.32%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_broadcast_join_1                             | parquet / none / none | 0.18   | 0.18        |   -0.40%   |   6.16%    |   2.45%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_exchange_broadcast                           | parquet / none / none | 46.27  | 46.51       |   -0.52%   |   7.83%    | * 15.60% *     | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_bigint_pk                            | parquet / none / none | 114.32 | 114.92      |   -0.52%   |   0.24%    |   0.61%        | 1           | 5     |
| TPCH(60)          | TPCH-Q22                                               | parquet / none / none | 6.66   | 6.70        |   -0.53%   |   1.39%    |   0.84%        | 1           | 5     |
| TPCH(60)          | TPCH-Q20                                               | parquet / none / none | 5.78   | 5.81        |   -0.62%   |   1.25%    |   0.67%        | 1           | 5     |
| TPCH(60)          | TPCH-Q2                                                | parquet / none / none | 2.53   | 2.55        |   -0.64%   |   3.86%    |   3.72%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q5                                            | parquet / none / none | 0.58   | 0.58        |   -0.75%   |   0.99%    |   6.89%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q7                                            | parquet / none / none | 2.05   | 2.07        |   -0.86%   |   2.16%    |   4.73%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_shuffle_join_union_all_with_groupby          | parquet / none / none | 54.86  | 55.34       |   -0.87%   |   0.25%    |   0.66%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_2                          | parquet / none / none | 7.52   | 7.59        |   -0.98%   |   1.53%    |   1.73%        | 1           | 5     |
| TPCH(60)          | TPCH-Q9                                                | parquet / none / none | 36.43  | 36.79       |   -1.00%   |   1.60%    |   7.39%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q1                                         | parquet / none / none | 2.79   | 2.82        |   -1.10%   |   1.15%    |   2.25%        | 1           | 5     |
| TPCH(60)          | TPCH-Q11                                               | parquet / none / none | 1.95   | 1.97        |   -1.18%   |   3.14%    |   2.24%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q2                                            | parquet / none / none | 10.98  | 11.11       |   -1.24%   |   0.77%    |   1.45%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_small_join_1                                 | parquet / none / none | 0.22   | 0.22        |   -1.34%   | * 13.03% * | * 12.31% *     | 1           | 5     |
| TPCH(60)          | TPCH-Q7                                                | parquet / none / none | 42.82  | 43.41       |   -1.37%   |   1.63%    |   1.51%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_empty_build_join_1                           | parquet / none / none | 3.30   | 3.35        |   -1.54%   |   2.15%    |   1.27%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q6                                         | parquet / none / none | 10.34  | 10.54       |   -1.81%   |   0.24%    |   2.02%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_bigint_highndv                       | parquet / none / none | 32.80  | 33.46       |   -1.98%   |   1.29%    |   0.61%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_decimal_non_selective                 | parquet / none / none | 1.62   | 1.67        |   -3.01%   |   0.79%    |   1.65%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_1                          | parquet / none / none | 0.13   | 0.14        |   -3.36%   |   8.66%    | * 12.66% *     | 1           | 5     |
| TARGETED-PERF(60) | primitive_exchange_shuffle                             | parquet / none / none | 84.92  | 87.96       |   -3.46%   |   1.46%    |   1.50%        | 1           | 5     |
| TPCH(60)          | TPCH-Q12                                               | parquet / none / none | 6.98   | 7.31        |   -4.57%   |   1.03%    |   7.13%        | 1           | 5     |
| TPCH(60)          | TPCH-Q18                                               | parquet / none / none | 47.54  | 50.39       |   -5.64%   |   5.70%    |   5.53%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_bigint_non_selective                  | parquet / none / none | 0.88   | 0.96        |   -7.81%   |   4.27%    |   5.97%        | 1           | 5     |
| TPCH(60)          | TPCH-Q15                                               | parquet / none / none | 8.14   | 9.15        |   -11.09%  |   0.63%    | * 10.44% *     | 1           | 5     |
| TPCH(60)          | TPCH-Q10                                               | parquet / none / none | 12.66  | 14.28       |   -11.34%  |   4.32%    |   1.14%        | 1           | 5     |
| TPCH(60)          | TPCH-Q17                                               | parquet / none / none | 10.31  | 12.59       |   -18.14%  |   0.65%    |   3.72%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_bigint_selective                      | parquet / none / none | 0.14   | 0.19        | I -27.60%  | * 32.55% * | * 39.78% *     | 1           | 5     |
| TPCH(60)          | TPCH-Q14                                               | parquet / none / none | 6.10   | 11.00       | I -44.55%  |   4.06%    |   3.84%        | 1           | 5     |
+-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+

Change-Id: Ib0a15cb53feab89e7b35a56b67b3b30eb3e62c6b
Reviewed-on: http://gerrit.cloudera.org:8080/9584
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-28 04:25:27 +00:00
Taras Bobrovytsky
8fec1911e5 IMPALA-6230, IMPALA-6468: Fix the output type of round() and related fns
Before this patch, the output type of round() ceil() floor() trunc() was
not always the same as the input type. It was also inconsistent in
general. For example, round(double) returned an integer, but
round(double, int) returned a double.

After looking at other database systems, we decided that the guideline
should be that the output type should be the same as the input type. In
this patch, we change the behavior of the previously mentioned functions
so that if a double is given then a double is returned.

We also modify the rounding behavior to always round away from zero.
Before, we were rounding towards positive infinity in some cases.

Testinging:
- Updated tests
- Ran an exhaustive build which passed.

Cherry-picks: not for 2.x

Change-Id: I77541678012edab70b182378b11ca8753be53f97
Reviewed-on: http://gerrit.cloudera.org:8080/9346
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-24 04:43:01 +00:00
Philip Zeyliger
783de170c9 IMPALA-4277: Support multiple versions of Hadoop ecosystem
Adds support for building against two sets of Hadoop ecosystem
components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE,
which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for
Hadoop 3, Hive 2, and so on).

We intend (in a trivial follow-on change soon) to make 3 the new default
and to explicitly deprecate 2, but this change only does not switch the
default yet. We support both to facilitate a smoother transition, but
support will be removed soon in the Impala 3.x line.

The switch is done at build time, following the pattern from IMPALA-5184
(build fe against both Hive 1 & 2 APIs). Switching back and forth
requires running 'cmake' again. Doing this at build-time avoids
complicating the Java code with classloader configuration.

There are relatively few incompatible APIs. This implementation
encapsulates that by extracting some Java code into
fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the
pattern established by IMPALA-5184, but, to avoid a proliferation
of directories, I've moved the Hive files into the same tree.)
pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I
consolidated the Hive changes into the same directory structure.

For Maven, I introduced Maven "profiles" to handle the two cases where
the dependencies (and exclusions) differ. These are driven by the
$IMPALA_MINICLUSTER_PROFILE environment variable.

For Sentry, exception class names changed. We work around this by adding
"isSentry...(Exception)" methods with two different implementations.
Sentry is also doing some odd shading, whereby some exceptions are
"sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism
to create a SentryAuthProvider is slightly different. The easiest way to
see the differences is to run:

  diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java
  diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java

The Sentry work is based on a change by Zach Amsden.

In addition, we recently added an explicit "refresh" permission.  In
Sentry 2, this required creating an ImpalaPrivilegeModel to capture
that. It's a slight customization of Hive's equivalent class.

For Parquet, the difference is even more mechanical. The package names
gone from "parquet" to "org.apache.parquet". The affected code
was extracted into ParquetHelper, but only one copy exists. The second
copy is generated at build-time using sed.

In the rare cases where we need to behave differently at runtime,
MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates
what version we were built aginst. One of the cases is the results
expected by various frontend tests. I avoided the issue by translating
one error string into another, which handled the diversion in one place,
rather than complicating the several locations which look for "No
FileSystem for scheme..." errors.

The HBase APIs we use for splitting regions at test time changed.
This patch includes a re-write of that code for the new APIs. This
piece was contributed by Zach Amsden.

To work with newer versions of dependencies, I updated the version of
httpcomponents.core we use to 4.4.9.

We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase
binaries to s3://native-toolchain, and amended the shell scripts to
launch the right things. There are minor mechanical differences.  Some
of this was based on earlier work by Joe McDonnell and Zach Amsden.
Hive's logging is changed in Hive 2, necessitating creating a
log4j2.properties template and using it appropriately. Furthermore,
Hadoop3's new shell script re-writes do a certain amount of classpath
de-duplication, causing some issues with locating the relevant logging
configurations. Accomodations exist in the code to deal with that.

parquet-filtering.test was updated to turn off stats filtering. Older
Hive didn't write Parquet statistics, but newer Hive does. By turning
off stats filtering, we test what the test had intended to test.

For views-compatibility.test, it seems that Hive 2 has fixed certain
bugs that we were testing for in Hive. I've added a
HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that.

For AuthorizationTest, different hive versions show slightly different
things for extended output.

To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing
to see here.

 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%)
 rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%)

CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This
was done manually, but without changing anything except what Java required in
terms of accessibility and boilerplate.

 rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%)
 copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%)

Testing: Ran core & exhaustive tests with both profiles.
Cherry-picks: not for 2.x.

Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7
Reviewed-on: http://gerrit.cloudera.org:8080/9716
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-23 20:56:00 +00:00
Tim Armstrong
588e1d46e9 IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Reviewed-on: http://gerrit.cloudera.org:8080/9403
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-22 02:47:33 +00:00
Tim Armstrong
e148c1a7c3 IMPALA-6589: remove invalid DCHECK in parquet reader
The DCHECK was only valid if the Parquet file metadata is internally
consistent, with the number of values reported by the metadata
matching the number of encoded levels.

The DCHECK was intended to directly detect misuse of the RleBatchDecoder
interface, which would lead to incorrect results. However, our other
test coverage for reading Parquet files is sufficient to test the
correctness of level decoding.

Testing:
Added a minimal corrupt test file that reproduces the issue.

Change-Id: Idd6e09f8c8cca8991be5b5b379f6420adaa97daa
Reviewed-on: http://gerrit.cloudera.org:8080/9556
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-17 02:52:19 +00:00
Fredy Wijaya
41a516f949 IMPALA-6655: Add owner information on database creation
Add owner information on database creation.

> create database foo;
> describe database extended foo;
+---------+----------+---------+
| name    | location | comment |
+---------+----------+---------+
| foo     |          |         |
| Owner:  |          |         |
|         | user1    | USER    |
+---------+----------+---------+

Testing:
- Ran end-to-end query and metadata tests

Change-Id: Id74ec9bd3cb7954999305e9cd9085cbf50921a78
Reviewed-on: http://gerrit.cloudera.org:8080/9637
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-16 19:28:37 +00:00