impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 21:00:54 -05:00

Author	SHA1	Message	Date
Lars Volker	8ea21d099f	IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Reviewed-on: http://gerrit.cloudera.org:8080/4863 Reviewed-by: Lars Volker <lv@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-22 02:51:20 +00:00
Sailesh Mukil	178fd59142	IMPALA-4502: test_partition_ddl_predicates breaks on non-HDFS filesystems This is because that test uses 'set cached' and 'set uncached' which are not supported on non-HDFS filesystems. This patch creates a separate test file for non-HDFS filesystems with only supported queries and invokes the right file based on the filesystem. Change-Id: I8606aa427cb6e50be3395cdde246abb53db5172c Reviewed-on: http://gerrit.cloudera.org:8080/5164 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-11-22 00:42:57 +00:00
Dimitris Tsirogiannis	af67b2fef4	IMPALA-4514: Fix broken exhaustive builds caused by non-nullable columns This commit fixes the broken exhaustive Impala builds. The issue was caused by Kudu table that didn't properly specify nullability constraints. Hence, some rows were rejected during data loading causing some tests to fail. Change-Id: Ib6f4b4c88ef18b1731b7c9789aad602880e18035 Reviewed-on: http://gerrit.cloudera.org:8080/5157 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-21 22:24:25 +00:00
Dimitris Tsirogiannis	3db5ced4ce	IMPALA-3726: Add support for Kudu-specific column options This commit adds support for Kudu-specific column options in CREATE TABLE statements. The syntax is: CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....]) where option is: \| NULL \| NOT NULL \| ENCODING encoding_val \| COMPRESSION compression_algorithm \| DEFAULT expr \| BLOCK_SIZE num The output of the SHOW CREATE TABLE statement was altered to include all the specified column options for Kudu tables. Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02 Reviewed-on: http://gerrit.cloudera.org:8080/5026 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-18 11:41:01 +00:00
Alex Behm	263f222557	IMPALA-4490: Only generate runtime filters for hash join nodes. Change-Id: I167725e260bd0f91c2bfc164eb044321192d5b95 Reviewed-on: http://gerrit.cloudera.org:8080/5117 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-18 00:26:35 +00:00
Henry Robinson	2648bfbd90	Improve message output from run-step.sh run-step prints a message to tell the reader what it's doing. However, that message wasn't flushed so that run-step could print OK or FAILED on the same line. The result was that long-running steps wouldn't print anything to the log until they were done, at least in Jenkins contexts. This patch changes it so that the message is flushed, and then the result is printed on a separate line (including the time it took to run the step). $ run-step "Hello world!" helloworld.out sleep 5 Hello world! (logging to /tmp/helloworld.out)... OK (Took: 0 min 5 sec) Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22 Reviewed-on: http://gerrit.cloudera.org:8080/5116 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 09:35:14 +00:00
Taras Bobrovytsky	eb8120d218	IMPALA-3812: Fix error message for unsupported types Before this patch an unclear error message was returned if DATE or DATETIME appeared in the select list after a star expansion. This was because DATE and DATETIME PrimitiveType was serialized as INVALID_TYPE. This is fixed by serializing correctly. Change-Id: I9019b4bfd219f94e554c795befd3ff5e39706ea9 Reviewed-on: http://gerrit.cloudera.org:8080/4859 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 05:31:34 +00:00
Thomas Tauber-Marshall	3833707dbd	IMPALA-4466: Improve Kudu CRUD test coverage The results in the test files were verified by hand. This patch also introduces a new test section 'DML_RESULTS', which takes the name of a table as a comment and the contents of the table as its body and then verifies that the body matches the actual contents of the table. This makes it easy to check that a DML operation has the desired effect on the contents of a table, rather than always having to add another test case that runs a select on the table. For now, this section cannot be used in a test along with the RESULTS or ERRORS sections. TODO: Refactor the DML test case handling (IMPALA-4471) Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6 Reviewed-on: http://gerrit.cloudera.org:8080/4953 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 02:54:30 +00:00
Alex Behm	f5e660dd6e	IMPALA-4470: Avoid creating a NumericLiteral from NaN/infinity/-0. Our NumericLiteral is backed by a BigDecimal which cannot represent the special float values NaN, infinity or negative zero. As a result, when evaluating constant expressions from the FE we hit an exception when trying to create a NumericLiteral from a NaN or infinity value. Before, negative zero would silently get converted to zero which is dangerous. The fix is to treat the expr evaluation as a failure and not replace the constant Expr with a LiteralExpr. Change-Id: I8243b2ee9fa9c470d078b385583f2f48b606a230 Reviewed-on: http://gerrit.cloudera.org:8080/5050 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-16 23:55:42 +00:00
Jim Apple	4b774880c9	Increase wait times for startup of Hive and its Metastore On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components frequently take more than 30 seconds to start. I have seen the HMS take more than 90 seconds; this patch sets a more conservative timeout default. Change-Id: I43eb8646cca495578c8f9730faa04812957d2917 Reviewed-on: http://gerrit.cloudera.org:8080/5068 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 20:35:01 +00:00
Alex Behm	91b5264e52	IMPALA-4479: Use correct isSet() thrift function when evaluating constant bool exprs. Change-Id: Ie3ba195a5241ca630bd0cf71b83d423733b06546 Reviewed-on: http://gerrit.cloudera.org:8080/5088 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 11:17:43 +00:00
Thomas Tauber-Marshall	e6e2baea33	IMPALA-4372: 'Describe formatted' returns types in upper case A recent change caused 'describe formatted' to display the types in all upper case, but we want 'describe formatted' to match Hive's 'describe' output, which displays the types in lower case. This patch also fixes several problems with test_describe_formatted, which was encountering an error but reporting success. Change-Id: I274b97d4d1247244247fb38a5ca7f4c10bba8d22 Reviewed-on: http://gerrit.cloudera.org:8080/4861 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 05:38:12 +00:00
Jim Apple	0ea4a666dc	IMPALA-4433: Always generate testdata using the same time zone setting Before this change, testdata was generated using the java.util.TimeZone.getDefault() TimeZone of the machine it was running on. This patch standardizes on "America/Los_Angeles", which matches the existing expected results in the end-to-end tests. Change-Id: Iaf7cc796e44e9ff64880f9ae852f40961592f279 Reviewed-on: http://gerrit.cloudera.org:8080/5058 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 04:18:33 +00:00
Amos Bird	628685ae74	IMPALA-1654: General partition exprs in DDL operations. This commit handles partition related DDL in a more general way. We can now use compound predicates to specify a list of partitions in statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL STATS, etc. It will also make sure some statements only accept one partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER TABLE ADD PARTITION remains using the old PartitionKeyValue's logic. The changed partition related DDLs are as follows, Table: p (i int) partitioned by (j int, k string) Partitions: +-------+---+-------+--------+------+--------------+-------------------+ \| j \| k \| #Rows \| #Files \| Size \| Bytes Cached \| Cache Replication \| +-------+---+-------+--------+------+--------------+-------------------+ \| 1 \| a \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| 1 \| b \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| 1 \| c \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| 2 \| d \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| 2 \| e \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| 2 \| f \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| Total \| \| -1 \| 0 \| 0B \| 0B \| \| +-------+---+-------+--------+------+--------------+-------------------+ 1. show files in p partition (j<2, k='a'); 2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool'; // j can appear more than once, 3.1. alter table p partition (j<2, j>0, k<>"d") set uncached; // it is the same as 3.2. alter table p partition (j<2 and j>0, not k="e") set uncached; // we can also do 'or' 3.3. alter table p partition (j<2 or j>0, k like "%") set uncached; // missing 'k' matches all values of k 4. alter table p partition (j<2) set fileformat textfile; 5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v"); 6. alter table p partition (j is not null) set tblproperties ("k"="v"); 7. alter table p drop partition (j<2); 8. compute incremental stats p partition(j<2); The remaining old partition related DDLs are as follows, 1. load data inpath '/path/from' into table p partition (j=2, k="d"); 2. alter table p add partition (j=2, k="g"); 3. alter table p partition (j=2, k="g") set location '/path/to'; 4. insert into p partition (j=2, k="g") values (1), (2), (3); General partition expressions or partially specified partition specs allows partition predicates to return empty partition set no matter 'IF EXISTS' is specified. Examples: [localhost.localdomain:21000] > alter table p drop partition (j=2, k="f"); Query: alter table p drop partition (j=2, k="f") +-------------------------+ \| summary \| +-------------------------+ \| Dropped 1 partition(s). \| +-------------------------+ Fetched 1 row(s) in 0.78s [localhost.localdomain:21000] > alter table p drop partition (j=2, k<"f"); Query: alter table p drop partition (j=2, k<"f") +-------------------------+ \| summary \| +-------------------------+ \| Dropped 2 partition(s). \| +-------------------------+ Fetched 1 row(s) in 0.41s [localhost.localdomain:21000] > alter table p drop partition (k="a"); Query: alter table p drop partition (k="a") +-------------------------+ \| summary \| +-------------------------+ \| Dropped 1 partition(s). \| +-------------------------+ Fetched 1 row(s) in 0.25s [localhost.localdomain:21000] > show partitions p; Query: show partitions p +-------+---+-------+--------+------+--------------+-------------------+ \| j \| k \| #Rows \| #Files \| Size \| Bytes Cached \| Cache Replication \| +-------+---+-------+--------+------+--------------+-------------------+ \| 1 \| b \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| 1 \| c \| -1 \| 0 \| 0B \| NOT CACHED \| NOT CACHED \| \| Total \| \| -1 \| 0 \| 0B \| 0B \| \| +-------+---+-------+--------+------+--------------+-------------------+ Fetched 3 row(s) in 0.01s Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f Reviewed-on: http://gerrit.cloudera.org:8080/3942 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 03:27:36 +00:00
David Knupp	b14f319708	IMPALA-4461: Make sure data gets loaded for wide hbase tables. Ths patch reverts a change that broke the exhaustive suite of Impala tests. The change was introduced here: `ce4c5f6743` The orginal problem was that data load was failing when run against a remote cluster, due to a 4000 byte max for SERDEPROPERTIES.PARAM_VALUE, a limitation that is well described in HIVE-1364. Locally, when we load data, we work around the issue here: https://github.com/apache/incubator-impala/blob/master/bin/create-test-configuration.sh#L99 When testing on CDH remote cluster however, this "fix" never gets applied. (It also assumes the database will always by postgres.) I made this change without realizing its full effect, or appreciating exactly how exhaustive our exhaustive test suite really is. Another solution will need to be found for the case of remote cluster testing, but this should unblock the local build for now. As far as testing, I ran the full suite of tests in query_test/ test_scanners.py, and they all pass after removing these lines. Change-Id: If2148d6546789c6c53c8e045717081b24ce76689 Reviewed-on: http://gerrit.cloudera.org:8080/5033 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-11-11 00:37:59 +00:00
Alex Behm	0aeb68050b	IMPALA-1286: Extract common conjuncts from disjunctions. Adds a new ExprRewriteRule to extract common conjuncts from disjunctions. Examples: (a AND b AND c) OR (b AND d) ==> b AND ((a AND c) OR (d)) (a AND b) OR (a AND b) ==> a AND b (a AND b AND c) OR (c) ==> c Adds a new query option ENABLE_EXPR_REWRITES to enable/disable non-essential expr rewrites in the FE. Note that some rewrites are required, e.g., BetweenToCompoundRule. Disabling the rewrites is useful for testing, in particular, to make sure that the exprs specified in expr-test.cc are executed as written. Testing: Added a new unit test in ExprRewriteRulesTest. Change-Id: I3cf9b950afaa3fd753d1b09ba5e540b5258940ad Reviewed-on: http://gerrit.cloudera.org:8080/4877 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 09:44:59 +00:00
Matthew Jacobs	cfac09de10	IMPALA-3710: Kudu DML should ignore conflicts, pt2 Second part of IMPALA-3710, which removed the IGNORE DML option and changed the following errors on Kudu DML operations to be ignored: 1) INSERT where the PK already exists 2) UPDATE/DELETE where the PK doesn't exist This changes other data-related errors to be ignored as well: 3) NULLs in non-nullable columns, i.e. null constraint violoations. 4) Rows with PKs that are in an 'uncovered range'. It became clear that we can't differentiate between (3) and (4) because both return a Kudu 'NotFound' error code. The Impala error codes have been simplified as well: we just report a generic KUDU_NOT_FOUND error in these cases. This also adds some metadata to the thrift report sent to the coordinator from sinks so the total number of rows with errors can be added to the profile. Note that this does not include a breakdown of error counts by type/code because we cannot differentiate between all of these cases yet. An upcoming change will add this new info to the beeswax interface and show it in the shell output (IMPALA-3713). Testing: Updated kudu_crud tests to check the number of rows with errors. Change-Id: I4eb1ad91dc355ea51de261c3a14df0f9d28c879c Reviewed-on: http://gerrit.cloudera.org:8080/4985 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 06:43:41 +00:00
Tim Armstrong	d7246d64c7	IMPALA-1430,IMPALA-4108: codegen all builtin aggregate functions This change enables codegen for all builtin aggregate functions, e.g. timestamp functions and group_concat. There are several parts to the change: * Adding support for generic UDAs. Previous the codegen code did not handle multiple input arguments or NULL return values. * Defaulting to using the UDA interface when there is not a special codegen path (we have implementations of all builtin aggregate functions for the interpreted path). * Remove all the logic to disable codegen for the special cases that now are supported. Also fix the generation of code to get/set NULL bits since I needed to add functionality there anyway. Testing: Add tests that check that codegen was enabled for builtin aggregate functions. Also fix some gaps in the preexisting tests. Also add tests for UDAs that check input/output nulls are handled correctly, in anticipation of enabling codegen for arbitrary UDAs. The tests are run with both codegen enabled and disabled. To avoid flaky tests, we switch the UDF tests to use "unique_database". Perf: Ran local TPC-H and targeted perf. Spent a lot of time on TPC-H Q1, since my original approach regressed it ~5%. In the end the problem was to do with the ordering of loads/stores to the slot and null bit in the generated code: the previous version of the code exploited some properties of the particular aggregate function. I ended up replicating this behaviour to avoid regressing perf. Change-Id: Id9dc21d1d676505d3617e1e4f37557397c4fb260 Reviewed-on: http://gerrit.cloudera.org:8080/4655 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 03:27:12 +00:00
Jim Apple	6775893894	IMPALA-4447: Rein in overly broad sed that dirties the tree This patch fixes a sed expression to make sure it only laters the code it is meant to alter, not the comment describing the code. Tested with tests/run-tests.py query_test/test_udfs.py Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e Reviewed-on: http://gerrit.cloudera.org:8080/5008 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 02:44:36 +00:00
Matthew Jacobs	08d89a5cc3	IMPALA-3710: Kudu DML should ignore conflicts by default Removes the non-standard IGNORE syntax that was allowed for DML into Kudu tables to indicate that certain errors should be ignored, i.e. not fail the query and continue. However, because there is no way to 'roll back' mutations that occurred before an error occurs, tables are left in an inconsistent state and it's difficult to know what rows were successfully modified vs which rows were not. Instead, this change makes it so that we always 'ignore' these conflicts, i.e. a 'best effort'. In the future, when Kudu will provide the mechanisms Impala needs to provide a notion of isolation levels, then Impala will be able to provide options for more traditional semantics. After this change, the following errors are ignored: * INSERT where the PK already exists * UPDATE/DELETE where the PK doesn't exist Another follow-up patch will change other violations to be handled in this way as well, e.g. nulls inserted in non-nullable cols. Reporting: The number of rows inserted is reported to the coordinator, which makes the aggregate available to the shell and via the profile. TODO: Return rows modified for INSERT via HS2 (IMPALA-1789). TODO: Return rows modified for other CRUD (beeswax+hs2) (IMPALA-3713). TODO: Return error counts for specific warnings (IMPALA-4416). Testing: Updated tests. Ran all functional tests. More tests will be needed when other conflicts are handled in the same way. Change-Id: I83b5beaa982d006da4997a2af061ef7c22cad3f1 Reviewed-on: http://gerrit.cloudera.org:8080/4911 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 20:34:00 +00:00
Martin Grund	ce4c5f6743	IMPALA-4365: Enabling end-to-end tests on a remote cluster This patch lays the groundwork for loading data and running end-to-end tests on a remote CDH cluster. The requirements for the cluster to run the tests are: - Managed by Cloudera Manager (CM) - GPL Extras need to be installed - KMS and KeyTrustee installed and available as a service - SERDEPROPERTIES in the Hive DB modified to accept wide tables - Hive warehouse dir points to /test-warehouse The actual data loading is done via a new script, remote_data_load.py, which takes the CM host as an argument. It can be run from a client machine that is not a node of the cluster, but it needs to have the Impala repo checked out and Impala built. This insures that all of the necessary data load scripts are available, as well as setting up the environment properly (client binaries like beeline and the hbase shell are available, python libraries like cm_api are installed, necessary environment variables are defined, etc.) It should be noted that running remote_data_load.py will overwrite any local XML config files with the configurations downloaded from the remote cluster. Usage: remote_data_load.py [options] <cm_host address> Options: -h, --help show this help message and exit --snapshot-file=SNAPSHOT_FILE Path to the test-warehouse archive --cm-user=CM_USER Cloudera Manager admin user --cm-pass=CM_PASS Cloudera Manager admin user password --gateway=GATEWAY Gateway host to upload the data from. If not set, uses the CM host as gateway. --ssh-user=SSH_USER System user on the remote machine with passwordless SSH configured. --no-load Do not try to load the snapshot --exploration-strategy=EXPLORATION_STRATEGY --test Run end-to-end tests against cluster Testing: This patch is being submitted with the understanding that there are still clean up issues that need to be addressed in the remote data load script, for which JIRA's have been filed. However, since many of the existing build scripts also had to be modified, it is more important to make sure that no regressions were inadvertently introduced into the existing data load process. Loading data to a local mini-cluster was checked repeatedly while this patch was being developed, as well as running it against the Jenkins job that provides the test-warehouse snapshot used by the many other Impala CI builds that run daily. Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9 Reviewed-on: http://gerrit.cloudera.org:8080/4769 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 10:16:55 +00:00
Tim Armstrong	381e719065	IMPALA-4266: Java udf returning string can give incorrect results The memory management of string results was wrong: strings returned from Exprs must live until the next time FreeLocalAllocations() is called. Otherwise the buffer holding the string is freed or reused by the next UDF call. The fix is to copy string values into a buffer with the right lifetime. Testing: Added a regression test based on Bharath's example that reproduced the bug reliably. Change-Id: I705d271814cb1143f67d8a12f4fd87bab7a8e161 Reviewed-on: http://gerrit.cloudera.org:8080/4941 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 02:47:11 +00:00
Tim Armstrong	10fa472fa6	IMPALA-4302,IMPALA-2379: constant expr arg fixes This patch fixes two issues around handling of constant expr args. The patches are combined because they touch some of the same code and depend on some of the same memory management cleanup. First, it fixes IMPALA-2379, where constant expr args were not visible to UDAFs. The issue is that the input exprs need to be opened before calling the UDAF Init() function. Second, it avoids overhead from repeated evaluation of constant arguments for ScalarFnCall expressions on both the codegen'd and interpreted paths. A common example is an IN predicate with a long list of constant values. The interpreted path was inefficient because it always evaluated all children expressions. Instead in this patch constant args are evaluated once and cached. The memory management of the AnyVal* objects was somewhat nebulous - adjusted it so that they're allocated from ExprContext::mem_pool_, which has the correct lifetime. The codegen'd path was inefficient only with varargs - with fixed arguments the LLVM optimiser is able to infer after inlining that the expressions are constant and remove all evaluation. However, for varargs it stores the vararg values into a heap-allocated buffer. The LLVM optimiser is unable to remove these stores because they have a side-effect that is visible to code outside the function. The codegen'd path is improved by evaluating varargs into an automatic buffer that can be optimised out. We also make a small related change to bake the string constants into the codegen'd code. Testing: Ran exhaustive build. Added regression test for IMPALA-2379 and MemPool test for aligned allocation. Added a test for in predicates with constant strings. Perf: Added a targeted query that demonstrates the improvement. Also manually validated the non-codegend perf. Also ran TPC-H and targeted perf queries locally - didn't see any significant changes. +--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Num Clients \| Iters \| +--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ \| TARGETED-PERF(_20) \| primitive_filter_in_predicate \| parquet / none / none \| 1.19 \| 9.82 \| I -87.85% \| 3.82% \| 0.71% \| 1 \| 10 \| +--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ (I) Improvement: TARGETED-PERF(_20) primitive_filter_in_predicate [parquet / none / none] (9.82s -> 1.19s [-87.85%]) +--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+ \| Operator \| % of Query \| Avg \| Base Avg \| Delta(Avg) \| StdDev(%) \| Max \| Base Max \| Delta(Max) \| #Hosts \| #Rows \| Est #Rows \| +--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+ \| 01:AGGREGATE \| 14.39% \| 155.88ms \| 214.61ms \| -27.37% \| 2.68% \| 163.38ms \| 227.53ms \| -28.19% \| 1 \| 1 \| 1 \| \| 00:SCAN HDFS \| 85.60% \| 927.46ms \| 9.43s \| -90.16% \| 4.49% \| 1.01s \| 9.50s \| -89.42% \| 1 \| 13.77K \| 14.05K \| +--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+ Change-Id: I45c3ed8c9d7a61e94a9b9d6c316e8a53d9ff6c24 Reviewed-on: http://gerrit.cloudera.org:8080/4838 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 02:44:51 +00:00
Thomas Tauber-Marshall	5cc133947f	IMPALA-4260: Alter table add column drops all the column stats Hive expects types for column stats to be specified as all lower case. For some reason, it doesn't check this when the stats are first written, but it does check when performing an 'alter table'. This causes it to drop stats that Impala wrote because we specify type names in upper case. This patch converts the types that Impala sends to Hive for the column stats to all lower case and adds a regression test. I also filed HIVE-15061 to track the issue from the Hive end. Change-Id: Ia373ec917efa7ab9f2a59b8a870b7ebc30175dda Reviewed-on: http://gerrit.cloudera.org:8080/4845 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-07 22:29:19 +00:00
Matthew Jacobs	50f7753d2b	IMPALA-3771: Expose kudu client timeout and set default The Kudu client timeout was too low for Impala usage. This sets the default timeout to 3 minutes and exposes it as a gflag. New timeout tests were added. Change-Id: Iad95e8e38aad4f76d21bac6879db6c02b3c3e045 Reviewed-on: http://gerrit.cloudera.org:8080/4849 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-05 06:43:45 +00:00
Thomas Tauber-Marshall	832fb53763	IMPALA-3725 Support Kudu UPSERT in Impala This patch introduces a new query statement, UPSERT, for Kudu tables which operates like an INSERT and uses all of the analysis, planning, and execution machinery as INSERT, except that if there's a primary key collision instead of returning an error an update is performed. New syntax: [with_clause] UPSERT INTO [TABLE] table_name [(column list)] { query_stmt \| VALUES (value [, value...]) [, (value [, (value...)]) ...] } where column list must contain all of the key columns in table_name, if specified, and table_name must be a Kudu table. This patch also improves the behavior of INSERTing into Kudu tables without specifying all of the key columns - this now results in an analysis exception, rather than attempting the INSERT and receiving an error back from Kudu. Change-Id: I8df5cea36b642e267f85ff6b163f3dd96b8386e9 Reviewed-on: http://gerrit.cloudera.org:8080/4047 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-05 04:16:54 +00:00
Dimitris Tsirogiannis	d802f321b2	IMPALA-3724: Support Kudu non-covering range partitions This commit adds support for non-covering range partitions in Kudu tables. The SPLIT ROWS clause is now deprecated and no longer supported. The following new syntax provides more flexibility in creating range partitions and it supports bounded and unbounded ranges as well as single value partitions; multi-column range partitions are supported as well. The new syntax is: DISTRIBUTE BY RANGE (col_list) ( PARTITION lower_1 <[=] VALUES <[=] upper_1, PARTITION lower_2 <[=] VALUES <[=] upper_2, .... PARTITION lower_n <[=] VALUES <[=] upper_n, PARTITION VALUE = val_1, .... PARTITION VALUE = val_n ) Multi-column range partitions are specified as follows: DISTRIBUTE BY RANGE (col1, col2,..., coln) ( PARTITION VALUE = (col1_val, col2_val, ..., coln_val), .... PARTITION VALUE = (col1_val, col2_val, ..., coln_val) ) Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06 Reviewed-on: http://gerrit.cloudera.org:8080/4856 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 22:02:22 +00:00
Matthew Jacobs	32294220c4	IMPALA-4379: Fix and test Kudu table type checking, follow up The first fix for IMPALA-4379 went in before all comments were addressed. First commit: `9b507b6`. This addresses some follow-up comments about how to handling ALTER TABLE setting the storage_handler table property, which doesn't really make sense to ever allow. Change-Id: I93d04a04483af598b392c28874363e3b0202e1f3 Reviewed-on: http://gerrit.cloudera.org:8080/4894 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 06:54:18 +00:00
Alex Behm	c5f49ec9bb	IMPALA-4423: Correct but conservative implementation of Subquery.equals(). The underlying problem was for trivial/constant [NOT] EXISTS subqueries we substituted out Subqueries with bool literals using an ExprSubstitutionMap, but the Subquery.equals() function was not implemented properly, so we ended up matching Subqueries to the wrong entry in the ExprSubstitutionMap. This could ultimately lead to wrong plans and results. Testing: Corrected an existing test and modified an existing test for extra coverage. Change-Id: I5562d98ce36507aa5e253323e184fd42b54f27ed Reviewed-on: http://gerrit.cloudera.org:8080/4923 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 00:19:33 +00:00
Laszlo Gaal	9af59bfe2b	IMPALA-4153: Fix count(*) on all blank('') columns - test This change adds test coverage for the fixes committed for IMPALA-2399 in commit `9ed3b685a1`. It uses the table nulltable in the workload functional-query to verify the materialization and counting of NULL and empty- valued columns. The test can be run on any supported storage and compression combination. Change-Id: I23923f95f43d67977ee1520a1fc09ce297548b3f Reviewed-on: http://gerrit.cloudera.org:8080/4755 Tested-by: Internal Jenkins Reviewed-by: Jim Apple <jbapple@cloudera.com>	2016-11-03 23:08:56 +00:00
Jim Apple	f397d75600	IMPALA-3853: More RAT cleaning. Apache RAT is a tool to audit code repositories for the ASF copyright rules. Our wrapper script around it found a few more things; this patch fixes those things. Change-Id: I01367ea26feaf6a3e2cf4ac04f1c6a63f6e66195 Reviewed-on: http://gerrit.cloudera.org:8080/4904 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 22:53:04 +00:00
Taras Bobrovytsky	40bce41765	Fix TPCH and TPCDS Kudu loading templates The templates (used by the stress test) for loading the TCPH and TPCDS data into Kudu had a missing "stored as kudu" statement. Change-Id: Ibe84e1831cc0722bd0381ec76f385ae2a02a6841 Reviewed-on: http://gerrit.cloudera.org:8080/4939 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>	2016-11-03 21:28:43 +00:00
Alex Behm	1d8cdb02c6	IMPALA-4309: Introduce Expr rewrite phase and supporting classes. Introduces a new phase for rewriting Exprs after analysis and before subquery rewriting. The transformed Exprs replace the original ones in analyzed statements. If Exprs were changed, the whole statement is reset() and re-analyzed, similar to how subqueries are rewritten. If both Exprs and subqueries are rewritten there is only one re-analysis of the changed statement. The following new classes work together to perform transformations: 1. ExprRewriteRule - base class for Expr transformation rules 2. ExprRewriter - drives the transformation of Exprs using a list of ExprRewriteRules Statements that have exprs to be rewritten need to implement a new method rewriteExprs() that accepts an ExprRewriter. As an example, this patch adds a rule for converting BetweenPredicates into their equivalent CompoundPredicates. The BetweenPredicate has been notoriously buggy due to a lack of such a separate rewrite phase and is now cleaned up. Testing: 1. Added a new test for checking that the rewrite framework covers all relevant statements, clauses and can properly handle nested statements and subqueries. 2. Added a new test for ExprRewriteRules and implemented tests for the BetweenPredicate rewrite. 2. There are many existing tests for BetweePredicates and they all exercise the new rewrite rule/phase. 3. Ran a private core/hdfs run and it passed. Change-Id: I2279dc984bcf7742db4fa3b1aa67283ecbb05e6e Reviewed-on: http://gerrit.cloudera.org:8080/4746 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 20:08:37 +00:00
Alex Behm	6a16e44a2a	Add functional tests for compute stats with mt_dop > 0. Change-Id: Icd4e7e44f9f23f66e59ad1fb298e13da76ad817a Reviewed-on: http://gerrit.cloudera.org:8080/4879 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 11:59:07 +00:00
Alex Behm	795c085fa3	IMPALA-4336: Cast exprs after unnesting union operands. The bug was that we cast the result exprs of operands before unnesting them. If we unnested an operands, casts were missing on those unnested operands' result exprs. The fix is to first unnest operands and then cast result exprs. Also clarifies the use of resultExprs vs. baseTblResultExprs. Change-Id: I5e3ab7349df7d67d0d9c2baf4a56342d3f04e76d Reviewed-on: http://gerrit.cloudera.org:8080/4815 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 08:59:45 +00:00
Dimitris Tsirogiannis	2990696e08	IMPALA-4374: Use new syntax for creating TPC-DS/H tables in Kudu stress test This commit modifies the DDL statements for creating TPC-DS/H tables in Kudu. The DDL statements now use the new syntax for creating Kudu tables (see IMPALA-3719). Change-Id: I2d501fb9c3cba00b1fb0f7b5941db49cbbda5a53 Reviewed-on: http://gerrit.cloudera.org:8080/4860 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-02 23:34:27 +00:00
Alex Behm	4918b20ac0	IMPALA-4408: Omit null bytes for Kudu scans with no nullable slots. Kudu does not allocate null bytes if all projected columns are non-nullable. Otherwise, Kudu allocates a null bit for all columns, even the non-nullable ones. The bug was that Impala's memory layout did not match the first requirement. Change-Id: I762ad9d5cc4198922ea4b5218c504fde355c49a5 Reviewed-on: http://gerrit.cloudera.org:8080/4892 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-01 01:47:30 +00:00
Matthew Jacobs	9b507b6ed6	IMPALA-4379: Fix and test Kudu table type checking Creating Kudu tables shouldn't allow types not supported by Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types). The behavior is inconsistent: for some types it throws in the catalog, for VARCHAR/CHAR these become strings. This changes behavior so that all fail during analysis. Analysis tests were added. Similarly, external tables cannot contain Kudu types that Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests were added to validate this behavior. Note that this required upgrading the python Kudu client. This also fixes a small corner case with ALTER TABLE: ALTER TABLE shouldn't allow Kudu tables to change the storage descriptor tblproperty, otherwise the table metadata gets in an inconsistent state. Tests were added for all of the above. Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e Reviewed-on: http://gerrit.cloudera.org:8080/4857 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-10-31 16:03:54 +00:00
Marcel Kornacker	0d857237a8	IMPALA-4314: Standardize on MT-related data structures This removes the data structures that were "superceded" in IMPALA-3903 and changes all control flow to utilize the new data structures. The new data structures are renamed to remove the "Mt" prefix. Change-Id: I465d0e15e2cf17cafe4c747d34c8f595d3645151 Reviewed-on: http://gerrit.cloudera.org:8080/4853 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-10-31 16:03:32 +00:00
Tim Armstrong	6587c08f70	IMPALA-4387: validate decimal type in Avro file schema This patch prevents an invalid decimal type in an Avro file schema from crashing Impala. Most invalid Avro schemas are caught by the frontend, but file schemas still need to be validated by the backend. After this patch files with bad schemas are skipped. Testing: This was hit very rarely by the scanner fuzzing. Added a regression test that scans a file with a bad schema. Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc Reviewed-on: http://gerrit.cloudera.org:8080/4876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-30 00:12:58 +00:00
Alex Behm	f7d71950e3	IMPALA-4369: Avoid DCHECK in Parquet scanner with MT_DOP > 0. When HdfsParquetScanner::Open() failed we used to hit a DCHECK when trying to access HdfsParquetScanner::batch() which is only valid to call for non-MT scan nodes. Change-Id: Ifbfdde505dbbd2742e7ab79a2415ff317a9bfa2f Reviewed-on: http://gerrit.cloudera.org:8080/4851 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 22:21:19 +00:00
Lars Volker	c24e9da914	IMPALA-2521: Add clustered hint to insert statements This change introduces a clustered/noclustered hint for insert statements. Specifying this hint adds an additional sort node to the plan, just before the table sink. This has the effect that data will be clustered by its partition prior to writing partitions, which therefore can be written sequentially. Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14 Reviewed-on: http://gerrit.cloudera.org:8080/4745 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 04:56:14 +00:00
Tim Armstrong	0dbfe169b7	IMPALA-4277: remove unneeded LegacyTCLIService Change-Id: I827590b19dc542f6256ae2e0d541eaa32a76520b Reviewed-on: http://gerrit.cloudera.org:8080/4844 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 02:34:01 +00:00
Michael Ho	13455b5a24	IMPALA-3884: Support TYPE_TIMESTAMP for HashTableCtx::CodegenAssignNullValue() This change implements support for TYPE_TIMESTAMP for HashTableCtx::CodegenAssignNullValue(). TimestampValue itself is 16 bytes in size. To match RawValue::Write() in the interpreted path, CodegenAssignNullValue() emits code to assign HashUtil::FNV_SEED to both the upper and lower 64-bit of the destination value. This change also fixes the handling of 128-bit Decimal16Value in CodegenAssignNullValue() so the emitted code matches the behavior of the interpreted path. Change-Id: I0211d38cbef46331e0006fa5ed0680e6e0867bc8 Reviewed-on: http://gerrit.cloudera.org:8080/4794 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Michael Ho <kwho@cloudera.com>	2016-10-25 05:52:33 +00:00
Matthew Jacobs	99ed6dc67a	IMPALA-4134,IMPALA-3704: Kudu INSERT improvements 1.) IMPALA-4134: Use Kudu AUTO FLUSH Improves performance of writes to Kudu up to 4.2x in bulk data loading tests (load 200 million rows from lineitem). 2.) IMPALA-3704: Improve errors on PK conflicts The Kudu client reports an error for every PK conflict, and all errors were being returned in the error status. As a result, inserts/updates/deletes could return errors with thousands errors reported. This changes the error handling to log all reported errors as warnings and return only the first error in the query error status. 3.) Improve the DataSink reporting of the insert stats. The per-partition stats returned by the data sink weren't useful for Kudu sinks. Firstly, the number of appended rows was not being displayed in the profile. Secondly, the 'stats' field isn't populated for Kudu tables and thus was confusing in the profile, so it is no longer printed if it is not set in the thrift struct. Testing: Ran local tests, including new tests to verify the query profile insert stats. Manual cluster testing was conducted of the AUTO FLUSH functionality, and that testing informed the default mutation buffer value of 100MB which was found to provide good results. Change-Id: I5542b9a061b01c543a139e8722560b1365f06595 Reviewed-on: http://gerrit.cloudera.org:8080/4728 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 02:06:10 +00:00
Henry Robinson	e0a3272129	Minor compute stats script fixes * Change run-step to output full log path * Change text to say "Computing table stats" rather than "Computing HBase stats" when running compute-table-stats.sh Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07 Reviewed-on: http://gerrit.cloudera.org:8080/4825 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 00:13:54 +00:00
Alex Behm	ff6b450ad3	IMPALA-4285/IMPALA-4286: Fixes for Parquet scanner with MT_DOP > 0. IMPALA-4258: The problem was that there was a reference to HdfsScanner::batch_ hidden inside WriteEmptyTuples(). The batch_ reference is NULL when the scanner is run with MT_DOP > 1. IMPALA-4286: When there are no scan ranges HdfsScanNodeBase::Open() exits early without initializing the reader context. This lead to a DCHECK in IoMgr::GetNextRange() called from HdfsScanNodeMt. The fix is to remove that unnecessary short-circuit Open(). I combined these two bugfixes because the new basic test covers both cases. Testing: Added a new test_mt_dop.py test. A private code/hdfs run passed. Change-Id: I79c0f6fd2aeb4bc6fa5f87219a485194fef2db1b Reviewed-on: http://gerrit.cloudera.org:8080/4767 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 10:24:24 +00:00
Michael Ho	51268c053f	IMPALA-4120: Incorrect results with LEAD() analytic function This change fixes a memory management problem with LEAD()/LAG() analytic functions which led to incorrect result. In particular, the update functions specified for these analytic functions only make a shallow copy of StringVal (i.e. copying only the pointer and the length of the string) without copying the string itself. This may lead to problem if the string is created from some UDFs which do local allocations whose buffer may be freed and reused before the result tuple is copied out. This change fixes the problem above by allocating a buffer at the Init() functions of these analytic functions to track the intermediate value. In addition, when the value is copied out in GetValue(), it will be copied into the MemPool belonging to the AnalyticEvalNode and attached to the outgoing row batches. This change also fixes a missing free of local allocations in QueryMaintenance(). Change-Id: I85bb1745232d8dd383a6047c86019c6378ab571f Reviewed-on: http://gerrit.cloudera.org:8080/4740 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 07:39:37 +00:00
Dimitris Tsirogiannis	8a49ceaae5	IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Reviewed-on: http://gerrit.cloudera.org:8080/4327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00

1 2 3 4 5 ...

1453 Commits