impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 00:00:40 -05:00

Author	SHA1	Message	Date
Attila Bukor	2576952655	IMPALA-5092 Add support for VARCHAR in Kudu tables KUDU-1938 added VARCHAR column type support to Kudu. This commit adds support for Kudu's VARCHAR type to Impala. The length of a Kudu varchar is applied as a character length as opposed to a byte length like Impala currently uses. When writing data to Kudu, the VARCHAR length is not an issue because Impala only officially supports ASCII characters and those characters are the same size in bytes and characters. Additionally, extra bytes would be truncated by the Kudu client if somehow a value was too long. When reading data from Kudu, it is possible that the value written by some other application is wider in bytes than Impala expects and can handle. This can happen due to multi-byte UTF-8 characters. In that case, we adjust the length in Impala to truncate the extra bytes of the value. This isn’t a great solution, but one other integrations have taken as well given Impala doesn’t support UTF-8 values. IMPALA-5675 tracks adding UTF-8 Character length support to VARCHAR columns and marked the truncation code with a TODO that references that Jira. Testing: * Performed manual testing of standard DDL and DML interaction * Manually reproduced a check failure due to multi-byte characters and tested that length truncation resolve that issue. * Added/adjusted the following automated tests: AnalyzeDDLTest: CTAS into Kudu with varchar type AnalyzeKuduDDLTest: CREATE TABLE in Kudu with VARCHAR type kudu_create.test: Create table with VARCHAR column, key, hash partition, and range partition kudu_describe.test: Describe table with VARCHAR column and key kudu_insert.test: Insert with VARCHAR columns including null and non-null defaults kudu_update.test: Updates with VARCHAR column kudu_upsert.test: Upserts with VARCHAR column kudu_delete.test Deletes with VARCHAR columns ** kudu-scan-node.test Tests basic predicates with VARCHAR columns Follow on work: - IMPALA-9580: Add min-max runtime filter support/tests - IMPALA-9581: Pushdown string predicates - IMPALA-9583: Automated multibyte truncation tests Change-Id: I0d4959410fdd882bfa980cb55e8a7837c7823da8 Reviewed-on: http://gerrit.cloudera.org:8080/14197 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2020-04-01 15:48:36 +00:00
Volodymyr Verovkin	6fdc644fed	IMPALA-8800: Added support of Kudu DATE type to Impala This patch supports reading and writing DATE values to Kudu tables. It does not add min-max filter runtime support, but there is followup JIRA IMPALA-9294. Corresponding Kudu JIRA is KUDU-2632. Change-Id: I91656749a58ac769b54c2a63bdd4f85c89520b32 Reviewed-on: http://gerrit.cloudera.org:8080/14705 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-12 14:26:13 +00:00
Tim Armstrong	2ca7f8e7c0	IMPALA-7995: part 1: fixes for e2e dockerised impala tests This fixes all core e2e tests running on my local dockerised minicluster build. I do not yet have a CI job or script running but I wanted to get feedback on these changes sooner. The second part of the change will include the CI script and any follow-on fixes required for the exhaustive tests. The following fixes were required: * Detect docker_network from TEST_START_CLUSTER_ARGS * get_webserver_port() does not depend on the caller passing in the default webserver port. It failed previously because it relied on start-impala-cluster.py setting -webserver_port for all processes. * Add SkipIf markers for tests that don't make sense or are non-trivial to fix for containerised Impala. * Support loading Impala-lzo plugin from host for tests that depend on it. * Fix some tests that had 'localhost' hardcoded - instead it should be $INTERNAL_LISTEN_HOST, which defaults to localhost. * Fix bug with sorting impala daemons by backend port, which is the same for all dockerised impalads. Testing: I ran tests locally as follows after having set up a docker network and starting other services: ./buildall.sh -noclean -notests -ninja ninja -j $IMPALA_BUILD_THREADS docker_images export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" export FE_TEST=false export BE_TEST=false export JDBC_TEST=false export CLUSTER_TEST=false ./bin/run-all-tests.sh Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755 Reviewed-on: http://gerrit.cloudera.org:8080/12639 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-13 02:42:32 +00:00
Todd Lipcon	c5918b44b0	IMPALA-8283. Order of Kudu PRIMARY KEYs can be silently ignored When creating a Kudu table, we would use the 'PRIMARY KEY (...)' clause to determine which columns made up the primary key, but the order of those columns would be ignored. Thus a statement like: CREATE TABLE (x int, y int, PRIMARY KEY (y, x)) STORED AS KUDU; would silently create a table with an (x,y) primary key instead of a (y,x) key. This can have substantial performance implications. This fixes the frontend to correctly throw an error in this case. This might be incompatible if someone was previously relying on the bug, but I think it's worth fixing because it was clearly doing the wrong thing. Change-Id: I0499cee7c532db19cddac3906198d965b27ea604 Reviewed-on: http://gerrit.cloudera.org:8080/12694 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org>	2019-03-18 23:54:00 +00:00
Thomas Tauber-Marshall	ba84ad03cb	IMPALA-6954: Fix problems with CTAS into Kudu with an expr rewrite This patch fixes two problems: - Previously a CTAS into a Kudu table where an expr rewrite occurred would create an unpartitioned table, due to the partition info being reset in TableDataLayout and then never reconstructed. Since the Kudu partition info is set by the parser and never changes, the solution is to not reset it. - Previously a CTAS into a Kudu table with a range partition where an expr rewrite occurred would fail with an analysis exception due to a Precondition check in RangePartition.analyze that checked that the RangePartition wasn't already analyzed, as the analysis can't be done twice. Since the state in RangePartition never changes, it doesn't need to be reanalyzed and we can just return instead of failing on the check. Testing: - Added an e2e test that creates a partitioned Kudu table with a CTAS with a rewrite, and checks that the expected partitions are created. Change-Id: I731743bd84cc695119e99342e1b155096147f0ed Reviewed-on: http://gerrit.cloudera.org:8080/10251 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-02 02:54:23 +00:00
Zoltan Borok-Nagy	2ee914d5b3	IMPALA-5903: Inconsistent specification of result set and result set metadata Before this commit it was quite random which DDL oprations returned a result set and which didn't. With this commit, every DDL operations return a summary of its execution. They declare their result set schema in Frontend.java, and provide the summary in CalatogOpExecutor.java. Updated the tests according to the new behavior. Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9 Reviewed-on: http://gerrit.cloudera.org:8080/9090 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 02:21:48 +00:00
Thomas Tauber-Marshall	832974383c	IMPALA-6445: Test for kudu master address with whitespace A concern was brought up that Impala might not handle kudu master addresses containing whitespace correctly. Turns out that the Kudu client takes care of stripping whitespace, so it works, but it would be good to have a test to ensure it continues to work. Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576 Reviewed-on: http://gerrit.cloudera.org:8080/9876 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 20:29:51 +00:00
Grant Henke	0c8eba076c	IMPALA-5752: Add support for DECIMAL on Kudu tables Adds support for the Kudu DECIMAL type introduced in Kudu 1.7.0. Note: Adding support for Kudu decimal min/max filters is tracked in IMPALA-6533. Tests: * Added Kudu create with decimal test to AnalyzeDDLTest.java * Added Kudu table_format to test_decimal_queries.py ** Both decimal.test and decimal-exprs.test workloads * Added decimal queries to the following Kudu workloads: kudu_create.test kudu_delete.test kudu_insert.test kudu_update.test ** kudu_upsert.test Change-Id: I3a9fe5acadc53ec198585d765a8cfb0abe56e199 Reviewed-on: http://gerrit.cloudera.org:8080/9368 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 00:03:54 +00:00
Thomas Tauber-Marshall	b881fba763	IMPALA-5546: Allow creating unpartitioned Kudu tables This patch makes it possible to create unpartitioned, managed Kudu tables from Impala, by making the 'PARTITION BY' clause of 'CREATE TABLE... STORED AS KUDU' optional: CREATE TABLE [IF NOT EXISTS] [db_name.]table_name (col_name data_type [kudu_column_attribute ...] [COMMENT 'col_comment'] [, ...] [PRIMARY KEY (col_name[, ...])] ) [PARTITION BY kudu_partition_clause] [COMMENT 'table_comment'] STORED AS KUDU [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)] Kudu represents this as a table that is range partitioned on no columns. Because unpartitioned Kudu tables are inefficient for large data sizes, and because the syntax doesn't make it explicit that the table will be unpartitioned, there is a warning issued to encourage users to created partitioned tables. This patch also converts the tpch_kudu.nation and tpch_kudu.region tables to be unpartitioned, as they are very small. Testing: - Updated analysis tests. - Added e2e test that creates unpartitioned table and inserts into it. Change-Id: I281f173dbec1484eb13434d53ea581a0f245358a Reviewed-on: http://gerrit.cloudera.org:8080/7446 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-07 19:53:59 +00:00
Matthew Jacobs	2dcbefc652	IMPALA-5338: Fix Kudu timestamp column default values While support for TIMESTAMP columns in Kudu tables has been committed (IMPALA-5137), it does not support TIMESTAMP column default values. This supports CREATE TABLE syntax to specify the default values, but more importantly this fixes the loading of Kudu tables that may have had default values set on UNIXTIME_MICROS columns, e.g. if the table was created via the python client. This involves fixing KuduColumn to hide the LiteralExpr representing the default value because it will be a BIGINT if the column type is TIMESTAMP. It is only needed to call toSql() and toStringValue(), so helper functions are added to KuduColumn to encapsulate special logic for TIMESTAMP. TODO: Add support and tests for ALTER setting the default value (when IMPALA-4622 is committed). Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2 Reviewed-on: http://gerrit.cloudera.org:8080/6936 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-02 01:47:48 +00:00
Matthew Jacobs	6226e59702	IMPALA-5137: Support TIMESTAMPs in Kudu range predicate DDL Adds support in DDL for timestamps in Kudu range partition syntax. For convenience, strings can be specified with or without explicit casts to TIMESTAMP. E.g. create table ts_ranges (ts timestamp primary key, i int) partition by range ( partition '2009-01-02 00:00:00' <= VALUES < '2009-01-03 00:00:00' ) stored as kudu Range bounds are converted to Kudu UNIXTIME_MICROS during analysis. Testing: Adds FE and EE tests. Change-Id: Iae409b6106c073b038940f0413ed9d5859daaeff Reviewed-on: http://gerrit.cloudera.org:8080/6849 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-19 00:41:46 +00:00
Matthew Jacobs	878fcf5a74	IMPALA-5111: Fix check when creating NOT NULL PK col in Kudu The fix for IMPALA-4616 broke the ability to create a PK key col in a Kudu table as explicitly 'NOT NULL'. While this is the default, it should be possible to specify. The precondition that was failing was fixed, and some tests were added/modified. Change-Id: I557eea7cd994d6a2ed38893d283d08107e78f789 Reviewed-on: http://gerrit.cloudera.org:8080/6465 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-24 21:22:50 +00:00
Dimitris Tsirogiannis	5ea1798661	IMPALA-4619: Allow NULL as default value in Kudu tables This commit fixes an issue where an error is thrown if the default value for a Kudu column is set to NULL. Change-Id: Ida27ce56f1dd7603485a69c680db3bcea6702aff Reviewed-on: http://gerrit.cloudera.org:8080/5405 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-12-08 04:53:38 +00:00
Dan Burkert	f83652c1da	Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and `BUCKETS` keywords that were going to be newly released in Impala 2.6, but are now unused. Additionally, a few remaining uses of the `DISTRIBUTE BY` syntax has been switched to `PARTITION BY`. Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922 Reviewed-on: http://gerrit.cloudera.org:8080/5382 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 07:31:16 +00:00
Matthew Jacobs	5188f879a7	IMPALA-4477: Bump Kudu version to latest master (60aa54e) Bumps the toolchain version to get a newer Kudu build. Also fixes test failures resulting from changes in Kudu. Notably error strings have changed (IMPALA-4590) and the number of replicas must be odd (IMPALA-4589). Note: The toolchain binaries starting with this build are now using the toolchain binutils rather than the system binutils. Testing: private exhaustive build. Change-Id: If1912f058c240fbe82b06f77e31add7755289be1 Reviewed-on: http://gerrit.cloudera.org:8080/5369 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 05:11:13 +00:00
Dimitris Tsirogiannis	cba93f1ac3	IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2 Reviewed-on: http://gerrit.cloudera.org:8080/5317 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 10:41:53 +00:00
Thomas Tauber-Marshall	3833707dbd	IMPALA-4466: Improve Kudu CRUD test coverage The results in the test files were verified by hand. This patch also introduces a new test section 'DML_RESULTS', which takes the name of a table as a comment and the contents of the table as its body and then verifies that the body matches the actual contents of the table. This makes it easy to check that a DML operation has the desired effect on the contents of a table, rather than always having to add another test case that runs a select on the table. For now, this section cannot be used in a test along with the RESULTS or ERRORS sections. TODO: Refactor the DML test case handling (IMPALA-4471) Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6 Reviewed-on: http://gerrit.cloudera.org:8080/4953 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 02:54:30 +00:00
Dimitris Tsirogiannis	d802f321b2	IMPALA-3724: Support Kudu non-covering range partitions This commit adds support for non-covering range partitions in Kudu tables. The SPLIT ROWS clause is now deprecated and no longer supported. The following new syntax provides more flexibility in creating range partitions and it supports bounded and unbounded ranges as well as single value partitions; multi-column range partitions are supported as well. The new syntax is: DISTRIBUTE BY RANGE (col_list) ( PARTITION lower_1 <[=] VALUES <[=] upper_1, PARTITION lower_2 <[=] VALUES <[=] upper_2, .... PARTITION lower_n <[=] VALUES <[=] upper_n, PARTITION VALUE = val_1, .... PARTITION VALUE = val_n ) Multi-column range partitions are specified as follows: DISTRIBUTE BY RANGE (col1, col2,..., coln) ( PARTITION VALUE = (col1_val, col2_val, ..., coln_val), .... PARTITION VALUE = (col1_val, col2_val, ..., coln_val) ) Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06 Reviewed-on: http://gerrit.cloudera.org:8080/4856 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 22:02:22 +00:00
Matthew Jacobs	9b507b6ed6	IMPALA-4379: Fix and test Kudu table type checking Creating Kudu tables shouldn't allow types not supported by Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types). The behavior is inconsistent: for some types it throws in the catalog, for VARCHAR/CHAR these become strings. This changes behavior so that all fail during analysis. Analysis tests were added. Similarly, external tables cannot contain Kudu types that Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests were added to validate this behavior. Note that this required upgrading the python Kudu client. This also fixes a small corner case with ALTER TABLE: ALTER TABLE shouldn't allow Kudu tables to change the storage descriptor tblproperty, otherwise the table metadata gets in an inconsistent state. Tests were added for all of the above. Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e Reviewed-on: http://gerrit.cloudera.org:8080/4857 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-10-31 16:03:54 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00

20 Commits