Commit Graph

19 Commits

Author SHA1 Message Date
Riza Suminto
4236c307b9 IMPALA-10465: Use IGNORE variant of Kudu write operations
KUDU-1563 added support for INSERT_IGNORE, UPDATE_IGNORE, and
DELETE_IGNORE to handle cases where users want to ignore primary key
errors efficiently. Impala already does this today for its INSERT
behavior. However, it does so by ignoring the per-row errors from Kudu
client side. This requires a large error buffer (which may need to be
expanded in rare cases) to log all of the warning messages which users
often do not care about and causes significant RPC overhead.

This patch change the Kudu write operation by Impala to use
INSERT_IGNORE, UPDATE_IGNORE, and DELETE_IGNORE if Kudu cluster supports
it and backend flag "kudu_ignore_conflicts" is true.

We benchmark the change by doing insert and update query on modified
tpch.lineitem table where we introduce conflicts for around half of the
total rows being modified. The table below shows the performance
difference after the patch:

+----------------------+--------+-------------+------------+------------+----------------+
| Query                | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) |
+----------------------+--------+-------------+------------+------------+----------------+
| KUDU-IGNORE-3-UPDATE | 30.06  | 30.52       |   -1.53%   |   0.18%    |   0.58%        |
| KUDU-IGNORE-2-INSERT | 48.91  | 71.09       | I -31.20%  |   0.60%    |   0.72%        |
+----------------------+--------+-------------+------------+------------+----------------+

Testing:
- Pass core tests.

Change-Id: I8da7c41d61b0888378b390b8b643238433eb3b52
Reviewed-on: http://gerrit.cloudera.org:8080/18536
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-26 08:47:24 +00:00
wzhou-code
fcaea30b15 IMPALA-10557: Support Kudu's multi-row transaction
Kudu added multi-row transaction so Impala could run query that inserts
multiple rows into Kudu's table in the context of a single transaction.
Kudu provides new Java/C++ client APIs to open/commit/rollback
transaction, create session with transaction, serialize/deserialize
metadata of transaction object. Kudu transaction object has built-in
heartbeater.

This patch added Impala support to use Kudu's multiple-row transaction.
 - Added a new query option to enable Kudu's transaction.
 - When the query option is set, a new Kudu transaction should be
   started for "insert", "CTAS" and "UPDATE/UPSERT/DELETE" statements
   by Impala's frontend of coordinator.
 - The Kudu transaction objects are kept in KuduTransactionManager until
   the transactions are going to be aborted or committed.
 - Frontend serialize the transaction metadata into a transaction token
   and pass to executors.
 - Executors deserialize the transaction token and ingest via that
   transaction handle. For Kudu session in the context of a transaction,
   return the first error if there are any pending errors for the Kudu
   session so that the Kudu transaction will be aborted.
   Since Kudu does not support transaction for "UPDATE/UPSERT/DELETE"
   statements now, Kudu returns error which causes transaction to be
   aborted.
 - Coordinator commits the transaction if everything goes well.
   Otherwise, aborts the transaction.

Also changed code to store KuduClient as shared pointer since KuduClient
has to be passed as a shared pointer when KuduTransaction::Deserialize()
is called.

Testing:
 - Added new e-to-e tests for Kudu transaction.
 - Passed core test.

Change-Id: I876ada48991afdff5d61b5d6a0417571aba7cb34
Reviewed-on: http://gerrit.cloudera.org:8080/17553
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-24 20:34:45 +00:00
Fang-Yu Rao
34668fab87 IMPALA-10092: Do not skip test vectors of Kudu tests in a custom cluster
We found that the following 4 tests do not run even we remove all the
decorators like "@SkipIfKudu.no_hybrid_clock" or
"@SkipIfHive3.kudu_hms_notifications_not_supported" to skip the tests.
This is due to the fact that those 3 classes inherit the class of
CustomClusterTestSuite, which adds a constraint that only allows test
vectors with 'file_format' and 'compression_codec' being "text" and
"none", respectively, to be run.

1. TestKuduOperations::test_local_tz_conversion_ops
2. TestKuduClientTimeout::test_impalad_timeout
3. TestKuduHMSIntegration::test_create_managed_kudu_tables
4. TestKuduHMSIntegration::test_kudu_alter_table

To address this issue, in this patch we create a parent class for those
3 classes above and override the method of
add_custom_cluster_constraints() for this newly created parent class so
that we do not skip test vectors with 'file_format' and
'compression_codec' being "kudu" and "none", respectively.

On the other hand, this patch also removes a redundant method call to
super(CustomClusterTestSuite, cls).add_test_dimensions() in
CustomClusterTestSuite.add_custom_cluster_constraints() since
super(CustomClusterTestSuite, cls).add_test_dimensions() had
already been called immediately before the call to
add_custom_cluster_constraints() in
CustomClusterTestSuite.add_test_dimensions().

Testing:
 - Manually verified that after removing the decorators to skip those
   tests, those tests could be run.

Change-Id: I60a4bd4ac5a9026629fb840ab9cc7b5f9948290c
Reviewed-on: http://gerrit.cloudera.org:8080/16348
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-28 01:37:16 +00:00
Hao Hao
d5d3ace6c2 IMPALA-8856: Deflake TestKuduHMSIntegration
Tests for Kudu's integration with the Hive Metastore can be flaky.
Since Kudu depends on the HMS, create/drop table requests can timeout
when creation/deletion in the HMS take more than 10s, and the following
retry will fail as the table has been already created/deleted by the
first request.

This patch increases the timeout of individual Kudu client rpcs to
avoid flakiness cause by such cases.

Change-Id: Ib98f34bb831b9255e35b5ef234abe6ceaf261bfd
Reviewed-on: http://gerrit.cloudera.org:8080/14067
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-15 00:43:23 +00:00
Csaba Ringhofer
a0c00e508f Bump CDP_BUILD_NUMBER to 1318335
The main reason for bumping is to include HIVE-21838.
Also skips / fixes some tests.

Change-Id: I432e8c02dbd349a3507bfabfef2727914537652c
Reviewed-on: http://gerrit.cloudera.org:8080/14005
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-08 15:29:13 +00:00
Csaba Ringhofer
245f2375e2 IMPALA-8751: Skip failing Kudu - HMS integration tests with Hive 3
These tests broke with the newer versions of CDP Hive, probably
because it started to send gzipped notifications, which lead to
errors like this in Kudu:
Not implemented: unknown message format: gzip(json-2.0)

The Kudu side fix does not seem trivial, so I am skipping these
tests for now. Generally Kudu is not tested well with Hive 3 yet,
so we cannot assume HMS notifications to work.

Change-Id: Ic19b8d93eaa6e8ec886c5704578563fb0871f941
Reviewed-on: http://gerrit.cloudera.org:8080/13854
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-07-18 21:20:03 +00:00
Grant Henke
ff4e0a3a45 IMPALA-8629: (part 2) Adjust new KuduStorageHandler package
This patch changes the new KuduStorageHandler
package from “org.apache.kudu.hive” to
“org.apache.hadoop.hive.kudu”.

This is done to ensure the stand-in storage handler
can be a real storage handler when a Hive integration
is added in the future. The “org.apache.hadoop.hive”
package is the standard package all Hive storage
handlers lives under.

Additionally this patch updates the stand-in InputFormat,
OutputFormat, and SerDe entries for Kudu. This allows a
future Hive integration to read HMS tables/entries created
by Impala.

This patch also includes PlannerTest fixes to handle the
ScanToken stringification changes in the newer version
of Kudu.

Change-Id: I4d0c505643247498383472704a37d27c9e1ce473
Reviewed-on: http://gerrit.cloudera.org:8080/13541
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-13 01:53:37 +00:00
Grant Henke
d753600f9c IMPALA-8629: (part 1) Add temp KuduStorageHandler
This patch adds a temporary KuduStorageHandler
so that the Kudu project can change its handler without
breaking the integration. It also disabled any tests that
depend on the specific handler.

A follow up patch will remove the
TEMP_KUDU_STORAGE_HANDLER. adjust the
KUDU_STORAGE_HANDLER value to be the final value,
and re-enable the tests.

Change-Id: Ic9982466699818390fa28efc5ea1aae75b11c12a
Reviewed-on: http://gerrit.cloudera.org:8080/13561
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
2019-06-10 16:12:41 +00:00
Hao Hao
b7077ffd52 IMPALA-8506: Support RENAME TABLE statement with Kudu/HMS integration
This commit intends to support the actual handling of ALTER/RENAME TABLE
DDL for managed Kudu tables with Kudu's integration with the Hive
Metastore. However, currently Kudu is considered as the source of
truth of the table schema, so when ALTER TABLE (ADD/DROP COLUMN/RANGE_PARTITION),
Impala always directly alters/loads the Kudu tables. Thus, this commit
only updates RENAME TABLE DDL, so that after the table is renamed in the
Kudu, relies on Kudu to rename the table in the HMS.

Change-Id: If7155e0b385b8ad81eda0a84277bc85171a88269
Reviewed-on: http://gerrit.cloudera.org:8080/13409
Reviewed-by: Grant Henke <granthenke@apache.org>
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-06 22:16:27 +00:00
Hao Hao
0e1d2e1484 IMPALA-8507: Support DROP TABLE statement with Kudu/HMS integration
This commit supports the actual handling of DROP TABLE DDL for managed
Kudu tables with Kudu's integration with the Hive Metastore. When
Kudu/HMS integration is enabled, after the table is dropped in the
Kudu, relies on Kudu to drop the table in the HMS.

Change-Id: I6d3b93957cc66009ad7a67fc513be2068f156abc
Reviewed-on: http://gerrit.cloudera.org:8080/13400
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Grant Henke <granthenke@apache.org>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
2019-06-05 17:17:52 +00:00
Hao Hao
6bb404dc35 IMPALA-8504 (part 2): Support CREATE TABLE statement with Kudu/HMS integration
This commit supports the actual handling of CREATE TABLE DDL for managed
Kudu tables when integration with Hive Metastore is enabled. When
Kudu/HMS integration is enabled, for CREATE TABLE statement, Impala can
rely on Kudu to create the table in the HMS.

Change-Id: Icffe412395f47f5e07d97bad457020770cfa7502
Reviewed-on: http://gerrit.cloudera.org:8080/13375
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Reviewed-by: Grant Henke <granthenke@apache.org>
Tested-by: Thomas Marshall <tmarshall@cloudera.com>
2019-06-04 17:36:59 +00:00
Thomas Tauber-Marshall
a1407adf61 IMPALA-7790: Skip some Kudu tests if use_hybrid_clock=false
Since IMPALA-6812, we've run many of our tests against Kudu at the
READ_AT_SNAPSHOT scan level, which ensures consistent results. This
scan level is only supported if Kudu is run with the flag
--use_hybrid_clock=true (which is the default).

This patch uses the Kudu master webui to detect when use_hybrid_clock
is false and skips these tests.

Follow up work will address allowing these tests to run regardless of
the value of the flag.

Testing:
- Ran a full exhaustive build with use_hybrid_clock=false set in the
  minicluster.

Change-Id: I4c9ed4a4ea0720760d65c98acfc394247ab2f1a2
Reviewed-on: http://gerrit.cloudera.org:8080/11851
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-28 02:48:09 +00:00
Thomas Tauber-Marshall
13b82624b5 IMPALA-6812: Fix flaky Kudu scan tests
Many of our Kudu related tests have been flaky with the symptom that
scans appear to not return rows that were just inserted. This occurs
because our default Kudu scan level of READ_LATEST doesn't make any
consistency guarantees.

This patch adds a query option 'kudu_read_mode', which overrides the
startup flag of the same name, and then set that option to
READ_AT_SNAPSHOT for all tests with Kudu inserts and scans, which
should give us more consistent test results.

Testing:
- Passed a full exhaustive run. Does not appear to increase time to
  run by any significant amount.

Change-Id: I70df84f2cbc663107f2ad029565d3c15bdfbd47c
Reviewed-on: http://gerrit.cloudera.org:8080/10503
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-18 20:19:53 +00:00
Thomas Tauber-Marshall
abd9b0e70a IMPALA-4591: Bound Kudu client error mem usage
Previously, Kudu client errors could grow in size unbounded,
potentially causing the process to be killed. This patch sets a
bound on the mem that can be used for these error messages, with
the size determined by the flag 'kudu_error_buffer_size'.

If the errors for a Kudu client exceed this size, the query will fail,
as some errors will be dropped and we won't be able to tell if all of
the errors can be safely ignored.

Testing:
- Added a custom cluster test that verifies that a query that exceeds
  the limit fails.

Change-Id: I186ddb3f3b5865e08f17dba57cf6640591d06b14
Reviewed-on: http://gerrit.cloudera.org:8080/8464
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-27 22:28:37 +00:00
Thomas Tauber-Marshall
5d92264c48 IMPALA-5951: Remove flaky test_catalogd_timeout
test_catalogd_timeout sets a Kudu operation timeout of 1ms and then
performs various Kudu operations which it expects to fail due to a
timeout.

Since the test was written, things have sped up - for example, Impala
used to create a new Kudu client for each operation, but that was
changed in IMPALA-5167, such that the operations now occasionally
complete quickly enough that they don't timeout.

There's not really any way to rewrite this test to ensure that it
won't be flaky, so the patch removes it.

Change-Id: I29fd67d0acc0ee15943c416f2179ad716d2cac05
Reviewed-on: http://gerrit.cloudera.org:8080/8154
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
2017-09-30 00:50:43 +00:00
Matthew Jacobs
7a1ff1e5e9 IMPALA-5539: Fix Kudu timestamp with -use_local_tz_for_unix_ts
The -use_local_tz_for_unix_timestamp_conversion flag exists
to specify if TIMESTAMPs should be interpreted as localtime
or UTC when converting to/from Unix time via builtins:
  from_unixtime(bigint unixtime)
  unix_timestamp(string datetime[, ...])
  unix_timestamp(timestamp datetime)

However, the KuduScanner was calling into code that, when
the gflag above was set, interpreted Unix times as local
time.  Unfortunately the write path (KuduTableSink) and some
FE TIMESTAMP code (see KuduUtil.java) did not have this
behavior, i.e. we were handling the gflag inconsistently.

Tests:
* Adds a custom cluster test to run Kudu test cases with
  -use_local_tz_for_unix_timestamp_conversion.
* Adds tests for the new builtin
  unix_micros_to_utc_timestamp() which run in a custom
  cluster test (added test_local_tz_conversion.py) as well
  as in the regular tests (added to test_exprs.py).

Change-Id: I423a810427353be76aa64442044133a9a22cdc9b
Reviewed-on: http://gerrit.cloudera.org:8080/7311
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-07-19 22:17:13 +00:00
David Knupp
894bb77855 IMPALA-4839: Remove implicit 'localhost' for KUDU_MASTER_HOSTS
The Kudu query tests were failing on a remote cluster because the Kudu
master was always set to '127.0.0.1', with no way to override it.

This patch corrects the issue with a number of changes:

- Add a pytest command line option to specify an arbitrary Kudu master

- Consolidate the place where the default Kudu master is derived. It
  had been stored both in the env and in tests/common/__init__.py,
  with different files looking to different places. For now, just look
  to the env, and remove the value from __init__.py.

- The kudu_client test fixture in conftest.py was using the connect()
  method from impala.dbapi (part of the Impyla library), without
  specifying the host param. In the absence of that, the default value
  is 'localhost', so add the host param to the connect() call.

- Define the various defaults for pytest config as constants at the top
  of conftest.py.

Change-Id: I9df71480a165f4ce21ae3edab6ce7227fbf76f77
Reviewed-on: http://gerrit.cloudera.org:8080/5877
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-14 21:51:39 +00:00
Matthew Jacobs
50f7753d2b IMPALA-3771: Expose kudu client timeout and set default
The Kudu client timeout was too low for Impala usage. This
sets the default timeout to 3 minutes and exposes it as a
gflag.

New timeout tests were added.

Change-Id: Iad95e8e38aad4f76d21bac6879db6c02b3c3e045
Reviewed-on: http://gerrit.cloudera.org:8080/4849
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-05 06:43:45 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00