This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.
All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.
The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.
get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.
Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
There are many custom cluster tests that require creating temporary
directory. The temporary directory typically live within a scope of test
method and cleaned afterwards. However, some test do create temporary
directory directly and forgot to clean them afterwards, leaving junk
dirs under /tmp/ or $LOG_DIR.
This patch unify the temporary directory management inside
CustomClusterTestSuite. It introduce new 'tmp_dir_placeholders' arg in
CustomClusterTestSuite.with_args() that list tmp dirs to create.
'impalad_args', 'catalogd_args', and 'impala_log_dir' now accept
formatting pattern that is replaceable by a temporary dir path, defined
through 'tmp_dir_placeholders'.
There are few occurrences where mkdtemp is called and not replaceable by
this work, such as tests/comparison/cluster.py. In that case, this patch
change them to supply prefix arg so that developer knows that it comes
from Impala test script.
This patch also addressed several flake8 errors in modified files.
Testing:
- Pass custom cluster tests in exhaustive mode.
- Manually run few modified tests and observe that the temporary dirs
are created and removed under logs/custom_cluster_tests/ as the tests
go.
Change-Id: I8dd665e8028b3f03e5e33d572c5e188f85c3bdf5
Reviewed-on: http://gerrit.cloudera.org:8080/21836
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
There are writer engines that use Iceberg's identifier-field-ids from
the Iceberg schema to identify the columns to be written into the
equality delete files (Flink, NiFi). So far Impala wasn't able to
populate this identifier-field-ids. This patch introduces the support
for not enforced primary keys for Iceberg tables, where the primary key
is going to be used for setting identifier-field-ids during Iceberg
schema creation.
Example syntax:
CREATE TABLE ice_tbl (
i int NOT NULL,
j int,
s string NOT NULL
primary key(i, s) not enforced)
PARTITIONED BY SPEC (truncate(10, s))
STORED AS ICEBERG;
There are some constraints with primary keys (PK) following the
behavior of Flink:
- Only NOT NULL columns can be in the PK.
- PK is not allowed in the column definition level like
'i int NOT NULL PRIMARY KEY'.
- If the table is partitioned then the partition columns have to be
part of the PK.
- Float and double columns are not allowed for the PK.
- Not allowed to drop a column that is used as a PK.
Testing:
- New E2E tests added for different table creation scenarios.
- Manual test to use Nifi for writing into a table with PK.
Change-Id: I7bea787acdabd8cb04661f4ddb5c3309af0364a6
Reviewed-on: http://gerrit.cloudera.org:8080/21149
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For CTAS statements that create Kudu/Iceberg tables the lineage log
was incomplete as it missed the table creation time of the newly
created table. This information was missing because in CatalogOpExecutor
createKuduTable() / createIcebergTable() did not set it in the
TDdlExecResponse object. This patch adds the missing information.
Testing
* e2e test
Change-Id: I6938938b1834809d5197a748c171e9a09e13906a
Reviewed-on: http://gerrit.cloudera.org:8080/19868
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Combines all SkipIf* classes for different filesystems into a single
SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the
rest as filesystem-specific special cases. The 'jira' option is removed
in favor of specific flags for each issue.
Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1
Reviewed-on: http://gerrit.cloudera.org:8080/18781
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds Ozone as an alternative to hdfs in the minicluster. Select by
setting `export TARGET_FILESYSTEM=ozone`. With that flag,
run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot
because Ozone does not support HBase (HDDS-3589); snapshot loading
doesn't work yet primarily due to HDDS-5502.
Uses the o3fs interface because Ozone puts specific restrictions on
bucket names (no underscores, for instance), and it was a lot easier to
use an interface where everything is written to a single bucket than to
update all Impala's use of HDFS-style paths to make `test-warehouse` a
bucket inside a volume.
Specifies reduced Ozone client retries during shutdown where Ozone may
not be available.
Passes tests with FE_TEST=false BE_TEST=false.
Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be
Reviewed-on: http://gerrit.cloudera.org:8080/18738
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.
New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.
Follow-up:
- Support for caching COS file handles will be addressed in
IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
COS (IMPALA-10773).
Tests:
- Upload hdfs test data to a COS bucket. Modify all locations in HMS
DB to point to the COS bucket. Remove some hdfs caching params.
Run CORE tests.
Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.
New flags for GCS:
- num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.
Follow-up:
- Support for spilling to GCS will be addressed in IMPALA-10561.
- Support for caching GCS file handles will be addressed in
IMPALA-10568.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
GCS (IMPALA-10562).
- Some tests are skipped due to issues introduced by /etc/hosts setting
on GCE instances (IMPALA-10563).
Tests:
- Compile and create hdfs test data on a GCE instance. Upload test data
to a GCS bucket. Modify all locations in HMS DB to point to the GCS
bucket. Remove some hdfs caching params. Run CORE tests.
- Compile and load snapshot data to a GCS bucket. Run CORE tests.
Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
LogLineageRecord()
For DDLs LogLineageRecord() adds certain fields in the backend before
flushing the lineage. It uses ddl_exec_response() to get these fields.
However, explain is a special kind of DDL which does not have an
associated catalog_op_executor_. This causes explain statements to
throw NPE when ddl_exec_response() is called.
Currently, tools like atlas do not track lineages for explain
statements. This change skips lineage logging for explain statements.
In general, adds a nullptr check for catalog_op_executor_.
Testing:
Added a test to verify lineage is not created for explain statements.
Change-Id: Iccc20fd5a80841c820ebeb4edffccebea30df76e
Reviewed-on: http://gerrit.cloudera.org:8080/14646
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This test makes the following changes / fixes when running Impala tests
on ABFS:
* Skips some tests in test_lineage.py that don't work on ABFS / ADLS
(they were already skipped for S3)
* Skips some tests in test_mt_dop.py; the test creates a directory that
ends with a period (and ABFS does not support writing files or
directories that end with a period)
* Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with
purge tests fail against ABFS due to trash misbehavior"); I removed
these flags and looped the tests overnight with no failures, so it is
likely whatever bug was causing this has now been fixed
* Now that HADOOP-15860 has been resolved, and the agreed upon behavior
for ABFS is that it will fail if a client tries to write a file /
directory that ends with a period, I added a new entry to the SkipIfABFS
class called file_or_folder_name_ends_with_period and applied it where
necessary
Testing:
* Ran core tests on ABFS
Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285
Reviewed-on: http://gerrit.cloudera.org:8080/14636
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TABLE' DDL.
Atlas needs table location to establish lineage between a newly
created external table and its table location.
The table location information is not available until the createTable
catalog op succeeds. After this change, location information is sent
to the backend in the TDDLExecResponse message which adds it to the
lineage graph. This information is sent only for create external
table queries.
Testing:
Added a test to verify the tableLocation field is populated for a
create external table query lineage. Also, modified the
lineage.test file to include location information for all lineages.
Change-Id: If02b0cc16d52c1956298171628f5737cab62ce9f
Reviewed-on: http://gerrit.cloudera.org:8080/14515
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
DDLs like 'create table' should generate minimal lineage graphs so
that consumers like Atlas can use information like 'queryText' to
establish lineages.
This change adds a call to the computeLineageGraph() method during
analysis phase of createTable which populates the graph with basic
information like queryText. If it is a CTAS, this graph is enhanced
in the "insert" phase with dependencies.
Testing:
Add an EE test to verify lineage information and also to check it
is flushed to disk properly.
Change-Id: Ia6c7ed9fe3265fd777fe93590cf4eb2d9ba0dd1e
Reviewed-on: http://gerrit.cloudera.org:8080/14458
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the query events (audits and lineages) are logged as a
part of query unregistration. This delays the event logging in cases
where the Unregister() is delayed by client for some reason (ex: Hue
does not call Unregister until the browser tab is closed) or the client
goes away without calling Unregister and the query timeout kicks in.
This patch moves this event logging to an earlier stage in the query
lifecycle. Moved the event logging related code into ClientRequestState
for easier code refactoring.
The conditions under which the events are logged are slightly
modified by this patch. Without the patch, events are logged for
unsuccessful queries if atleast a single fetch is perfomed. This patch
relaxes this guarantee to log events for any query that reaches
the FINISHED state (rows are available to fetch by the client) and does
not wait for a fetch to be performed. This simplifies the coordinator
state machine by avoiding unnecessary synchronization.
Added some test coverage for coordinator side code paths for writing
lineages. fe specific lineage tests only verified the correctness of
lineage created but did not test whether it was being flushed correctly
to the disk.
Change-Id: I639b9c1acb9806b29292cd85be2863688453ca2e
Reviewed-on: http://gerrit.cloudera.org:8080/14143
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch fixes the bug TestLineage::test_create_table_timestamp where
it uses the same lineage log directory created by
TestLineage::test_start_end_timestamp, which could fail the query ID
assertion. The fix is to use a different lineage log directory than the
one used by TestLineage::test_start_end_timestamp.
Testing:
- Ran test_lineage.py multiple times and it still passed.
Change-Id: I5813e4a570c181ba196b9ddf0210c8a0d92e21e8
Reviewed-on: http://gerrit.cloudera.org:8080/13560
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some tools use lineage graph logging to collect query metrics. Currently
only query hash is present in this log. Adding query id into it makes
such accounting easier.
Testing: The equality of query id in the query profile and lineage log
is checked in test_lineage.py. A test for TUniqueIdUtil is added to the
FE tests.
Change-Id: I4adbd02df37a234dbb79f58b7c46ca11a914229f
Reviewed-on: http://gerrit.cloudera.org:8080/8589
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Using TimestampValue (or equivalent string representation) for
timestamps that require a point in time doesn't work because the same
time can represent multiple point in times. For example, the timestamp:
'2016-11-13 01:01 AM' occurred twice last weekend.
Instead, we should use unix time directly rather than trying to derive
unix time from a (timezone-less) timestamp.
Note that there are other questionable uses of TimestampValue for
internal Impala service stuff, but I want to fix them separately as they
are not as important and fixing does add some risk.
While I'm here, remove a template TimestampValue constructor that was
unused and is confusing.
We don't have any end-to-end tests that exercise column lineage, so add
a simple custom cluster test that enables lineage and verifes the start
and end unix times are within appropriate bounds. The other column
lineage graph fields are at least tested via planner tests.
Automated regression testing for the specifc daylight savings issue is
difficult as we'd have to cross the daylight savings boundary at just
the right time during query execution in order to reproduce
reliably. But open to ideas.
Testing:
- loop the new test overnight without any failures.
- exhaustive run.
Change-Id: I34e435fc3511e65bc62906205cb558f2c116a8a9
Reviewed-on: http://gerrit.cloudera.org:8080/5129
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins