impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Riza Suminto	9c87cf41bf	IMPALA-13396: Unify tmp dir management in CustomClusterTestSuite There are many custom cluster tests that require creating temporary directory. The temporary directory typically live within a scope of test method and cleaned afterwards. However, some test do create temporary directory directly and forgot to clean them afterwards, leaving junk dirs under /tmp/ or $LOG_DIR. This patch unify the temporary directory management inside CustomClusterTestSuite. It introduce new 'tmp_dir_placeholders' arg in CustomClusterTestSuite.with_args() that list tmp dirs to create. 'impalad_args', 'catalogd_args', and 'impala_log_dir' now accept formatting pattern that is replaceable by a temporary dir path, defined through 'tmp_dir_placeholders'. There are few occurrences where mkdtemp is called and not replaceable by this work, such as tests/comparison/cluster.py. In that case, this patch change them to supply prefix arg so that developer knows that it comes from Impala test script. This patch also addressed several flake8 errors in modified files. Testing: - Pass custom cluster tests in exhaustive mode. - Manually run few modified tests and observe that the temporary dirs are created and removed under logs/custom_cluster_tests/ as the tests go. Change-Id: I8dd665e8028b3f03e5e33d572c5e188f85c3bdf5 Reviewed-on: http://gerrit.cloudera.org:8080/21836 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-02 01:25:39 +00:00
Gabor Kaszab	73171cb716	IMPALA-12729: Allow creating primary keys for Iceberg tables There are writer engines that use Iceberg's identifier-field-ids from the Iceberg schema to identify the columns to be written into the equality delete files (Flink, NiFi). So far Impala wasn't able to populate this identifier-field-ids. This patch introduces the support for not enforced primary keys for Iceberg tables, where the primary key is going to be used for setting identifier-field-ids during Iceberg schema creation. Example syntax: CREATE TABLE ice_tbl ( i int NOT NULL, j int, s string NOT NULL primary key(i, s) not enforced) PARTITIONED BY SPEC (truncate(10, s)) STORED AS ICEBERG; There are some constraints with primary keys (PK) following the behavior of Flink: - Only NOT NULL columns can be in the PK. - PK is not allowed in the column definition level like 'i int NOT NULL PRIMARY KEY'. - If the table is partitioned then the partition columns have to be part of the PK. - Float and double columns are not allowed for the PK. - Not allowed to drop a column that is used as a PK. Testing: - New E2E tests added for different table creation scenarios. - Manual test to use Nifi for writing into a table with PK. Change-Id: I7bea787acdabd8cb04661f4ddb5c3309af0364a6 Reviewed-on: http://gerrit.cloudera.org:8080/21149 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-03-28 13:57:07 +00:00
Zoltan Borok-Nagy	650758ffd9	IMPALA-12130: Table creation time is not set properly in lineage log for Kudu and Iceberg tables For CTAS statements that create Kudu/Iceberg tables the lineage log was incomplete as it missed the table creation time of the newly created table. This information was missing because in CatalogOpExecutor createKuduTable() / createIcebergTable() did not set it in the TDdlExecResponse object. This patch adds the missing information. Testing * e2e test Change-Id: I6938938b1834809d5197a748c171e9a09e13906a Reviewed-on: http://gerrit.cloudera.org:8080/19868 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2023-05-15 11:36:47 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Michael Smith	1eb0510eaa	IMPALA-11456: Collapse filesystem Skip logic Combines all SkipIf* classes for different filesystems into a single SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the rest as filesystem-specific special cases. The 'jira' option is removed in favor of specific flags for each issue. Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1 Reviewed-on: http://gerrit.cloudera.org:8080/18781 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 22:37:08 +00:00
Michael Smith	830625b104	IMPALA-9442: Add Ozone to minicluster Adds Ozone as an alternative to hdfs in the minicluster. Select by setting `export TARGET_FILESYSTEM=ozone`. With that flag, run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot because Ozone does not support HBase (HDDS-3589); snapshot loading doesn't work yet primarily due to HDDS-5502. Uses the o3fs interface because Ozone puts specific restrictions on bucket names (no underscores, for instance), and it was a lot easier to use an interface where everything is written to a single bucket than to update all Impala's use of HDFS-style paths to make `test-warehouse` a bucket inside a volume. Specifies reduced Ozone client retries during shutdown where Ozone may not be available. Passes tests with FE_TEST=false BE_TEST=false. Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be Reviewed-on: http://gerrit.cloudera.org:8080/18738 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-08-03 16:58:20 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Anurag Mantripragada	f49f8d8a32	IMPALA-9132: Explain statements should not cause nullptr in LogLineageRecord() For DDLs LogLineageRecord() adds certain fields in the backend before flushing the lineage. It uses ddl_exec_response() to get these fields. However, explain is a special kind of DDL which does not have an associated catalog_op_executor_. This causes explain statements to throw NPE when ddl_exec_response() is called. Currently, tools like atlas do not track lineages for explain statements. This change skips lineage logging for explain statements. In general, adds a nullptr check for catalog_op_executor_. Testing: Added a test to verify lineage is not created for explain statements. Change-Id: Iccc20fd5a80841c820ebeb4edffccebea30df76e Reviewed-on: http://gerrit.cloudera.org:8080/14646 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-06 22:14:04 +00:00
Sahil Takiar	e8fda1f224	IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS This test makes the following changes / fixes when running Impala tests on ABFS: * Skips some tests in test_lineage.py that don't work on ABFS / ADLS (they were already skipped for S3) * Skips some tests in test_mt_dop.py; the test creates a directory that ends with a period (and ABFS does not support writing files or directories that end with a period) * Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with purge tests fail against ABFS due to trash misbehavior"); I removed these flags and looped the tests overnight with no failures, so it is likely whatever bug was causing this has now been fixed * Now that HADOOP-15860 has been resolved, and the agreed upon behavior for ABFS is that it will fail if a client tries to write a file / directory that ends with a period, I added a new entry to the SkipIfABFS class called file_or_folder_name_ends_with_period and applied it where necessary Testing: * Ran core tests on ABFS Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285 Reviewed-on: http://gerrit.cloudera.org:8080/14636 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-06 05:44:01 +00:00
Anurag Mantripragada	378a108ecc	IMPALA-9070: Include table location in lineage for 'CREATE EXTERNAL TABLE' DDL. Atlas needs table location to establish lineage between a newly created external table and its table location. The table location information is not available until the createTable catalog op succeeds. After this change, location information is sent to the backend in the TDDLExecResponse message which adds it to the lineage graph. This information is sent only for create external table queries. Testing: Added a test to verify the tableLocation field is populated for a create external table query lineage. Also, modified the lineage.test file to include location information for all lineages. Change-Id: If02b0cc16d52c1956298171628f5737cab62ce9f Reviewed-on: http://gerrit.cloudera.org:8080/14515 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-21 20:12:47 +00:00
Anurag Mantripragada	b64efa76f7	IMPALA-9053: DDLs should generate lineage graphs. DDLs like 'create table' should generate minimal lineage graphs so that consumers like Atlas can use information like 'queryText' to establish lineages. This change adds a call to the computeLineageGraph() method during analysis phase of createTable which populates the graph with basic information like queryText. If it is a CTAS, this graph is enhanced in the "insert" phase with dependencies. Testing: Add an EE test to verify lineage information and also to check it is flushed to disk properly. Change-Id: Ia6c7ed9fe3265fd777fe93590cf4eb2d9ba0dd1e Reviewed-on: http://gerrit.cloudera.org:8080/14458 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-17 20:11:06 +00:00
Bharath Vissapragada	d155ef01f5	IMPALA-8572: Skip test_lineage_output on S3 The test has a hbase dependency that does not run in S3. Change-Id: I781c2dc42c042747eed6134cea4f3f0879a40294 Reviewed-on: http://gerrit.cloudera.org:8080/14230 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Sahil Takiar <stakiar@cloudera.com>	2019-09-16 17:04:17 +00:00
Bharath Vissapragada	32dbf5d78a	IMPALA-8572: Log query events before unregister. Currently, the query events (audits and lineages) are logged as a part of query unregistration. This delays the event logging in cases where the Unregister() is delayed by client for some reason (ex: Hue does not call Unregister until the browser tab is closed) or the client goes away without calling Unregister and the query timeout kicks in. This patch moves this event logging to an earlier stage in the query lifecycle. Moved the event logging related code into ClientRequestState for easier code refactoring. The conditions under which the events are logged are slightly modified by this patch. Without the patch, events are logged for unsuccessful queries if atleast a single fetch is perfomed. This patch relaxes this guarantee to log events for any query that reaches the FINISHED state (rows are available to fetch by the client) and does not wait for a fetch to be performed. This simplifies the coordinator state machine by avoiding unnecessary synchronization. Added some test coverage for coordinator side code paths for writing lineages. fe specific lineage tests only verified the correctness of lineage created but did not test whether it was being flushed correctly to the disk. Change-Id: I639b9c1acb9806b29292cd85be2863688453ca2e Reviewed-on: http://gerrit.cloudera.org:8080/14143 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-12 02:02:44 +00:00
Fredy Wijaya	d4c05b96e2	IMPALA-8638: Fix flaky TestLineage::test_create_table_timestamp This patch fixes the bug TestLineage::test_create_table_timestamp where it uses the same lineage log directory created by TestLineage::test_start_end_timestamp, which could fail the query ID assertion. The fix is to use a different lineage log directory than the one used by TestLineage::test_start_end_timestamp. Testing: - Ran test_lineage.py multiple times and it still passed. Change-Id: I5813e4a570c181ba196b9ddf0210c8a0d92e21e8 Reviewed-on: http://gerrit.cloudera.org:8080/13560 Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-07 23:23:06 +00:00
Fredy Wijaya	d9af99589f	IMPALA-8564: Add table/view create time in the lineage graph This patch adds table/view create time in the lineage graph. This is needed for Impala/Atlas integration. See ATLAS-3080. Below is an example of the updated lineage graph. { "queryText":"create table lineage_test_tbl as select int_col, tinyint_col from functional.alltypes", "queryId":"0:0", "hash":"407f23b24758ffcb2ac445b9703f5c44", "user":"dummy_user", "timestamp":1547867921, "edges":[ { "sources":[ 1 ], "targets":[ 0 ], "edgeType":"PROJECTION" }, { "sources":[ 3 ], "targets":[ 2 ], "edgeType":"PROJECTION" } ], "vertices":[ { "id":0, "vertexType":"COLUMN", "vertexId":"int_col", "metadata":{ "tableName":"default.lineage_test_tbl", "tableCreateTime":1559151337 } }, { "id":1, "vertexType":"COLUMN", "vertexId":"functional.alltypes.int_col", "metadata":{ "tableName":"functional.alltypes", "tableCreateTime":1559151317 } }, { "id":2, "vertexType":"COLUMN", "vertexId":"tinyint_col", "metadata":{ "tableName":"default.lineage_test_tbl", "tableCreateTime":1559151337 } }, { "id":3, "vertexType":"COLUMN", "vertexId":"functional.alltypes.tinyint_col", "metadata":{ "tableName":"functional.alltypes", "tableCreateTime":1559151317 } } ] } Testing: - Updated lineage tests in PlannerTest - Updated test_lineage.py - Ran all FE tests Change-Id: If4f578d7b299a76c30323b10a883ba32f8713d82 Reviewed-on: http://gerrit.cloudera.org:8080/13399 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 06:39:04 +00:00
Tianyi Wang	c505a8159b	IMPALA-6210: Add query id to lineage graph logging Some tools use lineage graph logging to collect query metrics. Currently only query hash is present in this log. Adding query id into it makes such accounting easier. Testing: The equality of query id in the query profile and lineage log is checked in test_lineage.py. A test for TUniqueIdUtil is added to the FE tests. Change-Id: I4adbd02df37a234dbb79f58b7c46ca11a914229f Reviewed-on: http://gerrit.cloudera.org:8080/8589 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-06 00:52:19 +00:00
Dan Hecht	035b775a6d	IMPALA-4440: lineage timestamps can go backwards across daylight savings transitions Using TimestampValue (or equivalent string representation) for timestamps that require a point in time doesn't work because the same time can represent multiple point in times. For example, the timestamp: '2016-11-13 01:01 AM' occurred twice last weekend. Instead, we should use unix time directly rather than trying to derive unix time from a (timezone-less) timestamp. Note that there are other questionable uses of TimestampValue for internal Impala service stuff, but I want to fix them separately as they are not as important and fixing does add some risk. While I'm here, remove a template TimestampValue constructor that was unused and is confusing. We don't have any end-to-end tests that exercise column lineage, so add a simple custom cluster test that enables lineage and verifes the start and end unix times are within appropriate bounds. The other column lineage graph fields are at least tested via planner tests. Automated regression testing for the specifc daylight savings issue is difficult as we'd have to cross the daylight savings boundary at just the right time during query execution in order to reproduce reliably. But open to ideas. Testing: - loop the new test overnight without any failures. - exhaustive run. Change-Id: I34e435fc3511e65bc62906205cb558f2c116a8a9 Reviewed-on: http://gerrit.cloudera.org:8080/5129 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-11-21 22:18:37 +00:00

19 Commits