impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Riza Suminto	b46d541501	IMPALA-13961: Remove usage of ImpalaBeeswaxResult.schema An equivalent of ImpalaBeeswaxResult.schema is not implemented at ImpylaHS2ResultSet. However, column_labels and column_types fields are implemented for both. This patch removes usage of ImpalaBeeswaxResult.schema and replaces it with either column_labels or column_types field. Tests that used to access ImpalaBeeswaxResult.schema are migrated to test using hs2 protocol by default. Also fix flake8 issues in modified test files. Testing: Run and pass modified test files in exhaustive exploration. Change-Id: I060fe2d3cded1470fd09b86675cb22442c19fbee Reviewed-on: http://gerrit.cloudera.org:8080/22776 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-16 06:28:11 +00:00
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Abhishek Rawat	fa525dfdf7	IMPALA-7876: COMPUTE STATS TABLESAMPLE is not updating number of estimated rows 'COMPUTE STATS TABLESAMPLE' uses a child query with following function 'ROUND(COUNT(*) / <effective_sample_perc>)' for computing the row count. The 'ROUND()' fn returns the row count as a DECIMAL type. The 'CatalogOpExecutor' (CatalogOpExecutor::SetTableStats) expects the row count as a BIGINT type. Due to this data type mismatch the table stats (Extrap #Rows) doesn't get set. Adding an explicit CAST to BIGINT for the ROUND function results in the table stats (Extrap #Rows) getting set properly. Fixed both 'custom_cluster/test_stats_extrapolation.py' and 'metadata/test_stats_extrapolation.py' so that they can catch issues like this, where table stats are not set when using 'COMPUTE STATS TABLESAMPLE'. Testing: - Ran core tests. Change-Id: I88a0a777c2be9cc18b3ff293cf1c06fb499ca052 Reviewed-on: http://gerrit.cloudera.org:8080/16712 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-13 09:07:29 +00:00
Paul Rogers	360f88e207	IMPALA-8181: Abbreviate row counts in EXPLAIN A recent fix added node cardinality to the standard EXPLAIN output, displaying a large number like 123456780 as 123.46M. This patch applies the same fix to the remaining row count numbers: metadata, extrapolated rows, etc. Tests: * Rebased PlannerTest .test files as needed for the new row count format. * Reran all tests to check for dependencies on the old format. Change-Id: I08faaa9ad7b5ed42dcd7a15a333e8734bb45f10c Reviewed-on: http://gerrit.cloudera.org:8080/12438 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-26 23:29:56 +00:00
Alex Behm	1a1927b07d	IMPALA-6228: Control stats extrapolation via tbl prop. Introduces a new TBLPROPERTY for controlling stats extrapolation on a per-table basis: impala.enable.stats.extrapolation=true/false The property key was chosen to be consistent with the impalad startup flag --enable_stats_extrapolation and to indicate that the property was set and is used by Impala. Behavior: - If the property is not set, then the extrapolation behavior is determined by the impalad startup flag. - If the property is set, it overrides the impalad startup flag, i.e., extrapolation can be explicitly enabled or disabled regardless of the startup flag. Testing: - added new unit tests - code/hdfs run passed Change-Id: Ie49597bf1b93b7572106abc620d91f199cba0cfd Reviewed-on: http://gerrit.cloudera.org:8080/9139 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 22:56:13 +00:00
Tianyi Wang	f0b3d9d122	IMPALA-3916: Reserve SQL:2016 reserved words This patch reserves SQL:2016 reserved words, excluding: 1. Impala builtin function names. 2. Time unit words(year, month, etc.). 3. An exception list based on a discussion. Some test cases are modified to avoid these words. A impalad and catalogd startup option reserved_words_version is added. The words are reserved if the option is set to "3.0.0". Change-Id: If1b295e6a77e840cf1b794c2eb73e1b9d2b8ddd6 Reviewed-on: http://gerrit.cloudera.org:8080/9096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-02 01:13:08 +00:00
Vuk Ercegovac	08ca346f2e	IMPALA-3562: support column restriction for compute stats The 'compute stats' statement currently computes column-level statistics for all columns of a table. This adds potentially unneeded work for columns whose stats are not needed by queries. It can be especially costly for very wide tables and unneeded large string fields. This change modifies the 'compute stats' (non-incremental only) to support a user-specified list of columns for which stats should be computed. An example with the extension is as follows: compute stats my_db.my_table(column_a, column_b); While the phrase "for columns ..." is commonly used, since 'compute stats' seems fairly unique (vs. 'analyze table ...'), this change favors brevity with the parenthesized column list. Whereas currently 'compute stats' is applied to the columns that can be analyzed, the 'compute stats' in this change results in an error when a column is specified that cannot be analyzed (e.g., column does not exist, column is of an unsupported type, column is a partitioning column). Moreover, an empty column list can be supplied which means that no columns will be analyzed. Testing: - analyzing a subset of columns is already supported (e.g., not all columns can be analyzed), so the focus with testing is to check that the user-specified columns are handled as expected. - tests include: parser tests, ddl analysis, end-to-end tests. Change-Id: If8b25dd248e578dc7ddd35468125cca12d1b9f27 Reviewed-on: http://gerrit.cloudera.org:8080/9133 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 20:27:14 +00:00
Alex Behm	22d9ac0893	IMPALA-6024: Min sample bytes for COMPUTE STATS TABLESAMPLE Adds a new query option COMPUTE_STATS_MIN_SAMPLE_SIZE which is the minimum number of bytes that will be scanned in COMPUTE STATS TABLESAMPLE, regardless of the user-supplied sampling percent. The motivation is to prevent sampling for very small tables where accurate stats can be obtained cheaply without sampling. This patch changes COMPUTE STATS TABLESAMPLE to run the regular COMPUTE STATS if the effective sampling percent is 0% or 100%. For a 100% sampling rate, the sampling-based stats queries are more expensive and produce less accurate stats than the regular COMPUTE STATS. Default: COMPUTE_STATS_MIN_SAMPLE_SIZE=1GB Testing: - added new unit tests and ran them locally Change-Id: I2cb91a40bec50b599875109c2f7c5bf6f41c2400 Reviewed-on: http://gerrit.cloudera.org:8080/9113 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-31 03:21:09 +00:00
Alex Behm	1f7b3b00e9	IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS. Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV() function. Testing: - modified/improved existing functional tests - core/hdfs run passed Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b Reviewed-on: http://gerrit.cloudera.org:8080/8840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:58:59 +00:00
Alex Behm	b3d8a507cb	IMPALA-5310: Add COMPUTE STATS TABLESAMPLE. Adds the TABLESAMPLE clause for COMPUTE STATS. Syntax: COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)] Computes and replaces the table-level row count and total file size, as well as all table-level column statistics. Existing partition-level row counts are not modified. The TABLESAMPLE clause can be used to limit the scanned data volume to a desired percentage. When sampling, the unmodified results of the COMPUTE STATS queries are sent to the CatalogServer. There, the stats are extrapolated before storing them into the HMS so as not to confuse other engines like Hive/SparkSQL which may rely on the shared HMS fields being accurate. Limitations - Only works for HDFS tables - TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS - TABLESAMPLE requires --enable_stats_extrapolation=true Changes to EXPLAIN The stored statistics from the HMS are more clearly displayed under a 'stored statistics' section. Example: 00:SCAN HDFS [functional.alltypes, RANDOM] partitions=24/24 files=24 size=478.45KB stored statistics: table: rows=7300 size=478.45KB partitions: 24/24 rows=7300 columns: all Testing: - added new functional tests - core/hdfs run passed Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7 Reviewed-on: http://gerrit.cloudera.org:8080/8136 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 22:37:01 +00:00
Alex Behm	e89d7057a6	IMPALA-2373: Extrapolate row counts for HDFS tables. The main idea of this patch is to use table stats to extrapolate the row counts for new/modified partitions. Existing behavior: - Partitions that lack the row count stat are ignored when estimating the cardinality of HDFS scans. Such partitions effectively have an estimated row count of zero. - We always use the row count stats for partitions that have one. The row count may be innaccurate if data in such partitions has changed significantly. Summary of changes: - Enhance COMPUTE STATS to also store the total number of file bytes in the table. - Use the table-level row count and file bytes stats to estimate the number of rows in a scan. - A new impalad startup flag is added to enable/disable the extrapolation behavior. The feature is disabled by default. Note that even with the feature disabled, COMPUTE STATS stores the file bytes so you can enable the feature without having to run COMPUTE STATS again. Testing: - Added new FE unit test - Added new EE test Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/6840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-26 21:06:17 +00:00

12 Commits