impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Riza Suminto	9fc941b611	IMPALA-14327: Update load-data.py and run-workload.py to use HS2 load-data.py is used for dataloading while run-workload.py is used for running perf-AB-test. This patch change the script from using beeswax protocol to HS2 protocol. Testing: Run data loading and perf-AB-test-ub2004 based on this patch. Change-Id: I1c3727871b8b2e75c3f10ceabfbe9cb96e36ead3 Reviewed-on: http://gerrit.cloudera.org:8080/23309 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-20 07:20:29 +00:00
Michael Smith	131f0c74a3	IMPALA-12939: Bound IMPALA_BUILD_THREADS for cgroups and memory Updates IMPALA_BUILD_THREADS to bound it based on guideline of 2 GB memory per core during builds. Computes cores and memory from cgroup limits if applicable; memory is used as a bound on physical memory, as sometimes cgroups will report a larger limit than available physical memory. Uses IMPALA_BUILD_THREADS for load-data. Adds a default in case USER is unset during bootstrap, which can occur in devcontainer. Change-Id: I87994d0464073fe2d91bc2f7c2592c012e42de71 Reviewed-on: http://gerrit.cloudera.org:8080/21200 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2024-09-26 17:00:05 +00:00
Abhishek Rawat	f620e5d5c0	IMPALA-13015: Dataload fails due to concurrency issue with test.jceks Move 'hadoop credential' command used for creating test.jceks to testdata/bin/create-load-data.sh. Earlier it was in bin/load-data.py which is called in parallel and was causing failures due to race conditions. Testing: - Ran JniFrontendTest#testGetSecretFromKeyStore after data loading and test ran clean. Change-Id: I7fbeffc19f2b78c19fee9acf7f96466c8f4f9bcd Reviewed-on: http://gerrit.cloudera.org:8080/21346 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-23 11:09:47 +00:00
Yida Wu	9837637d93	IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API Added support for following built-in functions: - ai_generate_text_default(prompt) - ai_generate_text(ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) 'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile options. 'ai_generate_text_default(prompt)' syntax expects all these to be set to proper values. The other syntax, will try to use the provided input parameter values, but fallback to instance level values if the inputs are NULL or empty. Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com) API endpoints are currently supported. Exposed these functions in FunctionContext so that they can also be called from UDFs: - ai_generate_text_default(context, model) - ai_generate_text(context, ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) Testing: - Added unit tests for AiGenerateTextInternal function - Added fe test for JniFrontend::getSecretFromKeyStore - Ran manual tests to make sure Impala can talk with OpenAI LLMs using 'ai_generate_text' built-in function. Example sql: select ai_generate_text("https://api.openai.com/v1/chat/completions", "hello", "gpt-3.5-turbo", "open-ai-key", '{"temperature": 0.9, "model": "gpt-4"}') - Tested using standalone UDF SDK and made sure that the UDFs can invoke BuiltInFunctions (ai_generate_text and ai_generate_text_default) Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b Reviewed-on: http://gerrit.cloudera.org:8080/21168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 07:25:50 +00:00
Riza Suminto	8661f922d3	IMPALA-12601: Add a fully partitioned TPC-DS database The current tpcds dataset only has store_sales table fully partitioned and leaves the other facts table unpartitioned. This is intended for faster data loading during tests. However, this is not an accurate reflection of the larger scale TPC-DS dataset where all facts tables are partitioned. Impala planner may change the details of the query plan if a partition column exists. This patch adds a new dataset tpcds_partitioned, loading a fully partitioned TPC-DS db in parquet format named tpcds_partitioned_parquet_snap. This dataset can not be loaded independently and requires the base 'tpcds' db from the tpcds dataset to be preloaded first. An example of how to load this dataset can be seen at function load-tpcds-data in bin/create-load-data.sh. This patch also changes PlannerTest#testProcessingCost from targeting tpcds_parquet to tpcds_partitioned_parquet_snap. Other planner tests are that currently target tpcds_parquet will be gradually changed to test against tpcds_partitioned_parquet_snap in follow-up patches. This addition adds a couple of seconds in the "Computing table stats" step, but loading itself is negligible since it is parallelized with TPC-H and functional-query. The total loading time for the three datasets remains similar after this patch. This patch also adds several improvements in the following files: bin/load-data.py: - Log elapsed time on serial steps. testdata/bin/create-load-data.sh: - Rename MSG to LOAD_MSG to avoid collision with the same variable name in ./testdata/bin/run-step.sh testdata/bin/generate-schema-statements.py: - Remove redundant FILE_FORMAT_MAP. - Add build_partitioned_load to simplify expressing partitioned insert query in SQL template. testdata/datasets/tpcds/tpcds_schema_template.sql: - Reorder schema template to load all dimension tables before fact tables. Testing: - Pass core tests. Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453 Reviewed-on: http://gerrit.cloudera.org:8080/20756 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-12-16 02:31:13 +00:00
Riza Suminto	378169be1f	Revert "Revert "IMPALA-9923: Load ORC serially to hack around ..."" This reverts commit `b03e8ef95c`. IMPALA-12630 report several tests were broken due to loading ORC in parallel with other non-text table format. ORC tables returns to load serially after this commit. Change-Id: I5d3f2ee1c15f9aff6aa632a78d86ba32c640e53d Reviewed-on: http://gerrit.cloudera.org:8080/20795 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-12-15 08:50:24 +00:00
Riza Suminto	b03e8ef95c	Revert "IMPALA-9923: Load ORC serially to hack around flakiness" This reverts commit `dc2fdabbd1`. Newer hive version and other fixes has allow ORC loading to happen in parallel. Change-Id: I67f4051dd07273f2b51843cb5c1ec2cf185c5924 Reviewed-on: http://gerrit.cloudera.org:8080/20755 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-12-13 06:11:52 +00:00
Joe McDonnell	c233634d74	IMPALA-11975: Fix Dictionary methods to work with Python 3 Python 3 made the main dictionary methods lazy (items(), keys(), values()). This means that code that uses those methods may need to wrap the call in list() to get a list immediately. Python 3 also removed the old iter* lazy variants. This changes all locations to use Python 3 dictionary methods and wraps calls with list() appropriately. This also changes all itemitems(), itervalues(), iterkeys() locations to items(), values(), keys(), etc. Python 2 will not use the lazy implementation of these, so there is a theoretical performance impact. Our python code is mostly for tests and the performance impact is minimal. Python 2 will be deprecated when Python 3 is functional. This addresses these pylint warnings: dict-iter-method dict-keys-not-iterating dict-values-not-iterating Testing: - Ran core tests Change-Id: Ie873ece54a633a8a95ed4600b1df4be7542348da Reviewed-on: http://gerrit.cloudera.org:8080/19590 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	2b550634d2	IMPALA-11952 (part 2): Fix print function syntax Python 3 now treats print as a function and requires the parenthesis in invocation. print "Hello World!" is now: print("Hello World!") This fixes all locations to use the function invocation. This is more complicated when the output is being redirected to a file or when avoiding the usual newline. print >> sys.stderr , "Hello World!" is now: print("Hello World!", file=sys.stderr) To support this properly and guarantee equivalent behavior between python 2 and python 3, all files that use print now add this import: from __future__ import print_function This also fixes random flake8 issues that intersect with the changes. Testing: - check-python-syntax.sh shows no errors related to print Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351 Reviewed-on: http://gerrit.cloudera.org:8080/19552 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
stiga-huang	e71ea69bb8	IMPALA-10459: Remove workarounds for MAPREDUCE-6441 MAPREDUCE-6441 is resolved and is in our toolchain. This patch removes workarounds for it. Tests: - Ran exhaustive test. Change-Id: I5c4d482a6d15cdc08e9cf8878e130399665a8ee0 Reviewed-on: http://gerrit.cloudera.org:8080/17011 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-05 06:35:26 +00:00
Joe McDonnell	dc2fdabbd1	IMPALA-9923: Load ORC serially to hack around flakiness ORC dataload has been intermittently failing with "Fail to get checksum, since file .../_orc_acid_version is under construction." This is due to some Hive/HDFS interaction that seems to get worse with parallelism. This has been hitting a lot of developer tests. As a temporary workaround, this changes dataload to load ORC serially. This is slightly slower, but it should be more reliable. Testing: - Ran precommit tests, manually verified dataload logs Change-Id: I15eff1ec6cab32c1216ed7400e4c4b57bb81e4cd Reviewed-on: http://gerrit.cloudera.org:8080/16292 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-06 00:06:53 +00:00
Tim Armstrong	0a9ea803d2	IMPALA-7290: part 1: clean up shell tests This sets up the tests to be extensible to test shell in both beeswax and HS2 modes. Testing: * Add test dimension containing only beeswax in preparation for HS2 dimension. * Factor out hardcoded ports. * Add tests for formatting of all types and NULL values. * Merge date shell test into general type tests. * Added testing for floating point output formatting, which does change as a result of switching to server-side vs client-side formatting. * Use unique_database for tests that create tables. Change-Id: Ibe5ab7f4817e690b7d3be08d71f8f14364b84412 Reviewed-on: http://gerrit.cloudera.org:8080/13083 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-30 11:30:45 +00:00
stiga-huang	357c0a959d	IMPALA-7490: fix uninitialized variables in load-data.py Fixes use of an uninitialized variable in bin/load-data.py I found the following error message in a failed build, which is quite misleading: Traceback (most recent call last): File "bin/load-data.py", line 495, in <module> if __name__ == "__main__": main() File "bin/load-data.py", line 459, in main impala_exec_query_files_parallel(thread_pool, impala_create_files) File "bin/load-data.py", line 297, in impala_exec_query_files_parallel exec_query_files_parallel(thread_pool, query_files, 'impala') File "bin/load-data.py", line 291, in exec_query_files_parallel for result in thread_pool.imap_unordered(execution_function, query_files): File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next raise value UnboundLocalError: local variable 'query' referenced before assignment The error is trown from the 'execution_function' which is 'exec_impala_query_from_file' in my case. Should not use 'query' if it's undefined. Change-Id: If0dd56a9b78a60b3a9f04d9f61e93b4b5d066b76 Reviewed-on: http://gerrit.cloudera.org:8080/11330 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-01 12:00:43 +00:00
Joe McDonnell	573550ca2f	IMPALA-7088: Fix uninitialized variable in cluster dataload bin/load-data.py uses a unique directory for local Hive execution to avoid a race condition when executing multiple Hive commands at once. This unique directory is not needed when loading on a real cluster. However, the code to remove the unique directory at the end does not handle this correctly. This skips the code to remove the unique directory when it is uninitialized. Change-Id: I5581a45460dc341842d77eaa09647e50f35be6c7 Reviewed-on: http://gerrit.cloudera.org:8080/10526 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-29 21:34:02 +00:00
Joe McDonnell	b126b2d105	IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2 There is a Hive bug in Hive 1.1.0 that can result in a NullPointerException when doing parallel Hive operations (see IMPALA-6532). Since dataload goes parallel on Hive loads starting with IMPALA-6372, dataload can hit this error on Hive 1.1.0 (i.e. IMPALA_MINICLUSTER_PROFILE=2). This is impacting builds on the 2.x branch. This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2. IMPALA_MINICLUSTER_PROFILE=3 uses a newer version of Hive that has a fix for this, so this continues to use parallel dataload for that case. Parallelism can be reenabled when Hive 1.1.0 gets the fix from Hive 2.1.1. Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6 Reviewed-on: http://gerrit.cloudera.org:8080/10306 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-10 01:30:13 +00:00
Joe McDonnell	d481cd4842	IMPALA-6372: Go parallel for Hive dataload This changes generate-schema-statements.py to produce separate SQL files for different file formats for Hive. This changes load-data.py to go parallel on these separate Hive SQL files. For correctness, the text version of all tables must be loaded before any of the other file formats. load-data.py runs DDLs to create the tables in Impala and goes parallel. Currently, there are some minor dependencies so that text tables must be created prior to creating the other table formats. This changes the definitions of some tables in testdata/datasets/functional/functional_schema_template.sql to remove these dependencies. Now, the DDLs for the text tables can run in parallel to the other file formats. To unify the parallelism for Impala and Hive, load-data.py now uses a single fixed-size pool of processes to run all SQL files rather than spawning a thread per SQL file. This also modifies the locations that do invalidate to use refresh where possible and eliminate global invalidates. For debuggability, different SQL executions output to different log files rather than to standard out. If an error occurs, this will point out the relevant log file. This saves about 10-15 minutes on dataload (including for GVO). Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Reviewed-on: http://gerrit.cloudera.org:8080/8894 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 00:16:26 +00:00
Tim Armstrong	f6576adc2e	IMPALA-6455: unique tmpdirs for test_partition_metadata_compatibility Concurrent hive statements running in local mode can race to modify the contents of temporary directories - see IMPALA-6108. This applies the workaround for IMPALA-6108 to the run_stmt_in_hive() utility function, which is used by test_partition_metadata_compatibility. Testing: I wasn't able to reproduce the race locally, but I ran the test and confirmed that it still passed. I also confirmed that the temporary directories /tmp/impala-tests-* were created using "ls" while the tests were running. Change-Id: Ibabff859d19ddbb2a3048ecc02897a611d8ddb20 Reviewed-on: http://gerrit.cloudera.org:8080/9165 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-31 22:08:48 +00:00
Joe McDonnell	d9b6fd0730	IMPALA-6386: Invalidate metadata at table level for dataload Dataload currently executes bin/load-data.py for TPC-H, TPC-DS, and functional-query concurrently. One of the final steps for bin/load-data.py is to run a global "invalidate metadata". Global "invalidate metadata" commands are known to cause problem on concurrent systems. See IMPALA-5087. For dataload, if TPC-H executes "invalidate metadata" while TPC-DS is still creating tables and adding partitions, the TPC-DS executor might erroneously believe that a table does not exist. This changes dataload to invalidate metadata at an individual table level rather than globally. This prevents the concurrency issue. This also changes the names of some of the intermediate SQL files generated by generate-schema-statements.py and consumed by load-data.py to make them less confusing. Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f Reviewed-on: http://gerrit.cloudera.org:8080/9009 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 22:52:58 +00:00
Philip Zeyliger	76111ce168	IMPALA-6108, IMPALA-6070: Parallel data load (re-instated). This is a revert of a revert, re-enabling parallel data load. It avoid the race condition by explicitly configuring the temporary directory in question in load-data.py. When the parallel data load change went in, we discovered a race with a signature of: java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist The number in this path is milliseconds since the epoch, and the race occurs when two queries submitted to HiveServer2, running with the local runner, hit the same millisecond time stamp. The upstream bug is https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which is now marked as a dupe). I've tested this by running data load 5 times on the same machines where it failed before. I also ran data load manually and inspected the system to make sure that the temporary directories are getting created as expected in /tmp/impala-data-load-*. Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10 Reviewed-on: http://gerrit.cloudera.org:8080/8405 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-02 00:40:19 +00:00
Jim Apple	07a7138817	Add a script to test performance on a developer machine This is a migration from an old and broken script from another repository. Example use: bin/single_node_perf_run.py --ninja --workloads targeted-perf \ --load --scale 4 --iterations 20 --num_impalads 3 \ --start_minicluster --query_names PERF_AGG-Q3 \ $(git rev-parse HEAD~1) $(git rev-parse HEAD) The script can load data, run benchmarks, and compare the statistics of those runs for significant differences in performance. It glues together buildall.sh, bin/load-data.py, bin/run-workload.py, and tests/benchmark/report_benchmark_results.py. Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Reviewed-on: http://gerrit.cloudera.org:8080/6818 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-31 08:10:48 +00:00
Martin Grund	ce4c5f6743	IMPALA-4365: Enabling end-to-end tests on a remote cluster This patch lays the groundwork for loading data and running end-to-end tests on a remote CDH cluster. The requirements for the cluster to run the tests are: - Managed by Cloudera Manager (CM) - GPL Extras need to be installed - KMS and KeyTrustee installed and available as a service - SERDEPROPERTIES in the Hive DB modified to accept wide tables - Hive warehouse dir points to /test-warehouse The actual data loading is done via a new script, remote_data_load.py, which takes the CM host as an argument. It can be run from a client machine that is not a node of the cluster, but it needs to have the Impala repo checked out and Impala built. This insures that all of the necessary data load scripts are available, as well as setting up the environment properly (client binaries like beeline and the hbase shell are available, python libraries like cm_api are installed, necessary environment variables are defined, etc.) It should be noted that running remote_data_load.py will overwrite any local XML config files with the configurations downloaded from the remote cluster. Usage: remote_data_load.py [options] <cm_host address> Options: -h, --help show this help message and exit --snapshot-file=SNAPSHOT_FILE Path to the test-warehouse archive --cm-user=CM_USER Cloudera Manager admin user --cm-pass=CM_PASS Cloudera Manager admin user password --gateway=GATEWAY Gateway host to upload the data from. If not set, uses the CM host as gateway. --ssh-user=SSH_USER System user on the remote machine with passwordless SSH configured. --no-load Do not try to load the snapshot --exploration-strategy=EXPLORATION_STRATEGY --test Run end-to-end tests against cluster Testing: This patch is being submitted with the understanding that there are still clean up issues that need to be addressed in the remote data load script, for which JIRA's have been filed. However, since many of the existing build scripts also had to be modified, it is more important to make sure that no regressions were inadvertently introduced into the existing data load process. Loading data to a local mini-cluster was checked repeatedly while this patch was being developed, as well as running it against the Jenkins job that provides the test-warehouse snapshot used by the many other Impala CI builds that run daily. Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9 Reviewed-on: http://gerrit.cloudera.org:8080/4769 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 10:16:55 +00:00
Lars Volker	ef4c9958d0	IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo This change removes some of the occurrences of the strings 'CDH'/'cdh' from the Impala repository. References to Cloudera-internal Jiras have been replaced with upstream Jira issues on issues.cloudera.org. For several categories of occurrences (e.g. pom.xml files, DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to remove the occurrences left after this change. Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8 Reviewed-on: http://gerrit.cloudera.org:8080/4187 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 00:40:41 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Tim Armstrong	c1d70f814e	IMPALA-3227: generate test TPC data sets during data load The generated data is identical to the pregenerated tpch.tar.gz and tpcds.tar.gz data that was used previously and were not publically accessible. This adds a "preload" hook to bin/load-data.py that can execute custom logic for each data set. This is used to call the TPC-H and TPC-DS data generation utilities that are already available in the Impala toolchain. Testing: Ran private test job with loading from snapshot disabled and without the tpch/tpcds tarballs available. Change-Id: Ieccfbd7d8d4a91bffddbe35abb7f5572e71a71cf Reviewed-on: http://gerrit.cloudera.org:8080/3761 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:56:57 +00:00
Tim Armstrong	bc8c55afcd	IMPALA-3729: batch_size=1 coverage for avro scanner Also fix a stale comment in the avro scanner header. The main work here is to fix the handling of empty result sets in the test result verifier. This is a problem because we wanted to verify that the results in the test file were a superset of the rows returned, and this was thrown off by superflous '' rows in the expected and actual result sets. The basic problem is that the way test file sections was parsed conflated an empty result section with non-empty result section that had a single empty string. I.e.: ---- RESULTS ==== vs ---- RESULTS ==== both got resolved to ['']. Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/3413 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-07-19 23:30:02 -07:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
Martin Grund	f58159d431	[CDH5] IMPALA-1141: HBase Planner Performance This patch improves the performance of the planning phase of a query querying HBase tables. It removes an unnecessary second call to compute stats and adds a new version for estimating the row count in a table. This patch adds an incremental version to estimate the number of rows for a set of regions. This incremental version will start querying up to five regions to calculate the average row size and use this value to estimate the row count based on the size of the regions on disk. Only if the standard deviation from the average is larger than 15% query an additional region, it will query additional regions to calculate an average with more confidence. If the data is balanced it will not be necessary to retrieve data from all regions but only from a subset. In the worst case, all regions are queried. Change-Id: Idcb3bea81b11cb08da6d9329ba66c86aca23e170 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5258 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2014-11-14 13:47:02 -08:00
Mike Yoder	75a97d3d7e	[CDH5] Kerberize mini-cluster and Impala daemons This is the first iteration of a kerberized development environment. All the daemons start and use kerberos, with the sole exception of the hive metastore. This is sufficient to test impala authentication. When buildall.sh is run using '-kerberize', it will stop before loading data or attempting to run tests. Loading data into the cluster is known to not work at this time, the root causes being that Beeline -> HiveServer2 -> MapReduce throws errors, and Beeline -> HiveServer2 -> HBase has problems. These are left for later work. However, the impala daemons will happily authenticate using kerberos both from clients (like the impala shell) and amongst each other. This means that if you can get data into the mini-cluster, you could query it. Usage: * Supply a '-kerberize' option to buildall.sh, or * Supply a '-kerberize' option to create-test-configuration.sh, then 'run-all.sh -format', re-source impala-config.sh, and then start impala daemons as usual. You must reformat the cluster because kerberizing it will change all the ownership of all files in HDFS. Notable changes: * Added clean start/stop script for the llama-minikdc * Creation of Kerberized HDFS - namenode and datanodes * Kerberized HBase (and Zookeeper) * Kerberized Hive (minus the MetaStore) * Kerberized Impala * Loading of data very nearly working Still to go: * Kerberize the MetaStore * Get data loading working * Run all tests * The unknown unknowns * Extensive testing Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019 Reviewed-by: Michael Yoder <myoder@cloudera.com> Tested-by: jenkins	2014-09-05 12:36:21 -07:00
Alex Behm	3d764619f7	Run Hive data loading through beeline instead of the Hive shell. Fixes our log configuration to put the Hive logs in cluster_logs/hive. Change-Id: I5d98581e35325f2173e4b3170e36bec42d33f8f3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1497 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1615 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-02-20 15:43:31 -08:00
ishaan	01ef3ef4c1	load-data.py should exit if a bash command returns a non-zero error code. Change-Id: I2f732a276a42d2697fa55bce0f18ac89e9a6f0a1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1397 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1408 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-30 15:47:13 -08:00
ishaan	4e9913b52f	Fix race in data loading by creating text tables first. While loading parquet, there are a few table creation queries that use the 'like' keyword; this ends up opening a small race window when all the table formats are created concurrently. With this change, we create the text tables first before attempting to parallelize the rest of the data loading. Change-Id: Ib84cf0e5120b3588d3f0503d7119ca055e08e53f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1241 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-10 15:01:59 -08:00
Nong Li	056c7d94d6	Remove compute stats option from bin/load-data.py This option is not implemented in this script and doesn't make it obvious that it doesn't do anything. Change-Id: I1a1eff38460fd181c486cfca2840108a58e21603 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1059 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-10 14:01:35 -08:00
ishaan	0ed1781323	Invalidate metadata before loading parquet data through Impala. During a full data load, we load all the data (except parquet) via hive, and then load the parquet data via Impala. The catalog service does not update the metadata of tables changed outside Impala, so we need to explicitly invalidate the metadata before loading parquet data. Change-Id: Iec39db9ea46e4a11b17589881732629a56444120 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1207 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:39 -08:00
Lenni Kuff	baf79f8185	Call 'invalidate metadata' after loading test data instead of before Instead of calling 'invalidate metadata' before loading each workload we should call it once, after loading all test data. This will allow us to pickup data inserted by Hive. The only reason this worked before is because we restart Impala before running the tests. This will also be a bit faster if loading multiple workloads. Change-Id: I28d42bbf5d7a24b5fde687d67a4b41472ec4b897 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1153 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:37 -08:00
ishaan	287953e87c	Better error logging while loading data. Change-Id: I67cbd9fd1d915ea043a731b7951f29fec25fc446 Reviewed-on: http://gerrit.ent.cloudera.com:8080/982 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:13 -08:00
ishaan	bf5359be8d	Cleanup Impala connections after data is loaded. Change-Id: I152b09808740d5344462bcbaf4df4b71d88504cc Reviewed-on: http://gerrit.ent.cloudera.com:8080/953 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:02 -08:00
Lenni Kuff	f579ee8b25	Fix logging in load-data to print the query being executed Change-Id: I4332e8d3a340f11e1bbb1f6c5126b0b9b4a2ad8e Reviewed-on: http://gerrit.ent.cloudera.com:8080/949 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:58 -08:00
ishaan	fcdcf1a9d8	Parallelize data loaded through Impala to speed up data loading. Currently, we execute all the queries involved in data loading serially. This change creates a separate .sql file for each file format, compression codec and compression scheme combination, and executes all the files in parallel. Additionally, we now store all the .sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note that only data loaded through Impala is parallelized, data loaded through hive and hbase remains serial. On our build machines, the time taken to load all the data from snapshot was on the order of 15 minutes. Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/804 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:53 -08:00
ishaan	565d15579c	Add the ability to use a workload as the unit of execution in the Impala benchmark runner. At the moment, a query is the default unit of execution and parallelism in the Impala performance suite. With this change, we now have the ability to treat a workload as the unit of execution. A workload is defined as a unique combination of the dataset, scale factor, a subset (or all) of the queries in the dataset, and a table format (file format, compression codec and compression scheme). It introduces two new command line options in bin/run-workload.py: * --execution_scope The default scope is 'query', and it maintains previous semantics. The new scope is 'workload', which toggles the unit of execution to a workload. * --shuffle_query_exec_order. Shuffles the order in which queries are executed (only applicable when the execution_scope if workload), defaults to False. Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/503 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:07 -08:00
Lenni Kuff	a1f2f72f49	Add Impala DDL support for creation of AVRO tables + support for CREATE/ALTER SERDEPROPERTIES This change adds Impala DDL support for creation of AVRO tables. Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES which are used when creating Avro backed tables. This syntax is not exactly the same as the Hive support since it introduces a new fileformat (AVROFILE) that implies the needed Serialization library, input format, and output format. Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d Reviewed-on: http://gerrit.ent.cloudera.com:8080/464 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:52:48 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Alan Choi	ecee109e68	IMPALA-387 Add refresh/invalidate SQL	2014-01-08 10:51:25 -08:00
Alan Choi	b71357fc28	IMPALA-387 Reuse Hdfs and Hive metastore metadata to perform a fast incremental refresh	2014-01-08 10:51:17 -08:00
Lenni Kuff	2f7198292a	Add support for auxiliary workloads, tests, and datasets This change adds support for auxiliary worksloads, tests, and datasets. This is useful to augment the regular test runs with some additional tests that do not belong in the main Impala repo.	2014-01-08 10:50:32 -08:00
Nong Li	1f6481382e	Fix parquet test setup.	2014-01-08 10:49:41 -08:00
Lenni Kuff	cba9cd00dd	Fix full data load build break due to constructing incorrect HDFS paths	2014-01-08 10:49:34 -08:00

1 2

63 Commits