Commit Graph

63 Commits

Author SHA1 Message Date
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
Riza Suminto
9fc941b611 IMPALA-14327: Update load-data.py and run-workload.py to use HS2
load-data.py is used for dataloading while run-workload.py is used for
running perf-AB-test. This patch change the script from using beeswax
protocol to HS2 protocol.

Testing:
Run data loading and perf-AB-test-ub2004 based on this patch.

Change-Id: I1c3727871b8b2e75c3f10ceabfbe9cb96e36ead3
Reviewed-on: http://gerrit.cloudera.org:8080/23309
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-20 07:20:29 +00:00
Michael Smith
131f0c74a3 IMPALA-12939: Bound IMPALA_BUILD_THREADS for cgroups and memory
Updates IMPALA_BUILD_THREADS to bound it based on guideline of 2 GB
memory per core during builds. Computes cores and memory from cgroup
limits if applicable; memory is used as a bound on physical memory, as
sometimes cgroups will report a larger limit than available physical
memory.

Uses IMPALA_BUILD_THREADS for load-data.

Adds a default in case USER is unset during bootstrap, which can occur
in devcontainer.

Change-Id: I87994d0464073fe2d91bc2f7c2592c012e42de71
Reviewed-on: http://gerrit.cloudera.org:8080/21200
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2024-09-26 17:00:05 +00:00
Abhishek Rawat
f620e5d5c0 IMPALA-13015: Dataload fails due to concurrency issue with test.jceks
Move 'hadoop credential' command used for creating test.jceks to
testdata/bin/create-load-data.sh. Earlier it was in bin/load-data.py
which is called in parallel and was causing failures due to race
conditions.

Testing:
- Ran JniFrontendTest#testGetSecretFromKeyStore after data loading and
test ran clean.

Change-Id: I7fbeffc19f2b78c19fee9acf7f96466c8f4f9bcd
Reviewed-on: http://gerrit.cloudera.org:8080/21346
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-23 11:09:47 +00:00
Yida Wu
9837637d93 IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API
Added support for following built-in functions:
- ai_generate_text_default(prompt)
- ai_generate_text(ai_endpoint, prompt, ai_model,
  ai_api_key_jceks_secret, additional_params)

'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile
options. 'ai_generate_text_default(prompt)' syntax expects all these
to be set to proper values. The other syntax, will try to use the
provided input parameter values, but fallback to instance level values
if the inputs are NULL or empty.

Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com)
API endpoints are currently supported.

Exposed these functions in FunctionContext so that they can also be
called from UDFs:
- ai_generate_text_default(context, model)
- ai_generate_text(context, ai_endpoint, prompt, ai_model,
  ai_api_key_jceks_secret, additional_params)

Testing:
- Added unit tests for AiGenerateTextInternal function
- Added fe test for JniFrontend::getSecretFromKeyStore
- Ran manual tests to make sure Impala can talk with OpenAI LLMs using
'ai_generate_text' built-in function. Example sql:
select ai_generate_text("https://api.openai.com/v1/chat/completions",
"hello", "gpt-3.5-turbo", "open-ai-key",
'{"temperature": 0.9, "model": "gpt-4"}')
- Tested using standalone UDF SDK and made sure that the UDFs can invoke
  BuiltInFunctions (ai_generate_text and ai_generate_text_default)

Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b
Reviewed-on: http://gerrit.cloudera.org:8080/21168
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-11 07:25:50 +00:00
Riza Suminto
8661f922d3 IMPALA-12601: Add a fully partitioned TPC-DS database
The current tpcds dataset only has store_sales table fully partitioned
and leaves the other facts table unpartitioned. This is intended for
faster data loading during tests. However, this is not an accurate
reflection of the larger scale TPC-DS dataset where all facts tables are
partitioned. Impala planner may change the details of the query plan if
a partition column exists.

This patch adds a new dataset tpcds_partitioned, loading a fully
partitioned TPC-DS db in parquet format named
tpcds_partitioned_parquet_snap. This dataset can not be loaded
independently and requires the base 'tpcds' db from the tpcds dataset to
be preloaded first. An example of how to load this dataset can be seen
at function load-tpcds-data in bin/create-load-data.sh.

This patch also changes PlannerTest#testProcessingCost from targeting
tpcds_parquet to tpcds_partitioned_parquet_snap. Other planner tests are
that currently target tpcds_parquet will be gradually changed to test
against tpcds_partitioned_parquet_snap in follow-up patches.

This addition adds a couple of seconds in the "Computing table stats"
step, but loading itself is negligible since it is parallelized with
TPC-H and functional-query. The total loading time for the three
datasets remains similar after this patch.

This patch also adds several improvements in the following files:

bin/load-data.py:
- Log elapsed time on serial steps.

testdata/bin/create-load-data.sh:
- Rename MSG to LOAD_MSG to avoid collision with the same variable name
  in ./testdata/bin/run-step.sh

testdata/bin/generate-schema-statements.py:
- Remove redundant FILE_FORMAT_MAP.
- Add build_partitioned_load to simplify expressing partitioned insert
  query in SQL template.

testdata/datasets/tpcds/tpcds_schema_template.sql:
- Reorder schema template to load all dimension tables before fact tables.

Testing:
- Pass core tests.

Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453
Reviewed-on: http://gerrit.cloudera.org:8080/20756
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-16 02:31:13 +00:00
Riza Suminto
378169be1f Revert "Revert "IMPALA-9923: Load ORC serially to hack around ...""
This reverts commit b03e8ef95c.

IMPALA-12630 report several tests were broken due to loading ORC in
parallel with other non-text table format. ORC tables returns to load
serially after this commit.

Change-Id: I5d3f2ee1c15f9aff6aa632a78d86ba32c640e53d
Reviewed-on: http://gerrit.cloudera.org:8080/20795
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-15 08:50:24 +00:00
Riza Suminto
b03e8ef95c Revert "IMPALA-9923: Load ORC serially to hack around flakiness"
This reverts commit dc2fdabbd1.

Newer hive version and other fixes has allow ORC loading to happen in
parallel.

Change-Id: I67f4051dd07273f2b51843cb5c1ec2cf185c5924
Reviewed-on: http://gerrit.cloudera.org:8080/20755
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-13 06:11:52 +00:00
Joe McDonnell
c233634d74 IMPALA-11975: Fix Dictionary methods to work with Python 3
Python 3 made the main dictionary methods lazy (items(),
keys(), values()). This means that code that uses those
methods may need to wrap the call in list() to get a
list immediately. Python 3 also removed the old iter*
lazy variants.

This changes all locations to use Python 3 dictionary
methods and wraps calls with list() appropriately.
This also changes all itemitems(), itervalues(), iterkeys()
locations to items(), values(), keys(), etc. Python 2
will not use the lazy implementation of these, so there
is a theoretical performance impact. Our python code is
mostly for tests and the performance impact is minimal.
Python 2 will be deprecated when Python 3 is functional.

This addresses these pylint warnings:
dict-iter-method
dict-keys-not-iterating
dict-values-not-iterating

Testing:
 - Ran core tests

Change-Id: Ie873ece54a633a8a95ed4600b1df4be7542348da
Reviewed-on: http://gerrit.cloudera.org:8080/19590
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
2b550634d2 IMPALA-11952 (part 2): Fix print function syntax
Python 3 now treats print as a function and requires
the parenthesis in invocation.

print "Hello World!"
is now:
print("Hello World!")

This fixes all locations to use the function
invocation. This is more complicated when the output
is being redirected to a file or when avoiding the
usual newline.

print >> sys.stderr , "Hello World!"
is now:
print("Hello World!", file=sys.stderr)

To support this properly and guarantee equivalent behavior
between python 2 and python 3, all files that use print
now add this import:
from __future__ import print_function

This also fixes random flake8 issues that intersect with
the changes.

Testing:
 - check-python-syntax.sh shows no errors related to print

Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351
Reviewed-on: http://gerrit.cloudera.org:8080/19552
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-28 17:11:50 +00:00
stiga-huang
e71ea69bb8 IMPALA-10459: Remove workarounds for MAPREDUCE-6441
MAPREDUCE-6441 is resolved and is in our toolchain. This patch removes
workarounds for it.

Tests:
 - Ran exhaustive test.

Change-Id: I5c4d482a6d15cdc08e9cf8878e130399665a8ee0
Reviewed-on: http://gerrit.cloudera.org:8080/17011
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-05 06:35:26 +00:00
Joe McDonnell
dc2fdabbd1 IMPALA-9923: Load ORC serially to hack around flakiness
ORC dataload has been intermittently failing with
"Fail to get checksum, since file .../_orc_acid_version is under construction."
This is due to some Hive/HDFS interaction that seems to get
worse with parallelism.

This has been hitting a lot of developer tests. As a temporary
workaround, this changes dataload to load ORC serially. This is
slightly slower, but it should be more reliable.

Testing:
 - Ran precommit tests, manually verified dataload logs

Change-Id: I15eff1ec6cab32c1216ed7400e4c4b57bb81e4cd
Reviewed-on: http://gerrit.cloudera.org:8080/16292
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-06 00:06:53 +00:00
Tim Armstrong
0a9ea803d2 IMPALA-7290: part 1: clean up shell tests
This sets up the tests to be extensible to test shell
in both beeswax and HS2 modes.

Testing:
* Add test dimension containing only beeswax in preparation
  for HS2 dimension.
* Factor out hardcoded ports.
* Add tests for formatting of all types and NULL values.
* Merge date shell test into general type tests.
* Added testing for floating point output formatting, which does
  change as a result of switching to server-side vs client-side
  formatting.
* Use unique_database for tests that create tables.

Change-Id: Ibe5ab7f4817e690b7d3be08d71f8f14364b84412
Reviewed-on: http://gerrit.cloudera.org:8080/13083
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-30 11:30:45 +00:00
stiga-huang
357c0a959d IMPALA-7490: fix uninitialized variables in load-data.py
Fixes use of an uninitialized variable in bin/load-data.py

I found the following error message in a failed build, which is quite
misleading:

Traceback (most recent call last):
  File "bin/load-data.py", line 495, in <module>
    if __name__ == "__main__": main()
  File "bin/load-data.py", line 459, in main
    impala_exec_query_files_parallel(thread_pool, impala_create_files)
  File "bin/load-data.py", line 297, in impala_exec_query_files_parallel
    exec_query_files_parallel(thread_pool, query_files, 'impala')
  File "bin/load-data.py", line 291, in exec_query_files_parallel
    for result in thread_pool.imap_unordered(execution_function, query_files):
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
    raise value
UnboundLocalError: local variable 'query' referenced before assignment

The error is trown from the 'execution_function' which is
'exec_impala_query_from_file' in my case. Should not use 'query' if it's
undefined.

Change-Id: If0dd56a9b78a60b3a9f04d9f61e93b4b5d066b76
Reviewed-on: http://gerrit.cloudera.org:8080/11330
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-01 12:00:43 +00:00
Joe McDonnell
573550ca2f IMPALA-7088: Fix uninitialized variable in cluster dataload
bin/load-data.py uses a unique directory for local Hive
execution to avoid a race condition when executing multiple
Hive commands at once. This unique directory is not needed
when loading on a real cluster. However, the code to remove
the unique directory at the end does not handle this
correctly.

This skips the code to remove the unique directory when
it is uninitialized.

Change-Id: I5581a45460dc341842d77eaa09647e50f35be6c7
Reviewed-on: http://gerrit.cloudera.org:8080/10526
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-29 21:34:02 +00:00
Joe McDonnell
b126b2d105 IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2
There is a Hive bug in Hive 1.1.0 that can result
in a NullPointerException when doing parallel Hive
operations (see IMPALA-6532). Since dataload goes
parallel on Hive loads starting with IMPALA-6372,
dataload can hit this error on Hive 1.1.0 (i.e.
IMPALA_MINICLUSTER_PROFILE=2). This is impacting
builds on the 2.x branch.

This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2.

IMPALA_MINICLUSTER_PROFILE=3 uses a newer version
of Hive that has a fix for this, so this continues
to use parallel dataload for that case.

Parallelism can be reenabled when Hive 1.1.0 gets the
fix from Hive 2.1.1.

Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6
Reviewed-on: http://gerrit.cloudera.org:8080/10306
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-10 01:30:13 +00:00
Joe McDonnell
d481cd4842 IMPALA-6372: Go parallel for Hive dataload
This changes generate-schema-statements.py to produce
separate SQL files for different file formats for Hive.
This changes load-data.py to go parallel on these
separate Hive SQL files. For correctness, the text
version of all tables must be loaded before any
of the other file formats.

load-data.py runs DDLs to create the tables in Impala
and goes parallel. Currently, there are some minor
dependencies so that text tables must be created
prior to creating the other table formats. This
changes the definitions of some tables in
testdata/datasets/functional/functional_schema_template.sql
to remove these dependencies. Now, the DDLs for the
text tables can run in parallel to the other file formats.

To unify the parallelism for Impala and Hive, load-data.py
now uses a single fixed-size pool of processes to run all
SQL files rather than spawning a thread per SQL file.

This also modifies the locations that do invalidate to
use refresh where possible and eliminate global
invalidates.

For debuggability, different SQL executions output to
different log files rather than to standard out. If an
error occurs, this will point out the relevant log
file.

This saves about 10-15 minutes on dataload (including
for GVO).

Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Reviewed-on: http://gerrit.cloudera.org:8080/8894
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-14 00:16:26 +00:00
Tim Armstrong
f6576adc2e IMPALA-6455: unique tmpdirs for test_partition_metadata_compatibility
Concurrent hive statements running in local mode can race to modify
the contents of temporary directories - see IMPALA-6108. This applies
the workaround for IMPALA-6108 to the run_stmt_in_hive() utility
function, which is used by test_partition_metadata_compatibility.

Testing:
I wasn't able to reproduce the race locally, but I ran the test and
confirmed that it still passed. I also confirmed that the temporary
directories /tmp/impala-tests-* were created using "ls" while the
tests were running.

Change-Id: Ibabff859d19ddbb2a3048ecc02897a611d8ddb20
Reviewed-on: http://gerrit.cloudera.org:8080/9165
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-31 22:08:48 +00:00
Joe McDonnell
d9b6fd0730 IMPALA-6386: Invalidate metadata at table level for dataload
Dataload currently executes bin/load-data.py for TPC-H,
TPC-DS, and functional-query concurrently. One of the final
steps for bin/load-data.py is to run a global "invalidate
metadata". Global "invalidate metadata" commands are known
to cause problem on concurrent systems. See IMPALA-5087.
For dataload, if TPC-H executes "invalidate metadata" while
TPC-DS is still creating tables and adding partitions,
the TPC-DS executor might erroneously believe that a table
does not exist.

This changes dataload to invalidate metadata at an
individual table level rather than globally. This
prevents the concurrency issue.

This also changes the names of some of the intermediate
SQL files generated by generate-schema-statements.py
and consumed by load-data.py to make them less confusing.

Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f
Reviewed-on: http://gerrit.cloudera.org:8080/9009
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-17 22:52:58 +00:00
Philip Zeyliger
76111ce168 IMPALA-6108, IMPALA-6070: Parallel data load (re-instated).
This is a revert of a revert, re-enabling parallel data load.  It avoid
the race condition by explicitly configuring the temporary directory in
question in load-data.py.

When the parallel data load change went in, we discovered
a race with a signature of:

  java.io.FileNotFoundException: File
  /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist

The number in this path is milliseconds since the epoch, and the race
occurs when two queries submitted to HiveServer2, running with the local
runner, hit the same millisecond time stamp.  The upstream bug is
https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the
symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which
is now marked as a dupe).

I've tested this by running data load 5 times on the same machines
where it failed before. I also ran data load manually and inspected
the system to make sure that the temporary directories are getting
created as expected in /tmp/impala-data-load-*.

Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10
Reviewed-on: http://gerrit.cloudera.org:8080/8405
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-02 00:40:19 +00:00
Jim Apple
07a7138817 Add a script to test performance on a developer machine
This is a migration from an old and broken script from another
repository. Example use:

    bin/single_node_perf_run.py --ninja --workloads targeted-perf \
      --load --scale 4 --iterations 20 --num_impalads 3 \
      --start_minicluster --query_names PERF_AGG-Q3 \
      $(git rev-parse HEAD~1) $(git rev-parse HEAD)

The script can load data, run benchmarks, and compare the statistics
of those runs for significant differences in performance. It glues
together buildall.sh, bin/load-data.py, bin/run-workload.py, and
tests/benchmark/report_benchmark_results.py.

Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9
Reviewed-on: http://gerrit.cloudera.org:8080/6818
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-05-31 08:10:48 +00:00
Martin Grund
ce4c5f6743 IMPALA-4365: Enabling end-to-end tests on a remote cluster
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:

  - Managed by Cloudera Manager (CM)
  - GPL Extras need to be installed
  - KMS and KeyTrustee installed and available as a service
  - SERDEPROPERTIES in the Hive DB modified to accept wide tables
  - Hive warehouse dir points to /test-warehouse

The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)

It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.

Usage: remote_data_load.py [options] <cm_host address>

Options:
  -h, --help            show this help message and exit
  --snapshot-file=SNAPSHOT_FILE
                        Path to the test-warehouse archive
  --cm-user=CM_USER     Cloudera Manager admin user
  --cm-pass=CM_PASS     Cloudera Manager admin user password
  --gateway=GATEWAY     Gateway host to upload the data from. If not
                        set, uses the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with
                        passwordless SSH configured.
  --no-load             Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --test                Run end-to-end tests against cluster

Testing:

This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.

However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 10:16:55 +00:00
Lars Volker
ef4c9958d0 IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo
This change removes some of the occurrences of the strings 'CDH'/'cdh'
from the Impala repository. References to Cloudera-internal Jiras have
been replaced with upstream Jira issues on issues.cloudera.org.

For several categories of occurrences (e.g. pom.xml files,
DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to
remove the occurrences left after this change.

Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8
Reviewed-on: http://gerrit.cloudera.org:8080/4187
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-13 00:40:41 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Tim Armstrong
c1d70f814e IMPALA-3227: generate test TPC data sets during data load
The generated data is identical to the pregenerated tpch.tar.gz
and tpcds.tar.gz data that was used previously and were not
publically accessible.

This adds a "preload" hook to bin/load-data.py that can execute custom
logic for each data set. This is used to call the TPC-H and TPC-DS data
generation utilities that are already available in the Impala toolchain.

Testing:
Ran private test job with loading from snapshot disabled and without
the tpch/tpcds tarballs available.

Change-Id: Ieccfbd7d8d4a91bffddbe35abb7f5572e71a71cf
Reviewed-on: http://gerrit.cloudera.org:8080/3761
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-07-28 04:56:57 +00:00
Tim Armstrong
bc8c55afcd IMPALA-3729: batch_size=1 coverage for avro scanner
Also fix a stale comment in the avro scanner header.

The main work here is to fix the handling of empty result sets in the
test result verifier. This is a problem because we wanted to verify
that the results in the test file were a superset of the rows
returned, and this was thrown off by superflous '' rows in the expected
and actual result sets.

The basic problem is that the way test file sections
was parsed conflated an empty result section with non-empty result
section that had a single empty string. I.e.:

---- RESULTS
====

vs
---- RESULTS

====

both got resolved to [''].

Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/3413
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-07-19 23:30:02 -07:00
Alex Behm
7e76e92bef Consolidate test and cluster logs under a single directory.
All logs, test results and SQL files generated during data
loading and testing are now consolidated under a single new
directory $IMPALA_HOME/logs. The goal is to simplify archiving
in Jenkins runs and debugging.

The new structure is as follows:

$IMPALA_HOME/logs/cluster
- logs of Hadoop components and Impala

$IMPALA_HOME/logs/data_loading
- logs and SQL files produced in data loading

$IMPALA_HOME/logs/fe_tests
- logs and test output of Frontend unit tests

$IMPALA_HOME/logs/be_tests
- logs and test output of Backend unit tests

$IMPALA_HOME/logs/ee_tests
- logs and test output of end-to-end tests

$IMPALA_HOME/logs/custom_cluster_tests
- logs and test output of custom cluster tests

I tested this change with a full data load which
was successful.

Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa
Reviewed-on: http://gerrit.cloudera.org:8080/2456
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-03-28 19:23:22 +00:00
Casey Ching
d202d6a967 Use "impala-python" (virtualenv) instead of system python
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.

Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-06 02:09:09 +00:00
Martin Grund
f58159d431 [CDH5] IMPALA-1141: HBase Planner Performance
This patch improves the performance of the planning phase of a query
querying HBase tables. It removes an unnecessary second call to compute
stats and adds a new version for estimating the row count in a table.

This patch adds an incremental version to estimate the number of rows
for a set of regions. This incremental version will start querying up to
five regions to calculate the average row size and use this value to
estimate the row count based on the size of the regions on disk. Only if
the standard deviation from the average is larger than 15% query an
additional region, it will query additional regions to calculate an
average with more confidence.

If the data is balanced it will not be necessary to retrieve data from
all regions but only from a subset. In the worst case, all regions are
queried.

Change-Id: Idcb3bea81b11cb08da6d9329ba66c86aca23e170
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5258
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2014-11-14 13:47:02 -08:00
Mike Yoder
75a97d3d7e [CDH5] Kerberize mini-cluster and Impala daemons
This is the first iteration of a kerberized development environment.
All the daemons start and use kerberos, with the sole exception of the
hive metastore.  This is sufficient to test impala authentication.

When buildall.sh is run using '-kerberize', it will stop before
loading data or attempting to run tests.

Loading data into the cluster is known to not work at this time, the
root causes being that Beeline -> HiveServer2 -> MapReduce throws
errors, and Beeline -> HiveServer2 -> HBase has problems.  These are
left for later work.

However, the impala daemons will happily authenticate using kerberos
both from clients (like the impala shell) and amongst each other.
This means that if you can get data into the mini-cluster, you could
query it.

Usage:
* Supply a '-kerberize' option to buildall.sh, or
* Supply a '-kerberize' option to create-test-configuration.sh, then
  'run-all.sh -format', re-source impala-config.sh, and then start
  impala daemons as usual.  You must reformat the cluster because
  kerberizing it will change all the ownership of all files in HDFS.

Notable changes:
* Added clean start/stop script for the llama-minikdc
* Creation of Kerberized HDFS - namenode and datanodes
* Kerberized HBase (and Zookeeper)
* Kerberized Hive (minus the MetaStore)
* Kerberized Impala
* Loading of data very nearly working

Still to go:
* Kerberize the MetaStore
* Get data loading working
* Run all tests
* The unknown unknowns
* Extensive testing

Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019
Reviewed-by: Michael Yoder <myoder@cloudera.com>
Tested-by: jenkins
2014-09-05 12:36:21 -07:00
Alex Behm
3d764619f7 Run Hive data loading through beeline instead of the Hive shell.
Fixes our log configuration to put the Hive logs in cluster_logs/hive.

Change-Id: I5d98581e35325f2173e4b3170e36bec42d33f8f3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1497
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1615
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-02-20 15:43:31 -08:00
ishaan
01ef3ef4c1 load-data.py should exit if a bash command returns a non-zero error code.
Change-Id: I2f732a276a42d2697fa55bce0f18ac89e9a6f0a1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1397
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1408
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-30 15:47:13 -08:00
ishaan
4e9913b52f Fix race in data loading by creating text tables first.
While loading parquet, there are a few table creation queries that use the 'like'
keyword; this ends up opening a small race window when all the table formats are created
concurrently. With this change, we create the text tables first before attempting to
parallelize the rest of the data loading.

Change-Id: Ib84cf0e5120b3588d3f0503d7119ca055e08e53f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1241
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-10 15:01:59 -08:00
Nong Li
056c7d94d6 Remove compute stats option from bin/load-data.py
This option is not implemented in this script and doesn't make it
obvious that it doesn't do anything.

Change-Id: I1a1eff38460fd181c486cfca2840108a58e21603
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1059
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-10 14:01:35 -08:00
ishaan
0ed1781323 Invalidate metadata before loading parquet data through Impala.
During a full data load, we load all the data (except parquet) via hive, and then load the
parquet data via Impala. The catalog service does not update the metadata of tables
changed outside Impala, so we need to explicitly invalidate the metadata before loading
parquet data.

Change-Id: Iec39db9ea46e4a11b17589881732629a56444120
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1207
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:39 -08:00
Lenni Kuff
baf79f8185 Call 'invalidate metadata' after loading test data instead of before
Instead of calling 'invalidate metadata' before loading each workload
we should call it once, after loading all test data. This will allow
us to pickup data inserted by Hive. The only reason this worked before
is because we restart Impala before running the tests. This will also
be a bit faster if loading multiple workloads.

Change-Id: I28d42bbf5d7a24b5fde687d67a4b41472ec4b897
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1153
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:37 -08:00
ishaan
287953e87c Better error logging while loading data.
Change-Id: I67cbd9fd1d915ea043a731b7951f29fec25fc446
Reviewed-on: http://gerrit.ent.cloudera.com:8080/982
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:13 -08:00
ishaan
bf5359be8d Cleanup Impala connections after data is loaded.
Change-Id: I152b09808740d5344462bcbaf4df4b71d88504cc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/953
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:02 -08:00
Lenni Kuff
f579ee8b25 Fix logging in load-data to print the query being executed
Change-Id: I4332e8d3a340f11e1bbb1f6c5126b0b9b4a2ad8e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/949
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:58 -08:00
ishaan
fcdcf1a9d8 Parallelize data loaded through Impala to speed up data loading.
Currently, we execute all the queries involved in data loading serially. This change
creates a separate .sql file for each file format, compression codec and compression
scheme combination, and executes all the files in parallel. Additionally, we now store all the
.sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note
that only data loaded through Impala is parallelized, data loaded through hive and hbase
remains serial.

On our build machines, the time taken to load all the data from snapshot was on the order
of 15 minutes.

Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/804
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:53 -08:00
ishaan
565d15579c Add the ability to use a workload as the unit of execution in the Impala benchmark runner.
At the moment, a query is the default unit of execution and parallelism in the Impala
performance suite. With this change, we now have the ability to treat a workload as the
unit of execution. A workload is defined as a unique combination of the dataset, scale
factor, a subset (or all) of the queries in the dataset, and a table format (file format,
compression codec and compression scheme).

It introduces two new command line options in bin/run-workload.py:
  * --execution_scope
    The default scope is 'query', and it maintains previous semantics. The
    new scope is 'workload', which toggles the unit of execution to a workload.
  * --shuffle_query_exec_order.
    Shuffles the order in which queries are executed (only applicable when the
    execution_scope if workload), defaults to False.

Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/503
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:07 -08:00
Lenni Kuff
a1f2f72f49 Add Impala DDL support for creation of AVRO tables + support for CREATE/ALTER SERDEPROPERTIES
This change adds Impala DDL support for creation of AVRO tables.
Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES
which are used when creating Avro backed tables. This syntax is not
exactly the same as the Hive support since it introduces a new
fileformat (AVROFILE) that implies the needed Serialization library,
input format, and output format.

Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/464
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:48 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Alan Choi
ecee109e68 IMPALA-387 Add refresh/invalidate SQL 2014-01-08 10:51:25 -08:00
Alan Choi
b71357fc28 IMPALA-387 Reuse Hdfs and Hive metastore metadata to perform a fast incremental refresh 2014-01-08 10:51:17 -08:00
Lenni Kuff
2f7198292a Add support for auxiliary workloads, tests, and datasets
This change adds support for auxiliary worksloads, tests, and datasets. This is useful
to augment the regular test runs with some additional tests that do not belong in the
main Impala repo.
2014-01-08 10:50:32 -08:00
Nong Li
1f6481382e Fix parquet test setup. 2014-01-08 10:49:41 -08:00
Lenni Kuff
cba9cd00dd Fix full data load build break due to constructing incorrect HDFS paths 2014-01-08 10:49:34 -08:00