This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.
Bug fixes:
* Embarassingly, I was still skipping thrift-server-test in the backend
tests. This was a mistake in handling feedback from my last review.
* I made the timeline a little bit taller to clip less.
Adding workloads:
* I added the RAT licensing check.
* I added exhaustive runs. This led me to model the suites a little
bit more in Python, with a class representing a suite with a
bunch of data about the suite. It's not perfect and still
coupled with the entrypoint.sh shell script, but it feels
workable. As part of adding exhaustive tests, I had
to re-work the timeout handling, since now different
suites meaningfully have different timeouts.
Speed ups:
* To speed up test runs, I added a mechanism to split py.test suites into
multiple shards with a py.test argument. This involved a little bit of work in
conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.
Furthermore, I moved a bit more logic about managing the
list of suites into Python.
* Doing the full build with "-notests" and only building
the backend tests in the relevant target that needs them. This speeds
up "docker commit" significantly by removing about 20GB from the
container. I had to indicates that expr-codegen-test depends on
expr-codegen-test-ir, which was missing.
* I sped up copying the Kudu data: previously I did
both a move and a copy; now I'm doing a move followed by a move. One
of the moves is cross-filesystem so is slow, but this does half the
amount of copying.
Memory usage:
* I tweaked the memlimit_gb settings to have a higher default. I've been
fighting empirically to have the tests run well on c4.8xlarge and
m4.10xlarge.
The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:
* The non-first Impalad does very little JVM work, so having
an Xmx keeps it small, even in the parallel tests.
* Datanodes do work, but they essentially were never garbage
collecting, because JVM defaults let them use up to 1/4th
the machine memory. (I observed this based on RSS at the
end of the run; nothing fancier.) Adding Xms/Xmx settings
helped.
* Similarly, I piped the settings through to HBase.
A few daemons still run without resource limitations, but they don't
seem to be a problem.
Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HDFS commandline calls can be expensive due to JVM
startup and other costs. Since most HDFS commandline
calls can take multiple paths, one way to reduce
execution time is to consolidate multiple HDFS
commands into a single HDFS call. Since HDFS put
commands will follow symbolic links and can copy
recursively, this can allow for further consolidation
by creating the full directory structure and
copying it in a single HDFS call.
This does several of these optimizations throughout
the dataload codepath. It saves a few seconds here
and there:
Loading Hive Builtins: 1:10 -> 0:30
Loading custom schemas: 0:35 -> 0:20
Loading Hive UDFs: 0:45 -> 0:25
Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Reviewed-on: http://gerrit.cloudera.org:8080/10120
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
testdata/bin/create-load-data.sh does bin/load-data.py for
functional/exhaustive, tpch/core, and tpcds/core in a
first phase, then it loads functional and tpch for Kudu
in a second phase. For a full dataload, this second phase
is not necessary. functional/exhaustive and tpch/core
already include Kudu.
This avoids the second phase when doing a full dataload.
The second phase is still necessary when loading from
a snapshot, and this does not change that behavior.
This saves a couple minutes off of full dataload.
Change-Id: Ic023d230f99126ed37795106c38faae5f0cb608e
Reviewed-on: http://gerrit.cloudera.org:8080/10128
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The patch puts the output of Sentry to
$IMPALA_CLUSTER_LOGS_DIR/sentry/sentry.out to follow the
same convention as other service output logs.
Testing:
- Injected some failure in run-sentry-service.sh script to see if the
error message was captured
Change-Id: I76627bb5b986a548ec6e4f12b555bd6fc8c4dab8
Reviewed-on: http://gerrit.cloudera.org:8080/10064
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This changes generate-schema-statements.py to produce
separate SQL files for different file formats for Hive.
This changes load-data.py to go parallel on these
separate Hive SQL files. For correctness, the text
version of all tables must be loaded before any
of the other file formats.
load-data.py runs DDLs to create the tables in Impala
and goes parallel. Currently, there are some minor
dependencies so that text tables must be created
prior to creating the other table formats. This
changes the definitions of some tables in
testdata/datasets/functional/functional_schema_template.sql
to remove these dependencies. Now, the DDLs for the
text tables can run in parallel to the other file formats.
To unify the parallelism for Impala and Hive, load-data.py
now uses a single fixed-size pool of processes to run all
SQL files rather than spawning a thread per SQL file.
This also modifies the locations that do invalidate to
use refresh where possible and eliminate global
invalidates.
For debuggability, different SQL executions output to
different log files rather than to standard out. If an
error occurs, this will point out the relevant log
file.
This saves about 10-15 minutes on dataload (including
for GVO).
Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Reviewed-on: http://gerrit.cloudera.org:8080/8894
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.
A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.
Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.
Tests
- Most of the end-to-end tests can run on ORC format.
- Add tpcds, tpch tests for ORC.
- Add some ORC specific tests.
- Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
is not robust for corrupt files (ORC-315).
Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allows running the tests that make up the "core" suite in about 2 hours.
By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend
tends to run in about 3.5 hours.
This commit:
* Adds "echo" statements in a few places, to facilitate timing.
* Adds --skip-parallel/--skip-serial flags to run-tests.py,
and exposes them in run-all-tests.sh.
* Marks TestRuntimeFilters as a serial test. This test runs
queries that need > 1GB of memory, and, combined with
other tests running in parallel, can kill the parallel test
suite.
* Adds "test-with-docker.py", which runs a full build, data load,
and executes tests inside of Docker containers, generating
a timeline at the end. In short, one container is used
to do the build and data load, and then this container is
re-used to run various tests in parallel. All logs are
left on the host system.
Besides the obvious win of getting test results more quickly, this
commit serves as an example of how to get various bits of Impala
development working inside of Docker containers. For example, Kudu
relies on atomic rename of directories, which isn't available in most
Docker filesystems, and entrypoint.sh works around it.
In addition, the timeline generated by the build suggests where further
optimizations can be made. Most obviously, dataload eats up a precious
~30-50 minutes, on a largely idle machine.
This work is significantly CPU and memory hungry. It was developed on a
32-core, 120GB RAM Google Compute Engine machine. I've worked out
parallelism configurations such that it runs nicely on 60GB of RAM
(c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is
some simple logic to guess at some knobs, and there are knobs. By and
large, EC2 and GCE price machines linearly, so, if CPU usage can be kept
up, it's not wasteful to run on bigger machines.
Change-Id: I82052ef31979564968effef13a3c6af0d5c62767
Reviewed-on: http://gerrit.cloudera.org:8080/9085
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit slightly loosens the coupling between IMPALA_HIVE_VERSION
and "hive.version" in the Maven sense.
Cherry-picks: not for 2.x
Change-Id: Ifbe6f5208b4ad0ffc9cbfe4e93d712ce698beb23
Reviewed-on: http://gerrit.cloudera.org:8080/9925
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
When an impalad is in executor-only mode, it receives no
catalog updates. As a result, lib-cache entries are never
refreshed. A consequence is that udf queries can return
incorrect results or may not run due to resolution issues.
Both cases are caused by the executor using a stale copy
of the lib file. For incorrect results, an old version of
the method may be used. Resolution issues can come up if
a method is added to a lib file.
The solution in this change is to capture the coordinator's
view of the lib file's last modified time when planning.
This last modified time is then shipped with the plan to
executors. Executors must then use both the lib file path
and the last modified time as a key for the lib-cache.
If the coordinator's last modified time is more recent than
the executor's lib-cache entry, then the entry is refreshed.
Brief discussion of alternatives:
- lib-cache always checks last modified time
+ easy/local change to lib-cache
- adds an fs lookup always. rejected for this reason
- keep the last modified time in the catalog
- bound on staleness is too loose. consider the case where
fn's f1, f2, f3 are created with last modified times of
t1, t2, t3. treat the fn's last modified time as a low-watermark;
if the cache entry has a more recent time, use it. Such a scheme
would allow the version at t2 to persist. An old fn may keep the
state from converging to the latest. This could end up with strange
cases where different versions of the lib are used across executors
for a single query.
In contrast, the change in this path relies on the statestore to
push versions forward at all coordinators, so will push all
versions at all caches forward as well.
Testing:
- added an e2e custom cluster test
Change-Id: Icf740ea8c6a47e671427d30b4d139cb8507b7ff6
Reviewed-on: http://gerrit.cloudera.org:8080/9697
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Adds support for building against two sets of Hadoop ecosystem
components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE,
which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for
Hadoop 3, Hive 2, and so on).
We intend (in a trivial follow-on change soon) to make 3 the new default
and to explicitly deprecate 2, but this change only does not switch the
default yet. We support both to facilitate a smoother transition, but
support will be removed soon in the Impala 3.x line.
The switch is done at build time, following the pattern from IMPALA-5184
(build fe against both Hive 1 & 2 APIs). Switching back and forth
requires running 'cmake' again. Doing this at build-time avoids
complicating the Java code with classloader configuration.
There are relatively few incompatible APIs. This implementation
encapsulates that by extracting some Java code into
fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the
pattern established by IMPALA-5184, but, to avoid a proliferation
of directories, I've moved the Hive files into the same tree.)
pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I
consolidated the Hive changes into the same directory structure.
For Maven, I introduced Maven "profiles" to handle the two cases where
the dependencies (and exclusions) differ. These are driven by the
$IMPALA_MINICLUSTER_PROFILE environment variable.
For Sentry, exception class names changed. We work around this by adding
"isSentry...(Exception)" methods with two different implementations.
Sentry is also doing some odd shading, whereby some exceptions are
"sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism
to create a SentryAuthProvider is slightly different. The easiest way to
see the differences is to run:
diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java
diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java
The Sentry work is based on a change by Zach Amsden.
In addition, we recently added an explicit "refresh" permission. In
Sentry 2, this required creating an ImpalaPrivilegeModel to capture
that. It's a slight customization of Hive's equivalent class.
For Parquet, the difference is even more mechanical. The package names
gone from "parquet" to "org.apache.parquet". The affected code
was extracted into ParquetHelper, but only one copy exists. The second
copy is generated at build-time using sed.
In the rare cases where we need to behave differently at runtime,
MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates
what version we were built aginst. One of the cases is the results
expected by various frontend tests. I avoided the issue by translating
one error string into another, which handled the diversion in one place,
rather than complicating the several locations which look for "No
FileSystem for scheme..." errors.
The HBase APIs we use for splitting regions at test time changed.
This patch includes a re-write of that code for the new APIs. This
piece was contributed by Zach Amsden.
To work with newer versions of dependencies, I updated the version of
httpcomponents.core we use to 4.4.9.
We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase
binaries to s3://native-toolchain, and amended the shell scripts to
launch the right things. There are minor mechanical differences. Some
of this was based on earlier work by Joe McDonnell and Zach Amsden.
Hive's logging is changed in Hive 2, necessitating creating a
log4j2.properties template and using it appropriately. Furthermore,
Hadoop3's new shell script re-writes do a certain amount of classpath
de-duplication, causing some issues with locating the relevant logging
configurations. Accomodations exist in the code to deal with that.
parquet-filtering.test was updated to turn off stats filtering. Older
Hive didn't write Parquet statistics, but newer Hive does. By turning
off stats filtering, we test what the test had intended to test.
For views-compatibility.test, it seems that Hive 2 has fixed certain
bugs that we were testing for in Hive. I've added a
HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that.
For AuthorizationTest, different hive versions show slightly different
things for extended output.
To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing
to see here.
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%)
rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%)
CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This
was done manually, but without changing anything except what Java required in
terms of accessibility and boilerplate.
rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%)
copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%)
Testing: Ran core & exhaustive tests with both profiles.
Cherry-picks: not for 2.x.
Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7
Reviewed-on: http://gerrit.cloudera.org:8080/9716
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
In wait-hdfs-replication, the frequent and eager restart might slow the
HDFS replication down. HDFS should be restarted only if no progress is
made in a certain amount of time, and we should wait longer before
failing the data loading.
Testing: It's tested with a fake HDFS fsck script.
Change-Id: Ib059480254643dc032731b4b3c55204a93b61e77
Reviewed-on: http://gerrit.cloudera.org:8080/9698
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The retries in split-hbase.sh don't work in the common case,
because $MINIKDC_PRINC_HIVE is not set in non-kerberized (common)
environments. The regular data load scripts (create-load-data.sh)
have code to manage that, but split-hbase.sh blindly forges ahead,
leading to errors like:
/home/impdev/Impala/testdata/bin/split-hbase.sh: line 49: MINIKDC_PRINC_HIVE: unbound variable
Error in /home/impdev/Impala/testdata/bin/create-load-data.sh at line 48: LOAD_DATA_ARGS=""
Since this hasn't been working, I opted to remove it entirely, as a failure on
the line where HBase splitting actually failed would be significantly more
useful than the error here. A search of mailing lists suggested that I was at
least the second person to have run into this. (In my case, I did break HBase
splitting, but it took me a second to identify the error, since the log was
spammed with unrelated information relating to the cluster restart.)
Testing: core tests.
Change-Id: I715891c9e744f21002330c3ae3ebc14095d94ffd
Reviewed-on: http://gerrit.cloudera.org:8080/9588
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
HDFS sometimes fails to fully replicate all the blocks in 30 seconds
and no progress is made. This patch tries to restart HDFS several times
before aborting the data loading.
Change-Id: Iefd4c2fc6c287f054e385de52bdc42b0bdbd7915
Reviewed-on: http://gerrit.cloudera.org:8080/9469
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
When loading from an up-to-date snapshot, dataload will
load all of the metadata and load data into HDFS. Then,
it will skip load-data.py for functional/exhaustive,
tpch/core, and tpcds/core. It will invoke a special
round of load-data.py calls to populate Kudu tables,
and it always runs these with a force reload.
However, when loading from an old snapshot, dataload will
still load all of the metadata and load the data into
HDFS, but then it will still invoke load-data.py for
functional/exhaustive, tpch/core, and tpcds/core.
These invocations mostly do DDLs with very few load
statements. However, these invocations are a problem
for Kudu. The metadata of Impala tables referencing
Kudu entities have been imported along with all the other
metadata, but the Kudu entities have not been created, as
they are separate from HDFS. This means that Kudu tables
are not really valid in this circumstance.
Since Kudu has been added to the list of data formats
for tpch/core (see IMPALA-6475), load-data.py with
tpch/core will attempt to insert into these invalid
Kudu tables.
To avoid this, always force reload any Kudu tables.
generate-schema-statements.py will always generate a
drop table statement before any create of a Kudu table.
This guarantees that the create will also create the
corresponding Kudu entity.
Change-Id: I2d07f3513c543e2590f2f62b96b37472316868ee
Reviewed-on: http://gerrit.cloudera.org:8080/9445
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
Dataload currently executes bin/load-data.py for TPC-H,
TPC-DS, and functional-query concurrently. One of the final
steps for bin/load-data.py is to run a global "invalidate
metadata". Global "invalidate metadata" commands are known
to cause problem on concurrent systems. See IMPALA-5087.
For dataload, if TPC-H executes "invalidate metadata" while
TPC-DS is still creating tables and adding partitions,
the TPC-DS executor might erroneously believe that a table
does not exist.
This changes dataload to invalidate metadata at an
individual table level rather than globally. This
prevents the concurrency issue.
This also changes the names of some of the intermediate
SQL files generated by generate-schema-statements.py
and consumed by load-data.py to make them less confusing.
Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f
Reviewed-on: http://gerrit.cloudera.org:8080/9009
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
When the data loading finishes, it is possible for some HDFS blocks to
be under replicated. If impala gets the metadata before the replication
is done, some tests may fail. This patch adds a replication waiting step
in the data loading script.
Resubmitted with filesystem type check.
Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8
Reviewed-on: http://gerrit.cloudera.org:8080/8916
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Using fsck breaks non-HDFS builds: local, S3, and Isilon.
This reverts commit 5a7c10ec3d.
Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc
Reviewed-on: http://gerrit.cloudera.org:8080/8880
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
When the data loading finishes, it is possible for some HDFS blocks to
be under replicated. If impala gets the metadata before the replication
is done, some tests may fail. This patch adds a replication waiting step
in the data loading script.
Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322
Reviewed-on: http://gerrit.cloudera.org:8080/8846
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The two Kudu loads and Hive UDFs can all run in parallel. This
should shave about 4 minutes off of the data load. (Current
timings are 3.5, 4, and 0.6 minutes, see below.)
I've run dataload with this change many times.
Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)...
Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec)
Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)...
Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec)
Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)...
Loading Hive UDFs OK (Took: 0 min 41 sec)
Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407
Reviewed-on: http://gerrit.cloudera.org:8080/8822
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
I re-created the original patch for IMPALA-6068, but only
performed what I believe to be the limited legal transformation
of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE.
Any place that directly uploads via hadoop or hdfs commands
was left alone as changing it can't be proven to be correct.
Change-Id: I6c242cca209a7138b10ad517076707709b5cd204
Testing: Doing a full data load. I mistakenly changed a variable
name causing the first two dry-runs to fail.
Reviewed-on: http://gerrit.cloudera.org:8080/8690
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Zach Amsden <zamsden@cloudera.com>
This reverts commit e4f585240a.
Among other things, that commit replaced hdfs command line calls
with "LOAD DATA LOCAL INPATH" using Hive. However, doing so
presumes that the minicluster is the only test environment.
Sometimes though, the data load script is against a remote cluster,
and those cases, the data load process is now broken.
Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2
Reviewed-on: http://gerrit.cloudera.org:8080/8653
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
With this commit, $IMPALA_MAVEN_OPTIONS is used by bin/mvn-quiet.sh
to configure Maven slightly. The default is no extra options.
This is handy for giving Maven a settings file with the "-s" flag, to
control, for example, repositories and their mirrors. In fact, I
considered exposing IMPALA_MAVEN_SETTINGS_FILE explicitly, but decided
that the generic option would be as good.
It's useful to customize how Maven works, especially
to provide a settings file with repository mirrors.
Change-Id: I2c62185476fd2388c7cda8884276b79a77370127
Reviewed-on: http://gerrit.cloudera.org:8080/8496
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
This is a revert of a revert, re-enabling parallel data load. It avoid
the race condition by explicitly configuring the temporary directory in
question in load-data.py.
When the parallel data load change went in, we discovered
a race with a signature of:
java.io.FileNotFoundException: File
/tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist
The number in this path is milliseconds since the epoch, and the race
occurs when two queries submitted to HiveServer2, running with the local
runner, hit the same millisecond time stamp. The upstream bug is
https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the
symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which
is now marked as a dupe).
I've tested this by running data load 5 times on the same machines
where it failed before. I also ran data load manually and inspected
the system to make sure that the temporary directories are getting
created as expected in /tmp/impala-data-load-*.
Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10
Reviewed-on: http://gerrit.cloudera.org:8080/8405
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
We may be seeing a race with errors like "java.io.FileNotFoundException:
File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist".
This reverts commit e020c37106.
Change-Id: I46da93f4315a5a4bdaa96fa464cb51922bd6c419
Reviewed-on: http://gerrit.cloudera.org:8080/8386
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Dataload typically follows a pattern of loading data into
a text version of a table, and then using an insert
overwrite from the text table to populate the table for
other file formats. This insert is always done in Impala
for Parquet and Kudu. Otherwise it runs in Hive.
Since Impala doesn't support writing nested data, the
population of complextypes_fileformat tries to hack
the insert to run in Hive by including it in the ALTER
part of the table definition. ALTER runs immediately
after CREATE and always runs in Hive. The problem is
that ALTER also runs before the base table
(functional.complextypes_fileformat) is populated.
The insert succeeds, but it is inserting zero rows.
This code change introduces a way to force the Parquet
load to run using Hive. This lets complextypes_fileformat
specify that the insert should happen in Hive and fixes
the ordering so that the table is populated correctly.
This is also useful for loading custom Parquet files
into Parquet tables. Hive supports the DATA LOAD LOCAL
syntax, which can read a file from the local filesystem.
This means that several locations that currently use
the hdfs commandline can be modified to use this SQL.
This change speeds up dataload by a few minutes, as it
avoids the overhead of the hdfs commandline.
Any other location that could use DATA LOAD LOCAL is
also switched over to use it. This includes the
testescape* tables which now print the appropriate
DATA LOAD commands as a result of text_delims_table.py.
Any location that already uses DATA LOAD LOCAL is also
switched to indicate that it must run in Hive. Any
location that was doing an HDFS command in the LOAD
section is moved to the LOAD_DEPENDENT_HIVE section.
Testing: Ran dataload and core tests. Also verified that
functional_parquet.complextypes_fileformat has rows.
Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
Reviewed-on: http://gerrit.cloudera.org:8080/8350
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)
To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.
I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.
The resulting log output is currently like so (but without the
timestamps):
15:58:04 Started Loading functional-query data in background; pid 8105.
15:58:04 Started Loading TPC-H data in background; pid 8106.
15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04 Started Loading TPC-DS data in background; pid 8107.
15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec)
16:14:33 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
16:35:08 Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec)
I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format
Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Reviewed-on: http://gerrit.cloudera.org:8080/8320
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.
This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:
old, cold: 7m44s
new, cold: 1m42s
old, warm: 1m23s
new, warm: 48s
The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.
Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Reviewed-on: http://gerrit.cloudera.org:8080/8354
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins
To make statements execute, some clients require always appending
a semi-colon to the end. The workaround is quite simple.
Change-Id: Id8b9f3dde4445513f1f389785a002c6cc6b3dada
Reviewed-on: http://gerrit.cloudera.org:8080/8132
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
Partition pruning has two mechanisms:
1) Simple predicates (e.g. binary predicates of the form
<SlotRef> <op> <LiteralExpr>) can be used to derive lists
of matching partition ids directly from the
partition key values. This is handled directly in the FE
and is very efficient for supported simple predicates.
2) General expr evaluation of predicates using the BE (via
FeSupport). This works for all predicates, so is the
mechanism used for predicates not supported by (1).
The issue was that (1) was being used when a binary
predicate contained an implicit cast on the SlotRef. While
this is OK when being evaluated by the BE, the simple
mechanism in (1) would not be able to match the partition
key values with the predicate literal because the partition
key values cannot be cast in the FE.
The fix is to force binary predicates involving a cast to be
evaluated in the BE.
Testing: A planner test was added to demonstrate the
expected partition pruning occurs.
Some modifications were made to the functional schema table
stringpartitionkey, so it will be necessary to reload those
tables:
load-data.py -w functional-query --table_names=stringpartitionkey
Change-Id: I94f597a6589f5e34d2b74abcd29be77c4161cd99
Reviewed-on: http://gerrit.cloudera.org:8080/7521
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
The TableFlattener takes a nested dataset and creates an equivalent
unnested dataset. The unnested dataset is saved as Parquet.
When an array or map is encountered in the original table, the flattener
creates a new table and adds an id column to it which references the row
in the parent table. Joining on the id column should produce the
original dataset.
The flattened dataset should be loaded into Postgres in order to run the
query generator (in nested types mode) on it. There is a script that
automates generaration, flattening and loading random data into Postgres
and Impala:
testdata/bin/generate-load-nested.sh -f
Testing:
- ran ./testdata/bin/generate-load-nested.sh -f and random nested data
was generated and flattened as expected.
Change-Id: I7e7a8e53ada9274759a3e2128b97bec292c129c6
Reviewed-on: http://gerrit.cloudera.org:8080/5787
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Occasionally we'd see HBase fail to startup properly on CentOS 7
clusters. The symptom was that HBase would not open the required nodes
in zookeeper, signaling its readiness.
As a workaround, this change includes waiting for the Zookeeper nodes
into the retry logic.
Change-Id: Id8dbdff4ad02cac1322e7d580e0a6971daf6ea28
Reviewed-on: http://gerrit.cloudera.org:8080/7159
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: anujphadke <aphadke@cloudera.com>
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Lars Volker <lv@cloudera.com>
This is a migration from an old and broken script from another
repository. Example use:
bin/single_node_perf_run.py --ninja --workloads targeted-perf \
--load --scale 4 --iterations 20 --num_impalads 3 \
--start_minicluster --query_names PERF_AGG-Q3 \
$(git rev-parse HEAD~1) $(git rev-parse HEAD)
The script can load data, run benchmarks, and compare the statistics
of those runs for significant differences in performance. It glues
together buildall.sh, bin/load-data.py, bin/run-workload.py, and
tests/benchmark/report_benchmark_results.py.
Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9
Reviewed-on: http://gerrit.cloudera.org:8080/6818
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This change loads the missing tables in TPC-DS. In addition,
it also fixes up the loading of the partitioned table store_sales
so all partitions will be loaded. The existing TPC-DS queries are
also updated to use the parameters for qualification runs as noted
in the TPC-DS specification. Some hard-coded partition filters were
also removed. They were there due to the lack of dynamic partitioning
in the past. Some missing TPC-DS queries are also added to this change,
including query28 which discovered the infamous IMPALA-5251.
Having all tables in TPC-DS available paves the way for us to include
all supported TPCDS queries in our functional testing. Due to the change
in the data, planner tests and the E2E tests have different results than
before. The results of E2E tests were compared against the run done with
Netezza and Vertica. The divergence were all due to the truncation behavior
of decimal types in DECIMAL_V1.
Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9
Reviewed-on: http://gerrit.cloudera.org:8080/6877
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
This change fixed IMPALA-4873 by adding the capability to supply a dict
'test_file_vars' to run_test_case(). Keys in this dict will be replaced
with their values inside test queries before they are executed.
Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70
Reviewed-on: http://gerrit.cloudera.org:8080/6817
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds support for running the stress test
(concurrent_select.py) and loading nested data (load_nested.py) into a
Kerberized, SSL-enabled Impala cluster. It assumes the calling user
already has a valid Kerberos ticket. One way to do that is:
1. Get access to a keytab and krb5.config
2. Set KRB5_CONFIG and KRB5CCNAME appropriately
3. Run kinit(1)
4. Run load_nested.py and/or concurrent_select.py within this
environment.
Because our Python clients already support Kerberos and SSL, we simply
need to make sure to use the correct options when calling the entry
points and initializing the clients:
Impala: Impyla
Hive: Impyla
HDFS: hdfs.ext.kerberos.KerberosClient
With this patch, I was able to manually do a short concurrent_select.py
run against a secure cluster without connection or auth errors, and I
was able to do the same with load_nested.py for a cluster that already
had TPC-H loaded.
Follow-ons for future cleanup work:
IMPALA-5263: support CA bundles when running stress test against SSL'd
Impala
IMPALA-5264: fix InsecurePlatformWarning under stress test with SSL
Change-Id: I0daad57bb8ceeb5071b75125f11c1997ed7e0179
Reviewed-on: http://gerrit.cloudera.org:8080/6763
Reviewed-by: Matthew Mulder <mmulder@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The Kudu query tests were failing on a remote cluster because the Kudu
master was always set to '127.0.0.1', with no way to override it.
This patch corrects the issue with a number of changes:
- Add a pytest command line option to specify an arbitrary Kudu master
- Consolidate the place where the default Kudu master is derived. It
had been stored both in the env and in tests/common/__init__.py,
with different files looking to different places. For now, just look
to the env, and remove the value from __init__.py.
- The kudu_client test fixture in conftest.py was using the connect()
method from impala.dbapi (part of the Impyla library), without
specifying the host param. In the absence of that, the default value
is 'localhost', so add the host param to the connect() call.
- Define the various defaults for pytest config as constants at the top
of conftest.py.
Change-Id: I9df71480a165f4ce21ae3edab6ce7227fbf76f77
Reviewed-on: http://gerrit.cloudera.org:8080/5877
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
This is the second patch to address IMPALA-4684. The first patch exposed
a transient Zookeeper connection error on RHEL7. This patch introduces a
retry (up to 3 times), and somewhat better logging.
Tested by running tests against an RHEL7 instance and confirming that
all HBase nodes start up.
Change-Id: I44b4eec342addcfe489f94c332bbe14225c9968c
Reviewed-on: http://gerrit.cloudera.org:8080/5554
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
If an exception (other than NoNodeError) was raised while checking for
HBase nodes, we weren't cleanly stopping the ZooKeeper client, which
in turn created a second exception when the the connection was closed.
The second exception masked the original error condition.
Tested by forcibly raising unexpected errors while checking for HBase
nodes.
Change-Id: I46a74d018f9169385a9f10a85718044c31a24dbc
Reviewed-on: http://gerrit.cloudera.org:8080/5547
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Testing:
Tested that buildall.sh works as expected. Built locally with
IMPALA_MAKE_FLAGS unset to confirm I didn't break anything.
Built locally with
IMPALA_MAKE_FLAGS=--load-average=$IMPALA_BUILD_THREADS and looked
at "ps auxf" output to confirm it's passed through.
Change-Id: I17b13cbaf395f962762d5cff3d650ffb077934a4
Reviewed-on: http://gerrit.cloudera.org:8080/5480
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.
Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
run-step prints a message to tell the reader what it's doing. However,
that message wasn't flushed so that run-step could print OK or FAILED on
the same line. The result was that long-running steps wouldn't print
anything to the log until they were done, at least in Jenkins contexts.
This patch changes it so that the message is flushed, and then the
result is printed on a separate line (including the time it took to run
the step).
$ run-step "Hello world!" helloworld.out sleep 5
Hello world! (logging to /tmp/helloworld.out)...
OK (Took: 0 min 5 sec)
Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22
Reviewed-on: http://gerrit.cloudera.org:8080/5116
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components
frequently take more than 30 seconds to start. I have seen the HMS
take more than 90 seconds; this patch sets a more conservative timeout
default.
Change-Id: I43eb8646cca495578c8f9730faa04812957d2917
Reviewed-on: http://gerrit.cloudera.org:8080/5068
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
This patch fixes a sed expression to make sure it only laters the code
it is meant to alter, not the comment describing the code.
Tested with tests/run-tests.py query_test/test_udfs.py
Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e
Reviewed-on: http://gerrit.cloudera.org:8080/5008
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:
- Managed by Cloudera Manager (CM)
- GPL Extras need to be installed
- KMS and KeyTrustee installed and available as a service
- SERDEPROPERTIES in the Hive DB modified to accept wide tables
- Hive warehouse dir points to /test-warehouse
The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)
It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.
Usage: remote_data_load.py [options] <cm_host address>
Options:
-h, --help show this help message and exit
--snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
--cm-user=CM_USER Cloudera Manager admin user
--cm-pass=CM_PASS Cloudera Manager admin user password
--gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
--ssh-user=SSH_USER System user on the remote machine with
passwordless SSH configured.
--no-load Do not try to load the snapshot
--exploration-strategy=EXPLORATION_STRATEGY
--test Run end-to-end tests against cluster
Testing:
This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.
However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.
Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
* Change run-step to output full log path
* Change text to say "Computing table stats" rather than "Computing
HBase stats" when running compute-table-stats.sh
Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07
Reviewed-on: http://gerrit.cloudera.org:8080/4825
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins