This patch does the following:
- Removes code that deals with executing queries through Hive.
- Gives the user the option to specify only the hostname for the Impalads.
- Moves the execution functions to their own .py file.
- Removes some duplicate code (exec_shell_cmd -> exec_process)
Change-Id: If49951c7bb5423ef9343d4d211f6da13d397325a
Reviewed-on: http://gerrit.cloudera.org:8080/862
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
Allow remote debugger to connect impala daemon will be very helpful
for debugging frontend.
You can also remote debug Impala running on a real cluster by
setting JAVA_TOOL_OPTIONS="-agentlib:jdwp=transport=dt_socket,
server=y,address=30000,suspend=n,quiet=y"
in Impala Environment Advanced Configuration via CM.
Change-Id: I761c5b2229d107ca4559c220488838b85fc14d53
Reviewed-on: http://gerrit.cloudera.org:8080/671
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
This patch provides the last fixes to finally enable the toolchain:
- Remove static OpenSSL dependency
- Fixing inline assembly problems in ASAN
- Issues with non-relocatable LLVM 3.3 - adds manual system
includes to fix issues with hardcoded header paths in clang.
When the toolchain is enabled and we build for ASAN we use a specific
toolchain file to build with LLVM-trunk as the main compiler. Even
though this uses LLVM-trunk for compiling the Impala code, this will use
LLVM 3.3 for codegen. In addition, this enables us to follow up with
TSAN and LEAKSAN.
Change-Id: I0abb914ca3f192cb7edd83ead134bc9e2d02071f
Reviewed-on: http://gerrit.cloudera.org:8080/556
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
I had missed a few places in my original commit:
2b43828fe22dbb17b4f6df875fc59af6772f3984
Change-Id: I55c0d0a79f6c3416f6ba64cfbf4c1dbb4293bd36
Reviewed-on: http://gerrit.cloudera.org:8080/616
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The symptom of this bug was that we were seeing "ValueError: bad marshal data"
when trying to import from tests.hs2.test_hs2 during customer cluster tests.
The problem was that we were not running the custom cluster tests through the
new Impala Python virtualenv.
Some tests (properly running with the virtualenv) that run before the customer
cluster tests had caused the generation of pyc files for tests.hs2.test_hs2.
Those pyc files then appeared corrupted when executing the custom cluster
tests because the default python env is running a different version than the
virtualenv those pyc files were generated from in earlier tests.
Change-Id: Ie9d8f90c65921247dd885804165f9b7271ea807b
Reviewed-on: http://gerrit.cloudera.org:8080/618
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.
Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This adds a bootstrap script and a "impala-python" command to
$IMPALA_HOME/bin that automatically runs the bootstrap and redirects to
the virtualenv python. Existing python scripts will later be updated to
use the this new "impala-python" command.
The bootstrap script will build a virtualenv to ensure a minimum python
version (2.6) and a well known set of dependencies. The bootstrap script
can be run with python 2.4 but 2.6 must already be installed on the
system. The resulting virtualenv will use 2.6 at a minimum.
Only dependencies explicitly listed in requirements.txt will be
installed and available (no system packages will ever be used). No
packages will ever be downloaded when setting up the virtualenv. In the
future new dependencies can be added by editing the requirements.txt
file. Installation through requirements.txt is a standard pip feature.
When requirements.txt is updated, the next run of "impala-python" will
rebuild the virtualenv.
Change-Id: I150595d7e09a45d5f2e3c30a845bc8d6a761eeed
Reviewed-on: http://gerrit.cloudera.org:8080/424
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
The SASL install directory was hardcoded to a path under /tmp. This solved
a very specific problem with ~'s in paths but could not be overridden. This
can lead to lost time when /tmp is cleared out periodically or on reboots.
IMPALA_CYRUS_SASL_INSTALL_DIR now defaults to a path under thirdparty unless
there is a tilde in the path. It can also be overridden by setting the
IMPALA_CYRUS_SASL_INSTALL_DIR environment variable.
Change-Id: I1aea2b51d265e3d1f04be0c915dcbee57c863be6
Reviewed-on: http://gerrit.cloudera.org:8080/536
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
The tests were doing unnecessary things. One such thing that stopped
working with the virtualenv patch was searching for the shell process to
get the pid. The search was never needed since the process was spawned
with Popen which provides the pid directly.
Change-Id: I2455e58de4fdba8fd2770f0489fac8cddf6b90a0
Reviewed-on: http://gerrit.cloudera.org:8080/555
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
A previous change only enabled importing psutil if it was present on the system and always
returning True for check_process_exists. This broke the functionality of --kill wherein we
throw an error if the process has not been killed.
This patch adds a method to check for the existance of psutil and only checking for a
process being killed if it does.
Change-Id: I679ce12dc7e2732a8a95d5825c31d8a1bec354ec
Reviewed-on: http://gerrit.cloudera.org:8080/541
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
Capitalize default for IMPALA_AUX_TEST_HOME in
bin/impala-config.sh so that it correctly matches the directory name.
The jenkins scripts will not be affected because they explicitly set
the variable and already use the correct capitalization.
Change-Id: I674ddfd38bc1a13721674e433e03cc66baad2cfc
Reviewed-on: http://gerrit.cloudera.org:8080/543
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
buildall invokes start-impala-cluster.py --kill --force in order to ensure that there are
no Impala daemons running on the system. The recent introduction of psutil in the start
script breaks the full CDH build as it's not a standard python module.
This patch only imports psutil inside the method its used and disables its usage with a
warning in the case that it's not found.
Change-Id: Ic2fce81b6d7af2722e0e23c2a580c30b86144aa1
Reviewed-on: http://gerrit.cloudera.org:8080/540
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch checks that the processes launched by start-impala-cluster.py
are actually started or killed. During startup, it checks if the
appropriate processes are really killed. This is useful in cases where a
similar process was started by another user and thus preventing starting
this mini cluster. For the statestore and the catalog this patch adds an
additional check after the processes were launched to verify their
existence.
Change-Id: Idfd6a11fd72278ddf180dc537459582b4392a109
Reviewed-on: http://gerrit.cloudera.org:8080/521
Tested-by: Internal Jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
start-impala-cluster.py takes arguments to pass through to
impalads, catalogd, and statestored, but all arguments must
be passed as a single string today. This changes the arg
parsing to also accept multiple arguments passed as
different parameters.
E.g. the following previously (and still) works:
> start-impala-cluster.py --impalad_args="-v=2 -memlimit=1G"
Now args can optionally be passed separately:
> start-impala-cluster.py --impalad_args=-v=2 --impalad_args=-memlimit=1G
This is helpful in general, but is needed to enable passing
through some arguments from environment variables in jenkins
jobs.
Change-Id: I32f7a75ec4ce8f5ce878b3e7f76880a731842c14
Reviewed-on: http://gerrit.cloudera.org:8080/510
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Additionally, the pom.xml is changed to pull in the shaded sentry-provider-db jar to
account for the difference in thrift versions.
Change-Id: I24d64d2b21712e76d9ad51551ee87fd37a738641
This patch removes the files that deal with evaluating performance to its own folder.
Additionally, it also changes the code to adhere to python conventions, by using single
underscores instead of double underscores.
Change-Id: I9c96f51f33dfbc60d3121fa1ff68bfac6480e2c2
Reviewed-on: http://gerrit.cloudera.org:8080/471
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch allows to optionally enable the new Impala binary
toolchain. For now there are now major version differences in the
toolchain dependencies and what is currently kept in thirdparty.
To enable the toolchain, export the variable IMPALA_TOOLCHAIN to the
folder where the binaries are available.
In addition this patch moves gutil from the thirdparty directory into
the source tree of be/src to allow easy propagation of compiler and
linker flags. Furthermore, the thrift-cpp target was added as a
dependency to all targets that require the generated thrift sources to
be available before the build is started.
What is the new toolchain: The goal of the toolchain is to homogenize
the build environment and to make sure that Impala is build nearly
identical on every platform. To achieve this, we limit the flexibility
of using the systems host libraries and rather rely on a set of custom
produced binaries including the necessary compiler.
Change-Id: If2dac920520e4a18be2a9a75b3184a5bd97a065b
Reviewed-on: http://gerrit.cloudera.org:8080/427
Reviewed-by: Adar Dembo <adar@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
This patch enables running Impala tests against Isilon as the default file system. The
intention is to run tests against a realistic deployment, i.e, Isilon replacing HDFS as
the underlying filesystem.
Specifically, it does the following:
- Adds a new environment variable DEFAULT_FS, which points to HDFS by default.
- Makes the fs.defaultFs property in core-site.xml use the DEFAULT_FS environment
variable, such that all clients talk to Isilon implicitly.
- Unset FILESYSTEM_PREFIX when the TARGET_FILESYSTEM is Isilon, since path prefixes
are no longer needed.
- Only starts the Hive Metastore and the Impala service stack when running
tests against Isilon.
We don't start KMS/HBase because they're not relevant to Isilon. We also don't
start YARN, Hive and LLama because hive queries are disabled with Isilon.
The scripts that start/stop Hive, YARN and Llama should be modified to point to a
filesystem other than HDFS in the future.
Change-Id: Id66bfb160fe57f66a64a089b465b536c6c514b63
Reviewed-on: http://gerrit.cloudera.org:8080/449
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This commit will be backported to 5.4.x to improve plans when using
Isilon and S3.
The planner currently estimates the number of backends that an hdfs scan
node will execute on as the number of datanodes holding block replica
for the corresponding table. This can be a bad estimate for various reasons:
1) It's completely wrong when the scan is remote (e.g. S3 or Isilon).
2) It doesn't account for partition pruning.
3) The size of the set of hosts holding block replica may larger than
the number of scan ranges.
Improve the estimate by examing the scan ranges and taking locality into
account. While this new estimate will eventually be used in all cases,
this change uses the new estimate only when there is a remote scan range
as to not change plans produced for local ranges (since this commit will
be backported to 5.4.x). So, this commit purposely addresses only case
1. A follow on commit will enable the new logic for all cases.
Also set up the S3PlannerTest so that we can enable it in the nightly
jenkins S3 run. It was inadvertantly never enabled there.
Change-Id: I3fd3f7c5431a535fb044c98c326338c21b8a1898
Reviewed-on: http://gerrit.cloudera.org:8080/425
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
impala-config.sh had three bugs that are fixed with this patch:
1) Not finding the postgres driver will return from the script, not
exit the shell.
2) HADOOP_CLASSPATH uses a corect wildcard path specifier and does not
rely on shell expansion anymore.
3) IMPALA_HOME is set correctly even if BASH_SOURCE is not available
for ZSH shells.
Change-Id: Ifbcf62c643cade43a9007f9bb780fc650760df0e
Reviewed-on: http://gerrit.cloudera.org:8080/407
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
By doing so, we avoid unnecessarily calling the copy constructor for
Status OK objects and loading the value from memory (due to the old
Status::OK being a global). The impact of this patch was validated by
inspecting both optimized assembly code and generated IR code.
Applying this patch has some effect on the amount of generated code. The
new tool `get_code_size` will list the text, data, and bss sizes for all
archives that we produce in a release build. This patch reduces the code
size by ~20 kB.
Text Data BSS
Old 10578622 576864 40825
New 10559367 576864 40809
The majority of the changes in this patch have been mechanically applied
using:
find be/src -name "*.cc" -or -name "*.h" | xargs sed -i
's/Status::OK;/Status::OK\(\);/'
A new micro-benchmark was added to determine the overhead of using
Status in hot code sections.
Machine Info: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
status: Function Rate (iters/ms) Comparison
----------------------------------------------------------------------
Call Status::OK() 9.555e+08 1X
Call static Status::Error 4.515e+07 0.04725X
Call Status(Code, 'string') 9.873e+06 0.01033X
Call w/ Assignment 5.422e+08 0.5674X
Call Cond Branch OK 5.941e+06 0.006218X
Call Cond Branch ERROR 7.047e+06 0.007375X
Call Cond Branch Bool (false) 1.914e+10 20.03X
Call Cond Branch Bool (true) 1.491e+11 156X
Call Cond Boost Optional (true) 3.935e+09 4.118X
Call Cond Boost Optional (false) 2.147e+10 22.47X
Change-Id: I1be6f4c52e2db8cba35b3938a236913faa321e9e
Reviewed-on: http://gerrit.cloudera.org:8080/351
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
This patch makes the following changes in our pom to reduce
the build time and signficantly reduce console spew.
1. Remove jar-with-dependencies from package goal.
We have no need for creating an uber jar that contains the FE as well
as all its dependencies. Locally, we carefully construct our class path
manually (relying on copy-dependencies), and in Impala deployments
the FE jar is put together with the other dependencies, so the FE jar
does not need to be self-contained.
2. Silence copy-dependencies.
Changes the configuration of the maven-dependency-plugin to not
log every copied file to the console.
Change-Id: If351e4e800fd1ca1108f9a0f4d88f52a53fc211c
Reviewed-on: http://gerrit.cloudera.org:8080/378
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Adds a flag to start-impala-cluster.py (--enable_rm) to set up the
mini Impala cluster using Yarn and Llama. This hides a number of
flags that must be set on the impalads:
-enable_rm
-llama_addressess: set to the local llama service
-fair_scheduler_allocation_path: set to the path of the fair-scheduler.xml
in each node's hadoop conf directory
-cgroup_hierarchy_path: set to a path in the CPU cgroup hierarchy which
has the correct permissions for Impala to manage a child cgroup. The
path comes from cgroups.py.
The new module cgroups.py was added to contain cgroups-related
utilities. Right now it provides paths to the CPU controller
hierarchy root and a path within the hierarchy that can be used
for impalads (i.e. have the proper permissions, one for each
cluster node).
Change-Id: Ic2181ec5613c180f240958c84f885c6b136a64d4
Reviewed-on: http://gerrit.cloudera.org:8080/369
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
This will enable jenkins jobs to specify custom arguments when
starting the mini cluster (via start-impala-cluster.py). This
will be used to create a jenkins job that runs tests with RM
enabled.
Change-Id: I96a2e8d90db448581bbf448f3df514381f79fb27
Reviewed-on: http://gerrit.cloudera.org:8080/380
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This patch enables the Impala test suite to run the end to end tests
against an isilon namenode. There are a few caveats:
- The fe test will currently not work.
- Only loading data from both the test-warehouse snapshot and the metadata snapshot is
supported.
- The test suite cannot be run by multiple people (unless we have access to multiple
isilon namenodes)
Change-Id: I786b4e4f51b99e79ad42abc676f537ebfc189237
Reviewed-on: http://gerrit.cloudera.org:8080/356
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch changes the workflow of test execution:
- All FE/BE/JDBC tests are executed inspite of failure.
- End to end pytests also continue to execute until a threshold is reached. This
threshold is set to 10, but is overridable by the environment variable
MAX_PYTEST_FAILURES
It also adds extra debugging informationf for end to end tests. Failures are reported
immedietely after a test failed. The entire execution report is still displayed at the
end.
Change-Id: I3a4f446e74dbc6feb5799226e109fc1eebe48733
Reviewed-on: http://gerrit.cloudera.org:8080/326
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This commit contains changes that allow Impala to be compiled, but not
yet run, on CentOs7. This commit adds some missing flags that specify
library locations.
Change-Id: I73aff2f75b6d0f7a3c13349665ac98a0286e7ebd
Reviewed-on: http://gerrit.cloudera.org:8080/313
Reviewed-by: Alex Leblang <alex.leblang@cloudera.com>
Tested-by: Internal Jenkins
Previously, some frontend tests depended upon state left behind by the end to end tests.
This is no longer the case, so we should run the faster tests first.
Change-Id: I0fa50a6916d76a4d0431e7fb2cc83e6e437b108b
Reviewed-on: http://gerrit.cloudera.org:8080/321
Reviewed-by: Alex Leblang <alex.leblang@cloudera.com>
Tested-by: Internal Jenkins
When executing custom commands / custom targets in parallel, the
dependency resolution for these targets will happen in parallel as
well. If two custom targets are started sufficiently close together for
a clean build, each of the target will execute all dependencies. In the
worst case, both targets will try to overwrite files the other target
has already written leading to either corrupt archives or missing code.
Some background can be found here:
http://www.cmake.org/pipermail/cmake/2011-July/045256.html
This patch separates the two IR targets and moves them to the end of the
compile chain.
Change-Id: I5b26ebd1c3421788fd22e6a09ef96dd6b944e89e
Reviewed-on: http://gerrit.cloudera.org:8080/318
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
Various build and test machines have multiple versions of java
installed and relying on the default "java" command being compatible
isn't practical (a machine may also build an older version of Impala
that might require a different java version). Since JAVA_HOME is already
required that can/should be used to determine which java binary to use.
This also includes a minor change to replace a block of code that was
using 4-space indent. Instead of using 2-space indent, that block was
replaced with one line.
Change-Id: I4b8698b2aa5411b5fa6c5bc06291625999478955
Reviewed-on: http://gerrit.cloudera.org:8080/310
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Currently, it's not possible to run jdbc tests without running fe tests. This patch
makes that possible.
Change-Id: I77ec336cc31b231b43008a99f7c3a9d48a0f3fda
Reviewed-on: http://gerrit.cloudera.org:8080/197
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch enables the user to set an environment variable specifying the number of
iterations to run all the tests. The configuration can be overriden via the command line.
Change-Id: I1da545201f1b59697344e62ae8edf5f9fb3cd92d
Reviewed-on: http://gerrit.cloudera.org:8080/188
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces the concept of error codes for errors that are
recorded in Impala and are going to be presented to the client. These
error codes are used to aggregate and group incoming error / warning
messages to reduce the spill on the shell and increase the usefulness of
the messages. By splitting the message string from the implementation,
it becomes possible to edit the string independently of the code and
pave the way for internationalization.
Error messages are defined as a combination of an enum value and a
string. Both are defined in the Error.thrift file that is automatically
generated using the script in common/thrift/generate_error_codes.py. The
goal of the script is to have a central understandable repository of
error messages. Adding new messages to this file will require rebuilding
the thrift part. The proxy class ErrorMessage is responsible to
represent an error and capture the parameters that are used to format
the error message string.
When error messages are recorded they are recorded based on the
following algorithm:
- If an error message is of type GENERAL, do not aggregate this message
and simply add it to the total number of messages
- If an error messages is of specific type, record the first error
message as a sample and for all other occurrences increment the count.
- The coordinator will merge all error messages except the ones of type
GENERAL and display a count.
For example, in the case of the parquet file spanning multiple blocks
the output will look like:
Parquet files should not be split into multiple hdfs-blocks.
file=hdfs://localhost:20500/fid.parq (1 of 321 similar)
All messages are always logged to VLOG. In the coordinator error
messages are merged across all backends to retain readability in the
case of large clusters.
The current version of this patch adds these new error codes to some of
the most important error messages as a reference implementation.
Change-Id: I1f1811631836d2dd6048035ad33f7194fb71d6b8
Reviewed-on: http://gerrit.cloudera.org:8080/39
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
Specifically:
- Hive needs some jars from hadoop/tools/lib
- Hive has an dependency on apache.snapshots ( added in fe/pom.xml )
- Beeline has to explicitly told not to use jline.
Change-Id: Id38956b748f8f667a39505c92355f0298f308718
Conflicts:
testdata/bin/load-hive-builtins.sh
This patch enables loading data to s3 instead of hdfs. It is preliminary in nature,
as such, there are a few caveats:
- The fe tests do not work.
- Only loading from a test-warehouse snapshot and metastore snapshot is enabled.
- Until hive works with s3, only a subset of all the tests will work.
Change-Id: Ia66a5f836b4245e3b022a49de805eec337a51324
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5851
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins