Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.
Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.
Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This is general clean up in prep for use with the stress test.
Changes:
1) Failed commands and failure to connect now raise exceptions.
Previously run_cmd() was not guaranteed to do anything at all in
remote mode.
2) Fix scope of 'hosts' which was at the class level but was modified
by instance level functions which makes no sense since different
instances could clash with each other.
3) Remove uses of opaque *args and **kwargs instead of named args. The
generic forms should be avoided since they impair readability.
4) Stop trying to get the cluster hosts from an environment variable
unconditionally upon construction.
5) Remove 'local' member variable, it's not needed and allowing 'local'
to be set to False when no 'hosts' are not set makes no sense.
6) Simplify and remove unneeded methods and arguments.
Change-Id: Id90bd3b640f2681bb7e82a5e6d5e49ed8c5a7b98
Reviewed-on: http://gerrit.cloudera.org:8080/514
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Previously the test_file_parser would setup the logging
configuration as part of importing the module. The
test_file_parser is not executable and not a logging utility so it
should not have any effect on logging. If some other file relies on this
it should be fixed separately.
Change-Id: Ib7293d152d0c0cd3c8f31533c95e50b2678e927b
Reviewed-on: http://gerrit.cloudera.org:8080/473
Tested-by: Internal Jenkins
Reviewed-by: Casey Ching <casey@cloudera.com>
test_load was using /tmp as the staging directory, which did not cleaned up in Isilon,
leading to a build failure. This patch does the following:
- use /test-warehouse as the staging directory.
- replace calls to the hdfs commandline with calls to the in-house hdfs client.
- cleanup the test file and remove duplicates.
Additionally, a new method is introduced in the hdfs client to simulate hdfs dfs -cp, i.e,
it does a get and a put to mimic the hdfs command line's semantics.
Change-Id: I0cc27ab00df5f5ec3138b995144ab45ad622605d
Reviewed-on: http://gerrit.cloudera.org:8080/431
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces changes to run tests against Isilon, combined with minor cleanup of
the test and client code.
For Isilon, it:
- Populates the SkipIfIsilon class with appropriate pytest markers.
- Introduces a new default for the hdfs client in order to connect to Isilon.
- Cleans up a few test files take the underlying filesystem into account.
- Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl
On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically:
- delete_file_dir does not throw an error if the file does not exist.
- get_file_dir_status automatically strips the leading '/'
Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f
Reviewed-on: http://gerrit.cloudera.org:8080/370
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch encapsulates pytests's skipif markers in classes. It leads to the following
benefits:
- Provide context and grouping for tests being skipped.
- As we improve test reporting, annotations will give us a better idea of coverage.
Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9
Reviewed-on: http://gerrit.cloudera.org:8080/297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-on: http://gerrit.cloudera.org:8080/343
This patch also adds a small utility file for for supporting different filesystems.
Change-Id: I28b1217b0cb901360e28e8d0ba269c9144117d2e
Reviewed-on: http://gerrit.cloudera.org:8080/124
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
Summary of changes:
1) (from Taras) Exercise CTAS and views by creating one from a random
query, then SELECT * FROM table/view.
2) Use bulk loading to generate random data. The old method was to use
INSERTs which is very slow. Now local data files are generated and
uploaded.
3) Misc schema parsing changes needed to support the simplified type
system in the earlier review (part 1).
Change-Id: I7986b97aa12051dc043faafef34a9540117e889f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5646
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
This patch contains the following changes:
- Add a metastore_snapshot_file parameter to build.sh
- Enable skipping loading the metadata.
- create-load-data.sh is refactored into functions.
- A lot of scripts source impala-config, which creates a lot of log spew. This has now
been muted.
- Unecessary log spew from compute-table-stats has been muted.
- build_thirdparty.sh determins its parallelism from the system, it was previously hard
coded to 4
- Only force load data of the particular dataset if a schema change is detected.
Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Our .test file parser used to not abort tests when there
is a malformed test/section. This patch changes that behavior
to report an error and treat the test as failed.
Quite a few tests were not well-formed, and were not executed
as a result. This patch fixes those tests.
Arguably, the test file parser should be more flexible in which places
to accept comments, but this patch does not address that problem.
Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
- Added % change to performance regressions/improvements table
- Automatic extraction of Impala version from runtime profiles
- Execution summary row will not be printed if max time is < 100ms or < 2% of the overall runtime
- Failed queries are ignored
- First result is discarded for each query
- Geometric mean was added to summary
- Improved handling of multiple workloads in a single JSON file
- Improved handling of the case when queries are different in results and reference results
- Works well for single client runs. Additional work is needed to handle multiple client runs well.
Change-Id: Ice7b9cc4fd7502a448d35ace10fbcef183df1769
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4210
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
(cherry picked from commit c722f6b0a104df54b550978cd222a9af4d39b929)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5250
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
This patch forces LOAD and INSERT to check ACLs during analysis. We
mimic the behaviour of HDFS's ACL checking by adding code to
FsPermissionChecker.
Change-Id: I42660db1da13ceaef63f582cff2c2078e08f90a1
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4428
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
This patch adds the necessary changes required to authorize SHOW ROLES statements.
This is not as easy as it could be because the Sentry Service doesn't currently
expose the metadata for who is/isn't authorized to execute these statements. To authorize
the statements, we need to first make an RPC to the Sentry Service (via the
Catalog Server) and then only proceed with the SHOW statement if the check succeeds.
We should consider revisiting this approach in the future when more metadata is available
from Sentry.
Additionally, this patch adds support for SHOW CURRENT ROLES which shows all roles
that are currently granted to the current user.
Change-Id: Ia01c20d58ab081f49a85566075836d8c6e25dbd4
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4367
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This is the first iteration of a kerberized development environment.
All the daemons start and use kerberos, with the sole exception of the
hive metastore. This is sufficient to test impala authentication.
When buildall.sh is run using '-kerberize', it will stop before
loading data or attempting to run tests.
Loading data into the cluster is known to not work at this time, the
root causes being that Beeline -> HiveServer2 -> MapReduce throws
errors, and Beeline -> HiveServer2 -> HBase has problems. These are
left for later work.
However, the impala daemons will happily authenticate using kerberos
both from clients (like the impala shell) and amongst each other.
This means that if you can get data into the mini-cluster, you could
query it.
Usage:
* Supply a '-kerberize' option to buildall.sh, or
* Supply a '-kerberize' option to create-test-configuration.sh, then
'run-all.sh -format', re-source impala-config.sh, and then start
impala daemons as usual. You must reformat the cluster because
kerberizing it will change all the ownership of all files in HDFS.
Notable changes:
* Added clean start/stop script for the llama-minikdc
* Creation of Kerberized HDFS - namenode and datanodes
* Kerberized HBase (and Zookeeper)
* Kerberized Hive (minus the MetaStore)
* Kerberized Impala
* Loading of data very nearly working
Still to go:
* Kerberize the MetaStore
* Get data loading working
* Run all tests
* The unknown unknowns
* Extensive testing
Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019
Reviewed-by: Michael Yoder <myoder@cloudera.com>
Tested-by: jenkins
- Added execution summary to the beeswax client and QueryResult
- Modified report-benchmark-results to handle JSON and perform
execution summary comparison between runs
- Added comments to the new workload runner
Change-Id: I9c3c5f2fdc5d8d1e70022c4077334bc44e3a2d1d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3598
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
(cherry picked from commit fd0b1406be2511c202e02fa63af94fbbe5e18eee)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3618
This patch introduces new abstractions and changes the way queries are run via the
workload runner. A new class 'Workload' is introduced, which represents the notion of a
workload in the performance framework (i.e, A set of query names mapped to query
strings).
The new workflow is:
- run-workload acts as a driver. It accepts user parmaters for which queries to
run and their execution strategy. It generates workload objects and passes them to the
workload-runner.
- The workload runner takes a workload, its execution parameters and generates a set of
test vectors over which the workload is run iteratively.
- A workload is executed by initialiazing a QueryExecutor for each query being run in a
test vector. The workload executor is then responsible for execution and gathering
results.
- The execution details of every query being executed are are stored and returned to the
driver (run-workload).
Change-Id: Ia16360140d65e6733e534e823bc5d5614622ab5f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3616
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
This patch also adds a mechanism to return analysis warnings to
client, which is used to log skipped decimal columns.
Change-Id: I30c246044a68ec8861cd5bed072bd54e65a079e6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2822
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit fc77422acef7e6f93fdeb5448309414b905f0725)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2984
This should allow individual service components, such as a single nodemanager,
to be shutdown for failure testing. The mini-cluster bundled with hadoop is a
single process that does not expose the ability to control individual roles.
Now each role can be controlled and configured independently of the others.
Change-Id: Ic1d42e024226c6867e79916464d184fce886d783
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1432
Tested-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
their parent's permissions
This patch adds --insert_inherit_permissions. If true, all
new partition directories created by INSERT will inherit their
permissions from their parent. When false, the directories are created
with the default permissions.
Change-Id: Ib2b4c251e51ea5048387169678e8dde34ecfe5f6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1917
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
This fixes how we validate delimiters to be in line with Hive. A delimiter must
fit in a single byte and can be specified in the following formats, as far as I can
tell (there isn't documentation):
- A single ASCII or unicode character (ex. '|')
- An escape character in octal format (ex. \001. Stored in the metastore as a
unicode character: \u0001).
- A signed decimal integer in the range [-128:127]. Used to support delimiters
for ASCII character values between 128-255 (-2 maps to ASCII 254).
Previously, we were not handling the "signed integer" case so there was no way
to specify a delimiter in the "extended" ASCII range of 128-255.
To support result validation, the test infrastructure had to be updated to support
reading/writing different character encodings.
Change-Id: Ie3c4d444dc9c6e60192093ed0c0f6f151eab16bc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1848
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1888
This change encloses fabric's task method with its 'hide' context manager. The current
state of the running commands are muted (i.e, hosts connected to, which command is running
etc.). Error messages are NOT muted, and will still be displayed (connection error,
command failure).
Change-Id: Ibfbbb995ab6fe057faec9af8be90449654b21f8c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1155
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
The plugin runner uses fabric as the underlying mechanism for running remote commands on
cluster hosts. fabric in turn uses paramiko, which generates a lot of log spew. This
change seta parmiko's logging level to ERROR, eliminating excess logging. Additionally,
it also constrains fabric's logging.
Change-Id: I6229d64f95f9c1512cc01842c4a661e96e421086
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1064
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Updates our compute stats script to execute using Impala. This allows us
to easily compute stats on all tables in a database or all tables in the
metastore.
The updated stats caused one of the TPCH plans to change so this also
updates the TPCH planner test results.
Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
We need to pass a flag to the metastore for the cleanup to happen. Previously we
were passing 'false' when we need to pass 'true' to get the same behavior as Hive
when dropping databases. Added a test case to validate the cleanup when dropping
databases and tables.
Change-Id: I500a3d3ac52c1b2031fae842403a670cfe43fa98
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1035
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This patch fixes a slightly pathological state that occurs when the
statestore is under heavy load. The result of the bug is that
subscribers cannot successfully re-register because the statestore never
marks them as failed.
The exact sequence of events is as follows:
1. Subscriber registers with state-store.
2. Statestore does not send heartbeats in timely fashion to
subscriber. Subscriber times-out.
3. Subscriber is restarted quickly. Statestore does not detect
restart.
4. Subscriber's RegisterSubscriber() call fails, because statestore
detects duplicate registration.
5. Subscriber restarts again. Since state-store is slow to send
heartbeats, the state-store has not detected the restart and the
subscriber receives a heartbeat message from the statestore and
does not reject it.
6. Statestore continues to believe subscriber is alive, since the
heartbeats are not being rejected.
To fix this, we add a registration ID to each successfully registered
subscriber that is known to both subscriber and statestore. If the
subscriber should restart and re-register, it receives a new
registration ID. Whenever a heartbeat arrives, it compares its
registration ID to that sent by the statestore with the heartbeat, and
rejects the heartbeat if they do not match.
We also allow re-registration of existing subscribers (getting rid of
the dreaded "Duplicate subscription" message). A new registration
overwrites an old one.
Change-Id: Ie32df3a586ccb375375ebfbcbec1aaeb930b6bfe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/778
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Fixed the following stats-related bugs:
- Per-partition row count was not distributed properly via CatalogService
- HBase column stats were not loaded and distributed properly
Enhancements to test framework:
- Allow regex specification of expected row or column values
- Fixed expected results of some tests because the test framework
did not catch that they were incorrect
Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
At the moment, a query is the default unit of execution and parallelism in the Impala
performance suite. With this change, we now have the ability to treat a workload as the
unit of execution. A workload is defined as a unique combination of the dataset, scale
factor, a subset (or all) of the queries in the dataset, and a table format (file format,
compression codec and compression scheme).
It introduces two new command line options in bin/run-workload.py:
* --execution_scope
The default scope is 'query', and it maintains previous semantics. The
new scope is 'workload', which toggles the unit of execution to a workload.
* --shuffle_query_exec_order.
Shuffles the order in which queries are executed (only applicable when the
execution_scope if workload), defaults to False.
Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/503
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
OVERWRITE
INSERT OVERWRITE into an unpartitioned table is supposed to remove all
data files from the root. This should not include hidden files or
directories. This patch excludes hidden files from deletion, and adds a
test case.
Partition directories are still removed in their entirety: the cost of
statting a large number of files and directories rather than issuing a
single "rm -rf" outweighs the benefits of preserving hidden files for
now.
Hive does not preserve hidden files in either configuration.
Change-Id: Ia73e55e011c26c88f14745075210cf359764e3c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/418
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
This adds support for CREATE TABLE AS SELECT to Impala. It supports all functionality a
regular CREATE TABLE statement includes, except it does not allow for for specifying
partition columns. Hive also has this limitation and it wouldn't be too hard to support
in the future.
Change-Id: I4ca3c3b8f1576441b8bb5ed9dc521d7dfa96ab74
Reviewed-on: http://gerrit.ent.cloudera.com:8080/157
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
This patch contains changes to the general test and plugin framework that were needed to make the VTune plugin run. These changes create a conext dictionary that is passed to the plugin.
Change-Id: I12ee2076fb0d777813c56bbb338e6d20426afaff
Reviewed-on: http://gerrit.ent.cloudera.com:8080/111
Reviewed-by: Alex Leblang <alex.leblang@cloudera.com>
Tested-by: Alex Leblang <alex.leblang@cloudera.com>
This works around a problem with computing table stats via the Hive Meta Store client
API. When executing these stements via the MetaStoreClient, all tables were getting a
num_rows=0 value returned from the ANALYZE TABLE query.