This patch introduces new abstractions and changes the way queries are run via the
workload runner. A new class 'Workload' is introduced, which represents the notion of a
workload in the performance framework (i.e, A set of query names mapped to query
strings).
The new workflow is:
- run-workload acts as a driver. It accepts user parmaters for which queries to
run and their execution strategy. It generates workload objects and passes them to the
workload-runner.
- The workload runner takes a workload, its execution parameters and generates a set of
test vectors over which the workload is run iteratively.
- A workload is executed by initialiazing a QueryExecutor for each query being run in a
test vector. The workload executor is then responsible for execution and gathering
results.
- The execution details of every query being executed are are stored and returned to the
driver (run-workload).
Change-Id: Ia16360140d65e6733e534e823bc5d5614622ab5f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3616
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
Some operating systems don't ship which nproc, which causes impala-config.sh to fail. This
change alleviates the problem by checking if nproc exists, and setting a reasonable
default if it fails.
Change-Id: Ic6e4d0fbce57eedc82163cfa17f71bdccbc38b51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3208
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Currently, we launch #nproc processes to run tests locally. This patch changes the default
to #proc/2, to not overload the system.
Change-Id: I8bca23eb7462a0c497df93f82a60d85835bedbe9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2972
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Enable order-by without limit
Added BufferedBlockMgr to allocate buffers and spill to disk.
Added Sorter for the external sort impelementation
Added new SortNode execution node that completely sorts its input
Changes to enable writing in IoMgr went in a separate patch.
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
Conflicts:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
This change adds support for authorizing based on policy metadata read from the Sentry
Service. Authorization is role based and roles are granted to user groups. Each role
can have zero or more privileges associated with it, granting fine grained access to
specific catalog objects at server, URI, database, or table scope. This patch only
adds support to authorize against metadata read from the Sentry Policy Service, it does
not add support for GRANT/REVOKE statements in Impala.
The authorization metadata is read by the catalog server from the Sentry Service and
propagated to all nodes in the cluster in the "catalog-update" statestore topic. To
enable the Catalog Server to read policy metadata, the --sentry_config must be
set to a valid sentry-site.xml config file.
On the impalad side, we continue to support authorization based on a file-based provider.
To enable file based authorization set the --authorization_policy_file to a
non-empty value. If --authorization_policy_file is not set, authorization will be done
based on cached policy metadata received from the Catalog Server (via the statestore).
TODO: There are still some issues with the Sentry Service that require disabling some of
the authorization tests and adding some workarounds. I have added comments in the code
where these workarounds are needed.
Change-Id: I3765748d2cdbe00f59eefa3c971558efede38eb1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2552
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED
When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.
When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.
It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).
Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.
Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This updates Impala to use Sentry v1.3 instead of Sentry v1.2. No major functionality
changed between Sentry versions, but some Sentry classes were moved and APIs changed.
Change-Id: I3765748d2cdbe00f59eefa3c971558efede38ebd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2319
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This just updates the versions, it doesn't touch anything in /thirdparty.
Change parquet version to append SNAPSHOT
Added hadoop-hbase-compat jar in AUX_CLASSPATH and mapreduce/*.jar to HDFS
Change-Id: I4471ef4476997371cf49a9d54cfa63f2fda126e4
This is the intital commit and is a work in progress. See the README for a
list of possible improvements.
As an overview of how the files are related:
model.py: This is the base upon which the other files are built. It
contains something like a grammer for queries.
query_generator.py: Generates random permutations of the model.
model_translator.py: Produces SQL based on the model
discrepancy_searcher.py: Uses the above to generate, run, and compare
query results.
Change-Id: Iaca6277766f5a86568eaa3f05b99c832942ab38b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1648
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
Temporarily increase the cap on max requested queries while running tests to unblock
builds. Currently, the exhaustive runs always fails, and there are some intermittent
failures in the core runs.
Change-Id: I26b9ce343d72bab7687e49f7dbd7bf3bf655a294
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2323
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2395
For wide Avro tables, ReadZLong() would get inlined many times into a
single function body, causing LLVM to crash. Not inlining doesn't seem
to have a performance impact on narrow tables, and helps with wide
tables.
This change also adds tests over wide (i.e. many-column) tables. The
test tables are produced by specifying shell commands to generate test
tables in functional_schema_template.sql, which are executed in
generate-schema-statements.py. In the SQL templates, sections starting
with a ` are treated as shell commands. The output of the shell
command is then used as the section text. This is only a starting
point; it isn't currently implemented for all sections, and may have
to be tweaked if we use this mechanism for all tables.
Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353
Tested-by: jenkins
This should allow individual service components, such as a single nodemanager,
to be shutdown for failure testing. The mini-cluster bundled with hadoop is a
single process that does not expose the ability to control individual roles.
Now each role can be controlled and configured independently of the others.
Change-Id: Ic1d42e024226c6867e79916464d184fce886d783
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1432
Tested-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
* One last NotifyThreadUsageChange() mismatched pair
* Don't set resource in plan fragment params if there isn't a resource
available. This fixes the problem where if no fragment with resources
was assigned to the same node as the coordinator, the coordinator
would have a dummy resource allocation which didn't work with
expansion.
* Substitute #ID in all impalad arguments to start-impala-cluster.py
with the 0-indexed ID of the impalad being started. This is required
to have different Impala processes use different cgroups.
Change-Id: If8c8fd8bef0809bdaf16115a45a9695fc2bf3e1b
(cherry picked from commit c71ce45e97570b8c09900eb5ae2e26984d3306a4)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2060
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
The problem was that were were deleting the version.info file because the default
of gen_build_version.py recently changed from --noclean to --clean.
Also fixed a bug in the shell version generation and made debugging a bit easier
by dumping the contents of version.info whenever it is generated.
Change-Id: I764d01c9e46eed1bd39de79bf076c15afa599486
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1901
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
(cherry picked from commit fa673b4d3342fc825ee7fa942bd254234d222906)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1910
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
The HS2 metadata operations do not go through analysis() so the prioritized
loading will not happen for them. Most of the HS2 metadata ops work purely
on table/db names, but GetColumns() requires loading the table metadata. This
patch updates MetadataOp to collect a set of missing tables and request these
tables be loaded from the catalog server. The operation will wait until the tables
are loaded in the local catalog before proceeding.
Change-Id: I070f2a0d9194d3317f09431971be9a8dffbc7386
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1542
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1557
Before this patch the -noclean option had almost no effect on the BE build time because
some source files were re-generated with .py scripts regardless.
This change allows ./buildall -skiptests -noclean to do a true incremental rebuild.
Change-Id: Ib3af85db05bdc96a2279a22c1d49d735f2cabd4e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1394
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1415
Changes include:
- version changes in impala-config
- version changes in various loading scripts
- hbase jars are no longer in hive/lib
- mini-llama script changes
- updates due to sentry api changes
- JDBC tests disabled
- unsupported types tests disabled.
Change-Id: If8cf1b7ad8e22aa4d23094b9a4b1047f7e9d93ee
Fixed codepath with rm disabled. Set enable_rm to false by default.
Change-Id: I3bf2d0525d91243ec3c0ea048b0c03680befcda2
Conflicts:
be/src/runtime/runtime-state.cc
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.
Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
While loading parquet, there are a few table creation queries that use the 'like'
keyword; this ends up opening a small race window when all the table formats are created
concurrently. With this change, we create the text tables first before attempting to
parallelize the rest of the data loading.
Change-Id: Ib84cf0e5120b3588d3f0503d7119ca055e08e53f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1241
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
* upload_codereview.py is no longer used since Rietveld is long gone
* runplanservice is deprecated as there is no longer a separate
PlanService
* README only mentions a single internal wiki page.
Change-Id: Iba61a3d62381deb882c4168f142574f2492e0969
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1249
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Enables JVM debugging by default for the catalogd and impalads
created via bin/start-impala-cluster.py.
Adds a -jvm_args command line option for passing additional JVM args to
the catalogd and impalads.
Change-Id: I68e901661bd1fd7eefa05ba84dbacf29dd124685
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1213
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
During a full data load, we load all the data (except parquet) via hive, and then load the
parquet data via Impala. The catalog service does not update the metadata of tables
changed outside Impala, so we need to explicitly invalidate the metadata before loading
parquet data.
Change-Id: Iec39db9ea46e4a11b17589881732629a56444120
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1207
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Instead of calling 'invalidate metadata' before loading each workload
we should call it once, after loading all test data. This will allow
us to pickup data inserted by Hive. The only reason this worked before
is because we restart Impala before running the tests. This will also
be a bit faster if loading multiple workloads.
Change-Id: I28d42bbf5d7a24b5fde687d67a4b41472ec4b897
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1153
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Goodnight, sweet non-blocking prince. We didn't support, or test, this
configuration, and it doesn't work with security or sessions and brings
in some annoying dependencies that are a pain to build.
We have other RPC-stack options to investigate; we may wind up re-adding
the non-blocking server but only in a way that supports all required
features more regularly.
Change-Id: Ifbcabc5014441f6d31c342c4e288dd7fc6201443
This patch makes the workload runner's logging concise and more informative. Specifically,
it
- logs the time taken for each iteration of a query.
- changes the default log level to INFO.
- The output is less verbose.
Change-Id: I5f964cf76269fd64ce127b9e4c51fe1deafd1d1b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1076
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>