This should allow individual service components, such as a single nodemanager,
to be shutdown for failure testing. The mini-cluster bundled with hadoop is a
single process that does not expose the ability to control individual roles.
Now each role can be controlled and configured independently of the others.
Change-Id: Ic1d42e024226c6867e79916464d184fce886d783
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1432
Tested-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
When updating partition metadata as part of COMPUTE STATS we would previously
attempt to update all partitions at once. This could lead to HMS socket timeouts
and also could run into issues if there were > 32K partitions.
In this change we now update the partitions in batches, with a max size of 500
partitions per batch. We also compare whether the row count has changed and only
update partitions that have been modified.
Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918
We run wat-for-hbase-master.py after starting hbase to account for a race between
the master and region server. This script has not been working for some time. It caused
no ill effects sinc the said race was absent. However, the race has manifested itself
again, so the script needs to be fixed. Setting the correct classpath does so.
Change-Id: I783a7473cfd24a9cb66711f5428f7052ceb96282
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1756
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
With a recent upstream change, a core-site.xml was introduced in a YARN test jar pulled in
by thirdparty. This causes MiniLlama to ignore options set in
fe/src/test/resources/core-site.xml. The problem manifests itself with the MiniDfsCluster
starting on an arbitary port, but it would have also caused a lot of tests to fail as none
of the compression codecs are pulled in. This change prepends the classpath used by
minillama with the path to the internal core-site.
Change-Id: Iee267fe12e02301baec059a1f7469288c038d6fa
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1739
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This updates how Impala fetches partition metadata from the Hive Metastore to fetch
partitions in batches, rather than all at once. This helps reduce the load on the
HMS and also lets Impala scale to above 32K partitions. The downside is that it
may require additional RPCs to get all the partitions.
This is done by first querying the metastore to get all the partition names that
exist, then splitting the list of names into seperate batches to get the actual
partition metadata.
Impala uses a default size of 1000 partitions per batch, but it can be configured
by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter
in the hive-site.xml config file.
Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698
The purpose of this patch is to avoid CDH-17414 which causes data files loaded
with Hive to incorrectly have a replication factor of 1. When using beeline
this problem only appears to occur immediately after creating the first HBase table
since starting HiveServer2, i.e., subsequent loads seem to function correctly.
This patch add a new script that creates an external HBase table in Hive to
'warm up' HiveServer2 immediately after it is started.
Subsequent loads should assign a correct replication factor.
Change-Id: Ic54c9401b67b748a8848d19f82b8e7df9535e845
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1640
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This change adds support for lazy loading of table metadata to the
CatalogService/Impalad. The way this works is that the CatalogService initially
sends out an update with only the databases and table names (wrapped as
IncompleteTables). When an Impalad encounters one of these tables, it will contact
the catalog service to get the metadata, possibly triggering a metadata load if the
catalog server has not yet loaded this table.
With these changes the catalog server starts up in just seconds, even for large
metastores since it only needs to call into the metastore to get the list of tables
and databases. The performance of "invalidate metadata" also improves for the same reason.
I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and
remove the need for using a LoadingCache for databases.
This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run
against catalog object recieved from the catalog server. This actually turned up some bugs
in our previous test configuration where we were not running with the correct column stats
(we were always running with avgSerializedSize = slotSize). This changed some plans so the
planner tests needed to be updated.
Still TODO:
This does not include the changes to perform background metadata loading. I will send
that out as a separate patch on top of this.
Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e
Saving changes
Change-Id: I48c34408826b7396004177f5fc61a9523e664acc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Changes include:
- version changes in impala-config
- version changes in various loading scripts
- hbase jars are no longer in hive/lib
- mini-llama script changes
- updates due to sentry api changes
- JDBC tests disabled
- unsupported types tests disabled.
Change-Id: If8cf1b7ad8e22aa4d23094b9a4b1047f7e9d93ee
Fixed codepath with rm disabled. Set enable_rm to false by default.
Change-Id: I3bf2d0525d91243ec3c0ea048b0c03680befcda2
Conflicts:
be/src/runtime/runtime-state.cc
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.
Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().
Also adds test cases for IMPALA-720.
Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb
Also adds a bit more logging on which individual services are starting.
Change-Id: I53f12e1825fbf738e2fb8325874c3126e55f3f44
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1147
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).
Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
Adds the TPCDS queries as planner tests and fixes a few small issues
with the Planner test file parser. This adds the TPC-DS queries using
SQL-92 style joins that have a hand optimized (although
not perfect) join order.
Change-Id: I2d81e66af740b2d826b8ebd0c5ba8553b5faf0a2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1019
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
The FE was creating class loaders with the HDFS locations of Hive UDF
libs, rather than the local locations created by the BE. Our tests
still passed since we only used UDFs already on the classpath
(e.g. Hive builtins).
Change-Id: Idbe9c98ad6adb84b70cb44efbf9ad0afc53366ca
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1081
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
Updates our compute stats script to execute using Impala. This allows us
to easily compute stats on all tables in a database or all tables in the
metastore.
The updated stats caused one of the TPCH plans to change so this also
updates the TPCH planner test results.
Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This change updates our DDL syntax support to allow for using 'STORED AS PARQUET'
as well as 'STORED AS PARQUETFILE'. Moving forward we should prefer the new syntax,
but continue to support the old. I made the same change for 'AVROFILE', but since
we have not yet documented the 'AVROFILE' syntax I left out support for the old syntax.
Change-Id: I10c73a71a94ee488c9ae205485777b58ab8957c9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1053
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Currently, we execute all the queries involved in data loading serially. This change
creates a separate .sql file for each file format, compression codec and compression
scheme combination, and executes all the files in parallel. Additionally, we now store all the
.sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note
that only data loaded through Impala is parallelized, data loaded through hive and hbase
remains serial.
On our build machines, the time taken to load all the data from snapshot was on the order
of 15 minutes.
Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/804
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
With this change we now detect if a table is read-only and disable INSERT/LOAD operations
on these tables. A table is read-only if Impala does not have write permission on the HDFS
base directory of the table or any one of the partition directories (if
the table is partitioned).
Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/713
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).
This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.
Some other minor cleanup in:
- JNI from FE to BE
- UDFs/UDAs that are loaded as test data
Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Using an external Hive Metastore Service for local test runs has a number of benefits.
Some of the benefits are that it helps separate the metastore logs from the impala
logs, and that it is more representative of what is on real cluster environments.
It also may help with some of the concurrency issues that we have been seeing when
running directly against the backend database since we no longer spin up an in-process
metastore server for each client connection.
The metastore is started by running "run-hive-server.sh" which is invoked as part of
"run-all.sh".
Change-Id: If60fa97aa38e4ad5cf578b9b409eeea1e0e29375
Reviewed-on: http://gerrit.ent.cloudera.com:8080/628
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This patch also adds a number of improvements to NativeUdfExpr. Highlights include:
* Correctly handling the lowering of AnyVal struct types (required for ABI compatibility)
* A rudimentary library cache for reusing handles produced by dlopen
* More complicated test cases
Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/540
Tested-by: jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
OVERWRITE
INSERT OVERWRITE into an unpartitioned table is supposed to remove all
data files from the root. This should not include hidden files or
directories. This patch excludes hidden files from deletion, and adds a
test case.
Partition directories are still removed in their entirety: the cost of
statting a large number of files and directories rather than issuing a
single "rm -rf" outweighs the benefits of preserving hidden files for
now.
Hive does not preserve hidden files in either configuration.
Change-Id: Ia73e55e011c26c88f14745075210cf359764e3c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/418
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
This change adds Impala DDL support for creation of AVRO tables.
Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES
which are used when creating Avro backed tables. This syntax is not
exactly the same as the Hive support since it introduces a new
fileformat (AVROFILE) that implies the needed Serialization library,
input format, and output format.
Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/464
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>