Commit Graph

256 Commits

Author SHA1 Message Date
ishaan
db4bd623ed IMPALA-2131: The metastore database name should be a global constant.
Previously, we tried to dynamically name the metastore db. With the introduction of
metatsore snapshots, this is no longer necessary and may cause naming ambiguity if the
Impala repository has a non-standard directory structure.

This patch use a constant name - impala_hive -  defined as an environment variable in
impala-config.

Change-Id: Iadc59db8c538113171c9c2b8cea3ef3f6b3bd4fc
Reviewed-on: http://gerrit.cloudera.org:8080/517
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-02-03 00:58:50 +00:00
Alex Behm
194377f7d8 Add the Postgres JDBC JAR to the HADOOP_CLASSPATH for Sentry.
This patch is required for updating thirdparty.

Sentry does not ship with the Postgres JDBC driver anymore,
so we need to point it to ours in thirdparty. Sentry picks
up JARs from the HADOOP_CLASSPATH and not the CLASSPATH,
so this patch adds the JDBC driver there in run-sentry-service.sh.

Change-Id: Iee950dfcd2839b4ca0fc827a45da2a9386c4404d
Reviewed-on: http://gerrit.cloudera.org:8080/1991
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-03 00:24:59 +00:00
Tim Armstrong
0c6628af18 Use psql -q consistently
Use psql -q to suppress verbose output during metastore creation.
Also use -q instead of redirection everywhere for consistency.

Change-Id: I539da86a50d18546474b2cfdc848f992745a7875
Reviewed-on: http://gerrit.cloudera.org:8080/1884
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 21:15:04 +00:00
Casey Ching
72d1889c08 IMPALA-2873: Fix nested TPC-H data loading
In commit 960808 I forgot to update the data-loading script for the
conversion of a shell script to a python script. It turns out there were
a couple of other little problems too. I checked manually that the data
was loaded after these changes.

Change-Id: Id81fc423348515ab446835868025cb839c77f52c
Reviewed-on: http://gerrit.cloudera.org:8080/1851
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-01-21 05:42:17 +00:00
Casey Ching
f288867833 Stress test: Various changes
The major changes are:

1) Collect backtrace and fatal log on crash.
2) Poll memory usage. The data is only displayed at this time.
3) Support kerberos.
4) Add random queries.
5) Generate random and TPC-H nested data on a remote cluster. The
   random data generator was converted to use MR for scaling.
6) Add a cluster abstraction to run data loading for #5 on a
   remote or local cluster. This also moves and consolidates some
   Cloudera Manager utilities that were in the stress test.
7) Cleanup the wrappers around impyla. That stuff was getting
   messy.

Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7
Reviewed-on: http://gerrit.cloudera.org:8080/1298
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-01-20 23:00:25 +00:00
Tim Armstrong
43de306d17 Log data loading and cluster setup to file
Log output of data loading steps to files only print to stdout
if there is an actual failure. The output of some steps is very noisy,
and some steps even have output that looks like errors.

This is implemented with a run-step helper function in bash that handles
redirection and logging. Any bash command can be prefixed with run-step
<step description> <log file name> to redirect the output to a log file.

Sample output is:

Starting Impala cluster (logging to start-impala-cluster.log)... OK
Setting up HDFS environment (logging to setup-hdfs-env.log)... OK
Skipped loading the metadata.
Loading HBase data only (logging to load-hbase-only.log)... OK
Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK
Running custom post-load steps (logging to custom-post-load-steps.log)... OK
Caching test tables (logging to cache-test-tables.log)... OK
Loading external data sources (logging to load-ext-data-source.log)... OK
Splitting HBase (logging to create-hbase.log)... OK

Change-Id: I6396540858c408b084039a87efc81e1004626f39
Reviewed-on: http://gerrit.cloudera.org:8080/1760
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2016-01-20 04:38:19 +00:00
Martin Grund
4648a39823 Introduce 'latest' build symlink
This adds a new 'latest' symlink in be/build that links to the latest
build configuration. This makes our script behave better as we don't
need to hard-code specific build types but can rather depend on sensible
defaults.

This patch addresses this issue in the cluster startup and a script that
is executed in the context of data loading. There might be more places
but so far my search did not yield any additional places where we rely
on a hardcoded path.

Change-Id: Ic814a1bef1d3088b2f8c1c34f25e2112b74315f8
Reviewed-on: http://gerrit.cloudera.org:8080/1797
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-01-17 09:04:53 +00:00
Tim Armstrong
1832bb285e Misc changes to reduce build noise
Use mvn-quiet.sh in a couple of places it was missed.

Fix mvn warnings.

Provide -q flag to git clean to prevent it reporting all of the files it
deletes.

Change-Id: I77ec2265bf35f64ab1ac76b0a253e67c5f97eccd
Reviewed-on: http://gerrit.cloudera.org:8080/1804
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-01-17 08:50:07 +00:00
Tim Armstrong
f13dfcbddc Suppress maven info logging
Maven's INFO log level is very verbose and includes a lot of progress
information that is minimally useful.

Maven doesn't have an option to output only ERROR and WARNING log
messages. As a workaround, use grep to filter out the majority of the
output (only warnings, errors, tests, and success/failure).

Also add a header with relevant info about the maven command:
targets and working directory.

Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82
Reviewed-on: http://gerrit.cloudera.org:8080/1752
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-01-15 19:38:46 +00:00
Martin Grund
d51f20fa1f Passing cluster startup flags
This patch allows passing additional cluster startup flags.
This is needed when building with optimizations in release
mode as the default cluster startup would only pick up a
debug build.

Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459
Reviewed-on: http://gerrit.cloudera.org:8080/1775
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2016-01-14 16:48:43 +00:00
Casey Ching
cfb1ab5c2c IMPALA-2781: Fix shell error reporting after chdir
The original error reporting relied on $0 being accessible from the
current working dir, which failed if a script changed the working dir
and $0 was relative. This updates the error reporting command to cd back
to the original dir before accessing $0.

Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1
Reviewed-on: http://gerrit.cloudera.org:8080/1666
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-01-14 07:10:54 +00:00
Juan Yu
4f61edee1d IMPALA-2798: Bring in AVRO-1617 fix and add test case for it
Impala could crash or return wrong result if it uses codegend
avro decoding function to scan avro file that has different
schema than table schema. With AVRO-1617 fix, we make sure
Impala doesn't use codegen if table schema has less columns
than file schema.

Change-Id: I268419e421404ad6b084482dee417634f17ecf60
Reviewed-on: http://gerrit.cloudera.org:8080/1696
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2016-01-14 06:04:48 +00:00
Tim Armstrong
4b5ad8cbfd Reduce log output for postgres db operations
Various test scripts operating on postgres databases output
unhelpful log messages, including "ERROR" messages that
aren't actual errors when trying to drop a database that doesn't exist.

Send useless output to /dev/null and consistently use || true to
ignore errors from dropdb.

Change-Id: I95f123a8e8cc083bf4eb81fe1199be74a64180f5
Reviewed-on: http://gerrit.cloudera.org:8080/1753
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-01-13 03:58:50 +00:00
Casey Ching
e2bfb6ae2f Misc improvements to shell scripts about error reporting
Changes:
  1) Consistently use "set -euo pipefail".
  2) When an error happens, print the file and line.
  3) Consolidated some of the kill scripts.
  4) Added better error messages to the load data script.
  5) Changed use of #!/bin/sh to bash.

Change-Id: I14fef66c46c1b4461859382ba3fd0dee0fbcdce1
Reviewed-on: http://gerrit.cloudera.org:8080/1620
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-12-17 18:25:27 +00:00
casey
c56ba5149c Infra scripts: Only attempt to kill processes owned by the current user
This is for compatibility with docker containers. Before this patch,
when the scripts were run on the docker host, the scripts  would try
to kill the mini-cluster in the docker containers and fail because they
didn't have permissions (the user is different). Now the scripts will
only try to kill mini-cluster processes that were started by the current
user.

Also some psutil availability checks were removed because psutil is now
provided by the python virtualenv.

Change-Id: Ida371797bbaffd0a3bd84ab353cb9f466ca510fd
Reviewed-on: http://gerrit.cloudera.org:8080/1541
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-12-17 12:08:33 +00:00
Vlad Berindei
b6c20b2a40 Allow Impala to run against local filesystem.
Allow Impala to start only with a running HMS (and no additional services like HDFS,
HBase, Hive, YARN) and use the local file system.

Skip all tests that need these services, use HDFS caching or assume that multiple impalads
are running.

To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and
WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has
permissions since this is the location where the test data will be extracted.

Test coverage (with core strategy) in comparison with HDFS and S3:
HDFS             1348 tests passed
S3               1157 tests passed
Local Filesystem 1161 tests passed

Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03
Reviewed-on: http://gerrit.cloudera.org:8080/1352
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Readability: Alex Behm <alex.behm@cloudera.com>
2015-12-05 06:48:32 +00:00
Taras Bobrovytsky
22df1fe1ca Random nested schema and data generation
Change-Id: Ie89f140ed389cd877a84ffe2df892853ac9897f2
Reviewed-on: http://gerrit.cloudera.org:8080/1167
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-11-14 05:19:32 +00:00
Sailesh Mukil
277a92a14a IMPALA-2479: Failure in TestParquet.test_verify_runtime_profile
The test_verify_runtime_profile test failed during C5.5 builds and
GVMs because this test relies on the table lineitem_multiblock to
have 3 blocks. However, due to the rules to load the data not being
followed in the functional_schema_template.sql file, the table ended
up being stored with only one block.

This change moves the data load to the end of create-load-data.sh
file which would load the data even for snapshots.

Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be
Reviewed-on: http://gerrit.cloudera.org:8080/1153
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-08 15:16:35 -07:00
ishaan
1beb8cc36d Increase Hive's heap size while writing nested tpch.
Recently, the full data load started failing because Hive ran out of heap space while
writing the nested tpch tables. This patch simply bumps up the heap space, and the query
is now successfull.

Change-Id: I92d0029659c41417d76a15f703df1d42e5187d5e
Reviewed-on: http://gerrit.cloudera.org:8080/776
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-09 10:32:06 +00:00
Dan Hecht
4457eb2df3 IMPALA-2278: fix generate-schema-statements-py --force-reload for text/lzo
The combination of --force and text/lzo was broken if the partition
directories already contained data.  For reasons explained in the
comments, the ALTER TABLE ADD PARTITION step is skipped in this case,
which causes HIVE to not do a full overwrite with INSERT OVERWRITE.

Fix it by manually removing the directories.

Testing: Verified the following combinations of load-data.py for
	 text/lzo now work:
 {--force, ""} x {no partition dirs, partition dirs with files}

Change-Id: I3ee34c4d85c58644345eadd8fc0976665c1bbaf5
Reviewed-on: http://gerrit.cloudera.org:8080/752
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-09-05 19:20:22 +00:00
ishaan
4b007666eb IMPALA-2302: Use a permitted value for parquet.block.size while loading nested tpch.
Due to a possible change in behaviour in Hive/MR, it is no longer possible to use
arbitrarily large values for parquet.block.size. This breaks the loading of nested tpch
data on newer Hive. This patch addresses the problem by using a permissble value.

Change-Id: Ib5b14651fb579cec6aa8d45bd2253cecb4346eb9
Reviewed-on: http://gerrit.cloudera.org:8080/755
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-09-05 02:05:11 +00:00
Alex Behm
9d46853fbc Nested Types: Check un/supported file formats for complex types.
Before this patch, we used to accept any query referencing complex
types, regardless of the table/partition's file format being scanned.
We would ultimately hit a DCHECK in the BE when attempting to scan
complex types of a table/partition with an unsupported format.

This patch makes queries fail gracefully during planning if a scan
would access a table/partition in a format for which we do not
support complex types.

For mixed-format partitioned Hdfs tables we perform this check
at the partition granularity, so such a table can be scanned as
long as only partitions with supported formats are accessed.

HBase tables with complex-typed columns can be scanned as long as
no complex-typed columns are accessed in the query.

Change-Id: I2fd2e386c9755faf2cfe326541698a7094fa0ffc
Reviewed-on: http://gerrit.cloudera.org:8080/705
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-01 03:26:53 +00:00
Taras Bobrovytsky
b8b7930377 Add nested types support to Create Table Like File
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.

Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 01:46:26 +00:00
Taras Bobrovytsky
3c9ceb1a2b Add Parquet nested schemas to testdata
A script is added that generates two parquet files with nested data.
One file has modern nested types encoding and the other one has
legacy encoding. This data will be used for testing nested types
support for "create table like file" statement.

Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef
Reviewed-on: http://gerrit.cloudera.org:8080/610
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-13 10:25:39 +00:00
Casey Ching
d202d6a967 Use "impala-python" (virtualenv) instead of system python
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.

Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-06 02:09:09 +00:00
Alex Behm
c908ba1b7e IMPALA-1136: Support loading Avro tables without an explicit Avro schema
Hive allows creating Avro tables without an explicit Avro schema since 0.14.0.
For such tables, the Avro schema is inferred from the column definitions,
and not stored in the metadata at all (no Avro schema literal or Avro schema file).

This patch adds support for loading the metadata of such tables, although Impala
currently cannot create such tables (expect a follow-on patch).

Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95
Reviewed-on: http://gerrit.cloudera.org:8080/538
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-07-31 12:13:37 +00:00
Alex Behm
1b6f14ab16 Nested Types: Compute stats for the nested TPCH database.
Change-Id: I7b2b77de1a9c25c2a5d9849b62437a58a18bdaae
Reviewed-on: http://gerrit.cloudera.org:8080/506
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-07-21 18:48:17 +00:00
Taras Bobrovytsky
704e3fa6bf Add loading by partitions option to the loaded_nested script
When loading a large nested table using the GROUP_CONCAT function,
Impala runs out of memory. We prevent this from happening by adding
an option to partition the table and load one partition at a time.

Change-Id: I8d517f94ef97e98d36eb8ebc8180865023655114
Reviewed-on: http://gerrit.cloudera.org:8080/448
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-07-02 03:34:53 +00:00
ishaan
377214c469 Use Isilon as the default file system when running Isilon tests.
This patch enables running Impala tests against Isilon as the default file system. The
intention is to run tests against a realistic deployment, i.e, Isilon replacing HDFS as
the underlying filesystem.

Specifically, it does the following:
  - Adds a new environment variable DEFAULT_FS, which points to HDFS by default.
  - Makes the fs.defaultFs property in core-site.xml use the DEFAULT_FS environment
    variable, such that all clients talk to Isilon implicitly.
  - Unset FILESYSTEM_PREFIX when the TARGET_FILESYSTEM is Isilon, since path prefixes
    are no longer needed.
  - Only starts the Hive Metastore and the Impala service stack when running
    tests against Isilon.

We don't start KMS/HBase because they're not relevant to Isilon. We also don't
start YARN, Hive and LLama because hive queries are disabled with Isilon.

The scripts that start/stop Hive, YARN and Llama should be modified to point to a
filesystem other than HDFS in the future.

Change-Id: Id66bfb160fe57f66a64a089b465b536c6c514b63
Reviewed-on: http://gerrit.cloudera.org:8080/449
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-06-11 01:23:11 +00:00
Casey Ching
060f08ef69 Add tpch_nested_parquet database
The database will be used for testing in the future.

Change-Id: I60b54b36db9493a5bea308151b4027cd47d73047
Reviewed-on: http://gerrit.cloudera.org:8080/400
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-06-04 21:18:36 +00:00
ishaan
dbc78aaa2c Enable isilon end to end tests for Impala.
This patch introduces changes to run tests against Isilon, combined with minor cleanup of
the test and client code.
For Isilon, it:
  - Populates the SkipIfIsilon class with appropriate pytest markers.
  - Introduces a new default for the hdfs client in order to connect to Isilon.
  - Cleans up a few test files take the underlying filesystem into account.
  - Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl

On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically:
  - delete_file_dir does not throw an error if the file does not exist.
  - get_file_dir_status automatically strips the leading '/'

Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f
Reviewed-on: http://gerrit.cloudera.org:8080/370
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-05-27 22:25:12 +00:00
Alex Behm
1bd3eca22f Quietly resolve dependencies in Jenkins runs to avoid log spew.
Change-Id: If38a683785f3c6c9d92f762a2dfd86f009ce9d84
Reviewed-on: http://gerrit.cloudera.org:8080/392
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-05-19 09:12:43 +00:00
Alex Behm
013d6f968f Clean up FE pom.xml to eliminate console spew.
This patch makes the following changes in our pom to reduce
the build time and signficantly reduce console spew.

1. Remove jar-with-dependencies from package goal.
We have no need for creating an uber jar that contains the FE as well
as all its dependencies. Locally, we carefully construct our class path
manually (relying on copy-dependencies), and in Impala deployments
the FE jar is put together with the other dependencies, so the FE jar
does not need to be self-contained.

2. Silence copy-dependencies.
Changes the configuration of the maven-dependency-plugin to not
log every copied file to the console.

Change-Id: If351e4e800fd1ca1108f9a0f4d88f52a53fc211c
Reviewed-on: http://gerrit.cloudera.org:8080/378
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-05-18 07:20:07 +00:00
ishaan
058978dccb Enable using isilon as the underlying filesystem.
This patch enables the Impala test suite to run the end to end tests
against an isilon namenode. There are a few caveats:
  - The fe test will currently not work.
  - Only loading data from both the test-warehouse snapshot and the metadata snapshot is
    supported.
  - The test suite cannot be run by multiple people (unless we have access to multiple
    isilon namenodes)

Change-Id: I786b4e4f51b99e79ad42abc676f537ebfc189237
Reviewed-on: http://gerrit.cloudera.org:8080/356
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-05-12 01:28:19 +00:00
Casey Ching
6f1ce232f4 Use java from JAVA_HOME
Various build and test machines have multiple versions of java
installed and relying on the default "java" command being compatible
isn't practical (a machine may also build an older version of Impala
that might require a different java version). Since JAVA_HOME is already
required that can/should be used to determine which java binary to use.

This also includes a minor change to replace a block of code that was
using 4-space indent. Instead of using 2-space indent, that block was
replaced with one line.

Change-Id: I4b8698b2aa5411b5fa6c5bc06291625999478955
Reviewed-on: http://gerrit.cloudera.org:8080/310
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-04-03 00:13:22 +00:00
ishaan
73d7ab11e1 Compute stats for tpch parquet tables while loading the data.
This patch removes the logic from the python test file, it should really live in the
code that sets up the test-warehouse.

Change-Id: Id04dc90c7ab813af2f347ec79e9e43d76de794a2
Reviewed-on: http://gerrit.cloudera.org:8080/224
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-03-12 17:49:55 -07:00
Dan Hecht
2916132283 S3: enable more tests for S3
As needed, fix up file paths and other misc things to get
more test cases running against S3.

Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad
Reviewed-on: http://gerrit.cloudera.org:8080/167
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 16:39:39 -07:00
ishaan
4a9adfd685 Fix the full data load build by not hardcoding the lzo index file's name.
After the hive/hdfs rebase, the indexed lzo file names changed. This patch uses a
wildcard rather than a specific file name to protect against such changes. It's safe
because the test simply expects a partition that does not have index files.

Change-Id: I6d32609b62df83fe2a8ef935d7ca6506ecff5e0d
Reviewed-on: http://gerrit.cloudera.org:8080/150
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-03-05 09:52:34 +00:00
ishaan
21d24f5295 Infrastructure changes to enable the hive version change from 0.13.1 to 1.1.0
Specifically:
  - Hive needs some jars from hadoop/tools/lib
  - Hive has an dependency on apache.snapshots ( added in fe/pom.xml )
  - Beeline has to explicitly told not to use jline.

Change-Id: Id38956b748f8f667a39505c92355f0298f308718

Conflicts:

	testdata/bin/load-hive-builtins.sh
2015-02-23 20:27:13 -08:00
Matthew Jacobs
835d6dbef4 IMPALA-1209: Add KMS service to testdata cluster (pt1)
First change for IMPALA-1209 to address Impala limitations when
using HDFS encryption. This adds a KMS process to the testdata
cluster. This was tested manually by creating a key and an
encryption zone.

Change-Id: I499154506386f04e71c5371b128c10868b1e1318
Reviewed-on: http://gerrit.cloudera.org:8080/41
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2015-02-13 20:46:14 +00:00
ishaan
b01252267a Fix the hive environment breakage caused by a malformed environment variable.
We build some of the jars that hive needs fe/target/. A recent change resulted in these
jars not being loaded, causing a bad hive environment. This patch restores proper
behaviour.

Change-Id: Icb27ab04f7f77cb4ddab51326eedfd11a6cdf960
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5930
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2015-02-04 02:51:00 -08:00
ishaan
2386fb84a8 Enable the data loading infrastructure to switch the underlying file system.
This patch enables loading data to s3 instead of hdfs. It is preliminary in nature,
as such, there are a few caveats:
 - The fe tests do not work.
 - Only loading from a test-warehouse snapshot and metastore snapshot is enabled.
 - Until hive works with s3, only a subset of all the tests will work.

Change-Id: Ia66a5f836b4245e3b022a49de805eec337a51324
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5851
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2015-02-03 01:02:42 -08:00
ishaan
5ac46af786 Fix the full data load path by explicitly creating the test-warehouse directory in hdfs.
Previously, when we started all the services, we created an HBase table from hive to avoid
a replication bug. This had the side-effect of creating a test-warehouse directory in
hdfs. After that check was removed, we no longer create the test-warehouse, causing the
full-data-load build to fail.

Change-Id: I75479562d33e08c79ad155c615cecb5b91c0eab6
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5904
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2015-02-03 00:51:49 -08:00
Alex Behm
762cae3fb9 Remove Hive-HBase warmup script because the original CDH-17414 issue has been resolved.
At the time when CDH-17414 was filed, the issue could be reproduced very reliably.
The issue seems to have been fixed, so our crufty workaround is no longer needed.

Change-Id: Ib31ac8f862ab2d06ebfc8656ce49b1b43fe301e8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5892
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2015-01-30 19:55:54 -08:00
ishaan
07efc0cb17 Add the ability to only reload the metastore snapshot in buildall and misc. changes.
This commit adds the ability to only load the metastore snapshot, with the assumption that
the hdfs data is already loaded. It also additionally adds the ability to specify some
buildall parameters via the environment.

Change-Id: I4a07d4cf3a63479c377d4be79c4a2140c2a52fb8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5665
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2015-01-09 12:40:06 -08:00
ishaan
dee6911b20 Enable loading metadata from the hive metastore snapshot and cleanup build scripts.
This patch contains the following changes:
  - Add a metastore_snapshot_file parameter to build.sh
  - Enable skipping loading the metadata.
  - create-load-data.sh is refactored into functions.
  - A lot of scripts source impala-config, which creates a lot of log spew. This has now
    been muted.
  - Unecessary log spew from compute-table-stats has been muted.
  - build_thirdparty.sh determins its parallelism from the system, it was previously hard
    coded to 4
  - Only force load data of the particular dataset if a schema change is detected.

Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-12-19 13:41:00 -08:00
ishaan
0ff8c99068 Fix generate-schema-statements to account for changed formatting in the hdfs client.
Change-Id: I4af5863bc0dd6660aef65e0e9b498002fc45edb8
2014-12-16 11:28:08 -08:00
ishaan
09b97f3881 Add the ability to load a metastore snapshot file.
This patch includes the following changes:
  - Modifies buildall to accept a hive metastore snapshot file as an argument.
  - Adds a script to load the hive metastore snapshot.

Change-Id: I7b9fc5b0643afe62fd4739a81eaa3bf9af1630da
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5510
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-12-08 18:16:45 -08:00
casey
4915ea4ac9 IMPALA-1134: Use copyBytes() to get value from o.a.h.io.Text
This affects java UDFs. Previously it was possible that the length of
the string returned from a java udf didn't match the actual data. Per the
Text.getBytes() documentation "... only data up to getLength() is
valid.". Impala just needs to use copyBytes() which is a convenience
function for this situation. The same should be done for BytesWritable.

Before:

Query: select length(echo('12345678901234567890'))
+-------------------------------------------+
| length(java.echo('12345678901234567890')) |
+-------------------------------------------+
| 22                                        |
+-------------------------------------------+

After:

Query: select length(echo('12345678901234567890'))
+-------------------------------------------------+
| length(functional.echo('12345678901234567890')) |
+-------------------------------------------------+
| 20                                              |
+-------------------------------------------------+

Change-Id: If9671278df8abf7529d3bc470c5f9d037ac3da1b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4897
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
2014-11-17 15:02:24 -08:00
Martin Grund
f58159d431 [CDH5] IMPALA-1141: HBase Planner Performance
This patch improves the performance of the planning phase of a query
querying HBase tables. It removes an unnecessary second call to compute
stats and adds a new version for estimating the row count in a table.

This patch adds an incremental version to estimate the number of rows
for a set of regions. This incremental version will start querying up to
five regions to calculate the average row size and use this value to
estimate the row count based on the size of the regions on disk. Only if
the standard deviation from the average is larger than 15% query an
additional region, it will query additional regions to calculate an
average with more confidence.

If the data is balanced it will not be necessary to retrieve data from
all regions but only from a subset. In the worst case, all regions are
queried.

Change-Id: Idcb3bea81b11cb08da6d9329ba66c86aca23e170
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5258
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2014-11-14 13:47:02 -08:00