impala

mirror of https://github.com/apache/impala.git synced 2026-01-03 15:00:52 -05:00

Author	SHA1	Message	Date
Harrison Sheinblatt	1058163f70	IMPALA-2276: Isilon and s3 builds must fail with stale snapshot If a stale snapshot is detected, the full data load proceeds even if the option to skip data load was set. A check is added to fail immediately if this happens for isilon or s3 because the full data load will not work on these filesystems currently. Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52 Reviewed-on: http://gerrit.cloudera.org:8080/2811 Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
Sailesh Mukil	49a73cd598	IMPALA-3249: Failed to mkdirs on core-local-filesystem build. This failure happens on filesystems other than HDFS because as a part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the new directories that the patch tries to create in create-load-data. Change-Id: I8de74db93893c5273ccc9c687f608959628f5004 Reviewed-on: http://gerrit.cloudera.org:8080/2644 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 00:03:45 +00:00
Alex Behm	b2ccb17c21	Print last 50 lines of log if data loading fails. The 20 lines we dump currently are often not enough to diagnose a failure quickly. Increasing to 50 lines. Printing 50 lines is also consistent with our run-step script which also prints 50 lines. Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73 Reviewed-on: http://gerrit.cloudera.org:8080/2632 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 20:22:18 +00:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
Casey Ching	432a76e4dd	Temporarily disable Kudu support Change-Id: I9aeb808a9898972788cb1d5d071619d8c64b514c Reviewed-on: http://gerrit.cloudera.org:8080/2551 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 00:15:34 +00:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Casey Ching	72d1889c08	IMPALA-2873: Fix nested TPC-H data loading In commit 960808 I forgot to update the data-loading script for the conversion of a shell script to a python script. It turns out there were a couple of other little problems too. I checked manually that the data was loaded after these changes. Change-Id: Id81fc423348515ab446835868025cb839c77f52c Reviewed-on: http://gerrit.cloudera.org:8080/1851 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-01-21 05:42:17 +00:00
Tim Armstrong	43de306d17	Log data loading and cluster setup to file Log output of data loading steps to files only print to stdout if there is an actual failure. The output of some steps is very noisy, and some steps even have output that looks like errors. This is implemented with a run-step helper function in bash that handles redirection and logging. Any bash command can be prefixed with run-step <step description> <log file name> to redirect the output to a log file. Sample output is: Starting Impala cluster (logging to start-impala-cluster.log)... OK Setting up HDFS environment (logging to setup-hdfs-env.log)... OK Skipped loading the metadata. Loading HBase data only (logging to load-hbase-only.log)... OK Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK Running custom post-load steps (logging to custom-post-load-steps.log)... OK Caching test tables (logging to cache-test-tables.log)... OK Loading external data sources (logging to load-ext-data-source.log)... OK Splitting HBase (logging to create-hbase.log)... OK Change-Id: I6396540858c408b084039a87efc81e1004626f39 Reviewed-on: http://gerrit.cloudera.org:8080/1760 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 04:38:19 +00:00
Tim Armstrong	f13dfcbddc	Suppress maven info logging Maven's INFO log level is very verbose and includes a lot of progress information that is minimally useful. Maven doesn't have an option to output only ERROR and WARNING log messages. As a workaround, use grep to filter out the majority of the output (only warnings, errors, tests, and success/failure). Also add a header with relevant info about the maven command: targets and working directory. Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82 Reviewed-on: http://gerrit.cloudera.org:8080/1752 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-01-15 19:38:46 +00:00
Martin Grund	d51f20fa1f	Passing cluster startup flags This patch allows passing additional cluster startup flags. This is needed when building with optimizations in release mode as the default cluster startup would only pick up a debug build. Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459 Reviewed-on: http://gerrit.cloudera.org:8080/1775 Tested-by: Internal Jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2016-01-14 16:48:43 +00:00
Casey Ching	cfb1ab5c2c	IMPALA-2781: Fix shell error reporting after chdir The original error reporting relied on $0 being accessible from the current working dir, which failed if a script changed the working dir and $0 was relative. This updates the error reporting command to cd back to the original dir before accessing $0. Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1 Reviewed-on: http://gerrit.cloudera.org:8080/1666 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-14 07:10:54 +00:00
Casey Ching	e2bfb6ae2f	Misc improvements to shell scripts about error reporting Changes: 1) Consistently use "set -euo pipefail". 2) When an error happens, print the file and line. 3) Consolidated some of the kill scripts. 4) Added better error messages to the load data script. 5) Changed use of #!/bin/sh to bash. Change-Id: I14fef66c46c1b4461859382ba3fd0dee0fbcdce1 Reviewed-on: http://gerrit.cloudera.org:8080/1620 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-12-17 18:25:27 +00:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Sailesh Mukil	277a92a14a	IMPALA-2479: Failure in TestParquet.test_verify_runtime_profile The test_verify_runtime_profile test failed during C5.5 builds and GVMs because this test relies on the table lineitem_multiblock to have 3 blocks. However, due to the rules to load the data not being followed in the functional_schema_template.sql file, the table ended up being stored with only one block. This change moves the data load to the end of create-load-data.sh file which would load the data even for snapshots. Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be Reviewed-on: http://gerrit.cloudera.org:8080/1153 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-10-08 15:16:35 -07:00
Taras Bobrovytsky	b8b7930377	Add nested types support to Create Table Like File Add support for creating a table based on a parquet file which contains arrays, structs and/or maps. Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae Reviewed-on: http://gerrit.cloudera.org:8080/582 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-22 01:46:26 +00:00
Taras Bobrovytsky	3c9ceb1a2b	Add Parquet nested schemas to testdata A script is added that generates two parquet files with nested data. One file has modern nested types encoding and the other one has legacy encoding. This data will be used for testing nested types support for "create table like file" statement. Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef Reviewed-on: http://gerrit.cloudera.org:8080/610 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 10:25:39 +00:00
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
David Alves	b911c7e97e	Always create the kudu tables even if we skipped metadata load Patch c0c9fbdf57df667f63632437f612a63baf1534dd: "Load Kudu as part of the normal data loading workflow" passed the build when it was first introduced as it had introduced changes to the datasets directory which cause the metadata loading not to be skipped. However it failed all subsequent times as there were no further changes to the metadata directory. This patch makes data loading for Kudu run independently of whether metadata load is skipped or not since a new Kudu cluster is now created on each build. This patch also removes one last reference to 'functional_kudu.liketbl' in AnalyzeDDLTest since we don't create/load data for that table anymore. Change-Id: Ibe9acc7da17062ac317dff06a8c57dd87cf566d6 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7110 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: David Alves <david.alves@cloudera.com>	2015-07-12 23:02:08 -07:00
ishaan	377214c469	Use Isilon as the default file system when running Isilon tests. This patch enables running Impala tests against Isilon as the default file system. The intention is to run tests against a realistic deployment, i.e, Isilon replacing HDFS as the underlying filesystem. Specifically, it does the following: - Adds a new environment variable DEFAULT_FS, which points to HDFS by default. - Makes the fs.defaultFs property in core-site.xml use the DEFAULT_FS environment variable, such that all clients talk to Isilon implicitly. - Unset FILESYSTEM_PREFIX when the TARGET_FILESYSTEM is Isilon, since path prefixes are no longer needed. - Only starts the Hive Metastore and the Impala service stack when running tests against Isilon. We don't start KMS/HBase because they're not relevant to Isilon. We also don't start YARN, Hive and LLama because hive queries are disabled with Isilon. The scripts that start/stop Hive, YARN and Llama should be modified to point to a filesystem other than HDFS in the future. Change-Id: Id66bfb160fe57f66a64a089b465b536c6c514b63 Reviewed-on: http://gerrit.cloudera.org:8080/449 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-06-11 01:23:11 +00:00
Casey Ching	060f08ef69	Add tpch_nested_parquet database The database will be used for testing in the future. Change-Id: I60b54b36db9493a5bea308151b4027cd47d73047 Reviewed-on: http://gerrit.cloudera.org:8080/400 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-06-04 21:18:36 +00:00
Martin Grund	55f6457a28	Kudu Table and Kudu Scan Node This patch adds a basic implementation for a Kudu table, scan node and supports simple DDL operations. Similar to "normal" HDFS tables, the DDL statements executed in the Hive metastore can be propagated to Kudu. Othewise, the Kudu table behaves similarily to a HBase table. The syntax to create a table stored in Kudu is: create table kudu (id int, name string, age int) tblproperties ( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'kudu', 'kudu.master_addresses' = '0.0.0.0:7051', 'kudu.key_columns' = 'id,name'); The 'storage_handler' attribute is fixed and used to identify a table backed by Kudu. The storage handler attribute is required, to make sure that Hive will not create a directory on HDFS for this table. The 'table_name' and 'master_addresses' properties define the properties of the physical persistence in Kudu. The 'key_columns' defines a list of columns that should be used as a (composite) key. A Kudu table can be created either as a managed or un-managed (external) table. Creating an external table behaves similar to an external HDFS table, if the table does not exist in Kudu create it, if it exists use it and check schema compatibility. If an external table is deleted, only delete the Hive table. If a managed table is created, the Kudu table must not exist and if it is deleted the Kudu table is deleted as well. TODO: Allow creation of external table without specifying columns. Change-Id: I794abf6abe30ace4426c53f77676ae1dcb4341ec Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6358 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-06-01 16:51:53 -07:00
Martin Grund	e95349676b	Re-enabling data loading from snapshot. Change-Id: Ia8929a492cb97ed5a21bc4bd3d793e91a2076be3 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6752 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins	2015-06-01 16:01:10 -07:00
Martin Grund	0edbe5004a	FIXME: Ignore schema changes in build for now Change-Id: Ib57272c8ec49323bf0506af92b9c6e8f4e3ba5a4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6362 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-06-01 15:41:52 -07:00
Alex Behm	1bd3eca22f	Quietly resolve dependencies in Jenkins runs to avoid log spew. Change-Id: If38a683785f3c6c9d92f762a2dfd86f009ce9d84 Reviewed-on: http://gerrit.cloudera.org:8080/392 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-05-19 09:12:43 +00:00
Dan Hecht	2916132283	S3: enable more tests for S3 As needed, fix up file paths and other misc things to get more test cases running against S3. Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad Reviewed-on: http://gerrit.cloudera.org:8080/167 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:39 -07:00
ishaan	4a9adfd685	Fix the full data load build by not hardcoding the lzo index file's name. After the hive/hdfs rebase, the indexed lzo file names changed. This patch uses a wildcard rather than a specific file name to protect against such changes. It's safe because the test simply expects a partition that does not have index files. Change-Id: I6d32609b62df83fe2a8ef935d7ca6506ecff5e0d Reviewed-on: http://gerrit.cloudera.org:8080/150 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 09:52:34 +00:00
Matthew Jacobs	835d6dbef4	IMPALA-1209: Add KMS service to testdata cluster (pt1) First change for IMPALA-1209 to address Impala limitations when using HDFS encryption. This adds a KMS process to the testdata cluster. This was tested manually by creating a key and an encryption zone. Change-Id: I499154506386f04e71c5371b128c10868b1e1318 Reviewed-on: http://gerrit.cloudera.org:8080/41 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-02-13 20:46:14 +00:00
ishaan	2386fb84a8	Enable the data loading infrastructure to switch the underlying file system. This patch enables loading data to s3 instead of hdfs. It is preliminary in nature, as such, there are a few caveats: - The fe tests do not work. - Only loading from a test-warehouse snapshot and metastore snapshot is enabled. - Until hive works with s3, only a subset of all the tests will work. Change-Id: Ia66a5f836b4245e3b022a49de805eec337a51324 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5851 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2015-02-03 01:02:42 -08:00
ishaan	5ac46af786	Fix the full data load path by explicitly creating the test-warehouse directory in hdfs. Previously, when we started all the services, we created an HBase table from hive to avoid a replication bug. This had the side-effect of creating a test-warehouse directory in hdfs. After that check was removed, we no longer create the test-warehouse, causing the full-data-load build to fail. Change-Id: I75479562d33e08c79ad155c615cecb5b91c0eab6 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5904 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2015-02-03 00:51:49 -08:00
ishaan	07efc0cb17	Add the ability to only reload the metastore snapshot in buildall and misc. changes. This commit adds the ability to only load the metastore snapshot, with the assumption that the hdfs data is already loaded. It also additionally adds the ability to specify some buildall parameters via the environment. Change-Id: I4a07d4cf3a63479c377d4be79c4a2140c2a52fb8 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5665 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2015-01-09 12:40:06 -08:00
ishaan	dee6911b20	Enable loading metadata from the hive metastore snapshot and cleanup build scripts. This patch contains the following changes: - Add a metastore_snapshot_file parameter to build.sh - Enable skipping loading the metadata. - create-load-data.sh is refactored into functions. - A lot of scripts source impala-config, which creates a lot of log spew. This has now been muted. - Unecessary log spew from compute-table-stats has been muted. - build_thirdparty.sh determins its parallelism from the system, it was previously hard coded to 4 - Only force load data of the particular dataset if a schema change is detected. Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-12-19 13:41:00 -08:00
Skye Wanderman-Milne	4a722980e5	IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to stitch together header buffer Change-Id: I4f33b90e845e9bef1ac929bf4ebb8e98eaff985c Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4961 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins (cherry picked from commit c3a90183b2f03434a9604f3aa2ef6dd08c9ba97c) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4981 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-10-27 16:30:56 -07:00
Lenni Kuff	758ba08bbb	Silence most of data loading spew by redirecting it to log files Change-Id: I256a3970ce52bbcac816178029f703095fec388f Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4610 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-10-06 15:09:42 -07:00
Victor Bittorf	a3767c9f2b	Fix data loading to unblock gvm Change-Id: I5e145f1e8497d340cb72a8112c247e63b1c79362 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4537 Reviewed-by: Nong Li <nong@cloudera.com> Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: Victor Bittorf <victor.bittorf@cloudera.com>	2014-09-26 12:26:37 -07:00
Victor Bittorf	af4b2086dc	Char PARQUET, AVRO, and TEXT tests Adds fixes and tests for Hive CHAR & VARCHAR compatibility. Also fixes a bug in tuple materialization for VARCHAR and non in-lined CHAR. Change-Id: I400b089cb8ddba2e264ef9f2e37956b2ceaaf9fb Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4054 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-26 12:24:07 -07:00
Mike Yoder	75a97d3d7e	[CDH5] Kerberize mini-cluster and Impala daemons This is the first iteration of a kerberized development environment. All the daemons start and use kerberos, with the sole exception of the hive metastore. This is sufficient to test impala authentication. When buildall.sh is run using '-kerberize', it will stop before loading data or attempting to run tests. Loading data into the cluster is known to not work at this time, the root causes being that Beeline -> HiveServer2 -> MapReduce throws errors, and Beeline -> HiveServer2 -> HBase has problems. These are left for later work. However, the impala daemons will happily authenticate using kerberos both from clients (like the impala shell) and amongst each other. This means that if you can get data into the mini-cluster, you could query it. Usage: * Supply a '-kerberize' option to buildall.sh, or * Supply a '-kerberize' option to create-test-configuration.sh, then 'run-all.sh -format', re-source impala-config.sh, and then start impala daemons as usual. You must reformat the cluster because kerberizing it will change all the ownership of all files in HDFS. Notable changes: * Added clean start/stop script for the llama-minikdc * Creation of Kerberized HDFS - namenode and datanodes * Kerberized HBase (and Zookeeper) * Kerberized Hive (minus the MetaStore) * Kerberized Impala * Loading of data very nearly working Still to go: * Kerberize the MetaStore * Get data loading working * Run all tests * The unknown unknowns * Extensive testing Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019 Reviewed-by: Michael Yoder <myoder@cloudera.com> Tested-by: jenkins	2014-09-05 12:36:21 -07:00
Lenni Kuff	286e312460	[CDH5] Minor code changes for Hive .13 support Changes include: * Fix compile errors due to new column stats API and other stats related fixes. * Temporarily disable JDBC tests due to new serialization format in Hive .13 * Disable view compatibility tests until we can get them to work in Hive .13 * Test fixes due to Hive's type checking for partition column values Change-Id: I05cc6a95976e0e037be79d91bc330a06d2fdc46c	2014-08-11 09:53:02 -07:00
Victor Bittorf	2d7f2e19b2	IMPALA 938: Infer schema from Parquet file Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'". Supports all options that CREATE TABLE does. Currently only PARQUET is supported. Run testdata/bin/create-load-data.sh after pulling this patch. Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158	2014-06-20 17:38:01 -07:00
ishaan	99602fb8c2	Force load data if the current HEAD has a schema change. This patch checks the test-warehouse's stored githash (if it exists) to determine if the current patch has changed the schema if a table. If a change is detected, we force load all the data. Change-Id: I314f9f3364d3e6b2d66de38a9e6d9f57c4e279a7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3049 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-06-19 02:25:50 -07:00
Nong Li	5d80942d42	[CDH5] IMPALA-1019: Fix cancellation path in io mgr for cached reads. Change-Id: I11efd65d1efa900f79afe88b781262a44ac5006a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2703 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-05-30 19:14:39 -07:00
Lenni Kuff	c45e9a70d9	[CDH5] Add DDL support for HDFS caching This change adds DDL support for HDFS caching. The DDL allows the user to indicate a table or partition should be cached and which pool to cache the data into: * Create a cached table: CREATE TABLE ... CACHED IN 'poolName' * Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName' * Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED When a table/partition is marked as cached, a new HDFS caching request is submitted to cache the location (HDFS path) of the table/partition and the ID of that request is stored with in the table metadata (in the table properties). This is stored as: 'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS and persisted across HDFS restarts. When a cached table or partition is dropped it is important to uncache the cached data (drop the associated cache request). For partitioned tables, this means dropping all cache requests from all cached partitions in the table. Likewise, if a partitioned table is created as cached, new partitions should be marked as cached by default. It is desirable to know which cache pools exists early on (in analysis) so the query will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To support this, a new cache pool catalog object type was introduced. The catalog server caches the known pools (periodically refreshing the cache) and sends the known pools out in catalog updates. This allows impalads to perform analysis checks on cache pool existence going to HDFS. It would be easy to use this to add basic cache pool management in the future (ADD/DROP/SHOW CACHE POOL). Waiting for the table/partition to become cached may take a long time. Instead of blocking the user from access the time during this period we will wait for the cache requests to complete in the background and once they have finished the table metadata will be automatically refreshed. Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-05-27 16:47:15 -07:00
Matthew Jacobs	ebc6c5894e	External Data Source: Frontend and catalog changes Initial frontend and catalog changes for external data sources. Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485	2014-05-08 14:56:19 -07:00
Nong Li	03e5665e56	Decimal: Read/Write to parquet. This adds support for the FIXED_LENGTH_BYTE_ARRAY parquet type and encoding for decimals. Change-Id: I9d5780feb4530989b568ec8d168cbdc32b7039bd Reviewed-on: http://gerrit.ent.cloudera.com:8080/1727 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2432	2014-05-02 16:38:35 -07:00
Nong Li	87295a4e06	Decimal implementation. This patch implements decimal support for text based formats. Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238 Tested-by: jenkins	2014-04-14 21:07:32 -07:00
Lenni Kuff	aa0b7a35f5	IMPALA-880: COMPUTE STATS should update partitions in batches When updating partition metadata as part of COMPUTE STATS we would previously attempt to update all partitions at once. This could lead to HMS socket timeouts and also could run into issues if there were > 32K partitions. In this change we now update the partitions in batches, with a max size of 500 partitions per batch. We also compare whether the row count has changed and only update partitions that have been modified. Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918	2014-03-14 19:20:12 -07:00
Lenni Kuff	bf16b5cd0d	IMPALA-749: Fetch partitions in batches, rather than all at once. This updates how Impala fetches partition metadata from the Hive Metastore to fetch partitions in batches, rather than all at once. This helps reduce the load on the HMS and also lets Impala scale to above 32K partitions. The downside is that it may require additional RPCs to get all the partitions. This is done by first querying the metastore to get all the partition names that exist, then splitting the list of names into seperate batches to get the actual partition metadata. Impala uses a default size of 1000 partitions per batch, but it can be configured by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter in the hive-site.xml config file. Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698	2014-02-28 22:30:45 -08:00
Nong Li	04b501d3a1	[CDH5] Collect metadata for cached blocks. Change-Id: I81026de2f9a08553dc15e07090b8297120aa7462 (cherry picked from commit 69414f67b20016e49b739a46d6e2b4b57e1d1a3c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1252 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-15 15:12:20 -08:00
Alex Behm	dc7b398bd3	Impala reserves resources from YARN via LLama. Impala reserves resources from YARN via Llama and handles resources preemptions by cancelling affected queries. Adds the Impala Resource Broker for interacting with Llama. Refactors scheduler and coordinator to move fragment-to-host assignment logic into scheduler. Local test setup uses MiniLLama. Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1	2014-01-15 15:12:04 -08:00
Skye Wanderman-Milne	561da008c7	IMPALA-729: fix resource management in Parquet scanner for multiple row groups We weren't attaching resources to the row batch when starting a new row group, so it was possible for string data to be overwritten. This patch removes CloseStreams() and merges its functionality with AttachCompletedResources() so it's not possible to destroy streams without transferring the resources first. It also merges and removes ScannerContext::Close(). Also adds test cases for IMPALA-720. Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb	2014-01-08 10:56:26 -08:00

1 2

99 Commits