impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Author	SHA1	Message	Date
Michael Smith	d4cb3afe69	[tools] fix buildall.sh -testdata with prior data The help output for buildall.sh notes running `buildall.sh -testdata` as an option to incrementally load test data without formatting the mini-cluster. However trying to do that with existing data loaded results in an error when running `hadoop fs -mkdir /test-warehouse`. Add `-p` so this step is idempotent, allowing the example to work as documented. Change-Id: Icc4ec4bb746abf53f6787fce4db493919806aaa9 Reviewed-on: http://gerrit.cloudera.org:8080/18522 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-16 08:19:33 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
Yu-Wen Lai	33467a7b2e	Bump up the GBN to 15549253 This patch bumps up the GBN to 15549253. This patch includes the fix by Fang-Yu for using correct policy id to update the policy of "all - database" due to the change on the Ranger side. Testing: * ran the create-load-data.sh Change-Id: Ie7776e62dad0b9bec6c03fb9ee8f1b8728ff0e69 Reviewed-on: http://gerrit.cloudera.org:8080/17746 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-08-03 17:25:40 +00:00
Fang-Yu Rao	5aa47bf901	IMPALA-10694: Improve the error handling of setup-ranger We found that setup-ranger will continue the execution if an error occurs when i) wget was executed to initialize the environment variables of GROUP_ID_OWNER and GROUP_ID_NON_OWNER, and ii) curl was executed to upload a revised Ranger policy even though the -e option was set when create-load-data.sh was executed. This patch improves the error handling by making setup-ranger exit as soon as an error occurs so that no test would be run at all in case there is an error. To exit if an error occurs during wget, we separate the assignment and the export of the environment variables since the export of an environment variable will run and succeed even though there is an error before the export if the assignment and the export are combined. That is, combining them hides the error. On the other hand, to exit if an error occurs during curl, we add an additional -f option so that an error will no longer be silently ignored. Testing: - Verified that setup-ranger could be successfully executed after this patch. - Verified that setup-ranger would exit if a URL in setup-ranger is not correctly set up or if the 'id' field in policy_4_revised.json does not match the URL of the policy to be updated. Change-Id: I45605d1a7441b734cf80249626638cde3adce28b Reviewed-on: http://gerrit.cloudera.org:8080/17386 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-12 12:25:49 +00:00
Laszlo Gaal	559f6044be	IMPALA-9331: Add symptom for dataload failing on schema mismatch Streamline triaging a bit. When this fails, it does so in a specific location, and until now you had to scan the build log to find the problem. This JUnitXML symptom should make this failure mode obvious. Tested by running an S3 build on private infrastructure with a knowingly mismatched data snapshot. Change-Id: I2fa193740a2764fdda799d6a9cc64f89cab64aba Reviewed-on: http://gerrit.cloudera.org:8080/17242 Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-01 15:09:10 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Joe McDonnell	97856478ec	IMPALA-10198 (part 1): Unify Java in a single java/ directory This changes all existing Java code to be submodules under a single root pom. The root pom is impala-parent/pom.xml with minor changes to add submodules. This avoids most of the weird CMake/maven interactions, because there is now a single maven invocation for all the Java code. This moves all the Java projects other than fe into a top level java directory. fe is left where it is to avoid disruption (but still is compiled via the java directory's root pom). Various pieces of code that reference the old locations are updated. Based on research, there are two options for dealing with the shaded dependencies. The first is to have an entirely separate Maven project with a separate Maven invocation. In this case, the consumers of the shaded jars will see the reduced set of transitive dependencies. The second is to have the shaded dependencies as modules with a single Maven invocation. The consumer would see all of the original transitive dependencies and need to exclude them all. See MSHADE-206/MNG-5899. This chooses the second. This only moves code around and does not focus on version numbers or making "mvn versions:set" work. Testing: - Ran a core job - Verified existing maven commands from fe/ directory still work - Compared the *-classpath.txt files from fe and executor-deps and verified they are the same except for paths Change-Id: I08773f4f9d7cb269b0491080078d6e6f490d8d7a Reviewed-on: http://gerrit.cloudera.org:8080/16500 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-10-15 19:30:13 +00:00
stiga-huang	7b44b35132	IMPALA-9351: Fix tests depending on hard-coded file paths of managed tables Some tests (e.g. AnalyzeDDLTest.TestCreateTableLikeFileOrc) depend on hard-coded file paths of managed tables, assuming that there is always a file named 'base_0000001/bucket_00000_0' under the table dir. However, the file name is in the form of bucket_${bucket-id}_${attempt-id}. The last part of the file name is not guaranteed to be 0. If the first attempt fails and the second attempt succeeds, the file name will be bucket_00000_1. This patch replaces these hard-coded file paths to corresponding files that are uploaded to HDFS by commands. For tests that do need to use the file paths of managed table files, we do a listing on the table dir to get the file names, instead of hard-coding the file paths. Updated chars-formats.orc to contain column names in the file so can be used in more tests. The original one only has names like col0, col1, col2. Tests: - Run CORE tests Change-Id: Ie3136ee90e2444c4a12f0f2e1470fca1d5deaba0 Reviewed-on: http://gerrit.cloudera.org:8080/16441 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-12 05:30:45 +00:00
Tim Armstrong	6ec6aaae8e	IMPALA-3695: Remove KUDU_IS_SUPPORTED Testing: Ran exhaustive tests. Change-Id: I059d7a42798c38b570f25283663c284f2fcee517 Reviewed-on: http://gerrit.cloudera.org:8080/16085 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 01:11:18 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
Joe McDonnell	f241fd08ac	IMPALA-9731: Remove USE_CDP_HIVE=false and Hive 2 support Impala 4 moved to using CDP versions for components, which involves adopting Hive 3. This removes the old code supporting CDH components and Hive 2. Specifically, it does the following: 1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true. USE_CDP_HIVE now has no effect on the Impala environment. This also means that bin/jenkins/build-all-flag-combinations.sh no longer include USE_CDP_HIVE=false as a configuration. 2. Remove USE_CDH_KUDU and default to getting Impala from the native toolchain. 3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml. There is a fair amount of code that still references the Hive major version. Upstream Hive is now working on Hive 4, so there is a high likelihood that we'll need some code to deal with that transition. This leaves some code (such as maven profiles) and test logic in place. Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608 Reviewed-on: http://gerrit.cloudera.org:8080/15869 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-07 22:14:39 +00:00
Joe McDonnell	90ab610d34	Convert dataload hdfs copy commands to LOAD DATA statements The schema file allows specifying a commandline command in several of the sections (LOAD, DEPENDENT_LOAD, etc). These are execute by testdata/bin/generate-schema-statements.py when it is creating the SQL files that are later executed for dataload. A fair number of tables use this flexibility to execute hdfs mkdir and copy commands via the command line. Unfortunately, this is very inefficient. HDFS command line commands require spinning up a JVM and can take over one second per command. These commands are executed during a serial part of dataload, and they can be executed multiple times. In short, these commands are a significant slowdown for loading the functional tables. This converts the hdfs command line statements to equivalent Hive LOAD DATA LOCAL statements. These are doing the copy from an already running JVM, so they do not need JVM startup. They also run in the parallel part of dataload, speeding up the SQL generation part. This speeds up generate-schema-statements.py significantly. On the functional dataset, it saves 7 minutes. Before: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 8m8.068s user 10m11.218s sys 0m44.932s After: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 0m35.800s user 0m42.536s sys 0m5.210s This is currently a long-pole in dataload, so it translates directly to an overall speedup of about 7 minutes. Testing: - Ran debug tests Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c Reviewed-on: http://gerrit.cloudera.org:8080/15228 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-24 21:22:18 +00:00
norbert.luksa	ba903deba7	Add log of created files for data load As Joe pointed out in IMPALA-9351, it would help debugging issues with missing files if we had logged the created files when loading the data. With this commit, running create-load-data.sh now logs the created files into created-files.log. Change-Id: I4f413810c6202a07c19ad1893088feedd9f7278f Reviewed-on: http://gerrit.cloudera.org:8080/15234 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-17 19:18:08 +00:00
Fang-Yu Rao	1b4ca58a98	IMPALA-9149: part 1: Re-enabe Ranger-related FE tests In IMPALA-9047, we disabled some Ranger-related FE and BE tests due to changes in Ranger's behavior after upgrading Ranger from 1.2 to 2.0. This patch aims to re-enable those disabled FE tests in AuthorizationStmtTest.java and RangerAuditLogTest.java to increase Impala's test coverage of authorization via Ranger. There are at least two major changes in Ranger's behavior in the newer versions. 1. The first is that the owner of the requested resource no longer has to be explicitly granted privileges in order to access the resource. 2. The second is that a user not explicitly granted the privilege of creating a database is able to do so. Due to these changes, some of previous Ranger authorization requests that were expected to be rejected are now granted after the upgrade. To re-enable the tests affected by the first change described above, we modify AuthorizationTestBase.java to allow our FE Ranger authorization tests to specify the requesting user in an authorization test. Those tests failed after the upgrade because the default requesting user in Impala's AuthorizationTestBase.java happens to be the owner of the resources involved in our FE authorization tests. After this patch, a requesting user can be either a non-owner user or an owner user in a Ranger authorization test and the requesting user would correspond to a non-owner user if it is not explicitly specified. Note that in a Sentry authorization test, we do not use the non-owner user as the requesting user by default as we do in the Ranger authorization tests. Instead, we set the name of the requesting user to a name that is the same as the owner user in Ranger authorization tests to avoid the need for providing a customized group mapping service when instantiating a Sentry ResourceAuthorizationProvider as we do in AuthorizationTest.java, our FE tests specifically for testing authorization via Sentry. On the other hand, to re-enable the tests affected by the second change, we remove from the Ranger policy for all databases the allowed condition that grants any user the privilege of creating a database, which is not by default granted by Sentry. After the removal of the allowed codition, those tests in AuthorizationStmtTest.java and RangerAuditLogTest.java affected by the second change now result in the same authorization errors before the upgrade of Ranger. Testing: - Passed AuthorizationStmtTest.java in a local dev environment - Passed RangerAuditLogTest.java in a local dev environment Change-Id: I228533aae34b9ac03bdbbcd51a380770ff17c7f2 Reviewed-on: http://gerrit.cloudera.org:8080/14798 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-20 11:08:23 +00:00
Joe McDonnell	fc4a91cf8c	IMPALA-9165: Add timeout for create-load-data.sh This converts the existing bin/run-all-tests-timeout-check.sh to a more generic bin/script-timeout-check.sh. It uses this new script for both bin/run-all-tests.sh and testdata/bin/create-load-data.sh. The new script takes two arguments: -timeout : timeout in minutes -script_name : name of the calling script The script_name is used in debugging output / output filenames to make it clear what timed out. The run-all-tests.sh timeout remains the same. testdata/bin/create-load-data.sh uses a 2.5 hour timeout. This should help debug the issue in IMPALA-9165, because at least the logs would be preserved on the Jenkins job. Testing: - Tested the timeout script by hand with a caller script that sleeps longer than the timeout - Ran a gerrit-verify-dryrun-external Change-Id: I19d76bd8850c7d4b5affff4d21f32d8715a382c6 Reviewed-on: http://gerrit.cloudera.org:8080/14741 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2019-11-20 21:59:26 +00:00
Joe McDonnell	f8c8fa5b45	IMPALA-9150: Use HBase's stop-hbase.sh script for minicluster testdata/bin/kill-hbase.sh currently uses the generic kill-java-service.sh script to kill the region servers, then the master, and then the zookeeper. Recent versions of HBase become unusable after performing this type of shutdown. The master seems to get stuck trying to recover, even after restarting the minicluster. The root cause in HBase is unclear, but HBase provides the stop-hbase.sh script, which does a more graceful shutdown. This switches tesdata/bin/kill-hbase.sh to use this script, which avoids the recovery problems. Testing: - Ran the test-with-docker.py tests (which does a minicluster restart). Before the change, the HBase tests timed out due to HBase getting stuck recovering. After the change, tests ran normally. - Added a minicluster restart after dataload so that this is tested. Change-Id: I67283f9098c73c849023af8bfa7af62308bf3ed3 Reviewed-on: http://gerrit.cloudera.org:8080/14697 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-13 19:23:33 +00:00
Csaba Ringhofer	df2c6f200f	IMPALA-8841: Try to fix Tez related dataload flakiness The flakiness may be related to starting Hive queries in parallel which triggers initializing Tez resources in parallel (only needed at the first statement that uses Tez). Doing a non-parallel statement at first may solve the issue. Also includes a fix for a recent issue in 'build-and-copy-hive-udfs' introduced by the version bump in https://gerrit.cloudera.org/#/c/14043/ Change-Id: Id21d57483fe7a4f72f450fb71f8f53b3c1ef6327 Reviewed-on: http://gerrit.cloudera.org:8080/14081 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-08-16 23:00:01 +00:00
Todd Lipcon	3567a2b5d4	IMPALA-8369 (part 4): Hive 3: fixes for functional dataset loading This fixes three issues for functional dataset loading: - works around HIVE-21675, a bug in which 'CREATE VIEW IF NOT EXISTS' does not function correctly in our current Hive build. This has been fixed already, but the workaround is pretty simple, and actually the 'drop and recreate' pattern is used more widely for data-loading than the 'create if not exists' one. - Moves the creation of the 'hive_index' table from load-dependent-tables.sql to a new load-dependent-tables-hive2.sql file which is only executed on Hive 2. - Moving from MR to Tez execution changed the behavior of data loading by disabling the auto-merging of small files. With Hive-on-MR, this behavior defaulted to true, but with Hive-on-Tez it defaults false. The change is likely motivated by the fact that Tez automatically groups small splits on the _input_ side and thus is less likely to produce lots of small files. However, that grouping functionality doesn't work properly in localhost clusters (TEZ-3310) so we aren't seeing the benefit. So, this patch enables the post-process merging of small files. Prior to this change, the 'alltypesaggmultifilesnopart' test table was getting 40+ files inside it, which broke various planner tests. With the change, it gets the expected 4 files. Change-Id: Ic34930dc064da3136dde4e01a011d14db6a74ecd Reviewed-on: http://gerrit.cloudera.org:8080/13251 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-15 11:00:45 +00:00
Todd Lipcon	9075099c27	Drop statestore update frequency during data loading The statestore update frequency is the limiting factor in most DDL statements. This improved the speed of an incremental data load of the functional dataset by 5-10x or so on my machine in the case where data had previously been loaded. Change-Id: I8931a88aa04e0b4e8ef26a92bfe50a539a3c2505 Reviewed-on: http://gerrit.cloudera.org:8080/13260 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-05-10 06:27:41 +00:00
Tim Armstrong	0a9ea803d2	IMPALA-7290: part 1: clean up shell tests This sets up the tests to be extensible to test shell in both beeswax and HS2 modes. Testing: * Add test dimension containing only beeswax in preparation for HS2 dimension. * Factor out hardcoded ports. * Add tests for formatting of all types and NULL values. * Merge date shell test into general type tests. * Added testing for floating point output formatting, which does change as a result of switching to server-side vs client-side formatting. * Use unique_database for tests that create tables. Change-Id: Ibe5ab7f4817e690b7d3be08d71f8f14364b84412 Reviewed-on: http://gerrit.cloudera.org:8080/13083 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-30 11:30:45 +00:00
David Knupp	df2d9f1333	IMPALA-8346: Don't create FE testcase files unless testing locally The same test data setup scripts get called when loading data for mini-cluster testing and testing against a real deployed cluster. Unfortunately, we're seeing more and more that not all set up steps apply equally in both situations. This patch avoids one such example. It skips the creation of TPCDS testcase files that are used by the FE java tests. These tests don't run against deployed clusters. Change-Id: Ibe11d7cb50d9e2657152c94f8defcbc69ca7e1ba Reviewed-on: http://gerrit.cloudera.org:8080/12958 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-13 03:48:09 +00:00
Austin Nobis	8bb366573c	IMPALA-8393: Skip ranger setup for unsupported environments Previously, the setup-ranger step in create-load-data.sh was hard coded with localhost as the host for Ranger. This patch makes it possible to skip the setup for Ranger by using the flag -skip_ranger. The script was also updated to set the SKIP_RANGER variable when the REMOTE_LOAD environment variable is set. Testing: - Testing was performed by calling the script with and without the setup-ranger flag set as well as calling the script with and without the REMOTE_LOAD environment variable set. Change-Id: Ie81dda992cf29792468580b182e570132d5ce0a1 Reviewed-on: http://gerrit.cloudera.org:8080/12957 Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-09 01:17:51 +00:00
Austin Nobis	17ed50b5ce	IMPALA-8226: Add grant/revoke to/from group for Ranger This patch adds fupport for GRANT privilege statements to GROUP and REVOKE privilege statements from GROUP. The grammar has been updated to support FROM GROUP and TO GROUP for GRANT/REVOKE statements, i.e: GRANT <privilege> ON <resource> TO GROUP <group> REVOKE <privilege> ON <resource> FROM GROUP <group> Currently, only Ranger's authorization implementation supports GROUP based privileges. Sentry will throw an UnsupportedOperationException if it is the enabled authorization provider and this new grammar is used. Testing: - AuthorizationStmtTest was updated to also test for GROUP authorization. - ToSqlTest was updated to test for GROUP changes to the grammar. - A GROUP based E2E test was added to test_ranger.py - ParserTest was updated to test combinations for GrantRevokePrivilege - AnalyzeAuthStmtsTest was updated to test for USER and GROUP identities - Ran all FE tests - Ran authorization E2E tests Change-Id: I28b7b3e4c776ad1bb5bdc184c7d733d0b5ef5e96 Reviewed-on: http://gerrit.cloudera.org:8080/12914 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-05 00:04:37 +00:00
Austin Nobis	515ded0035	IMPALA-7918: Remove support for authorization policy file This patch removes support for the authorization_policy_file. When the flag is passed, the backend will issue a warning message that the flag is being ignored. Tests relying on the authorization_policy_file flag have been updated to rely on sentry server instead. Testing: - Ran all FE tests - Ran all E2E tests Change-Id: Ic2a52c2d5d35f58fbff8c088fb0bf30169625ebd Reviewed-on: http://gerrit.cloudera.org:8080/12637 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-25 20:23:33 +00:00
Fredy Wijaya	656a2e8af0	IMPALA-8100: Add initial support for Ranger This patch adds an initial support for Ranger that can be enabled via the following flags in both impalad and catalogd to do enforcement. - ranger_service_type=hive - ranger_app_id=some_app_id - authorization_factory_class=\ org.apache.impala.authorization.ranger.RangerAuthorizationFactory The Ranger plugin for Impala uses Hive service definition to allow sharing Ranger policies between Hive and Impala. Temporarily the REFRESH privilege uses "read" access type and it will be updated in the later patch once Ranger supports "refresh" access type. There's a change in DESCRIBE <table> privilege requirement to use ANY privilege instead of VIEW_METADATA privilege as the first-level check to play nicely with Ranger. This is not a security risk since the column-level filtering logic after the first-level check will use VIEW_METADATA privilege to filter out unauthorized column access. In other words, DESCRIBE <table> may return an empty result instead of an authorization error as long as there exists any privilege in the given table. This patch updates AuthorizationStmtTest with a parameterized test that runs the tests against Sentry and Ranger. Testing: - Updated AuthorizationStmtTest with Ranger - Ran all FE tests - Ran all E2E authorization tests Change-Id: I8cad9e609d20aae1ff645c84fd58a02afee70276 Reviewed-on: http://gerrit.cloudera.org:8080/12632 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-21 19:08:12 +00:00
stiga-huang	9686545bfd	IMPALA-6503: Support reading complex types from ORC We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in io buffer read from DiskIoMgr). The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch). Then we transform values in orc::ColumnVectorBatch into impala::Tuples in hdfs-orc-scanner. We don't need to do anything about decoding/decompression since they are handled by the ORC lib. Fortunately, the ORC lib already supports complex types, we can still leverage it to support complex types. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need in the form required by the ORC lib (Get list of ORC type ids from tuple descriptors) 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation (Slots/Tuples/RowBatches) To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type and transforms outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. * Run exhaustive tests in DEBUG and RELEASE builds. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Reviewed-on: http://gerrit.cloudera.org:8080/12168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-08 04:39:08 +00:00
Joe McDonnell	6938831ae9	Turn off shell debug tracing for create-load-data.sh This removes a "set -x" from testdata/bin/create-load-data.sh. Change-Id: I524ec48d0264f6180a13d6d068832809bcc86596 Reviewed-on: http://gerrit.cloudera.org:8080/12398 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-12 23:25:28 +00:00
Bharath Vissapragada	f7df8adfae	IMPALA-5872: Testcase builder for query planner Implements a new testcase builder for simulating query plans from one cluster on a different cluster/minicluster with different number of nodes. The testcase is collected from one cluster and can be replayed on any other cluster. It includes all the information that is needed to replay the query plan exactly as in the source cluster. Also adds a stand-alone tool (PlannerTestCaseLoader) that can replay the testcase without having to start an actual cluster or a dev minicluster. This is done to make testcase debugging simpler. Motivation: ---------- - Make query planner issues easily reproducible - Improve user experience while collecting query diagnostics - Make it easy to test new planner features by testing it on customer usecases collected from much larger clusters. Commands: -------- -- Collect testcase for a query stmt (outputs the testcase file path). impala-shell> COPY TESTCASE TO <hdfs dirpath> <query stmt> -- Load the testcase metadata in a target cluster (dumps the query stmt) impala-shell> COPY TESTCASE FROM <hdfs testcase file path> -- Replay the query plan impala-shell> SET PLANNER_DEBUG_MODE=true impala-shell> EXPLAIN <query stmt> How it works? ------------ - During export on the source cluster, the command dumps all the thrift states of referenced objects in the query into a gzipped binary file. - During replay on a target cluster, it adds these objects to the catalog cache by faking them as DDLs. - The planner also fakes the number of hosts by using the scan range information from the target cluster. Caveats: ------ - Tested to work with HDFS tables. Tables based on other filesystems like HBase/Kudu may not work as desired. - The tool does not collect actual data files for the tables. Only the metadata state is dumped. - Currently only imports databases/tables/views. We can extend it to work for UDFS etc. - It only works for QueryStmts (select/union queries) - On a sentry enabled cluster, the role running the query requires VIEW_METADATA privilege on every db/table/view referenced in the query statement. - Once the metadata dump is loaded on a target cluster, the state is volatile. Hence it cannot survive a cluster restart / invalidate metadata - Loading a testcase requires setting the query option (SET PLANNER_DEBUG_MODE=true) so that the planner knows to fake the number of hosts. Otherwise it takes into account the local cluster topology. - Cross version compatibility of testcases needs some thought. For example, creating a testcase from Impala version 3.2 and trying to replay it on Impala version 3.5. This could be problematic if we don't keep the underlying thrift structures backward compatible. Change-Id: Iec83eeb2dc5136768b70ed581fb8d3ed0335cb52 Reviewed-on: http://gerrit.cloudera.org:8080/12221 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-09 03:59:10 +00:00
Tim Armstrong	236b9194d3	IMPALA-7988: support loading data with dockerized Impalas This patch does the work to load data and run some end-to-end query tests on a dockerised cluster. Changes were required in start-impala-cluster.py/ImpalaCluster and in some configuration files. ImpalaCluster is used for various things, including discovering service ports and testing for cluster readiness. This patch adds basic support and uses it from start-impala-cluster.py to check for cluster readiness. Some logic is moved from start-impala-cluster.py to ImpalaCluster. Limitations: * We're fairly inconsistent about whether services listen only on a single interface (e.g. loopback, traditionally) or whether it listens on all interfaces. This doesn't fix all of those issues. E.g. HDFS datanodes listen on all interfaces to work around some issues. * Many tests don't pass yet, particularly those using ImpalaCluster(), which isn't initialised with the appropriate docker arguments. Testing: Did a full data load locally using a dockerised Impala cluster: START_CLUSTER_ARGS="--docker_network=impala-cluster" \ TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" \ ./buildall.sh -format -testdata -ninja -notests -skiptests -noclean Ran a selection of end-to-end tests touching HDFS, Kudu and HBase tables after I loaded data locally. Ran exhaustive tests with non-dockerised impala cluster. Change-Id: I98fb9c4f5a3a3bb15c7809eab28ec8e5f63ff517 Reviewed-on: http://gerrit.cloudera.org:8080/12189 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-18 21:33:16 +00:00
Joe McDonnell	70fbd1df44	IMPALA-7871: Don't load Hive builtins Dataload has a step of "Loading Hive builtins" that loads a bunch of jars into HDFS/S3/etc. Despite its name, nothing seems to be using these. Dataload and core tests succeed without this step. This removes the Hive builtins step and associated scripts. Change-Id: Iaca5ffdaca4b5506e9401b17a7806d37fd7b1844 Reviewed-on: http://gerrit.cloudera.org:8080/11944 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-19 23:33:20 +00:00
David Knupp	6e5ec22b12	IMPALA-7399: Emit a junit xml report when trapping errors This patch will cause a junitxml file to be emitted in the case of errors in build scripts. Instead of simply echoing a message to the console, we set up a trap function that also writes out to a junit xml report that can be consumed by jenkins.impala.io. Main things to pay attention to: - New file that gets sourced by all bash scripts when trapping within bash scripts: https://gerrit.cloudera.org/c/11257/1/bin/report_build_error.sh - Installation of the python lib into impala-python venv for use from within python files: https://gerrit.cloudera.org/c/11257/1/bin/impala-python-common.sh - Change to the generate_junitxml.py file itself, for ease of https://gerrit.cloudera.org/c/11257/1/lib/python/impala_py_lib/jenkins/generate_junitxml.py Most of the other changes are to source the new report_build_error.sh script to set up the trap function. Change-Id: Idd62045bb43357abc2b89a78afff499149d3c3fc Reviewed-on: http://gerrit.cloudera.org:8080/11257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-23 18:33:58 +00:00
poojanilangekar	c6f9b61ec2	IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans This change ensures that the planner computes parquet conjuncts only when for scans containing parquet files. Additionally, it also handles PARQUET_DICTIONARY_FILTERING and PARQUET_READ_STATISTICS query options in the planner. Testing was carried out independently on parquet and non-parquet scans: 1. Parquet scans were tested via the existing parquet-filtering planner test. Additionally, a new test [parquet-filtering-disabled] was added to ensure that the explain plan generated skips parquet predicates based on the query options. 2. Non-parquet scans were tested manually to ensure that the functions to compute parquet conjucts were not invoked. Additional test cases were added to the parquet-filtering planner test to scan non parquet tables and ensure that the plans do not contain conjuncts based on parquet statistics. 3. A parquet partition was added to the alltypesmixedformat table in the functional database. Planner tests were added to ensure that Parquet conjuncts are constructed only when the Parquet partition is included in the query. Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7 Reviewed-on: http://gerrit.cloudera.org:8080/10704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 02:06:50 +00:00
Taras Bobrovytsky	8060f4d50e	IMPALA-7102 (Part 1): Disable reading of erasure coding by default In this patch we add a query option ALLOW_ERASURE_CODED_FILES, that allows us to enable or disable the support of erasure coded files. Even though Impala should be able to handle HDFS erasure coded files already, this feature hasn't been tested thoroughly yet. Also, Impala lacks metrics, observability and DDL commands related to erasure coding. This is a query option instead of a startup flag because we want to make it possible for advanced users to enable the feature. We may also need a follow on patch to also disable the write path with this flag. Cherry-picks: not for 2.x Change-Id: Icd3b1754541262467a6e67068b0b447882a40fb3 Reviewed-on: http://gerrit.cloudera.org:8080/10646 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-29 23:26:35 +00:00
Joe McDonnell	c8bfcbd6e8	IMPALA-7200: Fix missing FILESYSTEM_PREFIX hitting local dataload As part of IMPALA-3307, we copy a time-zone database into HDFS. This command is failing on local filesystem due to a missing FILESYSTEM_PREFIX. This adds FILESYSTEM_PREFIX for this command. Change-Id: I972192f22943baef6043a4c9db54d5d48089ea9d Reviewed-on: http://gerrit.cloudera.org:8080/10803 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-23 05:04:11 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Joe McDonnell	a541ac6039	IMPALA-7193: Fix cluster startup args in create-load-data.sh The minicluster args for dataload changed to a bash array in IMPALA-7119, and this requires a special syntax to derefence and get the whole array. This fixes the invocation to use the right syntax ($BASH_VAR[@] rather than $BASH_VAR). Change-Id: Ie9a24c0e9fa34e43697b16b48cf219f47f30c0cc Reviewed-on: http://gerrit.cloudera.org:8080/10782 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 06:23:23 +00:00
Joe McDonnell	147e962f2d	IMPALA-7119: Restart whole minicluster when HDFS replication stalls After loading data, we wait for HDFS to replicate all of the blocks appropriately. If this takes too long, we restart HDFS. However, HBase can fail if HDFS is restarted and HBase is unable to write its logs. In general, there is no real reason to keep HBase and the other minicluster components running while restarting HDFS. This changes the HDFS health check to restart the whole minicluster and Impala rather than just HDFS. Testing: - Tested with a modified version that always does the restart in the HDFS health check and verified that the tests pass Change-Id: I58ffe301708c78c26ee61aa754a06f46c224c6e2 Reviewed-on: http://gerrit.cloudera.org:8080/10665 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-18 21:46:11 +00:00
Joe McDonnell	9a5410570e	IMPALA-7061: Rework HBase splitting and assignment Some frontend PlannerTests rely on HBase tables being arranged in a deterministic way. Specifically, the HBase tables need to be split with specific region boundaries and those regions need to be assigned to specific HBase region servers. Currently, the tables are created without splits and testdata/bin/split-hbase.sh runs Java code in HBaseTestDataRegionAssignment to split and assign the tables. This runs during dataload via testdata/bin/create-load-data.sh and during tests with bin/run-all-tests.sh. There are problems with both parts of this process. The table splitting is flaky. Since significant time can pass between the assignments and the tests, rebalancing means the assignments are not always stable. This changes the process so that the HBase tables are created with the splits already specified via the HBase shell. The splits remain stable over time. PlannerTestBase runs the assignment code in HBaseTestDataRegionAssignment at the start of the PlannerTests. This makes the assignments deterministic. No other tests depends on the exact assignments, so this does not regress anything. Testing: - Local testing - Ran gerrit-verify-dryrun-external - Verified minicluster profile 2 compiles Change-Id: I3d639128a856254a6ccb93d6750f531974b5f897 Reviewed-on: http://gerrit.cloudera.org:8080/10447 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-25 00:28:18 +00:00
Joe McDonnell	2e9f5c90eb	IMPALA-7043: HBase split failure should not fail dataload HBase splitting can fail due to changes in HBase code. It is useful to still do tests even if HBase splitting failed. As it is today, buildall.sh will abort if create-load-data.sh's invocation of split-hbase.sh fails. No tests run, even though the HBase splitting affects only a small portion of our tests. This changes create-load-data.sh to keep going with dataload if HBase splitting fails. It outputs the same errors to the log as it would before this change. It adds a message to explain that it is ignoring the failure and there may be related test failures. Change-Id: I7497fe8c9f1655a34b2743462d8b7248eb94554e Reviewed-on: http://gerrit.cloudera.org:8080/10437 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 09:13:01 +00:00
Tianyi Wang	13a1acd7e4	IMPALA-7003: Deflake erasure coding data loading Erasure coding data loading is flaky in two ways: 1. HBase sometimes doesn't work because of HBase-19369 2. Nested data loading sometimes fails because the HDFS namenode cannot find enough good datanodes. For problem 1, this patch enables erasure coding only on /test-warehouse directory. For problem 2, this patch sets dfs.namenode.redundancy.considerLoad to false, preventing namenode from excluding heavily-loaded datanodes. Change-Id: I219106cd3ec7ffab7a834700f2a722b165e5f66c Reviewed-on: http://gerrit.cloudera.org:8080/10362 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-15 23:59:58 +00:00
Joe McDonnell	b126b2d105	IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2 There is a Hive bug in Hive 1.1.0 that can result in a NullPointerException when doing parallel Hive operations (see IMPALA-6532). Since dataload goes parallel on Hive loads starting with IMPALA-6372, dataload can hit this error on Hive 1.1.0 (i.e. IMPALA_MINICLUSTER_PROFILE=2). This is impacting builds on the 2.x branch. This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2. IMPALA_MINICLUSTER_PROFILE=3 uses a newer version of Hive that has a fix for this, so this continues to use parallel dataload for that case. Parallelism can be reenabled when Hive 1.1.0 gets the fix from Hive 2.1.1. Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6 Reviewed-on: http://gerrit.cloudera.org:8080/10306 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-10 01:30:13 +00:00
Taras Bobrovytsky	c05696dd6a	IMPALA-6949: Add the option to start the minicluster with EC enabled In this patch we add the "ERASURE_CODING" enviornment variable. If we enable it, a cluster with 5 data nodes will be created during data loading and HDFS will be started with erasure coding enabled. Testing: I ran the core build, and verified that erasure coding gets enabled in HDFS. Many of our EE tests failed however. Cherry-picks: not for 2.x Change-Id: I397aed491354be21b0a8441ca671232dca25146c Reviewed-on: http://gerrit.cloudera.org:8080/10275 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-05 01:20:59 +00:00
Joe McDonnell	da363a99a4	IMPALA-6899: Optimize the HDFS commands used in dataload HDFS commandline calls can be expensive due to JVM startup and other costs. Since most HDFS commandline calls can take multiple paths, one way to reduce execution time is to consolidate multiple HDFS commands into a single HDFS call. Since HDFS put commands will follow symbolic links and can copy recursively, this can allow for further consolidation by creating the full directory structure and copying it in a single HDFS call. This does several of these optimizations throughout the dataload codepath. It saves a few seconds here and there: Loading Hive Builtins: 1:10 -> 0:30 Loading custom schemas: 0:35 -> 0:20 Loading Hive UDFs: 0:45 -> 0:25 Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282 Reviewed-on: http://gerrit.cloudera.org:8080/10120 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-23 21:29:59 +00:00
Joe McDonnell	5bc5279b07	IMPALA-6898: Avoid duplicate Kudu load during full dataload testdata/bin/create-load-data.sh does bin/load-data.py for functional/exhaustive, tpch/core, and tpcds/core in a first phase, then it loads functional and tpch for Kudu in a second phase. For a full dataload, this second phase is not necessary. functional/exhaustive and tpch/core already include Kudu. This avoids the second phase when doing a full dataload. The second phase is still necessary when loading from a snapshot, and this does not change that behavior. This saves a couple minutes off of full dataload. Change-Id: Ic023d230f99126ed37795106c38faae5f0cb608e Reviewed-on: http://gerrit.cloudera.org:8080/10128 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-21 01:08:50 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Tianyi Wang	d03b66ca35	IMPALA-6394: Restart HDFS only when no replication progress is made In wait-hdfs-replication, the frequent and eager restart might slow the HDFS replication down. HDFS should be restarted only if no progress is made in a certain amount of time, and we should wait longer before failing the data loading. Testing: It's tested with a fake HDFS fsck script. Change-Id: Ib059480254643dc032731b4b3c55204a93b61e77 Reviewed-on: http://gerrit.cloudera.org:8080/9698 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 00:41:16 +00:00
Tianyi Wang	c7a58b8a73	IMPALA-6394: Restart HDFS when blocks are under replicated HDFS sometimes fails to fully replicate all the blocks in 30 seconds and no progress is made. This patch tries to restart HDFS several times before aborting the data loading. Change-Id: Iefd4c2fc6c287f054e385de52bdc42b0bdbd7915 Reviewed-on: http://gerrit.cloudera.org:8080/9469 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-09 22:54:47 +00:00
Tianyi Wang	c4d950b9e9	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Resubmitted with filesystem type check. Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8 Reviewed-on: http://gerrit.cloudera.org:8080/8916 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 03:24:36 +00:00
David Knupp	2fb11fb732	Revert "IMPALA-3887: Wait for HDFS replication in data loading" Using fsck breaks non-HDFS builds: local, S3, and Isilon. This reverts commit `5a7c10ec3d`. Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc Reviewed-on: http://gerrit.cloudera.org:8080/8880 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-19 23:26:29 +00:00
Tianyi Wang	5a7c10ec3d	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322 Reviewed-on: http://gerrit.cloudera.org:8080/8846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:53:56 +00:00

1 2 3 4

166 Commits