impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Author	SHA1	Message	Date
Michael Smith	d4cb3afe69	[tools] fix buildall.sh -testdata with prior data The help output for buildall.sh notes running `buildall.sh -testdata` as an option to incrementally load test data without formatting the mini-cluster. However trying to do that with existing data loaded results in an error when running `hadoop fs -mkdir /test-warehouse`. Add `-p` so this step is idempotent, allowing the example to work as documented. Change-Id: Icc4ec4bb746abf53f6787fce4db493919806aaa9 Reviewed-on: http://gerrit.cloudera.org:8080/18522 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-16 08:19:33 +00:00
Aman Sinha	8645ac6db3	IMPALA-11247: Test script changes for materialized views IMPALA-10723 added support for treating materialized views as tables. In certain test configurations, the rebuild of the materialized views (which is done via Hive) was not populating the data in the MV. In this patch, I have changed the source tables of materialized views to be full-acid instead of insert-only transactional tables. This enables the tests to succeed. Insert-only source tables are also meant to work for the MV rebuild but that is a Hive issue that will be investigated separately. Change-Id: I349faa0ad36ec8ca6f574f7f92d9a32fb7d0d344 Reviewed-on: http://gerrit.cloudera.org:8080/18421 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Aman Sinha <amsinha@cloudera.com>	2022-04-17 22:36:57 +00:00
Aman Sinha	e644c99724	IMPALA-10723: Treat materialized view as a table instead of a view The existing behavior is that materialized views are treated as views and therefore expanded similar to a view when one queries the MV directly (SELECT * FROM materialized_view). This is incorrect since an MV is a regular table with physical properties such as partitioning, clustering etc. and should be treated as such even though it has a view definition associated with it. This patch focuses on the use case where MVs are created as HDFS tables and makes the MVs a derived class of HdfsTable, therefore making it a Table object. It adds support for collecting and displaying statistics on materialized views and these statistics could be leveraged by an external frontend that supports MV based query rewrites (note that such a rewrite is not supported by Impala with or without this patch). Note that we are not introducing new syntax for MVs since DDL, DML operations on MVs are only supported through Hive. Directly querying a MV is permitted but inserts into MVs is not since MVs are supposed to be only modified through an external refresh when the source tables have modifications. If the source tables associated with a materialized view have column masking or row-filtering Ranger policies, querying the MV will throw an error. This behavior is consistent with that of Hive. Testing: - Added transactional tables for alltypes, jointbl and used them as source tables to create materialized view. - Added tests for compute stats, drop stats, show stats and simple select query on a materialized view. - Added test for select on a materialized view when the source table has a column mask. - Modified analyzer tests related to alter, insert, drop of materialized view. Change-Id: If3108996124c6544a97fb0c34b6aff5e324a6cff Reviewed-on: http://gerrit.cloudera.org:8080/17595 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>	2022-04-14 11:56:20 +00:00
Fang-Yu Rao	c190966db9	IMPALA-11232: Do not add some jars to HADOOP_CLASSPATH when starting HMS This patch changes the line that added to HADOOP_CLASSPATH all the jar files in the folder ${RANGER_HOME}/ews/webapp/WEB-INF/lib to a line that only includes those jar files with names starting with "ranger-" since almost all other jar files do not seem to be necessary to run the E2E test of test_hive_with_ranger_setup. This way we also avoid adding too many paths to HADOOP_CLASSPATH, which in turn could result in Hadoop not being able to return its version to the script that starts HMS due to the error of "Argument list too long". Testing: - Verified after this patch, test_hive_with_ranger_setup still succeeds. - Verified in a local development environment that the length of Hadoop's environment variable 'CLASSPATH' logged in hive-metastore.out decreases from 100,876 characters to 62,634 characters when executing run-hive-server.sh with the flag '-with_ranger' if $HADOOP_SHELL_SCRIPT_DEBUG is "true" and $IMPALA_HOME is "/home/fangyurao/Impala_for_FE". Change-Id: Ifd66fd99a346835b9f81f95b5f046273fcce2590 Reviewed-on: http://gerrit.cloudera.org:8080/18398 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-13 20:37:45 +00:00
stiga-huang	b2e4b29f06	IMPALA-11120: Fix codec not set in generating ORC tables We use 'mapred.output.compression.codec' to set the compression codec in generating test files by Hive. However, it doesn't affect ORC files. Instead, we need to set 'orc.compress' in tblproperties for each ORC tables. The default value of 'orc.compress' is ZLIB which corresponds to our 'def' codec. We only need to set it for non-def codecs. This patch also fixes a bug in build_compression_codec_statement() that would raise KeyError when loading lz4 non-avro tables. Tests - Loaded tpch data in orc/none/none, orc/def/block, orc/snap/block, orc/lz4/block and verified there compression codecs. Change-Id: I02bd5d9400864145133ff019a3d076a6cab36fcc Reviewed-on: http://gerrit.cloudera.org:8080/18228 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-01 13:09:59 +00:00
Fucun Chu	4186727fe6	IMPALA-10871: Add MetastoreShim to support Apache Hive 3.1.2 Like IMPALA-8369, this patch adds a compatibility shim in fe so that Impala can interoperate with Hive 3.1.2. we need adds a new Metastoreshim class under compat-apache-hive-3 directory. These shim classes implement method which are different in cdp-hive-3 vs apache-hive-3 and are used by front end code. At the build time, based on the environment variable IMPALA_HIVE_DIST_TYPE one of the two shims is added to as source using the fe/pom.xml build plugin. Some codes that directly use Hive 4 APIs need to be ignored in compilation, eg. fe/src/main/java/org/apache/impala/catalog/metastore/. Use Maven profile to ignore some codes, profile will automatically activated based on the IMPALA_HIVE_DIST_TYPE. Testing: 1. Code compiles and runs against both HMS-3 and ASF-HMS-3 2. Ran full-suite of tests against HMS-3 3. Running full-tests against ASF-HMS-3 will need more work supporting Tez in the mini-cluster (for dataloading) and HMS transaction support. This will be on-going effort and test failures on ASF-Hive-3 will be fixed in additional sub-tasks. Notes: 1. Patch uses a custom build of Apache Hive to be deployed in mini-cluster. This build has the fixes for HIVE-21569, HIVE-20038. This hack will be added to the build script in additional sub-tasks. Change-Id: I9f08db5f6da735ac431819063060941f0941f606 Reviewed-on: http://gerrit.cloudera.org:8080/17774 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-27 06:36:19 +00:00
Riza Suminto	8d92ba15a5	IMPALA-11055: Remove "exit" from HBase data loading script. load-functional-query-exhaustive-hbase-generated.create failed with newer HBase shell from version 2.4.6. HBase shell throws the following error at the end of script execution: ERROR NoMethodError: private method `exit' called for nil:NilClass Since we run the script in non-interactive way, it is safe for us to remove this last "exit" command from script. Testing: - Complete data loading without error. - Pass core tests. Change-Id: I9185704b02c51c7a9cb9aa7fd2a7d1103c8b7cbb Reviewed-on: http://gerrit.cloudera.org:8080/18079 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-11 00:19:17 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
Andrew Sherman	ee03727971	IMPALA-11025: Transactional tables should use /test-warehouse/managed/databasename.db Recent Hive releases seem to be enforcing that data for a managed table is stored under the hive.metastore.warehouse.dir path property in a folder path similar to databasename.db/tablename - see https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables Use this form /test-warehouse/managed/databasename.db in generate-schema-statements.py when creating transactional tables. Testing: - A few small changes to tests that verify filesystem changes for acid tables. - Exhaustive tests pass. Change-Id: Ib870ca802c9fa180e6be7a6f65bef35b227772db Reviewed-on: http://gerrit.cloudera.org:8080/18046 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-11-23 03:24:08 +00:00
Yu-Wen Lai	33467a7b2e	Bump up the GBN to 15549253 This patch bumps up the GBN to 15549253. This patch includes the fix by Fang-Yu for using correct policy id to update the policy of "all - database" due to the change on the Ranger side. Testing: * ran the create-load-data.sh Change-Id: Ie7776e62dad0b9bec6c03fb9ee8f1b8728ff0e69 Reviewed-on: http://gerrit.cloudera.org:8080/17746 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-08-03 17:25:40 +00:00
Fang-Yu Rao	5aa47bf901	IMPALA-10694: Improve the error handling of setup-ranger We found that setup-ranger will continue the execution if an error occurs when i) wget was executed to initialize the environment variables of GROUP_ID_OWNER and GROUP_ID_NON_OWNER, and ii) curl was executed to upload a revised Ranger policy even though the -e option was set when create-load-data.sh was executed. This patch improves the error handling by making setup-ranger exit as soon as an error occurs so that no test would be run at all in case there is an error. To exit if an error occurs during wget, we separate the assignment and the export of the environment variables since the export of an environment variable will run and succeed even though there is an error before the export if the assignment and the export are combined. That is, combining them hides the error. On the other hand, to exit if an error occurs during curl, we add an additional -f option so that an error will no longer be silently ignored. Testing: - Verified that setup-ranger could be successfully executed after this patch. - Verified that setup-ranger would exit if a URL in setup-ranger is not correctly set up or if the 'id' field in policy_4_revised.json does not match the URL of the policy to be updated. Change-Id: I45605d1a7441b734cf80249626638cde3adce28b Reviewed-on: http://gerrit.cloudera.org:8080/17386 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-12 12:25:49 +00:00
Fang-Yu Rao	b77f48efc5	IMPALA-10774: Collect the output when starting Hive Metastore or HiveServer2 Before this patch, when we started the Hive Metastore or the HiveServer2 via run-hive-server.sh, hive-metastore.out and hive-server2.out would be generated, respectively. The output of executing the hive command was redirected to the corresponding files mentioned above. However, since ">" was used to redirect the output, the previously generated files would be overwritten every time during the restart and thus some important error message may be lost. To collect more information regarding the restart, this patch appends the newly produced output to hive-metastore.out and hive-server2.out during the restart, which makes it easier for developers to troubleshoot issues related to the restart. Change-Id: I2efdbcf886e2d32ccf8c7eef038360884e44f216 Reviewed-on: http://gerrit.cloudera.org:8080/17642 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-05 18:45:01 +00:00
Vihang Karajgaonkar	d2a1460b92	IMPALA-10676: Improve start/stop scripts for Hiveserver and Metastore - Separate metastore and hiveserver starting in run-hive-server.sh - Change kill-hive-server.sh to shut down Metastore and retry Change-Id: Ie9208efdf49f383c5cfb10cd9881272847405a05 Reviewed-on: http://gerrit.cloudera.org:8080/17340 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>	2021-05-17 22:21:23 +00:00
Joe McDonnell	d29fab1ad9	IMPALA-10629: Fix parquet compression codecs for data load scripts Currently, the dataload scripts don't respect non-standard compression codecs when loading Parquet data. It always loads snappy, even when specifying something else like --table_format=parquet/zstd. This fixes the dataload scripts so that they specify the compression_codec query option correctly and thus use the right codec when loading Parquet. For backwards compatibility, this preserves the behavior that parquet/none corresponds to the default compression codec (which is Snappy). This should make it easier to do performance testing on various Parquet codecs (like ZSTD). Testing: - Ran bin/load-data.py -w tpch --table_format=parquet/zstd and checked the codec in the file with the parquet-reader utility Change-Id: I1a346de3e5c4e38328e5a8ce8162697b7dd6553a Reviewed-on: http://gerrit.cloudera.org:8080/17259 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2021-04-08 20:46:37 +00:00
Laszlo Gaal	559f6044be	IMPALA-9331: Add symptom for dataload failing on schema mismatch Streamline triaging a bit. When this fails, it does so in a specific location, and until now you had to scan the build log to find the problem. This JUnitXML symptom should make this failure mode obvious. Tested by running an S3 build on private infrastructure with a knowingly mismatched data snapshot. Change-Id: I2fa193740a2764fdda799d6a9cc64f89cab64aba Reviewed-on: http://gerrit.cloudera.org:8080/17242 Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-01 15:09:10 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Kurt Deschler	b3d95d8f32	IMPALA-10522: Support external use of frontend libraries This patch enables the Impala frontend jar and dependent library libfesupport.so to be used by an external Java frontend. Calling FeSupport.setExternalFE() will cause external frontend initialization mode to be used during FeSupport.loadLibrary(). This mode builds upon logic that is used to initialize the frontend jar for unit tests. Initialization in external frontend mode differs as follows: - Skip instantiating Frontend object and it's dependents - Skip loading libhdfs - Skip starting JVM Pause monitor - Disable Minidumper - Initialize TimezoneDatabase for external frontends - Disable redirect of stderr/stdout to libfesupport.so glog - Log messages from libfesupport.so to stderr - Use libfesupport.so for JNI symbol look up Null check were added in places where objects were assumed to be instantiated but are now skipped during initialization. Additional change: 1) Add libfesupport.lib path to JAVA_LIBRARY_PATH in test driver Testing: - Initialized frontend jar from external frontend - Verified that frontend Java objects can be used externally without issues - Verified that exceptions thrown from Impala Java or libfesupport can be caught or propagated correctly by the external frontend - Manual verification of minicluster logs - Ran queries with external frontend Co-authored-by: John Sherman <jfs@cloudera.com> Co-authored-by: Aman Sinha <amsinha@cloudera.com> Change-Id: I4e3a84721ba196ec00773ce2923b19610b90edd9 Reviewed-on: http://gerrit.cloudera.org:8080/17115 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2021-03-12 17:49:08 +00:00
Aman Sinha	b5ba793227	IMPALA-10360: Allow simple limit to be treated as sampling hint As a follow-up to IMPALA-10314, it is sometimes useful to consider a simple limit as a way to sample from a table if a relevant hint has been provided. Doing a sample instead of pure limit serves dual purposes: (a) it still helps with reducing the planning time since the scan ranges need be computed only for the sample files, (b) it allows sufficient number of files/rows to be read from the table such that after applying filter conditions or joins with another table, the query may still produce the N rows needed for limit. This fuctionality is especially useful if the query is against a view. Note that TABLESAMPLE clause cannot be applied to a view and embedding a TABLESAMPLE explicitly on a table within a view will not work because we don't want to sample if there's no limit. In this patch, a new table level hint, 'convert_limit_to_sample(n)' is added. If this hint is attached to a table either in the main query block or within a view/subquery and simple limit optimization conditions are satisfied (according to IMPALA-10314), the limit is converted to a table sample. The parameter 'n' in parenthesis is required and specifies the sample percentage. It must be an integer between 1 and 100. For example: set optimize_simple_limit = true; CREATE VIEW v1 as SELECT * FROM T [convert_limit_to_sample(5)] WHERE [always_true] <predicate>; SELECT * FROM v1 LIMIT 10; In this case, the limit 10 is applied on top of a 5 percent sample of T which is applied after partition pruning. Testing: - Added a alltypes_date_partition_2 table where the date and timestamp values match (this helps with setting the 'always_true' hint). - Added views with 'convert_limit_to_sample' and 'always_true' hints and added new tests against the views. Modified a few existing tests to reference the new table variant. - Added an end-to-end test. Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b Reviewed-on: http://gerrit.cloudera.org:8080/16792 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-10 07:15:36 +00:00
Joe McDonnell	97856478ec	IMPALA-10198 (part 1): Unify Java in a single java/ directory This changes all existing Java code to be submodules under a single root pom. The root pom is impala-parent/pom.xml with minor changes to add submodules. This avoids most of the weird CMake/maven interactions, because there is now a single maven invocation for all the Java code. This moves all the Java projects other than fe into a top level java directory. fe is left where it is to avoid disruption (but still is compiled via the java directory's root pom). Various pieces of code that reference the old locations are updated. Based on research, there are two options for dealing with the shaded dependencies. The first is to have an entirely separate Maven project with a separate Maven invocation. In this case, the consumers of the shaded jars will see the reduced set of transitive dependencies. The second is to have the shaded dependencies as modules with a single Maven invocation. The consumer would see all of the original transitive dependencies and need to exclude them all. See MSHADE-206/MNG-5899. This chooses the second. This only moves code around and does not focus on version numbers or making "mvn versions:set" work. Testing: - Ran a core job - Verified existing maven commands from fe/ directory still work - Compared the *-classpath.txt files from fe and executor-deps and verified they are the same except for paths Change-Id: I08773f4f9d7cb269b0491080078d6e6f490d8d7a Reviewed-on: http://gerrit.cloudera.org:8080/16500 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-10-15 19:30:13 +00:00
Zoltan Borok-Nagy	0e96ee8e99	IMPALA-9923: Load ORC files as full ACID only in workload 'functional-query' HIVE-24145 still causes data load failures quite frequently. The failure usually occurs during TPC-DS loading. I modified generate-schema-statements.py to only load ORC tables as full ACID in the 'functional-query' workload. Since this workload contains the ACID-specific tests, we should still have enough coverage for ORC/ACID testing. Testing * Ran exhaustive tests successfully Change-Id: I0c81aedd3be314819dc4bc5bebec17bad3d03b10 Reviewed-on: http://gerrit.cloudera.org:8080/16511 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-29 13:51:21 +00:00
stiga-huang	7b44b35132	IMPALA-9351: Fix tests depending on hard-coded file paths of managed tables Some tests (e.g. AnalyzeDDLTest.TestCreateTableLikeFileOrc) depend on hard-coded file paths of managed tables, assuming that there is always a file named 'base_0000001/bucket_00000_0' under the table dir. However, the file name is in the form of bucket_${bucket-id}_${attempt-id}. The last part of the file name is not guaranteed to be 0. If the first attempt fails and the second attempt succeeds, the file name will be bucket_00000_1. This patch replaces these hard-coded file paths to corresponding files that are uploaded to HDFS by commands. For tests that do need to use the file paths of managed table files, we do a listing on the table dir to get the file names, instead of hard-coding the file paths. Updated chars-formats.orc to contain column names in the file so can be used in more tests. The original one only has names like col0, col1, col2. Tests: - Run CORE tests Change-Id: Ie3136ee90e2444c4a12f0f2e1470fca1d5deaba0 Reviewed-on: http://gerrit.cloudera.org:8080/16441 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-12 05:30:45 +00:00
Aman Sinha	5e9f10d34c	IMPALA-10064: Support constant propagation for eligible range predicates This patch adds support for constant propagation of range predicates involving date and timestamp constants. Previously, only equality predicates were considered for propagation. The new type of propagation is shown by the following example: Before constant propagation: WHERE date_col = CAST(timestamp_col as DATE) AND timestamp_col BETWEEN '2019-01-01' AND '2020-01-01' After constant propagation: WHERE date_col >= '2019-01-01' AND date_col <= '2020-01-01' AND timestamp_col >= '2019-01-01' AND timestamp_col <= '2020-01-01' AND date_col = CAST(timestamp_col as DATE) As a consequence, since Impala supports table partitioning by date columns but not timestamp columns, the above propagation enables partition pruning based on timestamp ranges. Existing code for equality based constant propagation was refactored and consolidated into a new class which handles both equality and range based constant propagation. Range based propagation is only applied to date and timestamp columns. Testing: - Added new range constant propagation tests to PlannerTest. - Added e2e test for range constant propagation based on a newly added date partitioned table. - Ran precommit tests. Change-Id: I811a1f8d605c27c7704d7fc759a91510c6db3c2b Reviewed-on: http://gerrit.cloudera.org:8080/16346 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 22:57:55 +00:00
Tim Armstrong	6ec6aaae8e	IMPALA-3695: Remove KUDU_IS_SUPPORTED Testing: Ran exhaustive tests. Change-Id: I059d7a42798c38b570f25283663c284f2fcee517 Reviewed-on: http://gerrit.cloudera.org:8080/16085 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 01:11:18 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
Zoltan Borok-Nagy	f8015ff68d	IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list Minor compactions can compact several delta directories into a single delta directory. The current directory filtering algorithm had to be modified to handle minor compacted directories and prefer those over plain delta directories. This happens in the Frontend, mostly in AcidUtils.java. Hive Streaming Ingestion writes similar delta directories, but they might contain rows Impala cannot see based on its valid write id list. E.g. we can have the following delta directory: full_acid/delta_0000001_0000010/0000 # minWriteId: 1 # maxWriteId: 10 This delta dir contains rows with write ids between 1 and 10. But maybe we are only allowed to see write ids less than 5. Therefore we need to check the ACID write id column (named originalTransaction) to determine which rows are valid. Delta directories written by Hive Streaming don't have a visibility txn id, so we can recognize them based on the directory name. If there's a visibilityTxnId and it is committed => every row is valid: full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId # every row is valid If there's no visibilityTxnId then it was created via Hive Streaming, therefore we need to validate rows. Fortunately Hive Streaming writes rows with different write ids into different ORC stripes, therefore we don't need to validate the write id per row. If we had statistics, we could validate per stripe, but since Hive Streaming doesn't write statistics we validate the write id per ORC row batch (an alternative could be to do a 2-pass read, first we'd read a single value from each stripe's 'currentTransaction' field, then we'd read the stripe if the write id is valid). Testing * the frontend logic is tested in AcidUtilsTest * the backend row validation is tested in test_acid_row_validation Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Reviewed-on: http://gerrit.cloudera.org:8080/15818 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 21:00:44 +00:00
Joe McDonnell	3e76da9f51	IMPALA-9708: Remove Sentry support Impala 4 decided to drop Sentry support in favor of Ranger. This removes Sentry support and related tests. It retires startup flags related to Sentry and does the first round of removing obsolete code. This does not adjust documentation to remove references to Sentry, and other dead code will be removed separately. Some issues came up when implementing this. Here is a summary of how this patch resolves them: 1. authorization_provider currently defaults to "sentry", but "ranger" requires extra parameters to be set. This changes the default value of authorization_provider to "", which translates internally to the noop policy that does no authorization. 2. These flags are Sentry specific and are now retired: - authorization_policy_provider_class - sentry_catalog_polling_frequency_s - sentry_config 3. The authorization_factory_class may be obsolete now that there is only one authorization policy, but this leaves it in place. 4. Sentry is the last component using CDH_COMPONENTS_HOME, so that is removed. There are still Maven dependencies coming from the CDH_BUILD_NUMBER repository, so that is not removed. 5. To make the transition easier, testdata/bin/kill-sentry-service.sh is not removed and it is still called from testdata/bin/kill-all.sh. Testing: - Core job passes Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e Reviewed-on: http://gerrit.cloudera.org:8080/15833 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 17:43:40 +00:00
Joe McDonnell	f241fd08ac	IMPALA-9731: Remove USE_CDP_HIVE=false and Hive 2 support Impala 4 moved to using CDP versions for components, which involves adopting Hive 3. This removes the old code supporting CDH components and Hive 2. Specifically, it does the following: 1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true. USE_CDP_HIVE now has no effect on the Impala environment. This also means that bin/jenkins/build-all-flag-combinations.sh no longer include USE_CDP_HIVE=false as a configuration. 2. Remove USE_CDH_KUDU and default to getting Impala from the native toolchain. 3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml. There is a fair amount of code that still references the Hive major version. Upstream Hive is now working on Hive 4, so there is a high likelihood that we'll need some code to deal with that transition. This leaves some code (such as maven profiles) and test logic in place. Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608 Reviewed-on: http://gerrit.cloudera.org:8080/15869 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-07 22:14:39 +00:00
Zoltan Borok-Nagy	b770d2d378	Put transactional tables into 'managed' directory HIVE-22794 disallows ACID tables outside of the 'managed' warehouse directory. This change updates data loading to make it conform to the new rules. The following tests had to be modified to use the new paths: * AnalyzeDDLTest.TestCreateTableLikeFileOrc() * create-table-like-file-orc.test Change-Id: Id3b65f56bf7f225b1d29aa397f987fdd7eb7176c Reviewed-on: http://gerrit.cloudera.org:8080/15708 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-11 00:36:56 +00:00
Tim Armstrong	5989900ae8	IMPALA-9618: fix some usability issues with dev env Automatically assume IMPALA_HOME is the source directory in a couple of places. Delete the cache_tables.py script and MINI_DFS_BASE_DATA_DIR config var which had both bit-rotted and were unused. Allow setting IMPALA_CLUSTER_NODES_DIR to put the minicluster nodes, most important the data, in a different location, e.g. on a different filesystem. Testing: I set up a dev environment using this code and was able to load data and run some tests. Change-Id: Ibd8b42a6d045d73e3ea29015aa6ccbbde278eec7 Reviewed-on: http://gerrit.cloudera.org:8080/15687 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-09 08:01:24 +00:00
Zoltan Borok-Nagy	8aa0652871	IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema Full ACID row format looks like this: { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1, "row": {"i": 1} } User columns are nested under "row". In the frontend we need to create slot descriptors that correspond to the file schema. In the catalog we could mimic the file schema but that would introduce several complexities and corner cases in column resolution. Also in query results the heading of the above user column would be "row.i". Star expansion should also be modified, etc. Because of that in the Catalog I create the exact opposite of the above schema: { "row__id": { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1 } "i": 1 } This way very little modification is needed in the frontend. And the hidden columns can be easily retrieved via 'SELECT row__id.' when we need those for debugging/testing. We only need to change Path.getAbsolutePath() to return a schema path that corresponds to the file schema. Also in the backend we need some extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the table schema path from the file schema path. Testing: I changed data loading to load ORC files in full ACID format by default. With this change we should be able to scan full ACID tables that are not minor-compacted, don't have deleted rows, and don't have original files. Newly added Tests: specific queries about hidden columns (full-acid-rowid.test) * SHOW CREATE TABLE (show-create-table-full-acid.test) * DESCRIBE [FORMATTED] TABLE (describe-path.test) * INSERT should be forbidden (acid-negative.test) * added tests for column masking ( ranger_column_masking_complex_types.test) Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb Reviewed-on: http://gerrit.cloudera.org:8080/15395 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-02 12:01:41 +00:00
Fang-Yu Rao	5bdd3adb1a	IMPALA-9566: Sentry service should not be started after IMPALA-8870 Due to the incompatibility between different versions of the Guava libraries, after bumping up Guava in IMPALA-8870, our build script is not supposed to start up the Sentry service when starting the minicluster because Sentry has not had its Guava bumped up yet. However, the patch for IMPALA-8870 did not take this into consideration when $TARGET_FILESYSTEM is s3 and thus run-all.sh still attempts to start up Sentry in this case. This patch fixes the bug. Testing: - Verified that this patch passes the core tests in the DEBUG build when $TARGET_FILESYSTEM is s3. Change-Id: If81846f4251fb2aa752ba8c33615cae0ab513a62 Reviewed-on: http://gerrit.cloudera.org:8080/15590 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-30 20:04:17 +00:00
Fang-Yu Rao	01684ab3aa	IMPALA-9191 (part 1): Allow Impala to run tests without Sentry This patch adds an environment variable DISABLE_SENTRY to allow Impala to run tests without Sentry. Specifically, we start up Sentry only when $DISABLE_SENTRY does not evaluate to true. The corresponding Sentry FE and E2E tests will also be skipped if $DISABLE_SENTRY is true. Moreover, in this patch we will set DISABLE_SENTRY to true if $USE_CDP_HIVE evaluates to true, allowing one to only test Impala's authorization with Ranger when support for Sentry is dropped after we switch to the CDP Hive. Note that in this patch we also change the way we generate hive-site.xml when $DISABLE_SENTRY is true. To be more precise, when generating hive-site.xml, we do not add the Sentry server as a metastore event listener if $DISABLE_SENTRY is true. Recall that both CDH Hive and CDP Hive would make an RPC to the registered listeners every time after the method of create_database_core() in HiveMetaStore.java is called, which happens when Hive instead of Impala is used to create a database, e.g., when some databases in the TPC-DS data set are created during the execution of create-load-data.sh. Thus the removal of Sentry as an event listener is necessary when $DISABLE_SENTRY is true in that it prevents the HiveMetaStore from keeping connecting to the Sentry server that is not online, which could make create-load-data.sh time out. Testing: Except for two currently known issues of IMPALA-9513 AND IMPALA-9451, verified this patch passes the exhaustive tests in the DEBUG build - when $USE_CDP_HIVE is false, and - when $USE_CDP_HIVE is true. Change-Id: Ifa3f1840a77a7b32310a5c8b78a2c26300ccb41e Reviewed-on: http://gerrit.cloudera.org:8080/15505 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-21 20:14:33 +00:00
stiga-huang	9672d94596	IMPALA-7784: Use unescaped string in partition pruning + fix duplicatedly unescaping strings String values from external systems (HDFS, Hive, Kudu, etc.) are already unescaped, the same as string values in Thrift objects deserialized in coordinators. We should mark needsUnescaping_ as false in creating StringLiterals for these values (in LiteralExpr#create()). When comparing StringLiterals in partition pruning, we should also use the unescaped values if needsUnescaping_ is true. Tests: - Add tests for partition pruning on unescaped strings. - Add test coverage for all existing code paths using LiteralExpr#create(). - Run core tests Change-Id: Iea8070f16a74f9aeade294504f2834abb8b3b38f Reviewed-on: http://gerrit.cloudera.org:8080/15278 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-09 06:29:35 +00:00
Joe McDonnell	90ab610d34	Convert dataload hdfs copy commands to LOAD DATA statements The schema file allows specifying a commandline command in several of the sections (LOAD, DEPENDENT_LOAD, etc). These are execute by testdata/bin/generate-schema-statements.py when it is creating the SQL files that are later executed for dataload. A fair number of tables use this flexibility to execute hdfs mkdir and copy commands via the command line. Unfortunately, this is very inefficient. HDFS command line commands require spinning up a JVM and can take over one second per command. These commands are executed during a serial part of dataload, and they can be executed multiple times. In short, these commands are a significant slowdown for loading the functional tables. This converts the hdfs command line statements to equivalent Hive LOAD DATA LOCAL statements. These are doing the copy from an already running JVM, so they do not need JVM startup. They also run in the parallel part of dataload, speeding up the SQL generation part. This speeds up generate-schema-statements.py significantly. On the functional dataset, it saves 7 minutes. Before: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 8m8.068s user 10m11.218s sys 0m44.932s After: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 0m35.800s user 0m42.536s sys 0m5.210s This is currently a long-pole in dataload, so it translates directly to an overall speedup of about 7 minutes. Testing: - Ran debug tests Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c Reviewed-on: http://gerrit.cloudera.org:8080/15228 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-24 21:22:18 +00:00
norbert.luksa	ba903deba7	Add log of created files for data load As Joe pointed out in IMPALA-9351, it would help debugging issues with missing files if we had logged the created files when loading the data. With this commit, running create-load-data.sh now logs the created files into created-files.log. Change-Id: I4f413810c6202a07c19ad1893088feedd9f7278f Reviewed-on: http://gerrit.cloudera.org:8080/15234 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-17 19:18:08 +00:00
stiga-huang	cad156181b	IMPALA-9304: Support starting Hive with Ranger in minicluster Add a new flag -with_ranger in testdata/bin/run-hive-server.sh to start Hive with Ranger integration. The relative configuration files are generated in bin/create-test-configuration.sh using a new varient ranger_auth in hive-site.xml.py. Only Hive3 is supported. Current limitation: Can't use different username in Beeline by the -n option. "select current_user()" keeps returning my username, while "select logged_in_user()" can return the username given by -n option but it's not used in authorization. Tests: - Ran bin/create-test-configuration.sh and verified the generated hive-site_ranger_auth.xml contains Ranger configurations. - Ran testdata/bin/run-hive-server.sh -with_ranger. Verified column masking and row filtering policies took effect in Beeline. - Added test in test_ranger.py for this mode. Change-Id: I01e3a195b00a98388244a922a1a79e65146cec42 Reviewed-on: http://gerrit.cloudera.org:8080/15189 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-14 04:26:16 +00:00
Yanjia Li	ea0e1def61	IMPALA-8778: Support Apache Hudi Read Optimized Table Hudi Read Optimized Table contains multiple versions of parquet files, in order to load the table correctly, Impala needs to recognize Hudi Read Optimized Table as a HdfsTable and load the latest version of the file using HoodieROTablePathFilter. Tests - Unit test for Hudi in FileMetadataLoader - Create table tests in functional_schema_template.sql - Query tests in hudi-parquet.test Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Reviewed-on: http://gerrit.cloudera.org:8080/14711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-11 15:08:39 +00:00
Tim Armstrong	6f150d383c	IMPALA-9361: manually configured kerberized minicluster The kerberized minicluster is enabled by setting IMPALA_KERBERIZE=true in impala-config-*.sh. After setting it you must run ./bin/create-test-configuration.sh then restart minicluster. This adds a script to partially automate setup of a local KDC, in lieu of the unmaintained minikdc support (which has been ripped out). Testing: I was able to run some queries against pre-created HDFS tables with kerberos enabled. Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee Reviewed-on: http://gerrit.cloudera.org:8080/15159 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-08 05:16:12 +00:00
Joe McDonnell	0163a10332	IMPALA-9068: Use different directories for external vs managed warehouse Hive 3 changed the typical storage model for tables to split them between two directories: - hive.metastore.warehouse.dir stores managed tables (which is now defined to be only transactional tables) - hive.metastore.warehouse.external.dir stores external tables (everything that is not a transactional table) In more recent commits of Hive, there is now validation that the external tables cannot be stored in the managed directory. In order to adopt these newer versions of Hive, we need to use separate directories for external vs managed warehouses. Most of our test tables are not transactional, so they would reside in the external directory. To keep the test changes small, this uses /test-warehouse for the external directory and /test-warehouse/managed for the managed directory. Having the managed directory be a subdirectory of /test-warehouse means that the data snapshot code should not need to change. The Hive 2 configuration doesn't change as it does not have this concept. Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate location for data as compared to USE_CDP_HIVE=false. That should reduce conflicts between the two configurations. Testing: - Ran exhaustive tests with USE_CDP_HIVE=false - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version) - Verified that dataload succeeds and tests are able to run with a newer Hive version. Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec Reviewed-on: http://gerrit.cloudera.org:8080/15026 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-24 17:29:15 +00:00
Fang-Yu Rao	1b4ca58a98	IMPALA-9149: part 1: Re-enabe Ranger-related FE tests In IMPALA-9047, we disabled some Ranger-related FE and BE tests due to changes in Ranger's behavior after upgrading Ranger from 1.2 to 2.0. This patch aims to re-enable those disabled FE tests in AuthorizationStmtTest.java and RangerAuditLogTest.java to increase Impala's test coverage of authorization via Ranger. There are at least two major changes in Ranger's behavior in the newer versions. 1. The first is that the owner of the requested resource no longer has to be explicitly granted privileges in order to access the resource. 2. The second is that a user not explicitly granted the privilege of creating a database is able to do so. Due to these changes, some of previous Ranger authorization requests that were expected to be rejected are now granted after the upgrade. To re-enable the tests affected by the first change described above, we modify AuthorizationTestBase.java to allow our FE Ranger authorization tests to specify the requesting user in an authorization test. Those tests failed after the upgrade because the default requesting user in Impala's AuthorizationTestBase.java happens to be the owner of the resources involved in our FE authorization tests. After this patch, a requesting user can be either a non-owner user or an owner user in a Ranger authorization test and the requesting user would correspond to a non-owner user if it is not explicitly specified. Note that in a Sentry authorization test, we do not use the non-owner user as the requesting user by default as we do in the Ranger authorization tests. Instead, we set the name of the requesting user to a name that is the same as the owner user in Ranger authorization tests to avoid the need for providing a customized group mapping service when instantiating a Sentry ResourceAuthorizationProvider as we do in AuthorizationTest.java, our FE tests specifically for testing authorization via Sentry. On the other hand, to re-enable the tests affected by the second change, we remove from the Ranger policy for all databases the allowed condition that grants any user the privilege of creating a database, which is not by default granted by Sentry. After the removal of the allowed codition, those tests in AuthorizationStmtTest.java and RangerAuditLogTest.java affected by the second change now result in the same authorization errors before the upgrade of Ranger. Testing: - Passed AuthorizationStmtTest.java in a local dev environment - Passed RangerAuditLogTest.java in a local dev environment Change-Id: I228533aae34b9ac03bdbbcd51a380770ff17c7f2 Reviewed-on: http://gerrit.cloudera.org:8080/14798 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-20 11:08:23 +00:00
Joe McDonnell	4c09975c14	IMPALA-9165: Add back hard kill to kill-hbase.sh The fix for IMPALA-9150 changed kill-hbase.sh to use HBase's stop-hbase.sh script. Around this time, the GVO timeout issues started. GVO can reuse machines, so we don't know what state they may be in. If something failed to kill HBase processes, the next job would need to be able to kill them even without access to the last run's files / logs. This restores the original kill logic to kill-hbase.sh, after trying a graceful shutdown using HBase's stop-hbase.sh script. The original kill logic doesn't rely on anything from the filesystem to know about the existence of processes, so it would handle machine reuse. This also changes our Jenkins test scripts to shut down the minicluster at the end. Testing: - Started with a running minicluster, ran bin/clean.sh, then ran testdata/bin/kill-all.sh and verified that the java processes were gone Change-Id: Ie2f0b342bcd1d8abea8ef923adbb54a14518a7a6 Reviewed-on: http://gerrit.cloudera.org:8080/14789 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-23 02:42:25 +00:00
Joe McDonnell	6dd2cfa9e0	IMPALA-9165: Hack to remove MasterProcWALs directory in kill-hbase.sh Some jobs have been hanging in the testdata/bin/create-hbase.sh script. Logs during the hang show the HBase Master is stuck unitialized: Master startup cannot progress, in holding-pattern until region onlined. ... ERROR master.HMaster: Master failed to complete initialization after 900000ms. Anecdotally, the HBase Master doesn't have this problem if we remove the /hbase/MasterProcWALs directory in kill-hbase.sh. This patch does exactly that. It is a hack, and we should update this code once we know what is going on. Testing: - test-with-docker.py fails without this patch and passes with it - Hand testing on my minicluster shows that this allows HBase to restart and be consistently usable Change-Id: Icef3d30e6b539a175e03f63fcdbfb2d4608c08fa Reviewed-on: http://gerrit.cloudera.org:8080/14757 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-21 22:44:34 +00:00
Joe McDonnell	fc4a91cf8c	IMPALA-9165: Add timeout for create-load-data.sh This converts the existing bin/run-all-tests-timeout-check.sh to a more generic bin/script-timeout-check.sh. It uses this new script for both bin/run-all-tests.sh and testdata/bin/create-load-data.sh. The new script takes two arguments: -timeout : timeout in minutes -script_name : name of the calling script The script_name is used in debugging output / output filenames to make it clear what timed out. The run-all-tests.sh timeout remains the same. testdata/bin/create-load-data.sh uses a 2.5 hour timeout. This should help debug the issue in IMPALA-9165, because at least the logs would be preserved on the Jenkins job. Testing: - Tested the timeout script by hand with a caller script that sleeps longer than the timeout - Ran a gerrit-verify-dryrun-external Change-Id: I19d76bd8850c7d4b5affff4d21f32d8715a382c6 Reviewed-on: http://gerrit.cloudera.org:8080/14741 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2019-11-20 21:59:26 +00:00
Joe McDonnell	f8c8fa5b45	IMPALA-9150: Use HBase's stop-hbase.sh script for minicluster testdata/bin/kill-hbase.sh currently uses the generic kill-java-service.sh script to kill the region servers, then the master, and then the zookeeper. Recent versions of HBase become unusable after performing this type of shutdown. The master seems to get stuck trying to recover, even after restarting the minicluster. The root cause in HBase is unclear, but HBase provides the stop-hbase.sh script, which does a more graceful shutdown. This switches tesdata/bin/kill-hbase.sh to use this script, which avoids the recovery problems. Testing: - Ran the test-with-docker.py tests (which does a minicluster restart). Before the change, the HBase tests timed out due to HBase getting stuck recovering. After the change, tests ran normally. - Added a minicluster restart after dataload so that this is tested. Change-Id: I67283f9098c73c849023af8bfa7af62308bf3ed3 Reviewed-on: http://gerrit.cloudera.org:8080/14697 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-13 19:23:33 +00:00
Tim Armstrong	b8b39f1d27	IMPALA-8815: fix ranger startup after set-classpath.sh Having non-existent or incompatible jars on the classpath can cause Ranger startup to fail. Update run-ranger-server.sh to clean the classpath so that it works after sourcing set-classpath.sh. Also remove a couple of legacy jars from 2013. Those jars no longer exist in Hive. Testing: In my development environment. $ . bin/set-classpath.sh $ ./testdata/bin/run-ranger-server.sh Change-Id: Ie7036f9a07e5c9b8d46bb7f459d0b9d1e7e9d0a7 Reviewed-on: http://gerrit.cloudera.org:8080/14152 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-31 05:01:17 +00:00
Zoltan Borok-Nagy	999becd252	IMPALA-9055: Impala shouldn't set expiration to NEVER for cache directives. In HdfsCachingUtil we set the expiration of cache directives to never. This works well until the cache pool has max TTL set. Once max TTL is set Impala will get an exception when it tries to add caching for tables or partitions. I changed HdfsCachingUtil to not set the expiration. This way the cache directive inherits the expiration from the cache pool. Testing Added e2e test that creates a table in a cache pool that has max TTL. Change-Id: I475b92704b19e337b2e62f766e5b978585bf6583 Reviewed-on: http://gerrit.cloudera.org:8080/14485 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-18 08:15:00 +00:00
Attila Jeges	27fa27e808	IMPALA-8198: DATE: Read from avro. This change is a follow-up to IMPALA-7368 and adds support for DATE type to the avro scanner. Similarly to parquet, avro uses DATE logical type for dates. DATE logical type annotates an INT32 that stores the number of days since the unix epoch, 1 January 1970. This representation introduces an avro interoperability issue between Impala and older versions of Hive: - Before version 3.1, Hive used Julian calendar to represent dates up to 1582-10-05 and Gregorian calendar for dates starting with 1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost. - Impala uses proleptic Gregorian calendar, extending the Gregorian calendar backward to dates preceding its official introduction in 1582-10-15. This means that pre-1582-10-15 dates written to an avro table by Hive will be read back incorrectly by Impala. Note that Hive 3.1 switched to proleptic Gregorian calendar too, so for Hive 3.1+ this is no longer an issue. Dependency changes: - BE uses avro 1.7.4-p5 from native-toolchain. Change-Id: I7a9d5b93a22cf3a00244037e187f8c145cacc959 Reviewed-on: http://gerrit.cloudera.org:8080/13944 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-27 17:18:35 +00:00
Tim Armstrong	864a999569	use set-pythonpath only for impala-python * Don't add PYTHONPATH to environment in impala-config.sh, it is done automatically by the impala-python script anyway. I think this is legacy from when we ran some things with the system python. * Remove unnecessary set-pythonpath.sh invocations where all calls go via impala-python anyway. * Remove impala-shell eggs from python path. All these packages are installed into the virtualenv. * testdata path entry was not needed - it's imported via the root Testing: Ran core tests Change-Id: Iff98eb261ab48c592e8d323aa409c6a65317b95a Reviewed-on: http://gerrit.cloudera.org:8080/14238 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-09-24 16:46:31 +00:00
Csaba Ringhofer	df2c6f200f	IMPALA-8841: Try to fix Tez related dataload flakiness The flakiness may be related to starting Hive queries in parallel which triggers initializing Tez resources in parallel (only needed at the first statement that uses Tez). Doing a non-parallel statement at first may solve the issue. Also includes a fix for a recent issue in 'build-and-copy-hive-udfs' introduced by the version bump in https://gerrit.cloudera.org/#/c/14043/ Change-Id: Id21d57483fe7a4f72f450fb71f8f53b3c1ef6327 Reviewed-on: http://gerrit.cloudera.org:8080/14081 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-08-16 23:00:01 +00:00
Tim Armstrong	c3a67b67fa	IMPALA-8849: fix IllegalStateException with VARCHAR The bug is that the serialized size wasn't populated for VARCHAR in a case when it should have been. It appears a condition was simply not updated when VARCHAR was added. Other code assumed that the serialized size was populated when the other size field was populated, which is a reasonable invariant. I documented the invariant in the class and added validation that the invariant held. Defining and checking invariants led to discovering various other minor issues where the sizes were set incorrect for fixed-length types or not set for variable-length types: * CHAR was not consistently treated as a fixed-length type. * avgSerializedSize_ was not always updated with avgSize_ Testing: Added a regression test for this specific case. Adding the assertions resulted in other cases showing up related bugs. Change-Id: Ie45e386cb09e31f4b7cdc82b7734dbecb4464534 Reviewed-on: http://gerrit.cloudera.org:8080/14062 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2019-08-15 14:40:04 +00:00

1 2 3 4 5 ...

457 Commits