impala

mirror of https://github.com/apache/impala.git synced 2025-12-22 11:28:09 -05:00

Author	SHA1	Message	Date
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
Vihang Karajgaonkar	5a9dcd108d	IMPALA-8795: Turn on events processing by default This commit turns on events processing by default. The default polling interval is set as 1 second which can be overrriden by setting hms_event_polling_interval_s to non-default value. When the event polling turned on by default this patch also moves the test_event_processing.py to tests/metadata instead of custom cluster test. Some tests within test_event_processing.py which needed non-default configurations were moved to tests/custom_cluster/test_events_custom_configs.py. Additionally, some other tests were modified to take into account the automatic ability of Impala to detect newly added tables from hive. Testing done: 1. Ran exhaustive tests by turning on the events processing multiple times. 2. Ran exhaustive tests by disabling events processing. 3. Ran dockerized tests. Change-Id: I9a8b1871a98b913d0ad8bb26a104a296b6a06122 Reviewed-on: http://gerrit.cloudera.org:8080/17612 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>	2021-08-09 17:22:31 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Tim Armstrong	1f7b413d11	IMPALA-8721: re-enable test_hive_impala_interop The test now passes because HIVE-21290 was fixed. Revert "IMPALA-8689: test_hive_impala_interop failing with "Timeout >7200s"" This reverts commit `5d8c99ce74`. Change-Id: I7e2beabd7082a45a0fc3b60d318cf698079768ff Reviewed-on: http://gerrit.cloudera.org:8080/17042 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-09 21:44:33 +00:00
Joe McDonnell	0163a10332	IMPALA-9068: Use different directories for external vs managed warehouse Hive 3 changed the typical storage model for tables to split them between two directories: - hive.metastore.warehouse.dir stores managed tables (which is now defined to be only transactional tables) - hive.metastore.warehouse.external.dir stores external tables (everything that is not a transactional table) In more recent commits of Hive, there is now validation that the external tables cannot be stored in the managed directory. In order to adopt these newer versions of Hive, we need to use separate directories for external vs managed warehouses. Most of our test tables are not transactional, so they would reside in the external directory. To keep the test changes small, this uses /test-warehouse for the external directory and /test-warehouse/managed for the managed directory. Having the managed directory be a subdirectory of /test-warehouse means that the data snapshot code should not need to change. The Hive 2 configuration doesn't change as it does not have this concept. Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate location for data as compared to USE_CDP_HIVE=false. That should reduce conflicts between the two configurations. Testing: - Ran exhaustive tests with USE_CDP_HIVE=false - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version) - Verified that dataload succeeds and tests are able to run with a newer Hive version. Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec Reviewed-on: http://gerrit.cloudera.org:8080/15026 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-24 17:29:15 +00:00
Tim Armstrong	4fb8e8e324	IMPALA-8816: reduce custom cluster test runtime in core This includes some optimisations and a bulk move of tests to exhaustive. Move a bunch of custom cluster tests to exhaustive. I selected these partially based on runtime (i.e. I looked most carefully at the tests that ran for over a minute) and the likelihood of them catching a precommit bug. Regression tests for specific edge cases and tests for parts of the code that are very stable were prime candidates. Remove an unnecessary cluster restart in test_breakpad. Merge test_scheduler_error into test_failpoints to avoid an unnecessary cluster restart. Speed up cluster starts by ensuring that the default statestore args are applied even when _start_impala_cluster() is called directly. This shaves a couple of seconds off each restart. We made the default args use a faster update frequency - see IMPALA-7185 - but they did not take effect in all tests. Change-Id: Ib2e3e7ebc9695baec4d69183387259958df10f62 Reviewed-on: http://gerrit.cloudera.org:8080/13967 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-06 21:34:26 +00:00
Abhishek Rawat	99cfa92d3c	IMPALA-8736: test_hive_parquet_codec_interop fails on S3 The test case fails (hanging untill the timeout is hit), when trying to do hive insert on S3. Support for hive insert is flaky in S3 and so the testcase should be disabled for S3. Change-Id: Ie8eee8ba44e6acf4a069467f749756f9040ef641 Reviewed-on: http://gerrit.cloudera.org:8080/13781 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-04 00:50:30 +00:00
Abhishek Rawat	5d8c99ce74	IMPALA-8689: test_hive_impala_interop failing with "Timeout >7200s" The newly added Hive<->Impala interop test fails due to unexpected wrong results when reading TimeStamp column value written by Hive. The short term measure is to remove TimeStamp column from the interop tests. The original issue will be fixed by IMPALA-8721. Testing: Ran the testcase N number of times on both upstream and downstream code base. Change-Id: I148c79a31f9aada1b75614390434462d1e483f28 Reviewed-on: http://gerrit.cloudera.org:8080/13755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-28 11:27:18 +00:00
Tim Armstrong	9ecbe7d3dc	IMPALA-8553,IMPALA-8552: fix checks for remote cluster Apparently IMPALA_REMOTE_URL is not generally used for remote cluster tests: only --testing_remote_cluster is reliably set. Fix the is_remote_cluster() implementation to take into account REMOTE_DATA_LOAD and --testing_remote_cluster in addition to IMPALA_REMOTE_URL. Consistently use is_remote_cluster() in other tests instead of checking the pytest flag directly. There were a few lifecycle headaches with how ImpalaTestClusterProperties is used: * common.environ is imported from conftest, which means that the top-level code in the file runs before pytest command-line arguments have been registered and parsed. * ImpalaTestClusterProperties is used by various code, like build_flavor_timeout(), which runs before pytest command-line arguments have been parsed. * ImpalaTestClusterProperties is called from non-pytest scripts like start-impala-cluster.py, so the command-line arguments are not available. I dealt with the above challenges by making a few changes to do the detection later: * Lazily initializing a singleton ImpalaTestClusterProperties. This was not strictly necessary but makes the whole problem less sensitive to import order and module dependencies. * Adding cluster_properties fixture to make ImpalaTestClusterProperties available in tests without additional boilerplate. * Removing the caching of the local/remote build calculation. ImpalaTestClusterProperties is instantiated outside of python tests, but is_remote_cluster() is only called from python tests, so if we check flags in is_remote_cluster() we'll get the right results reliably. As a workaround to unblock remote tests, also assume catalog_v1 if accessing the web UI fails. Testing: Ran core tests against a regular minicluster. Ran tests against a remote cluster Change-Id: Ifa6b2a1391f53121d3d7c00c5cf0a57590899ce4 Reviewed-on: http://gerrit.cloudera.org:8080/13386 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-20 20:27:31 +00:00
Abhishek	97a6a3c807	IMPALA-8617: Add support for lz4 in parquet A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents the block compression scheme used by Hadoop. Its similar to SNAPPY_BLOCKED as far as the block format is concerned, with the only difference being the codec used for compression and decompression. Added Lz4BlockCompressor and Lz4BlockDecompressor classes for compressing and decompressing parquet data using Hadoop's lz4 block compression scheme. The Lz4BlockCompressor treats the input as a single block and generates a compressed block with following layout <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> The hdfs parquet table writer should call the Lz4BlockCompressor using the ideal input size (unit of compression in parquet is a page), and so the Lz4BlockCompressor does not further break down the input into smaller blocks. The Lz4BlockDecompressor on the other hand should be compatible with blocks written by Impala and other engines in Hadoop ecosystem. It can decompress compressed data in following format <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> ... <4 byte big endian compressed size> <lz4 compressed block> ... <repeated untill uncompressed size from outer block is consumed> Externally users can now set the lz4 codec for parquet using: set COMPRESSION_CODEC=lz4 This gets translated into LZ4_BLOCKED codec for the HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet data, the LZ4_BLOCKED codec is used. Testing: - Added unit tests for LZ4_BLOCKED in decompress-test.cc - Added unit tests for Hadoop compatibility in decompress-test.cc, basically being able to decompress an outer block with multiple inner blocks (the Lz4BlockDecompressor description above) - Added interoperability tests for Hive and Impala for all parquet codecs. New test added to tests/custom_cluster/test_hive_parquet_codec_interop.py Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c Reviewed-on: http://gerrit.cloudera.org:8080/13582 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-19 04:43:43 +00:00

10 Commits