This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.
New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.
Follow-up:
- Support for caching COS file handles will be addressed in
IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
COS (IMPALA-10773).
Tests:
- Upload hdfs test data to a COS bucket. Modify all locations in HMS
DB to point to the COS bucket. Remove some hdfs caching params.
Run CORE tests.
Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit turns on events processing by default. The default
polling interval is set as 1 second which can be overrriden by
setting hms_event_polling_interval_s to non-default value.
When the event polling turned on by default this patch also
moves the test_event_processing.py to tests/metadata instead
of custom cluster test. Some tests within test_event_processing.py
which needed non-default configurations were moved to
tests/custom_cluster/test_events_custom_configs.py.
Additionally, some other tests were modified to take into account
the automatic ability of Impala to detect newly added tables
from hive.
Testing done:
1. Ran exhaustive tests by turning on the events processing multiple
times.
2. Ran exhaustive tests by disabling events processing.
3. Ran dockerized tests.
Change-Id: I9a8b1871a98b913d0ad8bb26a104a296b6a06122
Reviewed-on: http://gerrit.cloudera.org:8080/17612
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.
New flags for GCS:
- num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.
Follow-up:
- Support for spilling to GCS will be addressed in IMPALA-10561.
- Support for caching GCS file handles will be addressed in
IMPALA-10568.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
GCS (IMPALA-10562).
- Some tests are skipped due to issues introduced by /etc/hosts setting
on GCE instances (IMPALA-10563).
Tests:
- Compile and create hdfs test data on a GCE instance. Upload test data
to a GCS bucket. Modify all locations in HMS DB to point to the GCS
bucket. Remove some hdfs caching params. Run CORE tests.
- Compile and load snapshot data to a GCS bucket. Run CORE tests.
Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive 3 changed the typical storage model for tables to split them
between two directories:
- hive.metastore.warehouse.dir stores managed tables (which is now
defined to be only transactional tables)
- hive.metastore.warehouse.external.dir stores external tables
(everything that is not a transactional table)
In more recent commits of Hive, there is now validation that the
external tables cannot be stored in the managed directory. In order
to adopt these newer versions of Hive, we need to use separate
directories for external vs managed warehouses.
Most of our test tables are not transactional, so they would reside
in the external directory. To keep the test changes small, this uses
/test-warehouse for the external directory and /test-warehouse/managed
for the managed directory. Having the managed directory be a subdirectory
of /test-warehouse means that the data snapshot code should not need to
change.
The Hive 2 configuration doesn't change as it does not have this concept.
Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION
to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate
location for data as compared to USE_CDP_HIVE=false. That should reduce
conflicts between the two configurations.
Testing:
- Ran exhaustive tests with USE_CDP_HIVE=false
- Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version)
- Verified that dataload succeeds and tests are able to run with a newer
Hive version.
Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec
Reviewed-on: http://gerrit.cloudera.org:8080/15026
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This includes some optimisations and a bulk move of tests
to exhaustive.
Move a bunch of custom cluster tests to exhaustive. I selected
these partially based on runtime (i.e. I looked most carefully
at the tests that ran for over a minute) and the likelihood
of them catching a precommit bug. Regression tests for specific
edge cases and tests for parts of the code that are very stable
were prime candidates.
Remove an unnecessary cluster restart in test_breakpad.
Merge test_scheduler_error into test_failpoints to avoid an unnecessary
cluster restart.
Speed up cluster starts by ensuring that the default statestore args are
applied even when _start_impala_cluster() is called directly. This
shaves a couple of seconds off each restart. We made the default args
use a faster update frequency - see IMPALA-7185 - but they did not
take effect in all tests.
Change-Id: Ib2e3e7ebc9695baec4d69183387259958df10f62
Reviewed-on: http://gerrit.cloudera.org:8080/13967
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The test case fails (hanging untill the timeout is hit), when trying
to do hive insert on S3. Support for hive insert is flaky in S3 and so
the testcase should be disabled for S3.
Change-Id: Ie8eee8ba44e6acf4a069467f749756f9040ef641
Reviewed-on: http://gerrit.cloudera.org:8080/13781
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The newly added Hive<->Impala interop test fails due to unexpected
wrong results when reading TimeStamp column value written by Hive.
The short term measure is to remove TimeStamp column from the interop
tests. The original issue will be fixed by IMPALA-8721.
Testing: Ran the testcase N number of times on both upstream and
downstream code base.
Change-Id: I148c79a31f9aada1b75614390434462d1e483f28
Reviewed-on: http://gerrit.cloudera.org:8080/13755
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Apparently IMPALA_REMOTE_URL is not generally used for remote cluster
tests: only --testing_remote_cluster is reliably set. Fix the
is_remote_cluster() implementation to take into account
REMOTE_DATA_LOAD and --testing_remote_cluster in addition to
IMPALA_REMOTE_URL. Consistently use is_remote_cluster() in
other tests instead of checking the pytest flag directly.
There were a few lifecycle headaches with how
ImpalaTestClusterProperties is used:
* common.environ is imported from conftest, which means that
the top-level code in the file runs *before* pytest
command-line arguments have been registered and parsed.
* ImpalaTestClusterProperties is used by various code,
like build_flavor_timeout(), which runs before pytest
command-line arguments have been parsed.
* ImpalaTestClusterProperties is called from non-pytest
scripts like start-impala-cluster.py, so the command-line
arguments are not available.
I dealt with the above challenges by making a few changes
to do the detection later:
* Lazily initializing a singleton ImpalaTestClusterProperties.
This was not strictly necessary but makes the whole problem
less sensitive to import order and module dependencies.
* Adding cluster_properties fixture to make ImpalaTestClusterProperties
available in tests without additional boilerplate.
* Removing the caching of the local/remote build calculation.
ImpalaTestClusterProperties is instantiated outside of python
tests, but is_remote_cluster() is only called from python tests,
so if we check flags in is_remote_cluster() we'll get the
right results reliably.
As a workaround to unblock remote tests, also assume catalog_v1 if
accessing the web UI fails.
Testing:
Ran core tests against a regular minicluster.
Ran tests against a remote cluster
Change-Id: Ifa6b2a1391f53121d3d7c00c5cf0a57590899ce4
Reviewed-on: http://gerrit.cloudera.org:8080/13386
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to
distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents
the block compression scheme used by Hadoop. Its similar to
SNAPPY_BLOCKED as far as the block format is concerned, with the only
difference being the codec used for compression and decompression.
Added Lz4BlockCompressor and Lz4BlockDecompressor classes for
compressing and decompressing parquet data using Hadoop's
lz4 block compression scheme.
The Lz4BlockCompressor treats the input
as a single block and generates a compressed block with following layout
<4 byte big endian uncompressed size>
<4 byte big endian compressed size>
<lz4 compressed block>
The hdfs parquet table writer should call the Lz4BlockCompressor
using the ideal input size (unit of compression in parquet is a page),
and so the Lz4BlockCompressor does not further break down the input
into smaller blocks.
The Lz4BlockDecompressor on the other hand should be compatible with
blocks written by Impala and other engines in Hadoop ecosystem. It can
decompress compressed data in following format
<4 byte big endian uncompressed size>
<4 byte big endian compressed size>
<lz4 compressed block>
...
<4 byte big endian compressed size>
<lz4 compressed block>
...
<repeated untill uncompressed size from outer block is consumed>
Externally users can now set the lz4 codec for parquet using:
set COMPRESSION_CODEC=lz4
This gets translated into LZ4_BLOCKED codec for the
HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet
data, the LZ4_BLOCKED codec is used.
Testing:
- Added unit tests for LZ4_BLOCKED in decompress-test.cc
- Added unit tests for Hadoop compatibility in decompress-test.cc,
basically being able to decompress an outer block with multiple inner
blocks (the Lz4BlockDecompressor description above)
- Added interoperability tests for Hive and Impala for all parquet
codecs. New test added to
tests/custom_cluster/test_hive_parquet_codec_interop.py
Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c
Reviewed-on: http://gerrit.cloudera.org:8080/13582
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>