94 Commits

Author SHA1 Message Date
Joe McDonnell
48b38810e8 IMPALA-14465: Unset HEAPCHECK when custom cluster tests restart Kudu
Custom cluster tests like TestKuduHMSIntegration restart the Kudu
service with custom startup flags. On Redhat8 ARM64, these tests
have been failing due to Kudu being unresponsive after this
restart. Debugging showed that Kudu was stuck early in startup.
This only reproduced via the custom cluster tests and never via
regular minicluster startup.

When custom cluster tests restart Kudu, the script to restart
Kudu inherits environment variables from the test runner. It
turns out that the HEAPCHECK environment variable (even when
empty) causes Kudu to get stuck during startup on Redhat8
ARM64 after the recent toolchain update.

As a short-term fix, this unsets HEAPCHECK when restarting the
Kudu service for these tests. There will need to be further
investigation / cleanup beyond this.

Testing:
 - Ran the Kudu custom cluster tests on Redhat8 ARM64 and
   on Ubuntu 20 x86_64

Change-Id: I51513e194d9e605df199672231b412fae40343af
Reviewed-on: http://gerrit.cloudera.org:8080/23467
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-26 04:45:22 +00:00
jasonmfehr
0fe8de0f3f IMPALA-14401: Deflake/Improve OpenTelemetry Tracing Tests
Contains the following improvements to the Impala queries as
OpenTelemetry traces custom cluster tests:

1. Supporting code for asserting traces was moved to
   'tests/util/otel_trace.py'. The moved code was modified to remove
   all references to 'self'. Since this code used
   'self.assert_impalad_log_contains', it had to be modified so the
   caller provides the correct log file path to search. The
   '__find_span_log' function was updated to call a new generic file
   grep function to run the necessary log file search regex. All
   other code was moved unmodified.

2. Classes 'TestOtelTraceSelectsDMLs' and 'TestOtelTraceDDLs'
   contained a total of 11 individual tests that used the
   'unique_database' fixture. When this fixture is used in a test, it
   results in two DDLs being run before the test to drop/create the
   database and one DDL being run after the test to drop the database.
   These classes now create a test database once during 'setup_class'
   and drop it once during 'teardown_class' because creating a new
   database for each test was unnecessary. This change dropped test
   execution time from about 97 seconds to about 77 seconds.

3. Each test now has comments describing what the test is asserting.

4. The unnecessary sleep in 'test_query_exec_fail' was removed saving
   five seconds of test execution time.

5. New test 'test_dml_insert_fail' added. Previously, the situation
   where an insert DML failed was not tested. The test passed without
   any changes to backend code.

6. Test 'test_ddl_createtable_fail' is greatly simplified by using a
   debug action to fail the query instead of multiple parallel
   queries where one dropped the database the other was inserting
   into. The simplified setup eliminated test flakiness caused by
   timing differences and sped up test execution by about 5 seconds.

7. Fixed test flakiness was caused by timing issues. Depending on
   when the close process was initiated, span events are sometimes in
   the QueryExecution span and sometimes in the Close span. Test
   assertions cannot handle these situations. All span event
   assertions for the Close span were removed. IMPALA-14334 will fix
   these assertions.

8. The function 'query_id_from_ui' which retrieves the query profile
   using the Impala debug ui now makes multiple attempts to retrieve
   the query. In slower test situations, such as ASAN, the query may
   not yet be available when the function is called initially which
   used to cause tests to fail. This test flakiness is now eliminated
   through the addition of the retries.

Testing accomplished by running tests in test_otel_trace.py both
locally and in a full Jenkins build.

Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I0c3e0075df688c7ae601c6f2e5743f56d6db100e
Reviewed-on: http://gerrit.cloudera.org:8080/23385
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-15 23:21:29 +00:00
Riza Suminto
28cff4022d IMPALA-14333: Run impala-py.test using Python3
Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true
reveals some tests that require adjustment. This patch made such
adjustment, which mostly revolves around encoding differences and string
vs bytes type in Python3. This patch also switch the default to run
pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The
following are the details:

Change hash() function in conftest.py to crc32() to produce
deterministic hash. Hash randomization is enabled by default since
Python 3.3 (see
https://docs.python.org/3/reference/datamodel.html#object.__hash__).
This cause test sharding (like --shard_tests=1/2) produce inconsistent
set of tests per shard. Always restart minicluster during custom cluster
tests if --shard_tests argument is set, because test order may change
and affect test correctness, depending on whether running on fresh
minicluster or not.

Moved one test case from delimited-latin-text.test to
test_delimited_text.py for easier binary comparison.

Add bytes_to_str() as a utility function to decode bytes in Python3.
This is often needed when inspecting the return value of
subprocess.check_output() as a string.

Implement DataTypeMetaclass.__lt__ to substitute
DataTypeMetaclass.__cmp__ that is ignored in Python3 (see
https://peps.python.org/pep-0207/).

Fix WEB_CERT_ERR difference in test_ipv6.py.

Fix trivial integer parsing in test_restart_services.py.

Fix various encoding issues in test_saml2_sso.py,
test_shell_commandline.py, and test_shell_interactive.py.

Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1.

Switch to binary comparison in test_iceberg.py where needed.

Specify text mode when calling tempfile.NamedTemporaryFile().

Simplify create_impala_shell_executable_dimension to skip testing dev
and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason
is that several UTF-8 related tests in test_shell_commandline.py break
in Python3 pytest + Python2 impala-shell combo. This skipping already
happen automatically in build OS without system Python2 available like
RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty).

Removed unused vector argument and fixed some trivial flake8 issues.

Several test logic require modification due to intermittent issue in
Python3 pytest. These include:

Add _run_query_with_client() in test_ranger.py to allow reusing a single
Impala client for running several queries. Ensure clients are closed
when the test is done. Mark several tests in test_ranger.py with

SkipIfFS.hive because they run queries through beeline + HiveServer2,
but Ozone and S3 build environment does not start HiveServer2 by
default.

Increase the sleep period from 0.1 to 0.5 seconds per iteration in
test_statestore.py and mark TestStatestore to execute serially. This is
because TServer appears to shut down more slowly when run concurrently
with other tests. Handle the deprecation of Thread.setDaemon() as well.

Always force_restart=True each test method in TestLoggingCore,
TestShellInteractiveReconnect, and TestQueryRetries to prevent them from
reusing minicluster from previous test method. Some of these tests
destruct minicluster (kill impalad) and will produce minidump if metrics
verifier for next tests fail to detect healthy minicluster state.

Testing:
Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true.

Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4
Reviewed-on: http://gerrit.cloudera.org:8080/23319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-03 10:01:29 +00:00
jasonmfehr
789991c6cc IMPALA-13237: [Patch 8] - OpenTelemetry Traces for DML/DDL Queries and Handle Leading Comments
Trace DML/DDL Queries
* Adds tracing for alter, compute, create, delete, drop, insert,
  invalidate metadata, and with queries.
* Stops tracing beeswax queries since that protocol is deprecated.
* Adds Coordinator attribute to Init and Root spans for identifying
  where the query is running.

Comment Handling
* Corrects handling of leading comments, both inline and full line.
  Previously, queries with comments before the first keyword were
  always ignored.
* Adds be ctest tests for determining whether or not a query should
  be traced.

General Improvements
* Handles the case where the first query keyword is followed by a
  newline character or an inline comment (without or with spaces
  between).
* Corrects traces for errored/cancelled queries. These cases
  short-circuit the normal query processing code path and have to be
  handled accordingly.
* Ends the root span when the query ends instead of waiting for the
  ClientRequestState to go out of scope. This change removes
  use-after-free issues caused by reading from ClientRequestState
  when the SpanManager went out of scope during that object's dtor.
* Simplified minimum tls version handling because the validators
  on the ssl_minimum_version eliminate invalid values that previously
  had to be accounted for.
* Removes the unnecessary otel_trace_enabled() function.
* Fixes IMPALA-14314 by waiting for the full trace to be written to
  the output file before asserting that trace.

Testing
* Full test suite passed.
* ASAN/TSAN builds passed.
* Adds new ctest test.
* Adds custom cluster tests to assert traces for the new supported
  query types.
* Adds custom cluster tests to assert traces for errored and
  cancelled queries.

Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: Ie9e83d7f761f3d629f067e0a0602224e42cd7184
Reviewed-on: http://gerrit.cloudera.org:8080/23279
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-03 04:38:36 +00:00
jasonmfehr
2ad6f818a5 IMPALA-13237: [Patch 5] - Implement OpenTelemetry Traces for Select Queries Tracking
Adds representation of Impala select queries using OpenTelemetry
traces.

Each Impala query is represented as its own individual OpenTelemetry
trace. The one exception is retried queries which will have an
individual trace for each attempt. These traces consist of a root span
and several child spans. Each child span has the root as its parent.
No child span has another child span as its parent. Each child span
represents one high-level query lifecycle stage. Each child span also
has span attributes that further describe the state of the query.

Child spans:
  1. Init
  2. Submitted
  3. Planning
  4. Admission Control
  5. Query Execution
  6. Close

Each child span contains a mix of universal attributes (available on
all spans) and query phase specific attributes. For example, the
"ErrorMsg" attribute, present on all child spans, is the error
message (if any) at the end of that particular query phase. One
example of a child span specific attribute is "QueryType" on the
Planning span. Since query type is first determined during query
planning, the "QueryType" attribute is present on the Planning span
and has a value of "QUERY" (since only selects are supported).

Since queries can run for lengthy periods of time, the Init span
communicates the beginning of a query along with global query
attributes. For example, span attributes include query id, session
id, sql, user, etc.

Once the query has closed, the root span is closed.

Testing accomplished with new custom cluster tests.

Generated-by: Github Copilot (GPT-4.1, Claude Sonnet 3.7)
Change-Id: Ie40b5cd33274df13f3005bf7a704299ebfff8a5b
Reviewed-on: http://gerrit.cloudera.org:8080/22924
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-12 04:11:06 +00:00
Zoltan Borok-Nagy
438461db9e IMPALA-14138: Manually disable block location loading via Hadoop config
For storage systems that support block location information (HDFS,
Ozone) we always retrieve it with the assumption that we can use it for
scheduling, to do local reads. But it's also typical that Impala is not
co-located with the storage system, not even in on-prem deployments.
E.g. when Impala runs in containers, and even if they are co-located,
we don't try to figure out which container runs on which machine.

In such cases we should not reach out to the storage system to collect
file information because it can be very expensive for large tables and
we won't benefit from it at all. Since currently there is no easy way
to tell if Impala is co-located with the storage system this patch
adds configuration options to disable block location retrieval during
table loading.

It can be disabled globally via Hadoop Configuration:

'impala.preload-block-locations-for-scheduling': 'false'

We can restrict it to filesystem schemes, e.g.:

'impala.preload-block-locations-for-scheduling.scheme.hdfs': 'false'

When multiple storage systems are configured with the same scheme, we
can still control block location loading based on authority, e.g.:

'impala.preload-block-locations-for-scheduling.authority.mycluster': 'false'

The latter only disables block location loading for URIs like
'hdfs://mycluster/warehouse/tablespace/...'

If block location loading is disabled by any of the switches, it cannot
be re-enabled by another, i.e. the most restrictive setting prevails.
E.g:
  disable scheme 'hdfs', enable authority 'mycluster'
     ==> hdfs://mycluster/ is still disabled

  disable globally, enable scheme 'hdfs', enable authority 'mycluster'
     ==> hdfs://mycluster/ is still disabled, as everything else is.

Testing:
 * added unit tests for FileSystemUtil
 * added unit tests for the file metadata loaders
 * custom cluster tests with custom Hadoop configuration

Change-Id: I1c7a6a91f657c99792db885991b7677d2c240867
Reviewed-on: http://gerrit.cloudera.org:8080/23175
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-17 13:08:15 +00:00
Riza Suminto
0b1a32fad8 IMPALA-13850 (part 4): Implement in-place reset for CatalogD
This patch improve the availability of CatalogD under huge INVALIDATE
METADATA operation. Previously, CatalogServiceCatalog.reset() hold
versionLock_.writeLock() for the whole reset duration. When the number
of database, tables, or functions are big, this write lock can be held
for a long time, preventing any other catalog operation from proceeding.

This patch improve the situation by:
1. Making CatalogServiceCatalog.reset() rebuild dbCache_ in place and
   occasionally release the write lock between rebuild stages.
2. Fetch databases, tables, and functions metadata from MetaStore in
   background using ExecutorService. Added catalog_reset_max_threads
   flag to control number of threads to do parallel fetch.

In order to do so, lexicographic order must be enforced during reset()
and ensure all Db invalidation within a single stage is complete before
releasing the write lock. Stages should run in approximately the same
amount of time. A catalog operation over a database must ensure that no
reset operation is currently running, or the database name is
lexicographically less than the current database-under-invalidation.

This patch adds CatalogResetManager to do background metadata fetching
and provide helper methods to help facilitate waiting for reset
progress. CatalogServiceCatalog must hold the versionLock_.writeLock()
before calling most of CatalogResetManager methods.

These are methods in CatalogServiceCatalog class that must wait for
CatalogResetManager.waitOngoingMetadataFetch():

addDb()
addFunction()
addIncompleteTable()
addTable()
invalidateTableIfExists()
removeDb()
removeFunction()
removeTable()
renameTable()
replaceTableIfUnchanged()
tryLock()
updateDb()
InvalidateAwareDbSnapshotIterator.hasNext()

Concurrent global IM must wait until currently running global IM
complete. The waiting happens by calling waitFullMetadataFetch().

CatalogServiceCatalog.getAllDbs() get a snapshot of dbCache_ values at a
time. With this patch, it is now possible that some Db in this snapshot
maybe removed from dbCache() by concurrent reset(). Caller that cares
about snapshot integrity like CatalogServiceCatalog.getCatalogDelta()
should be careful when iterating the snapshot. It must iterate in
lexicographic order, similar like reset(), and make sure that it does
not go beyond the current database-under-invalidation. It also must skip
the Db that it is currently being inspected if Db.isRemoved() is True.
Added helper class InvalidateAwareDbSnapshot for this kind of iteration

Override CatalogServiceCatalog.getDb() and
CatalogServiceCatalog.getDbs() to wait until first reset metadata
complete or looked up Db found in cache.

Expand test_restart_catalogd_twice to test_restart_legacy_catalogd_twice
and test_restart_local_catalogd_twice. Update
CustomClusterTestSuite.wait_for_wm_init_complete() to correctly pass
timeout values to helper methods that it calls. Reduce cluster_size from
10 to 3 in few tests of test_workload_mgmt_init.py to avoid flakiness.

Fixed HMS connection leak between tests in AuthorizationStmtTest (see
IMPALA-8073).

Testing:
- Pass exhaustive tests.

Change-Id: Ib4ae2154612746b34484391c5950e74b61f85c9d
Reviewed-on: http://gerrit.cloudera.org:8080/22640
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2025-07-09 14:05:04 +00:00
Csaba Ringhofer
5cca1aa9e5 IMPALA-13820: add ipv6 support for webui/hs2/hs2-http/beeswax
Main changes:
- added flag external_interface to override hostname for
  beeswax/hs2/hs2-http port to allow testing ipv6 on these
  interfaces without forcing ipv6 on internal communication
- compile Squeasel with USE_IPV6 to allow ipv6 on webui (webui
  interface can be configured with existing flag webserver_interface)
- fixed the handling of [<ipv6addr>].<port> style addresses in
  impala-shell (e.g. [::1]:21050) and test framework
- improved handling of custom clusters in test framework to
  allow webui/ImpalaTestSuite's clients to work with non
  standard settings (also fixes these clients with SSL)

Using ipv4 vs ipv6 vs dual stack can be configured by setting
the interface to bind to with flag webserver_interface and
external_interface. The Thrift server behind hs2/hs2-http/beeswax
only accepts a single host name and uses the first address
returned by getaddrinfo() that it can successfully bind to. This
means that unless an ipv6 address is used (like ::1) the behavior
will depend on the order of addresses returned by getaddrinfo():
63b7a263fc/lib/cpp/src/thrift/transport/TServerSocket.cpp (L481)
For dual stack the only way currently is to bind to "::",
as the Thrift server can only listen a single socket.

Testing:
- added custom cluster tests for ipv6 only/dual interface
  with and without SSL
- manually tested in dual stack environment with client on a
  different host
- among clients impala-shell and impyla are tested, but not
  JDBC/ODBC
- no tests yet on truly ipv6 only environment, as internal
  communication (e.g. krpc) is not ready for ipv6

To test manually the dev cluster can be started with ipv6 support:
dual mode:
bin/start-impala-cluster.py --impalad_args="--external_interface=:: --webserver_interface=::" --catalogd_args="--webserver_interface=::" --state_store_args="--webserver_interface=::"

ipv6 only:
bin/start-impala-cluster.py --impalad_args="--external_interface=::1 --webserver_interface=::1" --catalogd_args="--webserver_interface=::1" --state_store_args="--webserver_interface=::1"

Change-Id: I51ac66c568cc9bb06f4a3915db07a53c100109b6
Reviewed-on: http://gerrit.cloudera.org:8080/22527
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-21 14:00:31 +00:00
Mihaly Szjatinya
e0cb533c25 IMPALA-13912: Use SHARED_CLUSTER_ARGS in more custom cluster tests
In addition to IMPALA-13503 which allowed having the single cluster
running for the entire test class, this attempts to minimize restarting
between the existing tests without modifying any of their code.

This changeset saves the command line with which
'start-impala-cluster.py' has been run and skips the restarting if the
command line is the same for the next test.

Some tests however do require restart due to the specific metrics being
tested. Such tests are defined with the 'force_restart' flag within the
'with_args' decorator. NOTE: there might be more tests like that
revealed after running the tests in different order resulting in test
failures.

Experimentally, this results in ~150 fewer restarts, mostly coming from
restarts between tests. As for restarts between different variants of
the same test, most of the cluster tests are restricted to single
variant, although multi-variant tests occur occasionally.

Change-Id: I7c9115d4d47b9fe0bfd9dbda218aac2fb02dbd09
Reviewed-on: http://gerrit.cloudera.org:8080/22901
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-19 17:48:25 +00:00
Csaba Ringhofer
f98b697c7b IMPALA-13929: Make 'functional-query' the default workload in tests
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.

All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.

The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.

get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.

Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-08 07:12:55 +00:00
Zoltan Borok-Nagy
bd3486c051 IMPALA-13586: Initial support for Iceberg REST Catalogs
This patch adds initial support for Iceberg REST Catalogs. This means
now it's possible to run an Impala cluster without the Hive Metastore,
and without the Impala CatalogD. Impala Coordinators can directly
connect to an Iceberg REST server and fetch metadata for databases and
tables from there. The support is read-only, i.e. DDL and DML statements
are not supported yet.

This was initially developed in the context of a company Hackathon
program, i.e. it was a team effort that I squashed into a single commit
and polished the code a bit.

The Hackathon team members were:
* Daniel Becker
* Gabor Kaszab
* Kurt Deschler
* Peter Rozsa
* Zoltan Borok-Nagy

The Iceberg REST Catalog support can be configured via a Java properties
file, the location of it can be specified via:
 --catalog_config_dir: Directory of configuration files

Currently only one configuration file can be in the direcory as we only
support a single Catalog at a time. The following properties are mandatory
in the config file:
* connector.name=iceberg
* iceberg.catalog.type=rest
* iceberg.rest-catalog.uri

The first two properties can only be 'iceberg' and 'rest' for now, they
are needed for extensibility in the future.

Moreover, Impala Daemons need to specify the following flags to connect
to an Iceberg REST Catalog:
 --use_local_catalog=true
 --catalogd_deployed=false

Testing
* e2e added to test basic functionlity with against a custom-built
  Iceberg REST server that delegates to HadoopCatalog under the hood
* Further testing, e.g. Ranger tests are expected in subsequent
  commits

TODO:
* manual testing against Polaris / Lakekeeper, we could add automated
  tests in a later patch

Change-Id: I1722b898b568d2f5689002f2b9bef59320cb088c
Reviewed-on: http://gerrit.cloudera.org:8080/22353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-02 20:04:12 +00:00
Riza Suminto
ea8f74a6ac IMPALA-13861: Standardize workload management tests
This patch standardizes tests against workload management tables
(sys.impala_query_log and sys.impala_query_live) to use a common
superclass named WorkloadManagementTestSuite. The setup_method of this
superclass waits for workload management init completion
(wait_for_wm_init_complete()), while the teardown_method waits until
impala-server.completed-queries.queued metric reaches
0 (wait_for_wm_idle()).

test_query_log.py and test_workload_mgmt_sql_details.py are refactored
to extend from WorkloadManagementTestSuite. Tests to assert the query
log table flush behavior are grouped together in TestQueryLogTableFlush.
test_workload_mgmt_sql_details.py::TestWorkloadManagementSQLDetails now
uses 1 minicluster instance for all tests.

test_workload_mgmt_init.py does not extend from
WorkloadManagementTestSuite because it is testing cluster start and
restart scenario. This patch only adds wait_for_wm_idle() at
teardown_method where it make sense to do so.

test_query_live.py does not extend from WorkloadManagementTestSuite
because most of its test method require long
--query_log_write_interval_s so that DML queries from workload
management worker does not disturb sys.impala_query_live.

workload_mgmt parameter in CustomClusterTestSuite.with_args() is
standardized to setup appropriate default flags in cluster_setup()
rather than passing it down to _start_impala_cluster():
IMPALAD_ARGS
  --enable_workload_mgmt=true --query_log_write_interval_s=1 \
  --shutdown_grace_period_s=0 --shutdown_deadline_s=60
and CATALOGD_ARGS
  --enable_workload_mgmt=true

Note that IMPALAD_ARGS and CATALOGD_ARGS flags added by workload_mgmt
and impalad_graceful_shutdown parameter are still overridable to
different value by explicitly adding it in the impalad_args and
catalogd_args parameters. Setting workload_mgmt=True now automatically
enables graceful shutdown for the test. Thus,
impalad_graceful_shutdown=True is now removed.

With beeswax protocol deprecated, this patch also changes the protocol
under test from beeswax to hs2. TestQueryLogTableBeeswax is now renamed
to TestQueryLogTableBasic.

Additionally, print total wait time in wait_for_metric_value().

Testing:
- Run modified tests and pass.

Change-Id: Iecf6452fa963304e263805ebeb017c843d17dd16
Reviewed-on: http://gerrit.cloudera.org:8080/22617
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-21 22:31:11 +00:00
Riza Suminto
7ae19b069e IMPALA-13823: Clear existing entry of TMP_DIRS at cluster_setup
Assertion was hit at CustomClusterTestSuite.make_tmp_dir() for
attempting to create tmp dir that already have mapping in
CustomClusterTestSuite.TMP_DIRS. It can happen if different custom
cluster tests run declare the same tmp_dir_placeholders, and one of the
earlier run fail to teardown properly, resulting in clear_tmp_dirs() not
being called.

This patch fix the issue by clearing existing TMP_DIRS entry without
removing the underlying filesystem path. Along with it, a WARN log will
be printed, saying about dirty entry in TMP_DIRS.

Testing:
- Manually comment clear_tmp_dirs() call in cluster_teardown() and run
  TestQueryLogTableBufferPool.test_select. Confirmed that all 4 of its
  test runs complete, the warning logs printed to
  logs/custom_cluster_tests/results/TEST-impala-custom-cluster.xml
  and the tmp dir stays in logs/custom_cluster_tests/.

Change-Id: I3f528bb155eb3cf4dfa58a6a23feb438809556bc
Reviewed-on: http://gerrit.cloudera.org:8080/22572
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-07 00:56:29 +00:00
jasonmfehr
ea989dfb28 IMPALA-13815: Fix Flaky Workload Management Tests
The CustomClusterTestSuite.wait_for_wm_init() function checks for two
specific log lines to be logged by the catalog. The first line is
logged when workload manangement initialization is complete. The
second line is when a catalog topic update has been assembled.

However, if workload management initialization is slow, then there
may not be a catalog topic update assembled after the initialization
completes. When this happens, an assertion fails despite the workload
management tables having been properly initialized and loaded by the
catalog.

This patch simplifies the CustomClusterTestSuite.wait_for_wm_init()
function so it waits until the catalogd logs it has completed
workload management initialization and then checks each coordinator's
local catalog cache for the workload management tables.

The following test suites passed locally and in an ASAN build. These
tests all call the 'wait_for_wm_init' function of
CustomClusterTestSuite.
  * tests/custom_cluster/test_query_live.py
  * tests/custom_cluster/test_query_log.py
  * tests/custom_cluster/test_workload_mgmt_init.py
  * tests/custom_cluster/test_workload_mgmt_sql_details.py

Change-Id: Ieb4c86fa79bb1df000b6241bdd31c7641d807c4f
Reviewed-on: http://gerrit.cloudera.org:8080/22570
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-03-05 14:33:32 +00:00
jasonmfehr
aac67a077e IMPALA-13201: System Table Queries Execute When Admission Queues are Full
Queries that run only against in-memory system tables are currently
subject to the same admission control process as all other queries.
Since these queries do not use any resources on executors, admission
control does not need to consider the state of executors when
deciding to admit these queries.

This change adds a boolean configuration option 'onlyCoordinators'
to the fair-scheduler.xml file for specifying a request pool only
applies to the coordinators. When a query is submitted to a
coordinator only request pool, then no executors are required to be
running. Instead, all fragment instances are executed exclusively on
the coordinators.

A new member was added to the ClusterMembershipMgr::Snapshot struct
to hold the ExecutorGroup of all coordinators. This object is kept up
to date by processing statestore messages and is used when executing
queries that either require the coordinators (such as queries against
sys.impala_query_live) or that use an only coordinators request pool.

Testing was accomplished by:
1. Adding cluster membership manager ctests to assert cluster
   membership manager correctly builds the list of non-quiescing
   coordinators.
2. RequestPoolService JUnit tests to assert the new optional
   <onlyCoords> config in the fair scheduler xml file is correctly
   parsed.
3. ExecutorGroup ctests modified to assert the new function.
4. Custom cluster admission controller tests to assert queries with a
   coordinator only request pool only run on the active coordinators.

Change-Id: I5e0e64db92bdbf80f8b5bd85d001ffe4c8c9ffda
Reviewed-on: http://gerrit.cloudera.org:8080/22249
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-14 04:27:11 +00:00
jasonmfehr
d3b6cbcc20 IMPALA-13201: Remove Unused Parameter from Test retry Function
The custom cluster tests utilize the retry() function defined in
retry.py. This function takes as input another function to do the
assertions. This assertion function used to have a single boolean
parameter indicating if the retry was on its last attempt. In
actuality, this boolean was not used and thus caused flake8 failures.

This change removes this unused parameter from the assertion function
passed in to the retry function.

Change-Id: I1bce9417b603faea7233c70bde3816beed45539e
Reviewed-on: http://gerrit.cloudera.org:8080/22452
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-07 05:46:06 +00:00
Xuebin Su
d7ee509e93 IMPALA-12648: Add KILL QUERY statement
To support killing queries programatically, this patch adds a new
type of SQL statements, called the KILL QUERY statement, to cancel and
unregister a query on any coordinator in the cluster.

A KILL QUERY statement looks like
```
KILL QUERY '123:456';
```
where `123:456` is the query id of the query we want to kill. We follow
syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY'
are added as "unreserved keywords", like 'DEFAULT'. This allows the
three keywords to be used as identifiers.

A user is authorized to kill a query only if the user is an admin or is
the owner of the query. KILL QUERY statements are not affected by
admission control.

Implementation:

Since we don't know in advance which impalad is the coordinator of the
query we want to kill, we need to broadcast the kill request to all the
coordinators in the cluster. Upon receiving a kill request, each
coordinator checks whether it is the coordinator of the query:
- If yes, it cancels and unregisters the query,
- If no, it reports "Invalid or unknown query handle".

Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is
created for this.

For authorization, this patch adds a custom handler of
AuthorizationException for each statement to allow the exception to be
handled by the backend. This is because we don't know whether the user
is the owner of the query until we reach its coordinator.

To support cancelling child queries, this patch changes
ChildQuery::Cancel() to bypass the HS2 layer so that the session of the
child query will not be added to the connection used to execute the
KILL QUERY statement.

Testing:
- A new ParserTest case is added to test using "unreserved keywords" as
  identifiers.
- New E2E test cases are added for the KILL QUERY statement.
- Added a new dimension in TestCancellation to use the KILL QUERY
  statement.
- Added file tests/common/cluster_config.py and made
  CustomClusterTestSuite.with_args() composable so that common cluster
  configs can be reused in custom cluster tests.

Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0
Reviewed-on: http://gerrit.cloudera.org:8080/21930
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2025-01-22 22:22:54 +00:00
jasonmfehr
cb35dc8769 IMPALA-13536: Workload Management Tests Failing on Init Check
Most of the workload management tests verify that the workload
management process has successfully completed. Part of this
verification ensures a catalog update has propagated the workload
management changes to the coordinators by determining the catalog
version, from the catalogd logs, that contains the workload
management table changes and ensuring that version is in the
coordinator logs.

The test flakiness occurs when multiple catalogd versions are
combined into a later version. Specifically, tests were failing
because the coordinator logs were checked for catalog version X but
the actual version in the coordinator logs was X+1.

The fix for the test flakiness is to allow for the exepected catalog
version or any later version.

Change-Id: I9f20a149ab1f45ee3506f098f8594965a24a89d3
Reviewed-on: http://gerrit.cloudera.org:8080/22200
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-12-13 05:44:58 +00:00
jasonmfehr
490f90c65e IMPALA-13536: Fix Workload Management Init Tests Issues
Several problems with the workload management code and
test_workload_mgmt_init.py tests have been uncovered by the Ozone
tests.

* test_create_on_version_1_0_0 - Test comment said it ran on 10
      nodes, test configuration specified 1 node. Fix was to modify
      the test configuration.
* test_create_on_version_1_1_0 - Test comment said it ran on 10
      nodes, test configuration specified 1 node. Fix was to modify
      the test configuration.
* test_invalid_* - All four of these tests run the same internal
      function to execute the test. This internal function was not
      waiting long enough for the expected failure to appear. The
      fixed internal function waits longer for the expected failure.

Additionally, the @CustomClusterTestSuite annotation has a new option
named 'log_symlinks', which, if set to True will resolve all daemon
log symlinks and output their actual paths to the log. Failed tests
can then be easily traced to the exact log files for that test.

The existing workload management tests in testdata have been expanded
to also assert the expected table properties are present.

Modified tests passed on Ozone builds both with and without erasure
coding enabled.

Change-Id: Ie3f34088d1d925f30abb63471387e6fdb62b95a7
Reviewed-on: http://gerrit.cloudera.org:8080/22119
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-12-11 01:01:46 +00:00
Michael Smith
2085edbe1c IMPALA-13503: CustomClusterTestSuite for whole class
Allow using CustomClusterTestSuite with a single cluster for the whole
class. This speeds up tests by letting us group together multiple test
cases on the same cluster configuration and only starting the cluster
once.

Updates tuple cache tests as an example of how this can be used. Reduces
test_tuple_cache execution time from 100s to 60s.

Change-Id: I7a08694edcf8cc340d89a0fb33beb8229163b356
Reviewed-on: http://gerrit.cloudera.org:8080/22006
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-18 23:57:39 +00:00
Riza Suminto
95f353ac4a IMPALA-13507: Allow disabling glog buffering via with_args fixture
We have plenty of custom_cluster tests that assert against content of
Impala daemon log files while the process is still running using
assert_log_contains() and it's wrappers. The method specifically mention
about disabling glog buffering ('-logbuflevel=-1'), but not all
custom_cluster tests do that. This often result in flaky test that hard
to triage and often neglected if it does not frequently run in core
exploration.

This patch adds boolean param 'disable_log_buffering' into
CustomClusterTestSuite.with_args for test to declare intention to
inspect log files in live minicluster. If it is True, start minicluster
with '-logbuflevel=-1' for all daemons. If it is False, log WARNING on
any calls to assert_log_contains().

There are several complex custom_cluster tests that left unchanged and
print out such WARNING logs, such as:
- TestQueryLive
- TestQueryLogTableBeeswax
- TestQueryLogOtherTable
- TestQueryLogTableHS2
- TestQueryLogTableAll
- TestQueryLogTableBufferPool
- TestStatestoreRpcErrors
- TestWorkloadManagementInitWait
- TestWorkloadManagementSQLDetails

This patch also fixed some small flake8 issues on modified tests.

There is a flakiness sign at test_query_live.py where test query is
submitted to coordinator and fail because sys.impala_query_live table
has not exist yet from coordinator's perspective. This patch modify
test_query_live.py to wait for few seconds until sys.impala_query_live
is queryable.

Testing:
- Pass custom_cluster tests in exhaustive exploration.

Change-Id: I56fb1746b8f3cea9f3db3514a86a526dffb44a61
Reviewed-on: http://gerrit.cloudera.org:8080/22015
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-05 04:49:05 +00:00
jasonmfehr
7b6ccc644b IMPALA-12737: Query columns in workload management tables.
Adds "Select Columns", "Where Columns", "Join Columns", "Aggregate
Columns", and "OrderBy Columns" to the query profile and the workload
management active/completed queries tables. These fields are
presented as comma separate strings containing the fully qualified
column name in the format database.table_name.column_name. Aggregate
columns include all columns in the order by and having clauses.

Since new columns are being added, the workload management init
process is also being modified to allow for one-way upgrades of the
table schemas if necessary.  Additionally, workload management can be
set up to run under a schema version that is not the latest. This
ability will be useful during troubleshooting. To enable these
upgrades, the workload management initialization that manages the
structure of the tables has been moved to the catalogd.

The changes in this patch must be backwards compatible so that Impala
clusters running previous workload management code can co-exist with
Impala clusters running this workload management code. To enable that
backwards compatibility, a new table property named
'wm_schema_version' is now used to track the schema version of the
workload management tables. Thus, the old property 'schema_version'
will always be set to '1.0.0' since modifying that property value
causes Impala running previous workload management code to error at
startup.

Testing accomplished by
* Adding/updating workload and custom cluster tests to assert the new
  columns and the workload management upgrade process.
* JUnit tests added to verify the new workload management columns are
  being correctly parsed.
* GTests added to ensure the workload management columns are
  correctly defined and in the correct order.

Change-Id: I78f3670b067c0c192ee8a212fba95466fbcb51d7
Reviewed-on: http://gerrit.cloudera.org:8080/21142
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2024-10-31 17:06:43 +00:00
Riza Suminto
9c87cf41bf IMPALA-13396: Unify tmp dir management in CustomClusterTestSuite
There are many custom cluster tests that require creating temporary
directory. The temporary directory typically live within a scope of test
method and cleaned afterwards. However, some test do create temporary
directory directly and forgot to clean them afterwards, leaving junk
dirs under /tmp/ or $LOG_DIR.

This patch unify the temporary directory management inside
CustomClusterTestSuite. It introduce new 'tmp_dir_placeholders' arg in
CustomClusterTestSuite.with_args() that list tmp dirs to create.
'impalad_args', 'catalogd_args', and 'impala_log_dir' now accept
formatting pattern that is replaceable by a temporary dir path, defined
through 'tmp_dir_placeholders'.

There are few occurrences where mkdtemp is called and not replaceable by
this work, such as tests/comparison/cluster.py. In that case, this patch
change them to supply prefix arg so that developer knows that it comes
from Impala test script.

This patch also addressed several flake8 errors in modified files.

Testing:
- Pass custom cluster tests in exhaustive mode.
- Manually run few modified tests and observe that the temporary dirs
  are created and removed under logs/custom_cluster_tests/ as the tests
  go.

Change-Id: I8dd665e8028b3f03e5e33d572c5e188f85c3bdf5
Reviewed-on: http://gerrit.cloudera.org:8080/21836
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-02 01:25:39 +00:00
stiga-huang
fcee022e60 IMPALA-13208: Add cluster id to the membership and request-queue topic names
To share catalogd and statestore across Impala clusters, this adds the
cluster id to the membership and request-queue topic names. So impalads
are only visible to each other inside the same cluster, i.e. using the
same cluster id. Note that impalads are still subscribe to the same
catalog-update topic so they can share the same catalog service.
If cluster id is empty, use the original topic names.

This also adds the non-empty cluster id as the prefix of the statestore
subscriber id for impalad and admissiond.

Tests:
 - Add custom cluster test
 - Ran exhaustive tests

Change-Id: I2ff41539f568ef03c0ee2284762b4116b313d90f
Reviewed-on: http://gerrit.cloudera.org:8080/21573
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-07-18 03:38:27 +00:00
Michael Smith
3b35ddc8ca IMPALA-13051: Speed up, refactor query log tests
Sets faster default shutdown_grace_period_s and shutdown_deadline_s when
impalad_graceful_shutdown=True in tests. Impala waits until grace period
has passed and all queries are stopped (or deadline is exceeded) before
flushing the query log, so grace period of 0 is sufficient. Adds them in
setup_method to reduce duplication in test declarations.

Re-uses TQueryTableColumn Thrift definitions for testing.

Moves waiting for query log table to exist to setup_method rather than
as a side-effect of get_client.

Refactors workload management code to reduce if-clause nesting.

Adds functional query workload tests for both the sys.impala_query_log
and the sys.impala_query_live tables to assert the names and order of
the individual columns within each table.

Renames the python tests for the sys.impala_query_log table removing the
unnecessary "_query_log_table_" string from the name of each test.

Change-Id: I1127ef041a3e024bf2b262767d56ec5f29bf3855
Reviewed-on: http://gerrit.cloudera.org:8080/21358
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2024-05-13 22:46:42 +00:00
jasonmfehr
711a9f2bad IMPALA-12426: Query History Table
Adds the ability for users to specify that Impala will create and
maintain an internal Iceberg table that contains data about all
completed queries. This table is automatically created at startup by
each coordinator if it does not exist. Then, most completed queries are
queued in memory and flushed to the query history table at a set
interval (either minutes or number of records). Set, use, and show
queries are not written to this table. This commit leverages the
InternalServer class to maintain the query history table.

Ctest unit tests have been added to assert the various pieces of code.
New custom cluster tests have been added to assert the query history
table is properly populated with completed queries.

Negative testing consists of attempting sql injection attacks and
syntactically incorrect queries.

Impala built-in string functions benchmarks have been updated to include
the new built-in functions.

Change-Id: I2d2da9d450fba4e789400cfa62927fc25d34f844
Reviewed-on: http://gerrit.cloudera.org:8080/20770
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-03-19 22:17:16 +00:00
Riza Suminto
3381fbf761 IMPALA-12595: Allow automatic removal of old logs from previous PID
IMPALA-11184 add code to target specific PID for log rotation. This
align with glog behavior and grant safety. That is, it is strictly limit
log rotation to only consider log files made by the currently running
Impalad and exclude logs made by previous PID or other living-colocated
Impalads. The downside of this limit is that logs can start accumulate
in a node when impalad is frequently restarted and is only resolvable by
admin doing manual log removal.

To help avoid this manual removal, this patch adds a backend flag
'log_rotation_match_pid' that relax the limit by dropping the PID in
glob pattern. Default value for this new flag is False. However, for
testing purpose, start-impala-cluster.py will override it to True since
test minicluster logs to a common log directory. Setting
'log_rotation_match_pid' to True will prevent one impalad from
interfering with log rotation of other impalad in minicluster.

As a minimum exercise for this new log rotation behavior,
test_breakpad.py::TestLogging is modified to invoke
start-impala-cluster.py with 'log_rotation_match_pid' set to False.

Testing:
- Add test_excessive_cerr_ignore_pid and test_excessive_cerr_match_pid.
- Split TestLogging into two. One run test_excessive_cerr_ignore_pid in
  core exploration, while the other run the rest of logging tests in
  exhaustive exploration.
- Pass exhaustive tests.

Change-Id: I599799e73f27f941a1d7f3dec0f40b4f05ea5ceb
Reviewed-on: http://gerrit.cloudera.org:8080/20754
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-09 03:34:57 +00:00
wzhou-code
819db8fa46 IMPALA-12155: Support High Availability for CatalogD
To support catalog HA, we allow two catalogd instances in an Active-
Passive HA pair to be added to an Impala cluster.
We add the preemptive behavior for catalogd. When enabled, the
preemptive behavior allows the catalogd with the higher priority to
become active and the paired catalogd becomes standby. The active
catalogd acts as the source of metadata and provides catalog service
for the Impala cluster.

To enable catalog HA for a cluster, two catalogds in the HA pair and
statestore must be started with starting flag "enable_catalogd_ha".

The catalogd in an Active-Passive HA pair can be assigned an instance
priority value to indicate a preference for which catalogd should assume
the active role. The registration ID which is assigned by statestore can
be used as instance priority value. The lower numerical value in
registration ID corresponds to a higher priority. The catalogd with the
higher priority is designated as active, the other catalogd is
designated as standby. Only the active catalogd propagates the
IMPALA_CATALOG_TOPIC to the cluster. This guarantees only one writer for
the IMPALA_CATALOG_TOPIC in a Impala cluster.

The statestore which is the registration center of an Impala cluster
assigns the roles for the catalogd in the HA pair after both catalogds
register to statestore. When statestore detects the active catalogd is
not healthy, it fails over catalog service to standby catalogd. When
failover occurs, statestore sends notifications with the address of
active catalogd to all coordinators and catalogd in the cluster. The
events are logged in the statestore and catalogd logs. When the catalogd
with the higher priority recovers from a failure, statestore does not
resume it as active to avoid flip-flop between the two catalogd.

To make a specific catalogd in the HA pair as active instance, the
catalogd must be started with starting flag "force_catalogd_active" so
that the catalogd will be assigned with active role when it registers
to statestore. This allows administrator to manually perform catalog
service failover.

Added option "--enable_catalogd_ha" in bin/start-impala-cluster.py.
If the option is specified when running the script, the script will
create an Impala cluster with two catalogd instances in HA pair.

Testing:
 - Passed the core tests.
 - Added unit-test for auto failover and manual failover.

Change-Id: I68ce7e57014e2a01133aede7853a212d90688ddd
Reviewed-on: http://gerrit.cloudera.org:8080/19914
Reviewed-by: Xiang Yang <yx91490@126.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tamas Mate <tmater@apache.org>
2023-06-21 14:02:55 +00:00
Joe McDonnell
0c7c6a335e IMPALA-11977: Fix Python 3 broken imports and object model differences
Python 3 changed some object model methods:
 - __nonzero__ was removed in favor of __bool__
 - func_dict / func_name were removed in favor of __dict__ / __name__
 - The next() function was deprecated in favor of __next__
   (Code locations should use next(iter) rather than iter.next())
 - metaclasses are specified a different way
 - Locations that specify __eq__ should also specify __hash__

Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.

This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape

Testing:
 - Ran core tests
 - Ran release exhaustive tests

Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Michael Smith
88d49b6919 IMPALA-11693: Enable allow_erasure_coded_files by default
Enables allow_erasure_coded_files by default as we've now completed all
planned work to support it.

Testing
- Ran HDFS+EC test suite
- Ran Ozone+EC test suite

Change-Id: I0cfef087f2a7ae0889f47e85c5fab61a795d8fd4
Reviewed-on: http://gerrit.cloudera.org:8080/19362
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-31 16:53:46 +00:00
stiga-huang
77d80aeda6 IMPALA-11812: Deduplicate column schema in hmsPartitions
A list of HMS Partitions will be created in many workloads in catalogd,
e.g. table loading, bulk altering partitions by ComputeStats or
AlterTableRecoverPartitions, etc. Currently, each of hmsPartition hold a
unique list of column schema, i.e. a List<FieldSchema>. This results in
lots of FieldSchema instances if the table is wide and lots of
partitions need to be loaded/operated. Though the strings of column
names and comments are interned, the FieldSchema objects could still
occupy the majority of the heap. See the histogram in JIRA description.

In reality, the hmsPartition instances of a table can share the
table-level column schema since Impala doesn't respect the partition
level schema.

This patch replaces column list in StorageDescriptor of hmsPartitions
with the table level column list to remove the duplications. Also add
some progress logs in batch HMS operations, and avoid misleading logs
when event-processor is disabled.

Tests:
- Ran exhaustive tests
- Add tests on wide table operations that hit OOM errors without this
  fix.

Change-Id: I511ecca0ace8bea4c24a19a54fb0a75390e50c4d
Reviewed-on: http://gerrit.cloudera.org:8080/19391
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-01 04:38:36 +00:00
Fang-Yu Rao
db1cac2a49 IMPALA-10399, IMPALA-11060, IMPALA-11788: Reset Ranger policy repository in an E2E test
test_show_grant_hive_privilege() uses Ranger's REST API to get all the
existing policies from the Ranger server after creating a policy that
grants the LOCK and SELECT privileges on all the tables and columns in
the unique database in order to verify the granted privileges indeed
exist in Ranger's policy repository.

The way we download all the policies from the Ranger server in
test_show_grant_hive_privilege(), however, did not
always work. Specifically, when there were already a lot of existing
policies in Ranger, the policy that granted the LOCK and SELECT
privileges would not be included in the result returned via one single
GET request. We found that to reproduce the issue it suffices to add 300
Ranger policies before adding the policy granting those 2 privileges.

Moreover, we found that even we set the argument 'stream' of
requests.get() to True and used iter_content() to read the response in
chunks, we still could not retrieve the policy added in
test_show_grant_hive_privilege().

As a workaround, instead of changing how we download all the policies
from the Ranger server, this patch resets Ranger's policy repository for
Impala before we create the policy granting those 2 privileges so that
this test will be more resilient to the number of existing policies in
the repository.

Change-Id: Iff56ec03ceeb2912039241ea302f4bb8948d61f8
Reviewed-on: http://gerrit.cloudera.org:8080/19373
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2022-12-28 01:48:26 +00:00
Michael Smith
35dc24fbc8 IMPALA-10148: Cleanup cores in TestHooksStartupFail
Generalizes coredump cleanup and expecting startup failure from
test_provider.py and uses it in test_query_event_hooks.py
TestHooksStartupFail to ensure core dumps are cleaned up.

Testing: ran changed tests, observed core files being created and
cleaned up while they ran. Observed other core files already present
were not cleaned up, as expected.

Change-Id: Iec32e0acbadd65aa78264594c85ffcd574cf3458
Reviewed-on: http://gerrit.cloudera.org:8080/19103
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-13 16:20:34 +00:00
Bikramjeet Vig
06c9016a37 IMPALA-8762: Track host level admission stats across all coordinators
This patch adds the ability to share the per-host stats for locally
admitted queries across all coordinators. This helps to get a more
consolidated view of the cluster for stats like slots_in_use and
mem_admitted when making local admission decisions.

Testing:
Added e2e py test

Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243
Reviewed-on: http://gerrit.cloudera.org:8080/17683
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-28 05:33:16 +00:00
Thomas Tauber-Marshall
91adb33b22 IMPALA-9975 (part 2): Introduce new admission control daemon
A recent patch (IMPALA-9930) introduces a new admission control rpc
service, which can be configured to perform admission control for
coordinators. In that patch, the admission service runs in an impalad.

This patch separates the service out to run in a new daemon, called
the admissiond. It also integrates this new daemon with the build
infrastructure around Docker.

Some notable changes:
- Adds a new class, AdmissiondEnv, which performs the same function
  for the admissiond as ExecEnv does for impalads.
- The '/admission' http endpoint is exposed on the admissiond's webui
  if the admission control service is in use, otherwise it is exposed
  on coordinator impalad's webuis.
- start-impala-cluster.py takes a new flag --enable_admission_service
  which configures the minicluster to have an admissiond with all
  coordinators using it for admission control.
- Coordinators are now configured to use the admission service by
  specifying the startup flag --admission_service_host. This is
  intended to mirror the configuration of the statestored/catalogd
  location.

Testing:
- Existing tests for the admission control serivce are modified to run
  with an admissiond.
- Manually ran start-impala-cluster.py with --enable_admission_service
  and --docker_network to verify Docker integration.

Change-Id: Id677814b31e9193035e8cf0d08aba0ce388a0ad9
Reviewed-on: http://gerrit.cloudera.org:8080/16891
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-13 06:03:37 +00:00
Fang-Yu Rao
34668fab87 IMPALA-10092: Do not skip test vectors of Kudu tests in a custom cluster
We found that the following 4 tests do not run even we remove all the
decorators like "@SkipIfKudu.no_hybrid_clock" or
"@SkipIfHive3.kudu_hms_notifications_not_supported" to skip the tests.
This is due to the fact that those 3 classes inherit the class of
CustomClusterTestSuite, which adds a constraint that only allows test
vectors with 'file_format' and 'compression_codec' being "text" and
"none", respectively, to be run.

1. TestKuduOperations::test_local_tz_conversion_ops
2. TestKuduClientTimeout::test_impalad_timeout
3. TestKuduHMSIntegration::test_create_managed_kudu_tables
4. TestKuduHMSIntegration::test_kudu_alter_table

To address this issue, in this patch we create a parent class for those
3 classes above and override the method of
add_custom_cluster_constraints() for this newly created parent class so
that we do not skip test vectors with 'file_format' and
'compression_codec' being "kudu" and "none", respectively.

On the other hand, this patch also removes a redundant method call to
super(CustomClusterTestSuite, cls).add_test_dimensions() in
CustomClusterTestSuite.add_custom_cluster_constraints() since
super(CustomClusterTestSuite, cls).add_test_dimensions() had
already been called immediately before the call to
add_custom_cluster_constraints() in
CustomClusterTestSuite.add_test_dimensions().

Testing:
 - Manually verified that after removing the decorators to skip those
   tests, those tests could be run.

Change-Id: I60a4bd4ac5a9026629fb840ab9cc7b5f9948290c
Reviewed-on: http://gerrit.cloudera.org:8080/16348
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-28 01:37:16 +00:00
Joe McDonnell
3e76da9f51 IMPALA-9708: Remove Sentry support
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.

Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
   "ranger" requires extra parameters to be set. This changes the
   default value of authorization_provider to "", which translates
   internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
 - authorization_policy_provider_class
 - sentry_catalog_polling_frequency_s
 - sentry_config
3. The authorization_factory_class may be obsolete now that
   there is only one authorization policy, but this leaves it
   in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
   that is removed. There are still Maven dependencies coming
   from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
   is not removed and it is still called from testdata/bin/kill-all.sh.

Testing:
 - Core job passes

Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-20 17:43:40 +00:00
stiga-huang
b6b31e4cc4 IMPALA-9071: Handle translated external HDFS table in CTAS
After upgrading Hive-3 to a version containing HIVE-22158, it's not
allowed for managed tables to be non transactional. Creating non ACID
tables will result in creating an external table with table property
'external.table.purge' set to true.

In Hive-3, the default location of external HDFS tables will be located
in 'metastore.warehouse.external.dir' if it's set. This property is
added by HIVE-19837 in Hive 2.7, but hasn't been added to Hive in cdh6
yet.

In CTAS statement, we create a temporary HMS Table for the analysis on
the Insert part. The table path is created assuming it's a managed
table, and the Insert part will use this path for insertion. However, in
Hive-3, the created table is translated to an external table. It's not
the same as we passed to the HMS API. The created table is located in
'metastore.warehouse.external.dir', while the table path we assumed is
in 'metastore.warehouse.dir'. This introduces bugs when these two
properties are different. CTAS statement will create table in one place
and insert data in another place.

This patch adds a new method in MetastoreShim to wrap the difference for
getting the default table path for non transactional tables between
Hive-2 and Hive-3.

Changes in the infra:
 - To support customizing hive configuration, add an env var,
   CUSTOM_CLASSPATH in bin/set-classpath.sh to be put in front of
   existing CLASSPATH. The customized hive-site.xml should be put inside
   CUSTOM_CLASSPATH.
 - Change hive-site.xml.py to generate a hive-site.xml with non default
   'metastore.warehouse.external.dir'
 - Add an option, --env_vars, in bin/start-impala-cluster.py to pass
   down CUSTOM_CLASSPATH.

Tests:
 - Add a custom cluster test to start Hive with
   metastore.warehouse.external.dir being set to non default value. Run
   it locally using CDP components with HIVE-22158. xfail the test until
   we bump CDP_BUILD_NUMBER to 1507246.
 - Run CORE tests using CDH components

Change-Id: I460a57dc877ef68ad7dd0864a33b1599b1e9a8d9
Reviewed-on: http://gerrit.cloudera.org:8080/14527
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2019-10-24 22:10:03 +00:00
Tim Armstrong
4fb8e8e324 IMPALA-8816: reduce custom cluster test runtime in core
This includes some optimisations and a bulk move of tests
to exhaustive.

Move a bunch of custom cluster tests to exhaustive. I selected
these partially based on runtime (i.e. I looked most carefully
at the tests that ran for over a minute) and the likelihood
of them catching a precommit bug.  Regression tests for specific
edge cases and tests for parts of the code that are very stable
were prime candidates.

Remove an unnecessary cluster restart in test_breakpad.

Merge test_scheduler_error into test_failpoints to avoid an unnecessary
cluster restart.

Speed up cluster starts by ensuring that the default statestore args are
applied even when _start_impala_cluster() is called directly. This
shaves a couple of seconds off each restart. We made the default args
use a faster update frequency - see IMPALA-7185 - but they did not
take effect in all tests.

Change-Id: Ib2e3e7ebc9695baec4d69183387259958df10f62
Reviewed-on: http://gerrit.cloudera.org:8080/13967
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-06 21:34:26 +00:00
Lars Volker
2397ae5590 IMPALA-8484: Run queries on disjoint executor groups
This change adds support for running queries inside a single admission
control pool on one of several, disjoint sets of executors called
"executor groups".

Executors can be configured with an executor group through the newly
added '--executor_groups' flag. Note that in anticipation of future
changes, the flag already uses the plural form, but only a single
executor group may be specified for now. Each executor group
specification can optionally contain a minimum size, separated by a
':', e.g. --executor_groups default-pool-1:3. Only when the cluster
membership contains at least that number of executors for the groups
will it be considered for admission.

Executor groups are mapped to resource pools by their name: An executor
group can service queries from a resource pool if the pool name is a
prefix of the group name separated by a '-'. For example, queries in
poll poolA can be serviced by executor groups named poolA-1 and poolA-2,
but not by groups name foo or poolB-1.

During scheduling, executor groups are considered in alphabetical order.
This means that one group is filled up entirely before a subsequent
group is considered for admission. Groups also need to pass a health
check before considered. In particular, they must contain at least the
minimum number of executors specified.

If no group is specified during startup, executors are added to the
default executor group. If - during admission - no executor group for a
pool can be found and the default group is non-empty, then the default
group is considered. The default group does not have a minimum size.

This change inverts the order of scheduling and admission. Prior to this
change, queries were scheduled before submitting them to the admission
controller. Now the admission controller computes schedules for all
candidate executor groups before each admission attempt. If the cluster
membership has not changed, then the schedules of the previous attempt
will be reused. This means that queries will no longer fail if the
cluster membership changes while they are queued in the admission
controller.

This change also alters the default behavior when using a dedicated
coordinator and no executors have registered yet. Prior to this change,
a query would fail immediately with an error ("No executors registered
in group"). Now a query will get queued and wait until executors show
up, or it times out after the pools queue timeout period.

Testing:

This change adds a new custom cluster test for executor groups. It
makes use of new capabilities added to start-impala-cluster.py to bring
up additional executors into an already running cluster.

Additionally, this change adds an instructional implementation of
executor group based autoscaling, which can be used during development.
It also adds a helper to run queries concurrently. Both are used in a
new test to exercise the executor group logic and to prevent regressions
to these tools.

In addition to these tests, the existing tests for the admission
controller (both BE and EE tests) thoroughly exercise the changed code.
Some of them required changes themselves to reflect the new behavior.

I looped the new tests (test_executor_groups and test_auto_scaling) for
a night (110 iterations each) without any issues.

I also started an autoscaling cluster with a single group and ran
TPC-DS, TPC-H, and test_queries on it successfully.

Known limitations:

When using executor groups, only a single coordinator and a single AC
pool (i.e. the default pool) are supported. Executors to not include the
number of currently running queries in their statestore updates and so
admission controllers are not aware of the number of queries admitted by
other controllers per host.

Change-Id: I8a1d0900f2a82bd2fc0a906cc094e442cffa189b
Reviewed-on: http://gerrit.cloudera.org:8080/13550
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-07-21 04:54:03 +00:00
Fredy Wijaya
1b531be9be IMPALA-8589: Re-enable flaky test_query_event_hooks.py
This patch fixes the flaky test_query_event_hooks.py. The patch also
cuts down the waiting time for impalad timeout to 5 seconds from the
default 60 seconds especially for those tests that will fail Impala
startup.

Testing:
- Ran test_query_event_hooks.py in a loop.

Change-Id: Ia64550e986b5eba59a1d77657943932bb977d470
Reviewed-on: http://gerrit.cloudera.org:8080/13713
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-26 00:13:44 +00:00
Todd Lipcon
800f635855 IMPALA-8667. Remove --pull_incremental_stats flag
This flag was added as a "chicken bit" -- so we could disable the new
feature if we had some problems with it. It's been out in the wild for a
number of months and we haven't seen any such problems, so at this point
let's stop maintaining the old code path.

Change-Id: I8878fcd8a2462963c7db3183a003bb9816dda8f9
Reviewed-on: http://gerrit.cloudera.org:8080/13671
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-19 01:07:00 +00:00
Hao Hao
6bb404dc35 IMPALA-8504 (part 2): Support CREATE TABLE statement with Kudu/HMS integration
This commit supports the actual handling of CREATE TABLE DDL for managed
Kudu tables when integration with Hive Metastore is enabled. When
Kudu/HMS integration is enabled, for CREATE TABLE statement, Impala can
rely on Kudu to create the table in the HMS.

Change-Id: Icffe412395f47f5e07d97bad457020770cfa7502
Reviewed-on: http://gerrit.cloudera.org:8080/13375
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Reviewed-by: Grant Henke <granthenke@apache.org>
Tested-by: Thomas Marshall <tmarshall@cloudera.com>
2019-06-04 17:36:59 +00:00
Michael Ho
2ece4c9b2e IMPALA-8341: Data cache for remote reads
This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Reviewed-on: http://gerrit.cloudera.org:8080/12987
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-03 19:39:42 +00:00
Tim Armstrong
d820952d86 IMPALA-8469: admit_mem_limit for dedicated coordinator
Refactored to avoid the code duplication that resulted in this bug:
* admit_mem_limit is calculated once in ExecEnv
* The local backend descriptor is always constructed with
  a static helper: Scheduler::BuildLocalBackendDescriptor()

I chose to factor it in this way, in part, to avoid invasive
changes to scheduler-test, which currently doesn't depend on
ExecEnv or ImpalaServer.

Testing:
Added basic test that reproduces the bug.

Change-Id: Iaceb21b753b9b021bedc4187c0d44aaa6a626521
Reviewed-on: http://gerrit.cloudera.org:8080/13180
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-01 00:37:04 +00:00
Tim Armstrong
2ca7f8e7c0 IMPALA-7995: part 1: fixes for e2e dockerised impala tests
This fixes all core e2e tests running on my local dockerised
minicluster build. I do not yet have a CI job or script running
but I wanted to get feedback on these changes sooner. The second
part of the change will include the CI script and any follow-on
fixes required for the exhaustive tests.

The following fixes were required:
* Detect docker_network from TEST_START_CLUSTER_ARGS
* get_webserver_port() does not depend on the caller passing in
  the default webserver port. It failed previously because it
  relied on start-impala-cluster.py setting -webserver_port
  for *all* processes.
* Add SkipIf markers for tests that don't make sense or are
  non-trivial to fix for containerised Impala.
* Support loading Impala-lzo plugin from host for tests that depend on
  it.
* Fix some tests that had 'localhost' hardcoded - instead it should
  be $INTERNAL_LISTEN_HOST, which defaults to localhost.
* Fix bug with sorting impala daemons by backend port, which is
  the same for all dockerised impalads.

Testing:
I ran tests locally as follows after having set up a docker network and
starting other services:

  ./buildall.sh -noclean -notests -ninja
  ninja -j $IMPALA_BUILD_THREADS docker_images
  export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster"
  export FE_TEST=false
  export BE_TEST=false
  export JDBC_TEST=false
  export CLUSTER_TEST=false
  ./bin/run-all-tests.sh

Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755
Reviewed-on: http://gerrit.cloudera.org:8080/12639
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-13 02:42:32 +00:00
Radford Nguyen
f998d64767 IMPALA-8363: Fix E2E start with impala_log_dir
This commit fixes the `CustomClusterTestSuite` to wait for the
correct number of executors when `impala_log_dir` is specified
in the test decorator.  Previously, the default value of 3
was always used, regardless of `cluster_size`.

Testing:

- Manual verification using tests/authorization/test_ranger.py
with custom `impala_log_dir` and `cluster_size` arguments.
Failed before changes, passed after changes

- Ran all original E2E tests

Change-Id: I4f46f40474b4b380abe88647a37e8e4d2231d745
Reviewed-on: http://gerrit.cloudera.org:8080/12935
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-11 21:19:31 +00:00
Tim Armstrong
f9ced753ba IMPALA-7999: clean up start-*d.sh scripts
Delete these wrapper scripts and replace with a generic
start-daemon.sh script that sets environment variables
without the other logic.

Move the logic for setting JAVA_TOOL_OPTIONS into
start-impala-cluster.py.

Remove some options like -jvm_suspend, -gdb, -perf that
may not be used. These can be reintroduced if needed.

Port across the kerberized minicluster logic (which has
probably bitrotted) in case it needs to be revived.

Remove --verbose option that didn't appear to be useful
(it claims to print daemon output to the console,
but output is still redirected regardless).

Removed a level of quoting in custom cluster test argument
handling - this was made unnecessary by properly escaping
arguments with pipes.escape() in run_daemon().

Testing:
* Ran exhaustive tests.
* Ran on CentOS 6 to confirm we didn't reintroduce Popen issue
  worked around by kwho.

Change-Id: Ib67444fd4def8da119db5d3a0832ef1de15b068b
Reviewed-on: http://gerrit.cloudera.org:8080/12271
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-05 13:10:08 +00:00
Philip Zeyliger
13dfdc64db IMPALA-6664: Tag log statements with fragment or query ids.
This implements much of the desire in IMPALA-6664 to tag all log
statements with their query ids. It re-uses the existing ThreadDebugInfo
infrastructure as well as the existing
InstallLogMessageListenerFunction() patch to glog (currently used for
log redaction) to prefix log messages with fragment ids or query ids,
when available. The fragment id is the query id with the last bits
incremented, so it's possible to correlate a given query's log messages.
For example:

  $ grep 85420d575b9ff4b9:402b8868 logs/cluster/impalad.INFO
  I0108 10:39:16.453958 14752 impala-server.cc:1052] 85420d575b9ff4b9:402b886800000000] Registered query query_id=85420d575b9ff4b9:402b886800000000 session_id=aa45e480434f0516:101ae5ac12679d94
  I0108 10:39:16.454738 14752 Frontend.java:1242] 85420d575b9ff4b9:402b886800000000] Analyzing query: select count(*) from tpcds.web_sales
  I0108 10:39:16.456627 14752 Frontend.java:1282] 85420d575b9ff4b9:402b886800000000] Analysis finished.
  I0108 10:39:16.463538 14818 admission-controller.cc:598] 85420d575b9ff4b9:402b886800000000] Schedule for id=85420d575b9ff4b9:402b886800000000 in pool_name=default-pool per_host_mem_estimate=180.02 MB PoolConfig: max_requests=-1 max_queued=200 max_mem=-1.00 B
  I0108 10:39:16.463603 14818 admission-controller.cc:603] 85420d575b9ff4b9:402b886800000000] Stats: agg_num_running=0, agg_num_queued=0, agg_mem_reserved=0,  local_host(local_mem_admitted=0, num_admitted_running=0, num_queued=0, backend_mem_reserved=0)
  I0108 10:39:16.463780 14818 admission-controller.cc:635] 85420d575b9ff4b9:402b886800000000] Admitted query id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:16.463896 14818 coordinator.cc:93] 85420d575b9ff4b9:402b886800000000] Exec() query_id=85420d575b9ff4b9:402b886800000000 stmt=select count(*) from tpcds.web_sales
  I0108 10:39:16.464795 14818 coordinator.cc:356] 85420d575b9ff4b9:402b886800000000] starting execution on 2 backends for query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:16.466384 24891 impala-internal-service.cc:49] ExecQueryFInstances(): query_id=85420d575b9ff4b9:402b886800000000 coord=pannier.sf.cloudera.com:22000 #instances=2
  I0108 10:39:16.467339 14818 coordinator.cc:370] 85420d575b9ff4b9:402b886800000000] started execution on 2 backends for query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:16.467536 14823 query-state.cc:579] 85420d575b9ff4b9:402b886800000000] Executing instance. instance_id=85420d575b9ff4b9:402b886800000000 fragment_idx=0 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=1
  I0108 10:39:16.467627 14824 query-state.cc:579] 85420d575b9ff4b9:402b886800000001] Executing instance. instance_id=85420d575b9ff4b9:402b886800000001 fragment_idx=1 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=2
  I0108 10:39:16.820933 14824 query-state.cc:587] 85420d575b9ff4b9:402b886800000001] Instance completed. instance_id=85420d575b9ff4b9:402b886800000001 #in-flight=1 status=OK
  I0108 10:39:17.122299 14823 krpc-data-stream-mgr.cc:294] 85420d575b9ff4b9:402b886800000000] DeregisterRecvr(): fragment_instance_id=85420d575b9ff4b9:402b886800000000, node=2
  I0108 10:39:17.123500 24038 coordinator.cc:709] Backend completed: host=pannier.sf.cloudera.com:22001 remaining=2 query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.123509 24038 coordinator-backend-state.cc:265] query_id=85420d575b9ff4b9:402b886800000000: first in-progress backend: pannier.sf.cloudera.com:22000
  I0108 10:39:17.167752 14752 impala-beeswax-server.cc:197] 85420d575b9ff4b9:402b886800000000] get_results_metadata(): query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.168762 14752 coordinator.cc:483] 85420d575b9ff4b9:402b886800000000] ExecState: query id=85420d575b9ff4b9:402b886800000000 execution completed
  I0108 10:39:17.168808 14752 coordinator.cc:608] 85420d575b9ff4b9:402b886800000000] Coordinator waiting for backends to finish, 1 remaining. query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.168880 14823 query-state.cc:587] 85420d575b9ff4b9:402b886800000000] Instance completed. instance_id=85420d575b9ff4b9:402b886800000000 #in-flight=0 status=OK
  I0108 10:39:17.168977 14821 query-state.cc:252] UpdateBackendExecState(): last report for 85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.174401 24038 coordinator.cc:709] Backend completed: host=pannier.sf.cloudera.com:22000 remaining=1 query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.174513 14752 coordinator.cc:814] 85420d575b9ff4b9:402b886800000000] Release admission control resources for query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.174815 14821 query-state.cc:604] Cancel: query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.174837 14821 krpc-data-stream-mgr.cc:325] cancelling all streams for fragment_instance_id=85420d575b9ff4b9:402b886800000001
  I0108 10:39:17.174856 14821 krpc-data-stream-mgr.cc:325] cancelling all streams for fragment_instance_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.179621 14752 impala-beeswax-server.cc:239] 85420d575b9ff4b9:402b886800000000] close(): query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.179651 14752 impala-server.cc:1131] 85420d575b9ff4b9:402b886800000000] UnregisterQuery(): query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.179666 14752 impala-server.cc:1238] 85420d575b9ff4b9:402b886800000000] Cancel(): query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:17.179814 14752 coordinator.cc:684] 85420d575b9ff4b9:402b886800000000] CancelBackends() query_id=85420d575b9ff4b9:402b886800000000, tried to cancel 0 backends
  I0108 10:39:17.203898 14752 query-exec-mgr.cc:184] 85420d575b9ff4b9:402b886800000000] ReleaseQueryState(): deleted query_id=85420d575b9ff4b9:402b886800000000
  I0108 10:39:18.108947 14752 impala-server.cc:1993] 85420d575b9ff4b9:402b886800000000] Connection from client ::ffff:172.16.35.186:52096 closed, closing 1 associated session(s)
  I0108 10:39:18.108996 14752 impala-server.cc:1249] 85420d575b9ff4b9:402b886800000000] Closing session: aa45e480434f0516:101ae5ac12679d94
  I0108 10:39:18.109035 14752 impala-server.cc:1291] 85420d575b9ff4b9:402b886800000000] Closed session: aa45e480434f0516:101ae5ac12679d94

There are a few caveats here: the thread state isn't "scoped", so the "Closing
session" log statement is technically not part of the query. When that thread
is re-used for another query, it corrects itself. Some threads, like 14821,
aren't using the thread locals. In some case, we should go back and
add GetThreadDebugInfo()->SetQueryId(...) statements.

I've used this to debug some crashes (of my own doing) while running
parallel tests, and it's been quite helpful.

An alternative would be to use Kudu's be/src/kudu/util/async_logger.h,
and add the "Listener" functionality to it directly. Another alternative
would be to re-write all the *LOG macros, but this is quite painful (and
presumably was rejected when log redaction was introduced).

I changed thread-debug-info to capture TUniqueId (a thrift struct with
two int64s) rather than the string representation. This made it easier
to compare with the "0:0" id, which we treat as "unset".  If a developer
needs to analyze it from a debugger, gdb can print out hex just fine.

I added some context to request-context to be able to pipe ids through
to disk IO threads as well.

To test this, I moved "assert_log_contains" up to impala_test_suite, and
had it handle the default log location case. The test needs a sleep for
log buffering, but, it seems like a test with a sleep running in
parallel is better than a custom cluster test, which reboots the cluster
(and loads metadata).

Change-Id: I6634ef9d1a7346339f24f2d40a7a3aa36a535da8
Reviewed-on: http://gerrit.cloudera.org:8080/12129
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-25 00:47:09 +00:00