1515 Commits

Author SHA1 Message Date
Joe McDonnell
c5a0ec8bdf IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package
This puts all of the thrift-generated python code into the
impala_thrift_gen package. This is similar to what Impyla
does for its thrift-generated python code, except that it
uses the impala_thrift_gen package rather than impala._thrift_gen.
This is a preparatory patch for fixing the absolute import
issues.

This patches all of the thrift files to add the python namespace.
This has code to apply the patching to the thirdparty thrift
files (hive_metastore.thrift, fb303.thrift) to do the same.

Putting all the generated python into a package makes it easier
to understand where the imports are getting code. When the
subsequent change rearranges the shell code, the thrift generated
code can stay in a separate directory.

This uses isort to sort the imports for the affected Python files
with the provided .isort.cfg file. This also adds an impala-isort
shell script to make it easy to run.

Testing:
 - Ran a core job

Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0
Reviewed-on: http://gerrit.cloudera.org:8080/20169
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-04-15 17:03:02 +00:00
Steve Carlin
706e1f026c IMPALA-13657: Connect Calcite planner to Impala Frontend framework
This commit adds the plumbing created by IMPALA-13653. The Calcite
planner is now called from Impala's Frontend code via 4 hooks which
are:

- CalciteCompilerFactory: the factory class that creates
    the implementations of the parser, analysis, and single node
    planner hooks.
- CalciteParsedStatement: The class which holds the Calcite SqlNode
    AST.
- CalciteAnalysisDriver: The class that does the validation of the
    SqlNode AST
- CalciteSingleNodePlanner: The class that converts the AST to a
    logical plan, optimizes it, and converts it into an Impala
    PlanNode physical plan.

To run on Calcite, one needs to do two things:

1) set the USE_CALCITE_PLANNER env variable to true before starting
the cluster. This adds the jar file into the path in the
bin/setclasspath.sh file, which is not there by default at the time
of this commit.
2) set the use_calcite_planner query option to true.

This commit makes the CalciteJniFrontend class obsolete. Once the
test cases are moved out of there, that class and others can be
removed.

Change-Id: I3b30571beb797ede827ef4d794b8daefb130ccb1
Reviewed-on: http://gerrit.cloudera.org:8080/22319
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-04-09 23:55:15 +00:00
Zoltan Borok-Nagy
bd3486c051 IMPALA-13586: Initial support for Iceberg REST Catalogs
This patch adds initial support for Iceberg REST Catalogs. This means
now it's possible to run an Impala cluster without the Hive Metastore,
and without the Impala CatalogD. Impala Coordinators can directly
connect to an Iceberg REST server and fetch metadata for databases and
tables from there. The support is read-only, i.e. DDL and DML statements
are not supported yet.

This was initially developed in the context of a company Hackathon
program, i.e. it was a team effort that I squashed into a single commit
and polished the code a bit.

The Hackathon team members were:
* Daniel Becker
* Gabor Kaszab
* Kurt Deschler
* Peter Rozsa
* Zoltan Borok-Nagy

The Iceberg REST Catalog support can be configured via a Java properties
file, the location of it can be specified via:
 --catalog_config_dir: Directory of configuration files

Currently only one configuration file can be in the direcory as we only
support a single Catalog at a time. The following properties are mandatory
in the config file:
* connector.name=iceberg
* iceberg.catalog.type=rest
* iceberg.rest-catalog.uri

The first two properties can only be 'iceberg' and 'rest' for now, they
are needed for extensibility in the future.

Moreover, Impala Daemons need to specify the following flags to connect
to an Iceberg REST Catalog:
 --use_local_catalog=true
 --catalogd_deployed=false

Testing
* e2e added to test basic functionlity with against a custom-built
  Iceberg REST server that delegates to HadoopCatalog under the hood
* Further testing, e.g. Ranger tests are expected in subsequent
  commits

TODO:
* manual testing against Polaris / Lakekeeper, we could add automated
  tests in a later patch

Change-Id: I1722b898b568d2f5689002f2b9bef59320cb088c
Reviewed-on: http://gerrit.cloudera.org:8080/22353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-02 20:04:12 +00:00
Daniel Becker
d714798904 IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS
Currently, when COMPUTE STATS is run from Impala, we set the
'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on
the other hand, store the snapshot id for which the stats were
calculated. Although it is possible to retrieve the timestamp of a
snapshot, comparing these two values is error-prone, e.g. in the
following situation:

 - COMPUTE STATS calculation is running on snapshot N
 - snapshot N+1 is committed at time T
 - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time
   T + Delta
 - some engine writes Puffin statistics for snapshot N+1

After this, HMS stats will appear to be more recent even though they
were calculated on snapshot N, while we have Puffin stats for snapshot
N+1.

To make comparisons easier, after this change, COMPUTE STATS sets a new
table property, 'impala.computeStatsSnapshotIds'. This property stores
the snapshot id for which stats have been computed, for each column. It
is a comma-separated list of values of the form
"fieldIdRangeStart[-fieldIdRangeEndIncl]:snapshotId". The fieldId part
may be a single value or a contiguous, inclusive range.

Storing the snapshot ids on a per-column basis is needed because COMPUTE
STATS can be set to calculate stats for only a subset of the columns,
and then a different subset in a subsequent run. The recency of the
stats will then be different for each column.

Storing the Iceberg field ids instead of column names makes the format
easier to handle as we do not need to take care of escaping special
characters.

The 'impala.computeStatsSnapshotIds' table property is deleted after
DROP STATS.

Note that this change does not yet modify how Impala chooses between
Puffin and HMS stats: that will be done in a separate change.

Testing:
 - Added tests in iceberg-compute-stats.test checking that
   'impala.computeStatsSnapshotIds' is set correctly and is deleted
   after DROP STATS
 - added unit tests in IcebergUtilTest.java that check the parsing and
   serialisation of the table property

Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7
Reviewed-on: http://gerrit.cloudera.org:8080/22339
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-01 13:53:43 +00:00
Riza Suminto
7a5adc15ca IMPALA-13333: Limit memory estimation if PlanNode can spill
SortNode, AggregationNode, and HashJoinNode (the build side) can spill
to disk. However, their memory estimation does not consider this
capability and assumes it will hold all rows in memory. This causes
memory overestimation if cardinality is also overestimated. In reality,
the whole query execution in a single host is often subject to much
lower memory upper-bound and not allowed to exceed it.

This upper-bound is dictated by, but not limited to:
- MEM_LIMIT
- MEM_LIMIT_COORDINATORS
- MEM_LIMIT_EXECUTORS
- MAX_MEM_ESTIMATE_FOR_ADMISSION
- impala.admission-control.max-query-mem-limit.<pool_name>
  from admission control.

This patch adds SpillableOperator interface that defines an alternative
of either PlanNode.computeNodeResourceProfile() or
DataSink.computeResourceProfile() if a lower memory upper-bound can be
reasoned about from configs mentioned above. This interface is applied
to SortNode, AggregationNode, HashJoinNode, and JoinBuildSink.

The in-memory vs spill-to-disk bias is controlled through
MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR option. A scale between
[0.0,1.0] to control estimate peak memory of query operator that has
spill-to-disk capabilities. Setting value closer to 1.0 will make
Planner bias towards keeping as much rows as possible in memory, while
setting value closer to 0.0 will make Planner bias towards spilling rows
to disk under memory pressure. Note that lowering
MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR can make query that previously
rejected by Admission Controller becomes admittable, but also may
spill-to-disk more and have higher risk to exhaust scratch space more
than before.

There are some caveats on this memory bounding patch:
- It checks if spill-to-disk is enabled in the coordinator, but
  individual backend executors might not have it configured. Mismatch of
  spill-to-disk configs between the coordinator and backend executor,
  however, is rare and can be considered as misconfiguration.
- It does not check the actual total scratch space available to
  spill-to-disk. However, query execution will be forced to spill anyway
  if memory usage exceeds the three memory configs above. Raising
  MEM_LIMIT / MEM_LIMIT_EXECUTORS option can help increase memory
  estimation and increase the likelihood for the query to get assigned
  to a larger executor group set, which usually has a bigger total
  scratch space.
- The memory bound is divided evenly among all instances of a fragment
  kind in a single host. But in theory, they should be able to share
  and grow their memory usage independently beyond memory estimate as
  long max memory reservation is not set.
- This does not consider other memory-related configs such as
  clamp_query_mem_limit_backend_mem_limit or disable_pool_mem_limits
  flag. But the admission controller will still enforce them if set.

Testing:
- Pass FE and custom cluster tests with core exploration.

Change-Id: I290c4e889d4ab9e921e356f0f55a9c8b11d0854e
Reviewed-on: http://gerrit.cloudera.org:8080/21762
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-03-13 18:38:27 +00:00
Peter Rozsa
167ced7844 IMPALA-13674: Enable MERGE statement for Iceberg tables with equality deletes
This change fixes the delete expression calculation for
IcebergMergeImpl, when an Iceberg table contains equality deletes, the
merge implementation now includes the data sequence number in the result
expressions as the underlying tuple descriptor also includes it
implicitly. Without including this field, the row evaluation fails
because of the mismatching number of evaluators and slot descriptors.

Tests:
 - manually validated on an Iceberg table that contains equality delete
 - e2e test added

Change-Id: I60e48e2731a59520373dbb75104d75aae39a94c1
Reviewed-on: http://gerrit.cloudera.org:8080/22423
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-06 19:20:52 +00:00
Zoltan Borok-Nagy
d928815d1a IMPALA-13654: Tolerate missing data files of Iceberg tables
Before this patch we got a TableLoadingException for missing data files.
This means the IcebergTable will be in an incomplete state in Impala's
memory, therefore we won't be able to do any operation on it.

We should continue table loading in such cases, and only throw exception
for queries that are about to read the missing data files.

This way ROLLBACK / DROP PARTITION, and some SELECT statements should
still work.

If Impala is running in strict mode via CatalogD flag
--iceberg_allow_datafiles_in_table_location_only, and an Iceberg table
has data files outside of table location, we still raise an exception
and leave the table in an unloaded state. To retain this behavior, the
IOException we threw is substituted to TableLoadingException which fits
better to logic errors anyway.

Testing
 * added e2e tests

Change-Id: If753619d8ee1b30f018e90157ff7bdbe5d7f1525
Reviewed-on: http://gerrit.cloudera.org:8080/22367
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-03 17:50:04 +00:00
Noemi Pap-Takacs
79f51b018f IMPALA-12588: Don't UPDATE rows that already have the desired value
When UPDATEing an Iceberg or Kudu table, we should change as few rows
as possible. In case of Iceberg tables it means writing as few new
data records and delete records as possible.
Therefore, if rows already have the new values we should just ignore
them. One way to achieve this is to add extra predicates, e.g.:

  UPDATE tbl SET k = 3 WHERE i > 4;
    ==>
  UPDATE tbl SET k = 3 WHERE i > 4 AND k IS DISTINCT FROM 3;

So we won't write new data/delete records for the rows that already have
the desired value.

Explanation on how to create extra predicates to filter out these rows:

If there are multiple assignments in the SET list, we can only skip
updating a row if all the mentioned values are already equal.
If either of the values needs to be updated, the entire row does.
Therefore we can think of the SET list as predicates connected with AND
and all of them need to be taken into consideration.
To negate this SET list, we have to negate the individual SET
assignments and connect them with OR.
Then add this new compound predicate to the original where predicates
with an AND (if there were none, just create a WHERE predicate from it).

                AND
              /     \
      original        OR
 WHERE predicate    /    \
                  !a       OR
                         /    \
                       !b     !c

This simple graph illustrates how the where predicate is rewritten.
(Considering an UPDATE statement that sets 3 columns.)
'!a', '!b' and '!c' are the negations of the individual assignments in
the SET list. So the extended WHERE predicate is:
(original WHERE predicate) AND (!a OR !b OR !c)
To handle NULL values correctly, we use IS DISTINCT FROM instead of
simply negating the assignment with operator '!='.

If the assignments contain UDFs, the result might be inconsistent
because of possible non-deterministic values or state in the UDFs,
therefore we should not rewrite the WHERE predicate at all.

Evaluating expressions can be expensive, therefore this optimization
can be limited or switched off entirely using the Query Option
SKIP_UNNEEDED_UPDATES_COL_LIMIT. By default, there is no filtering
if more than 10 assignments are in the SET list.

-------------------------------------------------------------------
Some performance measurements on a tpch lineitem table:

- predicates in HASH join, all updates can be skipped
Q1/[Q2] (Similar, but Q2 adds extra 4 items to the SET list):
update t set t.l_suppkey = s.l_suppkey,
[ t.l_partkey=s.l_partkey,
  t.l_quantity=s.l_quantity,
  t.l_returnflag=s.l_returnflag,
  t.l_shipmode=s.l_shipmode ]
from ice_lineitem t join ice_lineitem s
on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber;

- predicates in HASH join, all rows need to be updated
Q3: update t set
 t.l_suppkey = s.l_suppkey,
 t.l_partkey=s.l_partkey,
 t.l_quantity=s.l_quantity,
 t.l_returnflag=s.l_returnflag,
 t.l_shipmode=concat(s.l_shipmode,' ')
 from ice_lineitem t join ice_lineitem s
 on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber;

- predicates pushed down to the scanner, all rows updated
Q4/[Q5] (Similar, but Q5 adds extra 8 items to the SET ist):
update ice_lineitem set
[ l_suppkey = l_suppkey + 0,
  l_partkey=l_partkey + 0,
  l_quantity=l_quantity,
  l_returnflag=l_returnflag,
  l_tax = l_tax,
  l_discount= l_discount,
  l_comment = l_comment,
  l_receiptdate = l_receiptdate, ]
l_shipmode=concat(l_shipmode,' ');

+=======+============+==========+======+
| Query | unfiltered | filtered | diff |
+=======+============+==========+======+
| Q1    |       4.1s |     1.9s | -54% |
+-------+------------+----------+------+
| Q2    |       4.2s |     2.1s | -50% |
+-------+------------+----------+------+
| Q3    |       4.3s |     4.7s | +10% |
+-------+------------+----------+------+
| Q4    |       3.0s |     3.0s | +0%  |
+-------+------------+----------+------+
| Q5    |       3.1s |     3.1s | +0%  |
+-------+------------+----------+------+

The results show that in the best case (we can skip all rows)
this change can cause significant perf improvement ~50%, since
0 rows were written. See Q1 and Q2.
If the predicates are evaluated in the join operator, but there were
no matches (worst case scenario) we can lose about 10%. (Q3)
If all the predicates can be pushed down to the scanners, the change
does not seem to cause significant difference (~0% in Q4 and Q5)
even if all rows have to be updated.

Testing:
 - Analysis
 - Planner
 - E2E
 - Kudu
 - Iceberg
 - testing the new query option: SKIP_UNNEEDED_UPDATES_COL_LIMIT
Change-Id: I926c80e8110de5a4615a3624a81a330f54317c8b
Reviewed-on: http://gerrit.cloudera.org:8080/22407
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-28 05:52:21 +00:00
stiga-huang
b410a90b47 IMPALA-12152: Add query options to wait for HMS events sync up
It's a common scenario to run Impala queries after the dependent
external changes are done. E.g. running COMPUTE STATS on a table after
Hive/Spark jobs ingest some new data to it. Currently, it's unsafe to
run the Impala queries immediately after the Hive/Spark jobs finish
since EventProcessor might have a long lag in applying the HMS events.
Note that running REFRESH/INVALIDATE on the table can also solve the
problem. But one of the motivation of EventProcessor is to get rid of
such Impala specific commands.

This patch adds a mechanism to let query planning wait until the
metadata is synced up. Two new query options are added:
 - SYNC_HMS_EVENTS_WAIT_TIME_S configures the timeout in seconds for
   waiting. It's 0 by default, which disables the waiting mechanism.
 - SYNC_HMS_EVENTS_STRICT_MODE controls the behavior if we can't wait
   for metadata to be synced up, e.g. when the waiting times out or
   EventProcessor is in ERROR state. It defaults to false (non-strict
   mode). In the strict mode, coordinator will fail the query. In the
   non-strict mode, coordinator will start planning with a warning
   message in profile (and in client outputs if the client consumes the
   get_log results, e.g. in impala-shell).

Example usage - query the table after inserting into dynamic partitions
in Hive. We don't know what partitions are modified so running REFRESH
in Impala is inefficient since it reloads all partitions.
  hive> insert into tbl partition(p) select * from tbl2;
  impala> set sync_hms_events_wait_time_s=300;
  impala> select * from tbl;
With this new feature, let catalogd reload the updated partitions based
on HMS events, which is more efficient than REFRESH. The wait time can
be set to the largest lag of event processing that has been observed in
the cluster. Note the lag of event processing is shown as the "Lag time"
in the /events page of catalogd WebUI and "events-processor.lag-time" in
the /metrics page. Users can monitor it to get a sense of the lag.

Some timeline items are added in query profile for this waiting
mechanism, e.g.
A succeeded wait:
    Query Compilation: 937.279ms
       - Synced events from Metastore: 909.162ms (909.162ms)
       - Metadata of all 1 tables cached: 911.005ms (1.843ms)
       - Analysis finished: 919.600ms (8.595ms)

A failed wait:
    Query Compilation: 1s321ms
       - Continuing without syncing Metastore events: 40.883ms (40.883ms)
       - Metadata load started: 41.618ms (735.633us)

Added a histogram metric, impala-server.wait-for-hms-event-durations-ms,
to track the duration of this waiting.

--------
Implementation

A new catalogd RPC, WaitForHmsEvent, is added to CatalogService API so
that coordinator can wait until catalogd processes the latest event when
this RPC is triggered. Query planning starts or fails after this RPC
returns. The RPC request contains the potential dbs/tables that are
required by the query. Catalogd records the latest event id when it
receives this RPC. When the last synced event id reaches this, catalogd
returns the catalog updates to the coordinator in the RPC response.
Before that, the RPC thread is in a waiting loop that sleeps in a
configurable interval. It's configured by a hidden flag,
hms_event_sync_sleep_interval_ms (defaults to 100).

Entry-point functions
 - Frontend#waitForHmsEvents()
 - CatalogServiceCatalog#waitForHmsEvent()

Some statements don't need to wait for HMS events, e.g. CREATE/DROP ROLE
statements. This patch adds an overrided method, requiresHmsMetadata(),
in each Statement to mark whether they can skip HMS event sync.

Test side changes:
 - Some test codes use EventProcessorUtils.wait_for_event_processing()
   to wait for HMS events being synced up before running a query. Now
   they are updated to just use these new query options in the query.
 - Note that we still need wait_for_event_processing() in test codes
   that verify metrics after HMS events are synced up.

--------
Limitation

Currently, UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT,
OPEN_TXN events are ignored by the event processor. If the latest event
happens to be in these types and there are no more other events, the
last synced event id can never reach the latest event id. We need to fix
last synced event id to also consider ignored events (IMPALA-13623).

The current implementation waits for the event id when the
WaitForHmsEvent RPC is received at catalogd side. We can improve it
by leveraging HIVE-27499 to efficiently detect whether the given
dbs/tables have unsynced events and just wait for the *largest* id
of them. Dbs/tables without unsynced events don't need to block query
planning. However, this only works for non-transactional tables.
Transactional tables might be modified by COMMIT_TXN or ABORT_TXN events
which don't have the table names. So even with HIVE-27499, we can't
determine whether a transactional table has pending events. IMPALA-13684
will target on improving this on non-transactional tables.

Tests
 - Add test to verify planning waits until catalogd is synced with HMS
   changes.
 - Add test on the error handling when HMS event processing is disabled

Change-Id: I36ac941bb2c2217b09fcfa2eb567b011b38efa2a
Reviewed-on: http://gerrit.cloudera.org:8080/20131
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-22 18:14:37 +00:00
Michael Smith
1b6395b8db IMPALA-13627: Handle legacy Hive timezone conversion
After HIVE-12191, Hive has 2 different methods of calculating timestamp
conversion from UTC to local timezone. When Impala has
convert_legacy_hive_parquet_utc_timestamps=true, it assumes times
written by Hive are in UTC and converts them to local time using tzdata,
which matches the newer method introduced by HIVE-12191.

Some dates convert differently between the two methods, such as
Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074).
After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to
distinguish which method is being used. As a result there are three
different cases we have to handle:
1. Hive prior to 3.1 used what’s now called “legacy conversion” using
   SimpleDateFormat.
2. Hive 3.1.2 (with HIVE-21290) used a new Java API that’s based on
   tzdata and added metadata to identify the timezone.
3. Hive 4 support both, and added a new file metadata to identify it.

Adds handling for Hive files (identified by created_by=parquet-mr) where
we can infer the correct handling from Parquet file metadata:
1. if writer.zone.conversion.legacy is present (Hive 4), use it to
   determine whether to use a legacy conversion method compatible with
   Hive's legacy behavior, or convert using tzdata.
2. if writer.zone.conversion.legacy is not present but writer.time.zone
   is, we can infer it was written by Hive 3.1.2+ using new APIs.
3. otherwise it was likely written by an earlier Hive version.

Adds a new CLI and query option - use_legacy_hive_timestamp_conversion -
to select what conversion method to use in the 3rd case above, when
Impala determines that the file was written by Hive older than 3.1.2.
Defaults to false to minimize changes in Impala's behavior and because
going through JNI is ~50x slower even when the results would not differ;
Hive defaults to true for its equivalent setting:
hive.parquet.timestamp.legacy.conversion.enabled.

Hive legacy-compatible conversion uses a Java method that would be
complicated to mimic in C++, doing

  DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  formatter.setTimeZone(TimeZone.getTimeZone(timezone_string));
  java.util.Date date = formatter.parse(date_time_string);
  formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
  return out.println(formatter.format(date);

IMPALA-9385 added a check against a Timezone pointer in
FromUnixTimestamp. That dominates the time in FromUnixTimeNanos,
overriding any benchmark gains from IMPALA-7417. Moves FromUnixTime to
allow inlining, and switches to using UTCPTR in the benchmark - as
IMPALA-9385 did in most other code - to restore benchmark results.

Testing:
- Adds JVM conversion method to convert-timestamp-benchmark.
- Adds tests for several cases from Hive conversion tests.

Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
Reviewed-on: http://gerrit.cloudera.org:8080/22293
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-02-18 16:33:39 +00:00
jasonmfehr
aac67a077e IMPALA-13201: System Table Queries Execute When Admission Queues are Full
Queries that run only against in-memory system tables are currently
subject to the same admission control process as all other queries.
Since these queries do not use any resources on executors, admission
control does not need to consider the state of executors when
deciding to admit these queries.

This change adds a boolean configuration option 'onlyCoordinators'
to the fair-scheduler.xml file for specifying a request pool only
applies to the coordinators. When a query is submitted to a
coordinator only request pool, then no executors are required to be
running. Instead, all fragment instances are executed exclusively on
the coordinators.

A new member was added to the ClusterMembershipMgr::Snapshot struct
to hold the ExecutorGroup of all coordinators. This object is kept up
to date by processing statestore messages and is used when executing
queries that either require the coordinators (such as queries against
sys.impala_query_live) or that use an only coordinators request pool.

Testing was accomplished by:
1. Adding cluster membership manager ctests to assert cluster
   membership manager correctly builds the list of non-quiescing
   coordinators.
2. RequestPoolService JUnit tests to assert the new optional
   <onlyCoords> config in the fair scheduler xml file is correctly
   parsed.
3. ExecutorGroup ctests modified to assert the new function.
4. Custom cluster admission controller tests to assert queries with a
   coordinator only request pool only run on the active coordinators.

Change-Id: I5e0e64db92bdbf80f8b5bd85d001ffe4c8c9ffda
Reviewed-on: http://gerrit.cloudera.org:8080/22249
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-14 04:27:11 +00:00
Yida Wu
80a45014ea IMPALA-13703: Cancel running queries before shutdown deadline
Currently, when the graceful shutdown deadline is reached, Impala
daemon exits immediately, leaving any running queries unfinished.
This approach is not quite graceful, as it may result in unreleased
resources, such as scratch files in remote storage.

This patch adds a new state in the graceful shutdown process.
Before reaching the shutdown deadline, Impala daemon will try to
cancel any remaining running queries within a configurable timelimit
flag, shutdown_query_cancel_period_s. If this time limit exceeds
20% of the total shutdown deadline, it will be automatically
capped at that value. The idea is to cancel queries only near the
end of the graceful shutdown deadline. The 20% is the threshold to
allow us to take a more aggressive way to ensure a graceful
shutdown.

If all queries are successfully canceled within this period, the
server shuts down immediately. Otherwise, it shuts down once the
deadline is reached, with queries still running.

Tests:
Passed core tests.
Added testcases test_shutdown_coordinator_cancel_query and
test_shutdown_executor_with_query_cancel_period and
test_shutdown_coordinator_and_executor_cancel_query.
Manually tested shutdown a coord or an executor with long
running queries and they were canceled.

Change-Id: I1cac2e100d329644e21fdceb0b23901b08079130
Reviewed-on: http://gerrit.cloudera.org:8080/22422
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-07 10:12:54 +00:00
Andrew Sherman
35c6a0b76d IMPALA-13726 Add admission control slots to /queries page in webui
When tracking resource utilization it is useful to see how many
admission control slots are being used by queries. Add the slots used
by coordinator and executors to the webui queries tables. For
implementation reasons this entails also adding these fields to the
query history and live query tables.

The executor admission control slots are calculated by looking at a
single executor backend. In theory this single number could be
misleading but in practice queries are expected to have symmetrical
slots across executors.

Bump the schema number for the query history schema, and add some new
tests.

Change-Id: I057493b7767902a417dfeb75cdaeffd452d66789
Reviewed-on: http://gerrit.cloudera.org:8080/22443
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-05 03:53:18 +00:00
pranavyl
a61b90f860 IMPALA-13039: AES Encryption/ Decryption Support in Impala
AES (Advanced Encryption Standard) crypto functions are
widely recognized and respected encryption algorithm used to protect
sensitive data which operate by transforming plaintext data into
ciphertext using a symmetric key, ensuring confidentiality and
integrity. This standard specifies the Rijndael algorithm, a symmetric
block cipher that can process data blocks of 128 bits, using cipher
keys with lengths of 128 and 256 bits. The patch makes use of the
EVP_*() algorithms from the OpenSSL library.

The patch includes:
1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and
AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities.
2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes.
3. Enhancements to EncryptionKey class to accommodate various AES modes.

The aes_encrypt() and aes_decrypt() functions serve as entry
points for encryption and decryption operations, handling
encryption and decryption based on user-provided keys, AES modes,
and initialization vectors (IVs). The implementation includes key
length validation and IV vector size checks to ensure data
integrity and confidentiality.

Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR
and ECB for decryption are supported to provide flexibility and
compatibility with various use cases and OpenSSL features. AES-GCM
is set as the default mode due to its strong security properties.
AES-CTR and AES-CFB are provided as fallbacks for environments where
AES-GCM may not be supported. Note that AES-GCM is not available in
OpenSSL versions prior to 1.0.1, so having multiple methods ensures
broader compatibility.

Testing: The patch is thouroughly tested and the tests are included in
exprs.test.

Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9
Reviewed-on: http://gerrit.cloudera.org:8080/20447
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-27 22:16:37 +00:00
stiga-huang
2e59bbae37 IMPALA-12785: Add commands to control event-processor status
This patch extends the existing AdminFnStmt to support operations on
EventProcessor. E.g. to pause the EventProcessor:
  impala-shell> :event_processor('pause');
to resume the EventProcessor:
  impala-shell> :event_processor('start');
Or to resume the EventProcessor on a given event id (1000):
  impala-shell> :event_processor('start', 1000);
Admin can also resume the EventProcessor at the latest event id by using
-1:
  impala-shell> :event_processor('start', -1);

Supported command actions in this patch: pause, start, status.

The command output of all actions will show the latest status of
EventProcessor, including
 - EventProcessor status:
   PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED.
 - LastSyncedEventId: The last HMS event id which we have synced to.
 - LatestEventId: The event id of the latest event in HMS.

Example output:
[localhost:21050] default> :event_processor('pause');
+--------------------------------------------------------------------------------+
| summary                                                                        |
+--------------------------------------------------------------------------------+
| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. |
+--------------------------------------------------------------------------------+
Fetched 1 row(s) in 0.01s

If authorization is enabled, only admin users that have ALL privilege on
SERVER can run this command.

Note that there is a restriction in MetastoreEventsProcessor#start(long)
that resuming EventProcessor back to a previous event id is only allowed
when it's not in the ACTIVE state. This patch aims to expose the control
of EventProcessor to the users so MetastoreEventsProcessor is not
changed. We can investigate the restriction and see if we want to relax
it.

Note that resuming EventProcessor at a newer event id can be done on any
states. Admins can use this to manually resolve the lag of HMS event
processing, after they have made sure all (or important) tables are
manually invalidated/refreshed.

A new catalogd RPC, SetEventProcessorStatus, is added for coordinators
to control the status of EventProcessor.

Tests
 - Added e2e tests

Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958
Reviewed-on: http://gerrit.cloudera.org:8080/22250
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-24 04:10:14 +00:00
Daniel Becker
c5b474d3f5 IMPALA-13594: Read Puffin stats also from older snapshots
Before this change, Puffin stats were only read from the current
snapshot. Now we also consider older snapshots, and for each column we
choose the most recent available stats. Note that this means that the
stats for different columns may come from different snapshots.

In case there are both HMS and Puffin stats for a column, the more
recent one will be used - for HMS stats we use the
'impala.lastComputeStatsTime' table property, and for Puffin stats we
use the snapshot timestamp to determine which is more recent.

This commit also renames the startup flag 'disable_reading_puffin_stats'
to 'enable_reading_puffin_stats' and the table property
'impala.iceberg_disable_reading_puffin_stats' to
'impala.iceberg_read_puffin_stats' to make them more intuitive. The
default values are flipped to keep the same behaviour as before.

The documentation of Puffin reading is updated in
docs/topics/impala_iceberg.xml

Testing:
 - updated existing test cases and added new ones in
   test_iceberg_with_puffin.py
 - reorganised the tests in TestIcebergTableWithPuffinStats in
   test_iceberg_with_puffin.py: tests that modify table properties and
   other state that other tests rely on are now run separately to
   provide a clean environment for all tests.

Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39
Reviewed-on: http://gerrit.cloudera.org:8080/22177
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-23 15:25:59 +00:00
Xuebin Su
d7ee509e93 IMPALA-12648: Add KILL QUERY statement
To support killing queries programatically, this patch adds a new
type of SQL statements, called the KILL QUERY statement, to cancel and
unregister a query on any coordinator in the cluster.

A KILL QUERY statement looks like
```
KILL QUERY '123:456';
```
where `123:456` is the query id of the query we want to kill. We follow
syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY'
are added as "unreserved keywords", like 'DEFAULT'. This allows the
three keywords to be used as identifiers.

A user is authorized to kill a query only if the user is an admin or is
the owner of the query. KILL QUERY statements are not affected by
admission control.

Implementation:

Since we don't know in advance which impalad is the coordinator of the
query we want to kill, we need to broadcast the kill request to all the
coordinators in the cluster. Upon receiving a kill request, each
coordinator checks whether it is the coordinator of the query:
- If yes, it cancels and unregisters the query,
- If no, it reports "Invalid or unknown query handle".

Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is
created for this.

For authorization, this patch adds a custom handler of
AuthorizationException for each statement to allow the exception to be
handled by the backend. This is because we don't know whether the user
is the owner of the query until we reach its coordinator.

To support cancelling child queries, this patch changes
ChildQuery::Cancel() to bypass the HS2 layer so that the session of the
child query will not be added to the connection used to execute the
KILL QUERY statement.

Testing:
- A new ParserTest case is added to test using "unreserved keywords" as
  identifiers.
- New E2E test cases are added for the KILL QUERY statement.
- Added a new dimension in TestCancellation to use the KILL QUERY
  statement.
- Added file tests/common/cluster_config.py and made
  CustomClusterTestSuite.with_args() composable so that common cluster
  configs can be reused in custom cluster tests.

Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0
Reviewed-on: http://gerrit.cloudera.org:8080/21930
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2025-01-22 22:22:54 +00:00
Riza Suminto
3118e41c26 IMPALA-2945: Account for duplicate keys on multiple nodes preAgg
AggregationNode.computeStats() estimate cardinality under single node
assumption. This can be an underestimation in preaggregation node case
because same grouping key may exist in multiple nodes during
preaggreation.

This patch adjust the cardinality estimate using following model for the
number of distinct values in a random sample of k rows, previously used
to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644.

Assumes we are picking k rows from an infinite sample with ndv distinct
values, with the value uniformly distributed. The probability of a given
value not appearing in a sample, in that case is

((NDV - 1) / NDV) ^ k

This is because we are making k choices, and each of them has
(ndv - 1) / ndv chance of not being our value. Therefore the
probability of a given value appearing in the sample is:

1 - ((NDV - 1) / NDV) ^ k

And the number of distinct values in the sample is:

(1 - ((NDV - 1) / NDV) ^ k) * NDV

Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to
use the new estimation logic or not.

Testing:
- Pass core tests.

Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
Reviewed-on: http://gerrit.cloudera.org:8080/22047
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-17 20:22:03 +00:00
Yida Wu
702131b677 IMPALA-13565: Add general AI platform support to ai_generate_text
Currently only OpenAI sites are allowed for ai_generate_text(),
this patch adds support for general AI platforms to
the ai_generate_text function. It introduces a new flag,
ai_additional_platforms, allowing Impala to access additional
AI platforms. For these general AI platforms, only the openai
standard is supported, and the default api credential serves as
the api token for general platforms.

The ai_api_key_jceks_secret parameter has been renamed to
auth_credential to support passing both plain text and jceks
encrypted secrets.

A new impala_options parameter is added to ai_generate_text() to
enable future extensions. Adds the api_standard option to
impala_options, with "openai" as the only supported standard.
Adds the credential_type option to impala_options for allowing
the plain text as the token, by default it is set to jceks.
Adds the payload option to impala_options for customized
payload input. If set, the request will use the provided
customized payload directly, and the response will follow the
openai standard for parsing. The customized payload size must not
exceed 5MB.

Adding the impala_options parameter to ai_generate_text() should
be fine for backward compatibility, as this is a relatively new
feature.

Example:
1. Add the site to ai_api_additional_platforms,like:
ai_additional_platforms='new_ai.site,new_ai.com'
2. Example sql:
select ai_generate_text("https://new_ai.com/v1/chat/completions",
"hello", "model-name", "ai-api-token", "platform params",
'{"api_standard":"openai", "credential_type":"plain",
"payload":"payload content"}}')

Tests:
Added a new test AiFunctionsTestAdditionalSites.
Manual tested the example with the Cloudera AI platform.
Passed core and asan tests.

Change-Id: I4ea2e1946089f262dda7ace73d5f7e37a5c98b14
Reviewed-on: http://gerrit.cloudera.org:8080/22130
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
2025-01-16 19:12:07 +00:00
gaurav1086
c3cbd79b56 IMPALA-13288: OAuth AuthN Support for Impala
This patch added OAuth support with following functionality:
 * Load and parse OAuth JWKS from configured JSON file or url.
 * Read the OAuth Access token from the HTTP Header which is
   the same format as JWT Authorization Bearer token.
 * Verify the OAuth's signature with public key in JWKS.
 * Get the username out of the payload of OAuth Access token.
 * If kerberos or ldap is enabled, then both jwt and oauth are
   supported together. Else only one of jwt or oauth is supported.
   This has been a pre existing flow for jwt. So OAuth will follow
   the same policy.
 * Impala Shell side changes: OAuth  options -a and --oauth_cmd

Testing:
 - Added 3 custom cluster be test in test_shell_jwt_auth.py:
   - test_oauth_auth_valid: authenticate with valid token.
   - test_oauth_auth_expired: authentication failure with
     expired token.
   - test_oauth_auth_invalid_jwk: authentication failure with
     valid signature but expired.
 - Added 1 custom cluster fe test in JwtWebserverTest.java
   - testWebserverOAuthAuth: Basic tests for OAuth
 - Added 1 custom cluster fe test in LdapHS2Test.java
   - testHiveserver2JwtAndOAuthAuth: tests all combinations of
     jwt and oauth token verification with separate jwks keys.
 - Manually tested with a valid, invalid and expired oauth
   access token.
 - Passed core run.

Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6
Reviewed-on: http://gerrit.cloudera.org:8080/21728
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-15 03:32:57 +00:00
Riza Suminto
23edbde7c7 IMPALA-13637: Add ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE option
IMPALA-13405 adds a new tuple-analysis algorithm in AggregationNode to
lower cardinality estimation when planning multi-column grouping. This
patch adds a query option ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE that allows
users to enable/disable the algorithm if necessary. Default is True.

Testing:
- Add testAggregationNoTupleAnalysis.
  This test is based on TpcdsPlannerTest#testQ19 but with
  ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE set to false.

Change-Id: Iabd8daa3d9414fc33d232643014042dc20530514
Reviewed-on: http://gerrit.cloudera.org:8080/22294
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-06 21:32:45 +00:00
Peter Rozsa
19110b490d IMPALA-13362: Implement WHEN NOT MATCHED BY SOURCE syntax for MERGE statement
This change adds support for a new MERGE clause that covers the
condition when the source statement's rows do not match the target
tables rows. Example: MERGE INTO target t using source s on t.id = s.id
WHEN NOT MATCHED BY SOURCE THEN UPDATE set t.column = "a";

This change also adds support to use WHEN NOT MATCHED BY TARGET
explicitly, this is equivalent to  WHEN NOT MATCHED.

Tests:
 - Parser tests for the new language elements.
 - Analyzer and planner test for WHEN NOT MATCHED BY SOURCE/TARGET
   clauses.
 - E2E tests for WHEN NOT MATCHED BY SOURCE clause.

Change-Id: Ia0e0607682a616ef6ad9eccf499dc0c5c9278c5f
Reviewed-on: http://gerrit.cloudera.org:8080/21988
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-02 15:02:20 +00:00
Andrew Sherman
de6b902581 IMPALA-12345: Add user quotas to Admission Control
Allow administrators to configure per user limits on queries that can
run in the Impala system.

In order to do this, there are two parts. Firstly we must track the
total counts of queries in the system on a per-user basis. Secondly
there must be a user model that allows rules that control per-user
limits on the number of queries that can be run.

In a Kerberos environment the user names that are used for both the user
model and at runtime are short user names, e.g. testuser when the
Kerberos principal is testuser/scm@EXAMPLE.COM

TPoolStats (the data that is shared between Admission Control instances)
is extended to include a map from user name to a count of queries
running. This (along with some derived data structures) is updated when
queries are queued and when they are released from Admission Control.
This lifecycle is slightly different from other TPoolStats data which
usually tracks data about queries that are running. Queries can be
rejected because of user quotas at submission time. This is done for
two reasons: (1) queries can only be admitted from the front of the
queue and we do not want to block other queries due to quotas, and
(2) it is easy for users to understand what is going on when queries
are rejected at submission time.

Note that when running in configurations without an Admission Daemon
then Admission Control does not have perfect information about the
system and over-admission is possible for User-Level Admission Quotas
in the same way that it is for other Admission Control controls.

The User Model is implemented by extending the format of the
fair-scheduler.xml file. The rules controlling the per-user limits are
specified in terms of user or group names.

Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to
the fair-scheduler.xml file. These elements can be placed on the root
configuration, which applies to all pools, or the pool configuration.
The ‘userQueryLimit’ element has 2 child elements: "user"
and "totalCount". The 'user' element contains the short names of users,
and can be repeated, or have the value "*" for a wildcard name which
matches all users. The ‘groupQueryLimit’ element has 2 child
elements: "group" and "totalCount". The 'group' element contains group
names.

The root level rules and pool level rules must both be passed for a new
query to be queued. The rules dictate a maximum number of queries that
can run by a user. When evaluating rules at either the root level, or
at the pool level, when a rule matches a user then there is no more
evaluation done.

To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the
RequestPoolService is enhanced.

If user quotas are enabled for a pool then a list of the users with
running or queued queries in that pool is visible on the coordinator
webui admission control page.

More comprehensive documentation of the user model will be provided in
IMPALA-12943

TESTING

New end-to-end tests are added to test_admission_controller.py, and
admission-controller-test is extended to provide unit tests for the
user model.

Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41
Reviewed-on: http://gerrit.cloudera.org:8080/21616
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-16 06:38:38 +00:00
Noemi Pap-Takacs
88e0e4e8ba IMPALA-13334: Fix test_sort.py DCHECK hit when max_sort_run_size>0
test_sort.py declared 'max_sort_run_size' query option, but it
silently did not exercise it. Fixing the query option declaration
in IMPALA-13349 using helper function add_exec_option_dimension()
revealed a DCHECK failure in sorter.cc. In some cases the length
of an in-memory run could exceed 'max_sort_run_size' by 1 page.

This patch fixed the DCHECK failure by strictly enforcing the
'max_sort_run_size' limit.
Memory limits were also adjusted in test_sort.py according to
the memory usage of the different sort run sizes.

Additionally, the 'MAX_SORT_RUN_SIZE' query option's valid range was
relaxed. Instead of throwing an error, negative values also disable
the run size limitation, just as the default: '0'.

Testing:
- E2E tests in sort.py
- set test

Change-Id: I943d8edcc87df168448a174d6c9c6b46fe960eae
Reviewed-on: http://gerrit.cloudera.org:8080/21777
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-05 15:06:52 +00:00
jasonmfehr
7b6ccc644b IMPALA-12737: Query columns in workload management tables.
Adds "Select Columns", "Where Columns", "Join Columns", "Aggregate
Columns", and "OrderBy Columns" to the query profile and the workload
management active/completed queries tables. These fields are
presented as comma separate strings containing the fully qualified
column name in the format database.table_name.column_name. Aggregate
columns include all columns in the order by and having clauses.

Since new columns are being added, the workload management init
process is also being modified to allow for one-way upgrades of the
table schemas if necessary.  Additionally, workload management can be
set up to run under a schema version that is not the latest. This
ability will be useful during troubleshooting. To enable these
upgrades, the workload management initialization that manages the
structure of the tables has been moved to the catalogd.

The changes in this patch must be backwards compatible so that Impala
clusters running previous workload management code can co-exist with
Impala clusters running this workload management code. To enable that
backwards compatibility, a new table property named
'wm_schema_version' is now used to track the schema version of the
workload management tables. Thus, the old property 'schema_version'
will always be set to '1.0.0' since modifying that property value
causes Impala running previous workload management code to error at
startup.

Testing accomplished by
* Adding/updating workload and custom cluster tests to assert the new
  columns and the workload management upgrade process.
* JUnit tests added to verify the new workload management columns are
  being correctly parsed.
* GTests added to ensure the workload management columns are
  correctly defined and in the correct order.

Change-Id: I78f3670b067c0c192ee8a212fba95466fbcb51d7
Reviewed-on: http://gerrit.cloudera.org:8080/21142
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2024-10-31 17:06:43 +00:00
Daniel Becker
b05b408f17 IMPALA-13247: Support Reading Puffin files for the current snapshot
This change adds support for reading NDV statistics from Puffin files
when they are available for the current snapshot. Puffin files or blobs
that were written for other snapshots than the current one are ignored.
Because this behaviour is different from what we have for HMS stats and
may therefore be unintuitive for users, reading Puffin stats is disabled
by default; set the "--disable_reading_puffin_stats" startup flag to
false to enable it.

When Puffin stats reading is enabled, the NDV values read from Puffin
files take precedence over NDV values stored in the HMS. This is because
we only read Puffin stats for the current snapshot, so these values are
always up-to-date, while the values in the HMS may be stale.

Note that it is currently not possible to drop Puffin stats from Impala.
For this reason, this patch also introduces two ways of disabling the
reading of Puffin stats:
  - globally, with the aforementioned "--disable_reading_puffin_stats"
    startup flag: when it is set to true, Impala will never read Puffin
    stats
  - for specific tables, by setting the
    "impala.iceberg_disable_reading_puffin_stats" table property to
    true.

Note that this change is only about reading Puffin files, Impala does
not yet support writing them.

Testing:
 - created the PuffinDataGenerator tool which can generate Puffin files
   and metadata.json files for different scenarios (e.g. all stats are
   in the same Puffin file; stats for different columns are in different
   Puffin files; some Puffin files are corrupt etc.). The generated
   files are under the "testdata/ice_puffin/generated" directory.
 - The new custom cluster test class
   'test_iceberg_with_puffin.py::TestIcebergTableWithPuffinStats' uses
   the generated data to test various scenarios.
 - Added custom cluster tests that test the
   'disable_reading_puffin_stats' startup flag.

Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Reviewed-on: http://gerrit.cloudera.org:8080/21605
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-19 22:14:59 +00:00
Peter Rozsa
df3a38096e IMPALA-12861: Fix mixed file format listing for Iceberg tables
This change fixes file format information collection for Iceberg
tables. Previously, all file descriptor's  file formats were collected
from getSampledOrRawPartitions() in HdfsScanNode for Iceberg tables, now
the collection part is extracted as a method and it's overridden in
IcebergScanNode. Now, only the to-be-scanned file descriptor's file
format is recorded, showing the correct file formats for each SCAN nodes
in the plans.

Tests:
 - Planner tests added for mixed file format table with partitioning.

Change-Id: Ifae900914a0d255f5a4d9b8539361247dfeaad7b
Reviewed-on: http://gerrit.cloudera.org:8080/21871
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-18 13:24:42 +00:00
Joe McDonnell
7a1fe97ae1 IMPALA-13412: Use the BYTES type for tuple cache entries-in-use-bytes
The impala.tuple-cache.entries-in-use-bytes metric currently
uses the UNIT type. This switches it to the BYTES type so that
the entry in the WebUI displays it in a clearer way (e.g. GBs).
This switches impala.codegen-cache.entries-in-use-bytes the same
way.

Testing:
 - Ran a core job

Change-Id: Ib02bbc1c7e796c8d83c58fa12d5c18012753e300
Reviewed-on: http://gerrit.cloudera.org:8080/21939
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-16 23:21:49 +00:00
Yida Wu
0767ae065a IMPALA-12908: (Addendum) use RUNTIME_FILTER_WAIT_TIME_MS for tuple cache TPC testing
When runtime filters arrive after tuple caching has occurred, they
can't filter the cached results. This can lead to larger tuple caching
result sets than expected, causing correctness check failures in TPC
tests.

While other solutions may exist, extending RUNTIME_FILTER_WAIT_TIME_MS
is a simple fix by ensuring runtime filters are applied before tuple
caching.

Also set the query option enable_tuple_cache_verification to false
by default, as the filter arrival time may affect the correctness
check. To avoid flaky tests, change to use a more conservative
approach and only enable the correctness check when explicitly
specified by the testcase.

Tests:
Verified TPC tests pass correctness checks with increased runtime
filter wait time.

Change-Id: Ie70a87344c436ce8e2073575df5c5bf762ef562d
Reviewed-on: http://gerrit.cloudera.org:8080/21898
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-15 23:38:03 +00:00
Michael Smith
b6b953b48e IMPALA-13186: Tag query option scope for tuple cache
Constructs a hash of non-default query options that are relevant to
query results; by default query options are included in the hash.
Passes this hash to the frontend for inclusion in the tuple cache key
on plan leaf nodes (which will be included in parent key calculation).

Modifies MurmurHash3 to be re-entrant so the backend can construct a
hash incrementally. This is slightly slower but more memory efficient
than accumulating all hash inputs in a contiguous array first.

Adds TUPLE_CACHE_EXEMPT_QUERY_OPT_FN to mark query options that can be
ignored when calculating a tuple cache hash.

Adds startup flag 'tuple_cache_exempt_query_options' as a safety valve
for query options that might be important to exempt that we missed.

Removes duplicate printing logic for query options from child-query.cc
in favor of re-using TQueryOptionsToMap, which does the same thing.

Cleans up query-options.cc helpers so they're static and reduces
duplicate printing logic.

Adds test that different values for a relevant query option use
different cache entries. Adds startup flag
'tuple_cache_ignore_query_options' to omit query options for testing
certain tuple cache failure modes, where we need to use debug actions.

Change-Id: I1f4802ad9548749cd43df8848b6f46dca3739ae7
Reviewed-on: http://gerrit.cloudera.org:8080/21698
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-09 17:21:10 +00:00
Yida Wu
f11172a4a2 IMPALA-12908: Add correctness check for tuple cache
The patch adds a feature to the automated correctness check for
tuple cache. The purpose of this feature is to enable the
verification of the correctness of the tuple cache by comparing
caches with the same key across different queries.

The feature consists of two main components: cache dumping and
runtime correctness validation.

During the cache dumping phase, if a tuple cache is detected,
we retrieve the cache from the global cache and dump it to a
subdirectory as a reference file within the specified debug
dumping directory. The subdirectory is using the cache key as
its name. Additionally, data from the child is also read and
dumped to a separate file in the same directory. We expect
these two files to be identical, assuming the results are
deterministic. For non-deterministic cases like TOP-N or others,
we may detect them and exclude them from dumping later.
Furthermore, the cache data will be transformed into a
human-readable text format on a row-by-row basis before dumping.
This approach allows for easier investigation and later analysis.

The verification process starts by comparing the entire file
content sharing with the same key. If the content matches, the
verification is considered successful. However, if the content
doesn't match, we enter a slower mode where we compare all the
rows individually. In the slow mode, we will create a hash map
from the reference cache file, then iterate the current cache
file row by row and search if every row exists in the hash map.
Additionally, a counter is integrated into the hash map to
handle scenarios involving duplicated rows. Once verification is
complete, if no discrepancies are found, both files will be removed.
If discrepancies are detected, the files will be kept and appended
with a '.bad' postfix.

New start flags:
Added a starting flag tuple_cache_debug_dump_dir for specifying
the directory for dumping the result caches. if
tuple_cache_debug_dump_dir is empty, the feature is disabled.

Added a query option enable_tuple_cache_verification to enable
or disable the tuple cache verification. Default is true. Only
valid when tuple_cache_debug_dump_dir is specified.

Tests:
Ran the testcase test_tuple_cache_tpc_queries and caught known
inconsistencies.

Change-Id: Ied074e274ebf99fb57e3ee41a13148725775b77c
Reviewed-on: http://gerrit.cloudera.org:8080/21754
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2024-09-30 16:25:19 +00:00
Riza Suminto
93c64e7e9a IMPALA-13376: Add docs for AGG_MEM_CORRELATION_FACTOR etc
This patch adds documentation for AGG_MEM_CORRELATION_FACTOR and
LARGE_AGG_MEM_THRESHOLD option introduced in Apache Impala 4.4.0.

IMPALA-12548 fix behavior of AGG_MEM_CORRELATION_FACTOR. Higher value
will lower memory estimation, while lower value will result in higher
memory estimation. The documentation in ImpalaService.thrift, however,
says the opposite. This patch fix documentation in thrift file as well.

Testing:
- Run "make plain-html" in docs/ dir and confirm the output.
- Manually check with comments in
  PlannerTest.testAggNodeMaxMemEstimate()

Change-Id: I00956a50fb7616ca3c3ea2fd75fd11239a6bcd90
Reviewed-on: http://gerrit.cloudera.org:8080/21793
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2024-09-24 17:10:34 +00:00
wzhou-code
d2cd9b51a0 IMPALA-13388: fix unit-tests of Statestore HA for UBSAN builds
Sometimes in UBSAN builds, unit-tests of Statestore HA failed due to
Thrift RPC receiving timeout. Standby statestored failed to send
heartbeats to its subscribers so that failover was not triggered.
The Thrift RPC failures still happened after increasing TCP timeout
for Thrift RPCs between statestored and its subscribers.

This patch adds a metric for number of subscribers which recevied
heartbeats from statestored in a monitoring period. Unit-tests of
Statestored HA for UBSAN build will be skipped if statestored failed
to send heartbeats to more than half of subscribers.
For other builds, throw exception with error message which complain
Thrift RPC failure if statestored failed to send heartbeats to more
than half of subscribers.
Also fixed a bug which calls SecondsSinceHeartbeat() but compares
the retutned value with time value in milli-seconds.

Filed following up JIRA IMPALA-13399 to track the very root cause.

Testing:
 - Looped to run test_statestored_ha.py for 100 times in UBSAN
   build without failed case, but 4 iterations out of 100 have
   skipped test cases.
 - Verified that the issue did not happen for ASAN build by
   running test_statestored_ha.py for 100 times in ASAN build.
 - Passed core test.

Change-Id: Ie59d1e93c635411723f7044da52e4ab19c7d2fac
Reviewed-on: http://gerrit.cloudera.org:8080/21820
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-24 05:39:14 +00:00
stiga-huang
58fd45f20c IMPALA-12876: Add catalogVersion and loaded timestamp in query profiles
When debugging stale metadata, it'd be helpful to know what catalog
version of the tables are used and what's the time when catalogd loads
those versions. This patch exposes these info in the query profile for
each referenced table. E.g.

Original Table Versions: tpch.customer, 2249, 1726052668932, Wed Sep 11 19:04:28 CST 2024
tpch.nation, 2255, 1726052790140, Wed Sep 11 19:06:30 CST 2024
tpch.orders, 2257, 1726052803258, Wed Sep 11 19:06:43 CST 2024
tpch.lineitem, 2254, 1726052785384, Wed Sep 11 19:06:25 CST 2024
tpch.supplier, 2256, 1726052794235, Wed Sep 11 19:06:34 CST 2024

Each line consists of the table name, catalog version, loaded timestamp
and the timestamp string.

Implementation:

The loaded timestamp is updated whenever a CatalogObject updates its
catalog version in catalogd. It's passed to impalads with the
TCatalogObject broadcasted by statestore, or in DDL/DML responses.
Currently, the loaded timestamp is added for table, view, function, data
source, and hdfs cache pool in catalogd. However, only those of table
and view are applied used in impalad. For the loaded timestamp of other
types, users can check them in the /catalog WebUI of catalogd.

Tests:
 - Adds e2e test

Change-Id: I94b2fd59ed5aca664d6db4448c61ad21a88a4f98
Reviewed-on: http://gerrit.cloudera.org:8080/21782
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-16 21:51:56 +00:00
Csaba Ringhofer
c7ce233679 IMPALA-12594: Add flag to tune KrpcDataStreamSender mem estimate
The way the planner estimates mem usage for KrpcDataStreamSender
is very different than how the backend actually uses memory -
the planner assumes that batch_size number of rows are sent at
a time while the BE tries to limit it to
data_stream_sender_buffer_size_ (but doesn't consider var len data).
The Jira has more detail about differences and issues.

This change adds flag data_stream_sender_buffer_size_used_by_planner.
If this is set to 16K (data_stream_sender_buffer_size_ default)
then the estimation will work similarly to BE.

Tested that this can improve both under and overestimations:
peak mem / mem estimate of the first sender:

select distinct * from tpch_parquet.lineitem limit 100000
default:
284.04 KB        2.75 MB
--data_stream_sender_buffer_size_used_by_planner=16384:
282.04 KB      283.39 KB

select distinct l_comment from tpch_parquet.lineitem limit 100000;
default:
747.71 KB      509.94 KB
--data_stream_sender_buffer_size_used_by_planner=16384:
740.71 KB      627.46 KB

The default is not changed to avoid side effects. I would like
to change it once BE's handling of var len data is fixed, which
is a prerequisity to use mem reservation in KrpcDataStreamSender.

Change-Id: I1e4b1db030be934cece565e3f2634ee7cbdb7c4f
Reviewed-on: http://gerrit.cloudera.org:8080/21797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-14 20:24:41 +00:00
Mihaly Szjatinya
22723d0f27 IMPALA-7086: Cache timezone in *_utc_timestamp()
Added Prepare - Close routine around from/to_utc standard functions.
This gives a consistent time improvement for constant timezones.

Given sample table with 600M timestamp rows, on all-default
environment the query below gives a stable 2-3 seconds improvement.
SELECT count(*) FROM a_table
where from_utc_timestamp(ts, "a_timezone") > "a_date";

Averaged results for Release, SET MT_OP=1, SET DISABLE_CODEGEN=TRUE:
from_utc: 16,53s -> 12,53s
to_utc: 14,02 - > 11,53

Change-Id: Icdf5ff82c5d0554333aef1bc3bba034a4cf48230
Reviewed-on: http://gerrit.cloudera.org:8080/21735
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-06 18:08:26 +00:00
Peter Rozsa
a0aaf338ae IMPALA-12732: Add support for MERGE statements for Iceberg tables
MERGE statement is a DML command that allows users to perform
conditional insert, update, or delete operations on a target table based
on the results of a join with a source table. This change adds MERGE
statement parsing and an Iceberg-specific semantic analysis, planning,
and execution. The parsing grammar follows the SQL standard, it accepts
the same syntax as Hive, Spark, and Trino by supporting arbitrary number
of WHEN clauses, with conditions or without and accepting inline views
as source.

Example:
'MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED AND t.id < 100 THEN UPDATE SET column1 = s.column1
WHEN MATCHED AND t.id > 100 THEN DELETE
WHEN MATCHED THEN UPDATE SET column1 = "value"
WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.column1);'

The Iceberg-specific analysis, planning, and execution are based on a
concept that was previously used for UPDATE: The analyzer creates a
SELECT statement with all target and source columns (including
Iceberg's virtual columns) and a 'row_present' column that defines
whether the source, the target, or both rows are present in the result
set after joining the two table references by the ON clause. The join
condition should be an equi-join, as it is a FULL OUTER JOIN, and Impala
currently supports only equi-joins in this case. The joining order is
forced by a query hint, this guarantees that the target table is always
on the left side.

A new, IcebergMergeNode is added at planning phase, this node does the
row-level filtering for each MATCHED/ NOT MATCHED cases. The
'row_present' column decides which case group will be evaluated; if
both sides are available, the matched cases, if only the source side
matches then the not matched cases and their filter expressions
will be evaluated over the row. If one of the cases match, then the
execution evaluates the result expressions into the output row batch,
and an auxiliary tuple will store the merge action. The merge action is
a flag for the newly added IcebergMergeSink; this sink will route each
incoming row from IcebergMergeNode to their respective destination. Each
row could go to the delete sink, insert sink, or to both sinks.

Target-side duplicate records are filtered during IcebergMergeNode's
execution, if one target table-side duplicate is detected, the whole
statement's execution is stopped and the error is reported back to the
user.

Added tests:
 - Parser tests
 - Analyzer tests
 - Unit test for WHEN NOT MATCHED INSERT column collation
 - Planner tests for partitioned/sorted cases
 - Authorization tests
 - E2E tests

Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268
Reviewed-on: http://gerrit.cloudera.org:8080/21423
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-05 01:01:05 +00:00
Joe McDonnell
831585c2f5 IMPALA-12906: Incorporate scan range information into the tuple cache key
This change is accomplishing two things:
1. It incorporates scan range information into the tuple
   cache key.
2. It reintroduces deterministic scheduling as an option
   for mt_dop and uses it for HdfsScanNodes that feed
   into a TupleCacheNode.

The combination of these two things solves several problems:
1. When a table is modified, the list of scan ranges will
   change, and this will naturally change the cache keys.
2. When accessing a partitioned table, two queries may have
   different predicates on the partition columns. Since the
   predicates can be satisfied via partition pruning, they are
   not included at runtime. This means that two queries
   may have identical compile-time keys with only the scan
   ranges being different due to different partition pruning.
3. Each fragment instance processes different scan ranges, so
   each will have a unique cache key.

To incorporate scan range information, this introduces a new
per-fragment-instance cache key. At compile time, the planner
now keeps track of which HdfsScanNodes feed into a TupleCacheNode.
This is passed over to the runtime as a list of plan node ids
that contain scan ranges. At runtime, the fragment instance
walks through the list of plan nodes ids and hashes any scan ranges
associated with them. This hash is the per-fragment-instance
cache key. The combination of the compile-time cache key produced
by the planner and the per-fragment-instance cache key is a unique
identifier of the result.

Deterministic scheduling for mt_dop was removed via IMPALA-9655
with the introduction of the shared queue. This revives some of
the pre-IMPALA-9655 scheduling logic as an option. Since the
TupleCacheNode knows which HdfsScanNodes feed into it, the
TupleCacheNode turns on deterministic scheduling for all of those
HdfsScanNodes. Since this only applies to HdfsScanNodes that feed
into a TupleCacheNode, it means that any HdfsScanNode that doesn't
feed into a TupleCacheNode continues using the current algorithm.
The pre-IMPALA-9655 code is modified to make it more deterministic
about how it assigns scan ranges to instances.

Testing:
 - Added custom cluster tests for the scan range information
   including modifying a table, selecting from a partitioned
   table, and verifying that fragment instances have unique
   keys
 - Added basic frontend test to verify that deterministic scheduling
   gets set on the HdfsScanNode that feed into the TupleCacheNode.
 - Restored the pre-IMPALA-9655 backend test to cover the LPT code

Change-Id: Ibe298fff0f644ce931a2aa934ebb98f69aab9d34
Reviewed-on: http://gerrit.cloudera.org:8080/21541
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
2024-09-04 16:31:31 +00:00
Noemi Pap-Takacs
9e0649b9ce IMPALA-12867: Filter files to OPTIMIZE based on file size
The OPTIMIZE TABLE statement is currently used to rewrite the entire
Iceberg table. With the 'FILE_SIZE_THRESHOLD_MB' option, the user can
specify a file size limit to rewrite only small files.

Syntax: OPTIMIZE TABLE <table_name> [(FILE_SIZE_THRESHOLD_MB=<value>)];
The value of the threshold is the file size in MBs. It must be a
non-negative integer. Data files larger than the given limit will only
be rewritten if they are referenced from delete files.
If only 1 file is selected in a partition, it will not be rewritten.
If the threshold is 0, only the delete files and the referenced data
files will be rewritten.

IMPALA-12839: 'Optimizing empty table should be no-op' is also
resolved in this patch.

With the file selection option, the OPTIMIZE operation can operate
in 3 different modes:
- REWRITE_ALL: The entire table is rewritten. Either because the
  compaction was triggered by a simple 'OPTIMIZE TABLE' command
  without a specified 'FILE_SIZE_THRESHOLD_MB' parameter, or
  because all files of the table are deletes/referenced by deletes
  or are smaller than the limit.
- PARTIAL: If the value of 'FILE_SIZE_THRESHOLD_MB' parameter is
  specified then only the small data files without deletes are selected
  and the delete files are merged. Large data files without deletes
  are kept to avoid unnecessary resource consuming writes.
- NOOP: When no files qualify for the selection criteria, there is
  no need to rewrite any files. This is a no-operation.

Testing:
 - Parser test
 - FE unit tests
 - E2E tests

Change-Id: Icfbb589513aacdb68a86c1aec4a0d39b12091820
Reviewed-on: http://gerrit.cloudera.org:8080/21388
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-28 14:19:41 +00:00
stiga-huang
db0f0dadf1 IMPALA-13240: Add gerrit comments for Thrift/FlatBuffers changes
Adds gerrit comments for changes in Thrift/FlatBuffers files that could
break the communication between impalad and catalogd/statestore during
upgrade.

Basically, only new optional fields can be added in Thrift protocol. For
Flatbuffers schemas, we should only add new fields at the end of a table
definition.

Adds a new option (--revision) for critique-gerrit-review.py to specify
the revision (HEAD or a commit, branch, etc). Also adds an option
(--base-revision) to specify the base revision for comparison.

To test the script locally, prepare a virtual env with the virtualenv
package installed:
  virtualenv venv
  source venv/bin/activate
  pip install virtualenv
Then run the script with --dryrun:
  python bin/jenkins/critique-gerrit-review.py --dryrun --revision effc9df93

Limitations
 - False positive in cases that add new Thrift structs with required
   fields and only use those new structs in new optional fields, e.g.
   effc9df93 and 72732da9d.
 - Might have false positive results on reformat changes due to simple
   string checks, e.g. 91d8a8f62.
 - Can't check incompatible changes in FlatBuffers files. Just add
   general file level comments.

We can integrate DUPCheck in the future to parse the Thrift/FlatBuffers
files to AST and compare the AST instead.
https://github.com/jwjwyoung/DUPChecker

Tests:
 - Verified incompatible commits like 012996a06 and 65094a74f.
 - Verified posting Gerrit comments from local env using my username.

Change-Id: Ib35fafa50bfd38631312d22464df14d426f55346
Reviewed-on: http://gerrit.cloudera.org:8080/21646
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
2024-08-15 22:04:14 +00:00
Joe McDonnell
d8a8412c2b IMPALA-13294: Add support for long polling to avoid client side wait
Currently, Impala does an execute call, then the client polls
waiting for the operation to finish (or error out). The client
sleeps between polls, and this sleep time can be a substantial
percentage of a short query's execution time.

To reduce this client side sleep, this implements long polling to
provide an option to wait for query completion on the server side.
This is controlled by the long_polling_time_ms query option. If
set to greater than zero, status RPCs will wait for query
completion for up to that amount of time. This defaults to off (0ms).

Both Beeswax and HS2 add a wait for query completion in their
get status calls (get_state for Beeswax, GetOperationStatus for HS2).
This doesn't wait in the execute RPC calls (e.g. query for Beeswax,
ExecuteStatement for HS2), because neither includes the query status
in the response. The client will always need to do a separate status
RPC.

This modifies impala-shell and the beeswax client to avoid doing a
sleep if the get_state/GetOperationStatus calls take longer than
they would have slept. In other words, if they would have slept 50ms,
then they skip that sleep if the RPC to the server took longer than
50ms. This allows the client to maintain its sleep behavior with
older Impalas that don't use long polling while adapting properly
to systems that do have long polling. This has the added benefit
that it also adjusts for high latency to the server as well. This
does not change any of the sleep times.

Testing:
 - This adds a test case in test_hs2.py to verify the long
   polling behavior

Change-Id: I72ca595c5dd8a33b936f078f7f7faa5b3f0f337d
Reviewed-on: http://gerrit.cloudera.org:8080/19205
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-13 03:49:45 +00:00
Fang-Yu Rao
589dbd6f1a IMPALA-13276: Revise the documentation of 'RUNTIME_FILTER_WAIT_TIME_MS'
This patch revises the documentation of the query option
'RUNTIME_FILTER_WAIT_TIME_MS' as well as the code comment for the same
query option to make its meaning clearer.

Change-Id: Ic98e23a902a65e4fa41a628d4a3edb1894660fb4
Reviewed-on: http://gerrit.cloudera.org:8080/21644
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2024-08-09 17:49:03 +00:00
Noemi Pap-Takacs
05585c19bf IMPALA-12857: Add flag to enable merge-on-read even if tables are configured with copy-on-write
Impala can only modify an Iceberg table via 'merge-on-read'. The
'iceberg_always allow_merge_on_read_operations' backend flag makes it
possible to execute 'merge-on-read' operations (DELETE, UPDATE, MERGE)
even if the table property is 'copy-on-write'.

Testing:
 - custom cluster test
 - negative E2E test

Change-Id: I3800043e135beeedfb655a238c0644aaa0ef11f4
Reviewed-on: http://gerrit.cloudera.org:8080/21578
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-07-22 14:13:02 +00:00
Eyizoha
ec59578106 IMPALA-12786: Optimize count(*) for JSON scans
When performing zero slots scans on a JSON table for operations like
count(*), we don't require specific data from the JSON, we only need the
number of top-level JSON objects. However, the current JSON parser based
on rapidjson still decodes and copies specific data from the JSON, even
in zero slots scans. Skipping these steps can significantly improve scan
performance.

This patch introduces a JSON skipper to conduct zero slots scans on JSON
data. Essentially, it is a simplified version of a rapidjson parser,
removing specific data decoding and copying operations, resulting in
faster parsing of the number of JSON objects. The skipper retains the
ability to recognize malformed JSON and provide specific error codes
same as the rapidjson parser. Nevertheless, as it bypasses specific
data parsing, it cannot identify string encoding errors or numeric
overflow errors. Despite this, these data errors do not impact the
counting of JSON objects, so it is acceptable to ignore them. The TEXT
scanner exhibits similar behavior.

Additionally, a new query option, disable_optimized_json_count_star, has
been added to disable this optimization and revert to the old behavior.

In the performance test of TPC-DS with a format of json/none and a scale
of 10GB, the performance optimization is shown in the following tables:
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload  | Query                     | File Format        | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | json / none / none | 6.78   | 6.88        |   -1.46%   |   4.93%   |   3.63%        | 9     |   -1.51%       | -0.74   | -0.72  |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | json / none / none | 2.42   | 6.75        | I -64.20%  |   6.44%   |   4.58%        | 9     | I -177.75%     | -3.36   | -37.55 |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | json / none / none | 2.42   | 7.03        | I -65.63%  |   3.93%   |   4.39%        | 9     | I -194.13%     | -3.36   | -42.82 |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%])
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg     | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| 01:AGGREGATE | 2.58%      | 54.85ms | 58.88ms  | -6.85%     | * 14.43% * | 115.82ms | 133.11ms | -12.99%    | 3      | 3     | 3      | 1         |
| 00:SCAN HDFS | 97.41%     | 2.07s   | 6.07s    | -65.84%    |   5.87%    | 2.43s    | 6.95s    | -65.01%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%])
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg   | Base Avg | Delta(Avg) | StdDev(%) | Max   | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| 00:SCAN HDFS | 99.35%     | 2.07s | 6.49s    | -68.15%    |   4.83%   | 2.37s | 7.49s    | -68.32%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+

Testing:
- Added new test cases in TestQueriesJsonTables to verify that query
  results are consistent before and after optimization.
- Passed existing JSON scanning-related tests.

Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e
Reviewed-on: http://gerrit.cloudera.org:8080/21039
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-07-10 14:37:19 +00:00
stiga-huang
00d0b0dda1 IMPALA-9441,IMPALA-13170: Ops listing dbs/tables should handle db not exists
We have some operations listing the dbs/tables in the following steps:
  1. Get the db list
  2. Do something on the db which could fail if the db no longer exists
For instance, when authorization is enabled, SHOW DATABASES would need a
step-2 to get the owner of each db. This is fine in the legacy catalog
mode since the whole Db object is cached in the coordinator side.
However, in the local catalog mode, the msDb could be missing in the
local cache. Coordinator then triggers a getPartialCatalogObject RPC to
load it from catalogd. If the db no longer exists in catalogd, such step
will fail.

The same in GetTables HS2 requests when listing all tables in all dbs.
In step-2 we list the table names for a db. Though it exists when we get
the db list, it could be dropped when we start listing the table names
in it.

This patch adds codes to handle the exceptions due to db no longer
exists. Also improves GetSchemas to not list the table names to get rid
of the same issue.

Tests:
 - Add e2e tests

Change-Id: I2bd40d33859feca2bbd2e5f1158f3894a91c2929
Reviewed-on: http://gerrit.cloudera.org:8080/21546
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-07-01 06:56:40 +00:00
Csaba Ringhofer
a99de990b0 IMPALA-12370: Allow converting timestamps to UTC when writing to Kudu
Before this commit, only read support was implemented
(convert_kudu_utc_timestamps=true). This change adds write support:
if write_kudu_utc_timestamps=true, then timestamps are converted
from local time to UTC during DMLs to Kudu. In case of
ambiguous conversions (DST changes) the earlier possible UTC
timestamp is written.

All DMLs supported with Kudu tables are affected:
INSERT, UPSERT, UPDATE, DELETE

To be able to read back Kudu tables written by Impala correctly
convert_kudu_utc_timestamps and write_kudu_utc_timestamps need to
have the same value. Having the same value in the two query option
is also critical for UPDATE/DELETE if the primary key contains a
timestamp column - these operations do a scan first (affected by
convert_kudu_utc_timestamps) and then use the keys from the scan to
select updated/deleted rows (affected by write_kudu_utc_timestamps).

The conversion is implemented by adding to_utc_timestamp() to inserted
timestamp expressions during planning. This allows doing the same
conversion during the pre-insert sorting and partitioning.
Read support is implemented differently - in that case the plan is not
changed and the scanner does the conversion.

Other changes:
- Before this change, verification of tests with TIMESTAMP results
  were skipped when the file format is Kudu. This shouldn't be
  necessary so the skipping was removed.

Change-Id: Ibb4995a64e042e7bb261fcc6e6bf7ffce61e9bd1
Reviewed-on: http://gerrit.cloudera.org:8080/21492
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
2024-06-19 10:51:56 +00:00
Riza Suminto
5d1bd80623 IMPALA-13152: Avoid NaN, infinite, and negative ProcessingCost
TOP-N cost will turn into NaN if inputCardinality is equal to 0 due to
Math.log(inputCardinality). This patch fix the issue by avoiding
Math.log(0) and replace it with 0 instead.

After this patch, Instantiating BaseProcessingCost with NaN, infinite,
or negative totalCost will throw IllegalArgumentException. In
BaseProcessingCost.getDetails(), "total-cost" is renamed to "raw-cost"
to avoid confusion with "cost-total" in ProcessingCost.getDetails().

Testing:
- Add testcase that run TOP-N query over empty table.
- Compute ProcessingCost in most FE and EE test even when
  COMPUTE_PROCESSING_COST option is not enabled by checking if
  RuntimeEnv.INSTANCE.isTestEnv() is True or TEST_REPLAN option is
  enabled.
- Pass core test.

Change-Id: Ib49c7ae397dadcb2cb69fde1850d442d33cdf177
Reviewed-on: http://gerrit.cloudera.org:8080/21504
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-15 23:36:46 +00:00
Michael Smith
4681666e93 IMPALA-12800: Add cache for isTrueWithNullSlots() evaluation
isTrueWithNullSlots() can be expensive when it has to query the backend.
Many of the expressions will look similar, especially in large
auto-generated expressions. Adds a cache based on the nullified
expression to avoid querying the backend for expressions with identical
structure.

With DEBUG logging enabled for the Analyzer, computes and logs stats
about the null slots cache.

Adds 'use_null_slots_cache' query option to disable caching. Documents
the new option.

Change-Id: Ib63f5553284f21f775d2097b6c5d6bbb63699acd
Reviewed-on: http://gerrit.cloudera.org:8080/21484
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-12 12:27:05 +00:00
wzhou-code
70b7b6a78d IMPALA-13134: DDL hang with SYNC_DDL enabled when Catalogd is changed to standby status
Catalogd waits for SYNC_DDL version when it processes a DDL with
SYNC_DDL enabled. If the status of Catalogd is changed from active to
standby when CatalogServiceCatalog.waitForSyncDdlVersion() is called,
the standby catalogd does not receive catalog topic updates from
statestore, hence catalogd thread waits indefinitely.

This patch fixed the issue by re-generating service id when Catalogd
is changed to standby status and throwing exception if its service id
has been changed when waiting for SYNC_DDL version.

Testing:
 - Added unit-test code for CatalogD HA to run DDL with SYNC_DDL enabled
   and injected delay when waiting SYNC_DLL version, then verify that
   DDL query fails due to catalog failover.
 - Passed test_catalogd_ha.py.

Change-Id: I2dcd628cff3c10d2e7566ba2d9de0b5886a18fc1
Reviewed-on: http://gerrit.cloudera.org:8080/21480
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-06 01:39:36 +00:00
ttttttz
f67f5f1815 IMPALA-12705: Add /catalog_ha_info page on Statestore to show catalog HA information
This patch adds /catalog_ha_info page on Statestore to show catalog HA
information. The page contains the following information: Active Node,
Standby Node, and Notified Subscribers table. In the Notified
Subscribers table, include the following information items:
  -- Id,
  -- Address,
  -- Registration ID,
  -- Subscriber Type,
  -- Catalogd Version,
  -- Catalogd Address,
  -- Last Update Catalogd Time

Change-Id: If85f6a827ae8180d13caac588b92af0511ac35e3
Reviewed-on: http://gerrit.cloudera.org:8080/21418
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-05 22:19:11 +00:00