Files
impala/docs/topics/impala_metadata.xml
Venu Reddy 5db760662f IMPALA-12709: Add support for hierarchical metastore event processing
At present, metastore event processor is single threaded. Notification
events are processed sequentially with a maximum limit of 1000 events
fetched and processed in a single batch. Multiple locks are used to
address the concurrency issues that may arise when catalog DDL
operation processing and metastore event processing tries to
access/update the catalog objects concurrently. Waiting for a lock or
file metadata loading of a table can slow the event processing and can
affect the processing of other events following it. Those events may
not be dependent on the previous event. Altogether it takes a very
long time to synchronize all the HMS events.

Existing metastore event processing is turned into multi-level
event processing with enable_hierarchical_event_processing flag. It
is not enabled by default. Idea is to segregate the events based on
their dependency, maintain the order of events as they occur within
the dependency and process them independently as much as possible.
Following 3 main classes represents the three level threaded event
processing.
1. EventExecutorService
   It provides the necessary methods to initialize, start, clear,
   stop and process the metastore events processing in hierarchical
   mode. It is instantiated from MetastoreEventsProcessor and its
   methods are invoked from MetastoreEventsProcessor. Upon receiving
   the event to process, EventExecutorService queues the event to
   appropriate DbEventExecutor for processing.
2. DbEventExecutor
   An instance of this class has an execution thread, manage events
   of multiple databases with DbProcessors. An instance of DbProcessor
   is maintained to store the context of each database within the
   DbEventExecutor. On each scheduled execution, input events on
   DbProcessor are segregated to appropriate TableProcessors for the
   event processing and also process the database events that are
   eligible for processing.
   Once a DbEventExecutor is assigned to a database, a DbProcessor
   is created. And the subsequent events belonging to the database
   are queued to same DbEventExecutor thread for further processing.
   Hence, linearizability is ensured in dealing with events within
   the database. Each instance of DbEventExecutor has a fixed list
   of TableEventExecutors.
3. TableEventExecutor
   An instance of this class has an execution thread, processes
   events of multiple tables with TableProcessors. An instance of
   TableProcessor is maintained to store context of each table within
   a TableEventExecutor. On each scheduled execution, events from
   TableProcessors are processed.
   Once a TableEventExecutor is assigned to table, a TableProcessor
   is created. And the subsequent table events are processed by same
   TableEventExecutor thread. Hence, linearizability is guaranteed
   in processing events of a particular table.
   - All the events of a table are processed in the same order they
     have occurred.
   - Events of different tables are processed in parallel when those
     tables are assigned to different TableEventExecutors.

Following new events are added:
1. DbBarrierEvent
   This event wraps a database event. It is used to synchronize all
   the TableProcessors belonging to database before processing the
   database event. It acts as a barrier to restrict the processing
   of table events that occurred after the database event until the
   database event is processed on DbProcessor.
2. RenameTableBarrierEvent
   This event wraps an alter table event for rename. It is used to
   synchronize the source and target TableProcessors to
   process the rename table event. It ensures the source
   TableProcessor removes the table first and then allows the target
   TableProcessor to create the renamed table.
3. PseudoCommitTxnEvent and PseudoAbortTxnEvent
   CommitTxnEvent and AbortTxnEvent can involve multiple tables in
   a transaction and processing these events modifies multiple table
   objects. Pseudo events are introduced such that a pseudo event is
   created for each table involved in the transaction and these
   pseudo events are processed independently at respective
   TableProcessors.

Following new flags are introduced:
1. enable_hierarchical_event_processing
   To enable the hierarchical event processing on catalogd.
2. num_db_event_executors
   To set the number of database level event executors.
3. num_table_event_executors_per_db_event_executor
   To set the number of table level event executors within a
   database event executor.
4. min_event_processor_idle_ms
   To set the minimum time to retain idle db processors and table
   processors on the database event executors and table event
   executors respectively, when they do not have events to process.
5. max_outstanding_events_on_executors
   To set the limit of maximum outstanding events to process on
   event executors.

Changed hms_event_polling_interval_s type from int to double to support
millisecond precision interval

TODOs:
1. We need to redefine the lag in the hierarchical processing mode.
2. Need to have a mechanism to capture the actual event processing time
   in hierarchical processing mode. Currently, with
   enable_hierarchical_event_processing as true, lastSyncedEventId_ and
   lastSyncedEventTimeSecs_ are updated upon event dispatch to
   EventExecutorService for processing on respective DbEventExecutor
   and/or TableEventExecutor. So lastSyncedEventId_ and
   lastSyncedEventTimeSecs_ doesn't actually mean events are processed.
3. Hierarchical processing mode currently have a mechanism to show the
   total number of outstanding events on all the db and table executors
   at the moment. Need to enhance observability further with this mode.
Filed a jira[IMPALA-13801] to fix them.

Testing:
 - Executed existing end to end tests.
 - Added fe and end-to-end tests with enable_hierarchical_event_processing.
 - Added event processing performance tests.
 - Have executed the existing tests with hierarchical processing
   mode enabled. lastSyncedEventId_ is now used in the new feature of
   sync_hms_events_wait_time_s (IMPALA-12152) as well. Some tests fail when
   hierarchical processing mode is enabled because lastSyncedEventId_ do
   not actually mean event is processed in this mode. This need to be
   fixed/verified with above jira[IMPALA-13801].

Change-Id: I76d8a739f9db6d40f01028bfd786a85d83f9e5d6
Reviewed-on: http://gerrit.cloudera.org:8080/21031
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-30 11:51:03 +00:00

697 lines
24 KiB
XML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="impala_metadata">
<title>Metadata Management</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Configuring"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
</metadata>
</prolog>
<conbody>
<p>
This topic describes various knobs you can use to control how Impala manages its metadata
in order to improve performance and scalability.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="on_demand_metadata">
<title>On-demand Metadata</title>
<conbody>
<p>
In previous versions of Impala, every coordinator kept a replica of all the cache in
<codeph>catalogd</codeph>, consuming large memory on each coordinator with no option to
evict. Metadata always propagated through the <codeph>statestored</codeph> and suffers
from head-of-line blocking, for example, one user loading a big table blocking another
user loading a small table.
</p>
<p>
With this new feature, the coordinators pull metadata as needed from
<codeph>catalogd</codeph> and cache it locally. The cached metadata gets evicted
automatically under memory pressure.
</p>
<p>
The granularity of on-demand metadata fetches is now at the partition level between the
coordinator and <codeph>catalogd</codeph>. Common use cases like add/drop partitions do
not trigger unnecessary serialization/deserialization of large metadata.
</p>
<p>
This feature is disabled by default.
</p>
<p>
The feature can be used in either of the following modes.
<dl>
<dlentry>
<dt>
Metadata on-demand mode
</dt>
<dd>
In this mode, all coordinators use the metadata on-demand.
</dd>
<dd>
Set the following on <codeph>catalogd</codeph>:
<codeblock>--catalog_topic_mode=minimal</codeblock>
</dd>
<dd>
Set the following on all <codeph>impalad</codeph> coordinators:
<codeblock>--use_local_catalog=true</codeblock>
</dd>
</dlentry>
<dlentry>
<dt>
Mixed mode
</dt>
<dd>
In this mode, only some coordinators are enabled to use the metadata on-demand.
</dd>
<dd>
We recommend that you use the mixed mode only for testing local catalogs impact
on heap usage.
</dd>
<dd>
Set the following on <codeph>catalogd</codeph>:
<codeblock>--catalog_topic_mode=mixed</codeblock>
</dd>
<dd>
Set the following on <codeph>impalad</codeph> coordinators with metdadata
on-demand:
<codeblock>--use_local_catalog=true </codeblock>
</dd>
</dlentry>
<dlentry>
<dt>Flags related to <codeph>use_local_catalog</codeph></dt>
<dd>When <codeph>use_local_catalog</codeph> is enabled or set to <codeph>True</codeph> on the impalad
coordinators the following list of flags configure various parameters as described below. It is not
recommended to change the default values on these flags.
</dd>
<dd>
<ul>
<li>The flag <codeph>local_catalog_cache_mb</codeph> (defaults to -1) configures the
size of the catalog cache within each coordinator. With the default set to -1, the
cache is auto-configured to 60% of the configured Java heap size. Note that the
Java heap size is distinct from and typically smaller than the overall Impala
memory limit.</li>
<li>The flag <codeph>local_catalog_cache_expiration_s</codeph> (defaults to 3600) configures the
expiration time of the catalog cache within each impalad. Even if the configured
cache capacity has not been reached, items are removed from the cache if they have not
been accessed in the defined amount of time.</li>
<li>The flag <codeph>local_catalog_max_fetch_retries</codeph> (defaults to 40) configures
the maximum number of retries needed for queries to fetch a metadata object from the impalad
coordinator's local catalog cache.</li>
</ul>
</dd>
</dlentry>
</dl>
</p>
</conbody>
</concept>
<concept id="auto_invalidate_metadata">
<title>Automatic Invalidation of Metadata Cache</title>
<conbody>
<p>
To keep the size of metadata bounded, <codeph>catalogd</codeph> periodically scans all
the tables and invalidates those not recently used. There are two types of
configurations for <codeph>catalogd</codeph> and <codeph>impalad</codeph>.
</p>
<dl>
<dlentry>
<dt>
Time-based cache invalidation
</dt>
<dd>
<codeph>Catalogd</codeph> invalidates tables that are not recently used in the
specified time period (in seconds).
</dd>
<dd>
The <codeph>&#8209;&#8209;invalidate_tables_timeout_s</codeph> flag needs to be
applied to both <codeph>impalad</codeph> and <codeph>catalogd</codeph>.
</dd>
</dlentry>
<dlentry>
<dt>
Memory-based cache invalidation
</dt>
<dd>
When the memory pressure reaches 60% of JVM heap size after a Java garbage
collection in <codeph>catalogd</codeph>, Impala invalidates 10% of the least
recently used tables.
</dd>
<dd>
The <codeph>&#8209;&#8209;invalidate_tables_on_memory_pressure</codeph> flag needs
to be applied to both <codeph>impalad</codeph> and <codeph>catalogd</codeph>.
</dd>
</dlentry>
</dl>
<p>
Automatic invalidation of metadata provides more stability with lower chances of running
out of memory, but the feature could potentially cause performance issues and may
require tuning.
</p>
</conbody>
</concept>
<concept id="auto_poll_hms_notification">
<title>Automatic Invalidation/Refresh of Metadata</title>
<conbody>
<p>
When tools such as Hive and Spark are used to process the raw data ingested into Hive
tables, new HMS metadata (database, tables, partitions) and filesystem metadata (new
files in existing partitions/tables) is generated. In previous versions of Impala, in
order to pick up this new information, Impala users needed to manually issue an
<codeph>INVALIDATE</codeph> or <codeph>REFRESH</codeph> commands.
</p>
<p>
When automatic invalidate/refresh of metadata is enabled, <codeph>catalogd</codeph>
polls Hive Metastore (HMS) notification events at a configurable interval and processes
the following changes:
</p>
<note>
This is a preview feature in <keyword keyref="impala33_full"/> and <keyword keyref="impala40"/>
It is generally available and enabled by default from <keyword keyref="impala41"/> onwards.
</note>
<ul>
<li>
Invalidates the tables when it receives the <codeph>ALTER TABLE</codeph> event.
</li>
<li>
Refreshes the partition when it receives the <codeph>ALTER</codeph>,
<codeph>ADD</codeph>, or <codeph>DROP</codeph> partitions.
</li>
<li>
Adds the tables or databases when it receives the <codeph>CREATE TABLE</codeph> or
<codeph>CREATE DATABASE</codeph> events.
</li>
<li>
Removes the tables from <codeph>catalogd</codeph> when it receives the <codeph>DROP
TABLE</codeph> or <codeph>DROP DATABASE</codeph> events.
</li>
<li>
Refreshes the table and partitions when it receives the <codeph>INSERT</codeph>
events.
<p>
If the table is not loaded at the time of processing the <codeph>INSERT</codeph>
event, the event processor does not need to refresh the table and skips it.
</p>
</li>
<li>
Changes the database and updates <codeph>catalogd</codeph> when it receives the
<codeph>ALTER DATABASE</codeph> events. The following changes are supported. This
event does not invalidate the tables in the database.
<ul>
<li>
Change the database properties
</li>
<li>
Change the comment on the database
</li>
<li>
Change the owner of the database
</li>
<li>
Change the default location of the database
<p>
Changing the default location of the database does not move the tables of that
database to the new location. Only the new tables which are created subsequently
use the default location of the database in case it is not provided in the
create table statement.
</p>
</li>
</ul>
</li>
</ul>
<p>
This feature is controlled by the
<codeph>&#8209;&#8209;hms_event_polling_interval_s</codeph> flag. Start the
<codeph>catalogd</codeph> with the <codeph>hms_event_polling_interval_s</codeph>
flag set to a positive double value to enable the feature and set the polling frequency in
seconds. We recommend the value between 1.0 to 5.0 seconds.
</p>
<p>
The following use cases are not supported:
</p>
<ul>
<li>
When you bypass HMS and add or remove data into table by adding files directly on the
filesystem, HMS does not generate the <codeph>INSERT</codeph> event, and the event
processor will not invalidate the corresponding table or refresh the corresponding
partition.
<p>
It is recommended that you use the <codeph>LOAD DATA</codeph> command to do the data
load in such cases, so that event processor can act on the events generated by the
<codeph>LOAD</codeph> command.
</p>
</li>
<li>
The Spark API that saves data to a specified location does not generate events in HMS,
thus is not supported. For example:
<codeblock>Seq((1, 2)).toDF("i", "j").write.save("/user/hive/warehouse/spark_etl.db/customers/date=01012019")</codeblock>
</li>
</ul>
<p>
This feature is turned off by default with the
<codeph>&#8209;&#8209;hms_event_polling_interval_s</codeph> flag set to
<codeph>0</codeph>.
</p>
</conbody>
<concept id="configure_event_based_metadata_sync">
<title>Configure HMS for Event Based Automatic Metadata Sync</title>
<conbody>
<p>
To use the HMS event based metadata sync:
</p>
<ol>
<li>
Add the following entries to the <codeph>hive-site.xml</codeph> of the Hive
Metastore service.
<codeblock> &lt;property>
&lt;name>hive.metastore.transactional.event.listeners&lt;/name>
&lt;value>org.apache.hive.hcatalog.listener.DbNotificationListener&lt;/value>
&lt;/property>
&lt;property>
&lt;name>hive.metastore.dml.events&lt;/name>
&lt;value>true&lt;/true>
&lt;/property></codeblock>
</li>
<li>
Save <codeph>hive-site.xml</codeph>.
</li>
<li>
Set the <codeph>hive.metastore.dml.events</codeph> configuration key to
<codeph>true</codeph> in HiveServer2 service's <codeph>hive-site.xml</codeph>. This
configuration key needs to be set to <codeph>true</codeph> in both Hive services,
HiveServer2 and Hive Metastore.
</li>
<li>
If applicable, set the <codeph>hive.metastore.dml.events</codeph> configuration key
to <codeph>true</codeph> in <codeph>hive-site.xml</codeph> used by the Spark
applications (typically, <codeph>/etc/hive/conf/hive-site.xml</codeph>) so that the
<codeph>INSERT</codeph> events are generated when the Spark application inserts data
into existing tables and partitions.
</li>
<li>
Restart the HiveServer2, Hive Metastore, and Spark (if applicable) services.
</li>
</ol>
</conbody>
</concept>
<concept id="disable_event_based_metadata_sync">
<title>Disable Event Based Automatic Metadata Sync</title>
<conbody>
<p>
When the <codeph>hms_event_polling_interval_s</codeph> flag is set to a non-zero
value for your <codeph>catalogd</codeph>, the event-based automatic invalidation is
enabled for all databases and tables. If you wish to have the fine-grained control on
which tables or databases need to be synced using events, you can use the
<codeph>impala.disableHmsSync</codeph> property to disable the event processing at the
table or database level.
</p>
<p>
When you add the <codeph>DBPROPERTIES</codeph> or <codeph>TBLPROPERTIES</codeph> with
the <codeph>impala.disableHmsSync</codeph> key, the HMS event based sync is turned on
or off. The value of the <codeph>impala.disableHmsSync</codeph> property determines if
the event processing needs to be disabled for a particular table or database.
</p>
<ul>
<li>
If <codeph>'impala.disableHmsSync'='true'</codeph>, the events for that table or
database are ignored and not synced with HMS.
</li>
<li>
If <codeph>'impala.disableHmsSync'='false'</codeph> or if
<codeph>impala.disableHmsSync</codeph> is not set, the automatic sync with HMS is
enabled if the <codeph>hms_event_polling_interval_s</codeph> global flag is
set to non-zero.
</li>
</ul>
<ul>
<li>
To disable the event based HMS sync for a new database, set the
<codeph>impala.disableHmsSync</codeph> database properties in Hive as currently,
Impala does not support setting database properties:
<codeblock>CREATE DATABASE &lt;name> WITH DBPROPERTIES ('impala.disableHmsSync'='true');</codeblock>
</li>
<li>
To enable or disable the event based HMS sync for a table:
<codeblock>CREATE TABLE &lt;name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');</codeblock>
</li>
<li>
To change the event based HMS sync at the table level:
<codeblock>ALTER TABLE &lt;name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');</codeblock>
</li>
</ul>
<p>
When both table and database level properties are set, the table level property takes
precedence. If the table level property is not set, then the database level property
is used to evaluate if the event needs to be processed or not.
</p>
<p>
If the property is changed from <codeph>true</codeph> (meaning events are skipped) to
<codeph>false</codeph> (meaning events are not skipped), you need to issue a manual
<codeph>INVALIDATE METADATA</codeph> command to reset event processor because it
doesn't know how many events have been skipped in the past and cannot know if the
object in the event is the latest. In such a case, the status of the event processor
changes to <codeph>NEEDS_INVALIDATE</codeph>.
</p>
</conbody>
</concept>
<concept id="event_processor_metrics">
<title>Metrics for Event Based Automatic Metadata Sync</title>
<conbody>
<p>
You can use the web UI of the <codeph>catalogd</codeph> to check the state of the
automatic invalidate event processor.
</p>
<p>
Under the web UI, there are two pages that presents the metrics for HMS event
processor that is responsible for the event based automatic metadata sync.
<ul>
<li>
<b>/metrics#events</b>
</li>
<li>
<b>/events</b>
<p>
This provides a detailed view of the metrics of the event processor, including
min, max, mean, median, of the durations and rate metrics for all the counters
listed on the <b>/metrics#events</b> page.
</p>
</li>
</ul>
</p>
</conbody>
<concept id="concept_gch_xzm_1hb">
<title>/metrics#events Page</title>
<conbody>
<p>
The <b>/metrics#events</b> page provides the following metrics about the HMS event
processor.
</p>
<table id="events-tbl">
<tgroup cols="2">
<colspec colnum="1" colname="col1" colwidth="1*"/>
<colspec colnum="2" colname="col3" colwidth="2.58*"/>
<thead>
<row>
<entry>
Name
</entry>
<entry>
Description
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
events-processor.avg-events-fetch-duration
</entry>
<entry>
Average duration to fetch a batch of events and process it.
</entry>
</row>
<row>
<entry>
events-processor.avg-events-process-duration
</entry>
<entry>
Average time taken to process a batch of events received from the Metastore.
</entry>
</row>
<row>
<entry>
events-processor.events-received
</entry>
<entry>
Total number of the Metastore events received.
</entry>
</row>
<row>
<entry>
events-processor.events-received-15min-rate
</entry>
<entry>
Exponentially weighted moving average (EWMA) of number of events received in
last 15 min.
<p>
This rate of events can be used to determine if there are spikes in event
processor activity during certain hours of the day.
</p>
</entry>
</row>
<row>
<entry>
events-processor.events-received-1min-rate
</entry>
<entry>
Exponentially weighted moving average (EWMA) of number of events received in
last 1 min.
<p>
This rate of events can be used to determine if there are spikes in event
processor activity during certain hours of the day.
</p>
</entry>
</row>
<row>
<entry>
events-processor.events-received-5min-rate
</entry>
<entry>
Exponentially weighted moving average (EWMA) of number of events received in
last 5 min.
<p>
This rate of events can be used to determine if there are spikes in event
processor activity during certain hours of the day.
</p>
</entry>
</row>
<row>
<entry>
events-processor.events-skipped
</entry>
<entry>
Total number of the Metastore events skipped.
<p>
Events can be skipped based on certain flags are table and database level.
You can use this metric to make decisions, such as:
<ul>
<li>
If most of the events are being skipped, see if you might just turn
off the event processing.
</li>
<li>
If most of the events are not skipped, see if you need to add flags on
certain databases.
</li>
</ul>
</p>
</entry>
</row>
<row>
<entry>
events-processor.outstanding-event-count
</entry>
<entry>
Total number of outstanding events to be processed on db event executors
and table event executors when
<codeph>--enable_hierarchical_event_processing</codeph> flag is
<codeph>true</codeph>.
</entry>
</row>
<row>
<entry>
events-processor.status
</entry>
<entry>
Metastore event processor status to see if there are events being received
or not. Possible states are:
<ul>
<li>
<codeph>PAUSED</codeph>
<p>
The event processor is paused because catalog is being reset
concurrently.
</p>
</li>
<li>
<codeph>ACTIVE</codeph>
<p>
The event processor is scheduled at a given frequency.
</p>
</li>
<li>
<codeph>ERROR</codeph>
</li>
<li>
The event processor is in error state and event processing has stopped.
Needs a manual <codeph>INVALIDATE</codeph> command to reset the state.
</li>
<li>
<codeph>NEEDS_INVALIDATE</codeph>
<p>
The event processor could not resolve certain events and needs a
manual <codeph>INVALIDATE</codeph> command to reset the state.
</p>
</li>
<li>
<codeph>STOPPED</codeph>
<p>
The event processing has been shutdown. No events will be processed.
</p>
</li>
<li>
<codeph>DISABLED</codeph>
<p>
The event processor is not configured to run.
</p>
</li>
</ul>
</entry>
</row>
</tbody>
</tgroup>
</table>
</conbody>
</concept>
</concept>
</concept>
</concept>