Files
impala/docs
Venu Reddy 5db760662f IMPALA-12709: Add support for hierarchical metastore event processing
At present, metastore event processor is single threaded. Notification
events are processed sequentially with a maximum limit of 1000 events
fetched and processed in a single batch. Multiple locks are used to
address the concurrency issues that may arise when catalog DDL
operation processing and metastore event processing tries to
access/update the catalog objects concurrently. Waiting for a lock or
file metadata loading of a table can slow the event processing and can
affect the processing of other events following it. Those events may
not be dependent on the previous event. Altogether it takes a very
long time to synchronize all the HMS events.

Existing metastore event processing is turned into multi-level
event processing with enable_hierarchical_event_processing flag. It
is not enabled by default. Idea is to segregate the events based on
their dependency, maintain the order of events as they occur within
the dependency and process them independently as much as possible.
Following 3 main classes represents the three level threaded event
processing.
1. EventExecutorService
   It provides the necessary methods to initialize, start, clear,
   stop and process the metastore events processing in hierarchical
   mode. It is instantiated from MetastoreEventsProcessor and its
   methods are invoked from MetastoreEventsProcessor. Upon receiving
   the event to process, EventExecutorService queues the event to
   appropriate DbEventExecutor for processing.
2. DbEventExecutor
   An instance of this class has an execution thread, manage events
   of multiple databases with DbProcessors. An instance of DbProcessor
   is maintained to store the context of each database within the
   DbEventExecutor. On each scheduled execution, input events on
   DbProcessor are segregated to appropriate TableProcessors for the
   event processing and also process the database events that are
   eligible for processing.
   Once a DbEventExecutor is assigned to a database, a DbProcessor
   is created. And the subsequent events belonging to the database
   are queued to same DbEventExecutor thread for further processing.
   Hence, linearizability is ensured in dealing with events within
   the database. Each instance of DbEventExecutor has a fixed list
   of TableEventExecutors.
3. TableEventExecutor
   An instance of this class has an execution thread, processes
   events of multiple tables with TableProcessors. An instance of
   TableProcessor is maintained to store context of each table within
   a TableEventExecutor. On each scheduled execution, events from
   TableProcessors are processed.
   Once a TableEventExecutor is assigned to table, a TableProcessor
   is created. And the subsequent table events are processed by same
   TableEventExecutor thread. Hence, linearizability is guaranteed
   in processing events of a particular table.
   - All the events of a table are processed in the same order they
     have occurred.
   - Events of different tables are processed in parallel when those
     tables are assigned to different TableEventExecutors.

Following new events are added:
1. DbBarrierEvent
   This event wraps a database event. It is used to synchronize all
   the TableProcessors belonging to database before processing the
   database event. It acts as a barrier to restrict the processing
   of table events that occurred after the database event until the
   database event is processed on DbProcessor.
2. RenameTableBarrierEvent
   This event wraps an alter table event for rename. It is used to
   synchronize the source and target TableProcessors to
   process the rename table event. It ensures the source
   TableProcessor removes the table first and then allows the target
   TableProcessor to create the renamed table.
3. PseudoCommitTxnEvent and PseudoAbortTxnEvent
   CommitTxnEvent and AbortTxnEvent can involve multiple tables in
   a transaction and processing these events modifies multiple table
   objects. Pseudo events are introduced such that a pseudo event is
   created for each table involved in the transaction and these
   pseudo events are processed independently at respective
   TableProcessors.

Following new flags are introduced:
1. enable_hierarchical_event_processing
   To enable the hierarchical event processing on catalogd.
2. num_db_event_executors
   To set the number of database level event executors.
3. num_table_event_executors_per_db_event_executor
   To set the number of table level event executors within a
   database event executor.
4. min_event_processor_idle_ms
   To set the minimum time to retain idle db processors and table
   processors on the database event executors and table event
   executors respectively, when they do not have events to process.
5. max_outstanding_events_on_executors
   To set the limit of maximum outstanding events to process on
   event executors.

Changed hms_event_polling_interval_s type from int to double to support
millisecond precision interval

TODOs:
1. We need to redefine the lag in the hierarchical processing mode.
2. Need to have a mechanism to capture the actual event processing time
   in hierarchical processing mode. Currently, with
   enable_hierarchical_event_processing as true, lastSyncedEventId_ and
   lastSyncedEventTimeSecs_ are updated upon event dispatch to
   EventExecutorService for processing on respective DbEventExecutor
   and/or TableEventExecutor. So lastSyncedEventId_ and
   lastSyncedEventTimeSecs_ doesn't actually mean events are processed.
3. Hierarchical processing mode currently have a mechanism to show the
   total number of outstanding events on all the db and table executors
   at the moment. Need to enhance observability further with this mode.
Filed a jira[IMPALA-13801] to fix them.

Testing:
 - Executed existing end to end tests.
 - Added fe and end-to-end tests with enable_hierarchical_event_processing.
 - Added event processing performance tests.
 - Have executed the existing tests with hierarchical processing
   mode enabled. lastSyncedEventId_ is now used in the new feature of
   sync_hms_events_wait_time_s (IMPALA-12152) as well. Some tests fail when
   hierarchical processing mode is enabled because lastSyncedEventId_ do
   not actually mean event is processed in this mode. This need to be
   fixed/verified with above jira[IMPALA-13801].

Change-Id: I76d8a739f9db6d40f01028bfd786a85d83f9e5d6
Reviewed-on: http://gerrit.cloudera.org:8080/21031
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-30 11:51:03 +00:00
..
2022-11-16 20:26:31 +00:00
2022-11-16 20:26:31 +00:00

Generating HTML or a PDF of Apache Impala Documentation

Prerequisites

Make sure that you have a recent version of a Java JDK installed and that your JAVA_HOME environment variable is set. This procedure has been tested with JDK 1.8.0. See Setting JAVA_HOME at the end of these instructions.

Download Docs Source

  • There are two ways to obtain docs sources.
    • Clone the whole repository. Open a terminal window and run the following commands to get the whole Impala repository from Git and go to the docs folder:

        git clone https://gitbox.apache.org/repos/asf/impala.git
        cd <local_directory>
        git checkout master
        cd docs/
      

      Where master is the branch where Impala documentation source files are uploaded.

    • Clone only the docs directory. Open a terminal window and run the following commands to get only the Impala documentation source files from Git:

        git init impala_docs
        cd impala_docs
        git remote add origin https://gitbox.apache.org/repos/asf/impala.git
        git sparse-checkout set docs/
        git pull origin master
        cd docs/
      

      You'll see only the 'docs/' sub-directory is downloaded.

Download DITA Open Toolkit

  • Download the DITA Open Toolkit version 2.3.3 from the DITA Open Toolkit web site:

    https://github.com/dita-ot/dita-ot/releases/download/2.3.3/dita-ot-2.3.3.zip

    Note: A DITA-OT 2.3.3 User Guide is included in the toolkit. Look for userguide.pdf in the doc directory of the toolkit after you extract it. For example, if you extract the toolkit package to the /Users/<username>/DITA-OT directory on Mac OS, you will find the userguide.pdf at the following location:

    /Users/<username>/DITA-OT/doc/userguide.pdf
    

Add dita Executable to Your PATH

  1. Identify the directory into which you extracted DITA-OT. For this exercise, we'll assume it's /Users/<username>/DITA-OT
  2. Find your .bash_profile. On Mac OS X, it is probably /Users/<username>/.bash_profile.
  3. Edit your <path_to_bash_profile>/.bash_profile file and add the following lines to the end of the file.
    # Add dita to path
    export PATH="/Users/<username>/DITA-OT/bin:$PATH"
    
    Save the file.
  4. Open a new terminal, or run source <path_to_bash_profile>/.bash_profile.
  5. Verify dita is in your PATH. A command like which dita should print the location of the dita executable, like:
    $ which dita
    /Users/<username>/DITA-OT/bin/dita
    

Verify dita Executable Can Run

In a terminal, try dita --help. You should get brief usage, like:

Usage: dita -i <file> -f <name> [options]
   or: dita -install [<file>]
   or: dita -uninstall <id>
   or: dita -help
   or: dita -version
Arguments:
  -i, -input <file>      input file
  -f, -format <name>     output format (transformation type)
  -install [<file>]      install plug-in from a ZIP file or reload plugins
  -uninstall <id>        uninstall plug-in with the ID
  -h, -help              print this message
  -version               print version information and exit
Options:
  -o, -output <dir>      output directory
  -filter <file>         filter and flagging file
  -t, -temp <dir>        temporary directory
  -v, -verbose           verbose logging
  -d, -debug             print debugging information
  -l, logfile <file>     use given file for log
  -D<property>=<value>   use value for given property
  -propertyfile <name>   load all properties from file with -D
                         properties taking precedence

If you don't get this, or you get an error, see Setting JAVA_HOME and Troubleshooting at the end of these instructions.

Oneshot Docs Build

The easiest way to build the docs is to run make from the docs/ directory corresponding to your git clone. It takes about 1 minute. This works because the make uses the provided Makefile to call dita properly.

Docs will end up in docs/build (both HTML and PDF).

Details, Advanced Usage

  1. In the directory where you cloned the Impala documentation files, you will find the following important configuration files in the docs subdirectory. These files are used to convert the XML source you downloaded from the Apache site to PDF and HTML:

    • impala.ditamap: Tells the DITA Open Toolkit what topics to include in the Impala User/Administration Guide. This guide also includes the Impala SQL Reference.
    • impala_html.ditaval: Further defines what topics to include in the Impala HTML output.
    • impala_pdf.ditaval: Further defines what topics to include in the Impala PDF output.
  2. Run one of the following commands, depending on what you want to generate:

    • To generate HTML output of the Impala User and Administration Guide, which includes the Impala SQL Reference, run the following command:

      dita -input <path_to_impala.ditamap> -format html5 \
        -output <path_to_build_output_directory> \
        -filter <path_to_impala_html.ditaval>
      
    • To generate PDF output of the Impala User and Administration Guide, which includes the Impala SQL Reference, run the following command:

      dita -input <path_to_impala.ditamap> -format pdf \
        -output <path_to_build_output_directory> \
        -filter <path_to_impala_pdf.ditaval>
      

    Note: For a description of all command-line options, see the DITA Open Toolkit User Guide in the doc directory of your downloaded DITA Open Toolkit.

Setting JAVA_HOME

Set your JAVA_HOME environment variable to tell your computer where to find the Java executable file. For example, to set your JAVA_HOME environment on Mac OS X when you have the 1.8.0_101 version of the Java Development Kit (JDK) installed and you are using the Bash version 3.2 shell, perform the following steps:

  1. Find your .bash_profile. On Mac OS X, it is probably /Users/<username>/.bash_profile. Edit your <path_to_bash_profile>/.bash_profile file and add the following lines to the end of the file.

    # Set JAVA_HOME
    JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home
    export JAVA_HOME
    

    Where jdk1.8.0_101.jdk is the version of JDK that you have installed. For example, if you have installed jdk1.8.0_102.jdk, you would use that value instead.

  2. Open a new terminal, or run source <path_to_bash_profile>/.bash_profile.

  3. Test to make sure you have set your JAVA_HOME correctly:

    • Open a terminal window and type: $JAVA_HOME/bin/java -version

    • Press return. If you see something like the following:

      java version "1.8.0_101"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.8.0_101-b06-284)
      Java HotSpot (TM) Client VM (build 1.8.0_101-133, mixed mode, sharing)
      

      Then you've successfully set your JAVA_HOME environment variable to the binary stored in /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home.

      Note: The exact version and build number on your system may differ. The point is you want a message like the above.

Troubleshooting

Ant

If you're trying to use DITA-OT to build docs and you get an exception like this

java.lang.NoSuchMethodError: org.apache.tools.ant.Main: method <init>()V not found
    at org.dita.dost.invoker.Main.<init>(Main.java:418)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at java.lang.Class.newInstance(Class.java:379)
    at org.apache.tools.ant.launch.Launcher.run(Launcher.java:279)
    at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)

... your CLASSPATH may be interfering with DITA-OT's ability to find the proper Ant. While you're free to fix the CLASSPATH yourself, it may be easier just to run

unset CLASSPATH

and try again. This will use the libraries and Ant provided by the DITA-OT package.