Files
impala/testdata/datasets
stiga-huang d74cc7319f IMPALA-9670: Fix unloaded views are shown as tables for GET_TABLES requests
At startup, catalogd pulls the table names from HMS and tracks each
table using an IncompleteTable which only contains the table name. The
table types (TABLE/VIEW) and comments are unknown until the table/view
is loaded in catalogd. GET_TABLES is a request of the HS2 protocol. It
fetches all the tables with their types and comments. For unloaded
tables/views, Impala always returns them with TABLE type (the default)
and empty comments.

This patch enables catalogd to always load the table types and comments
along with the table names. This behavior is controlled by a
catalogd-only flag, --pull_table_types_and_comments, which is false by
default. When this flag is enabled, catalogd will load table types and
comments at startup and in executing INVALIDATE METADATA commands. In
other words, an unloaded table (IncompleteTable) now not just contains
the table name, but also contains the correct table type and comment.

This is implemented by using the getTableMetas HMS API when invalidating
a table. The original behavior uses getAllTables to load all table names
and uses tableExists to verify whether a table still exists. When the
flag is set, we'll use getTableMetas instead to also load the table
types and comments.

Implementation:
Add a new table type, UNLOADED_TABLE, in TTableType to identify tables
that we just know it's not a view, but don’t know whether it's a Kudu or
HDFS table since its full set of metadata is unloaded.

When propagating catalog objects from catalogd to coordinators, views
are sent using a catalog key explicitly prefixed by VIEW. So
coordinators can create IncompleteTables/LocalIncompleteTables with the
correct types.

In most of the cases in creating an IncompleteTable, we have the table
types and comments in the context. For instance, when adding an
IncompleteTable for a CreateTable/CreateView request, we know exactly
it's a table or view. So we can create IncompleteTables with the correct
types.

Test infra changes:
 - Adds get_tables() method for the hs2_client
 - Extends ImpalaTestSuite.create_client_for_nth_impalad() to support
   hs2 and hs2-http protocols. So we can create HS2 clients on all
   impalads.

Tests:
 - Add custom cluster tests on all catalog modes (with/without
   local-catalog or event processor). Verify the table types and
   comments are always correct when pull_table_types_and_comments is
   true.

Change-Id: I528bb20272ebdd66a0118c30efc2b0566f2b0e2f
Reviewed-on: http://gerrit.cloudera.org:8080/18626
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-24 04:27:49 +00:00
..

This directory contains Impala test data sets. The directory layout is structured as follows:

datasets/
   <data set>/<data set>_schema_template.sql
   <data set>/<data files SF1>/data files
   <data set>/<data files SF2>/data files

Where SF is the scale factor controlling data size. This allows for scaling the same schema to
different sizes based on the target test environment.

The schema template SQL files have the following format:

  The goal is to provide a single place to define a table + data files
  and have the schema and data load statements generated for each combination of file
  format, compression, etc. The way this works is by specifying how to create a
  'base table'. The base table can be used to generate tables in other file formats
  by performing the defined INSERT / SELECT INTO statement. Each new table using the
  file format/compression combination needs to have a unique name, so all the
  statements are pameterized on table name.
  The template file is read in by the 'generate_schema_statements.py' script to
  to generate all the schema for the Imapla benchmark tests.

  Each table is defined as a new section in the file with the following format:

  ====
  ---- SECTION NAME
  section contents
  ...
  ---- ANOTHER SECTION
  ... section contents
  ---- ... more sections...

  Note that tables are delimited by '====' and that even the first table in the
  file must include this header line.

  The supported section names are:

  DATASET
      Data set name - Used to group sets of tables together
  BASE_TABLE_NAME
      The name of the table within the database
  CREATE
      Explicit CREATE statement used to create the table (executed by Impala)
  CREATE_HIVE
      Same as the above, but will be executed by Hive instead. If specified,
      'CREATE' must not be specified.
  CREATE_KUDU
      Customized CREATE TABLE statement used to create the table for Kudu-specific
      syntax.

  COLUMNS
  PARTITION_COLUMNS
  ROW_FORMAT
  HBASE_COLUMN_FAMILIES
  TABLE_PROPERTIES
  HBASE_REGION_SPLITS
      If no explicit CREATE statement is provided, a CREATE statement is generated
      from these sections (see 'build_table_template' function in
      'generate-schema-statements.py' for details)

  ALTER
      A set of ALTER statements to be executed after the table is created
      (typically to add partitions, but may also be used for other settings that
      cannot be specified directly in the CREATE TABLE statement).

      These statements are ignored for HBase and Kudu tables.

  LOAD
      The statement used to load the base (text) form of the table. This is
      typically a LOAD DATA statement.

  DEPENDENT_LOAD
  DEPENDENT_LOAD_KUDU
  DEPENDENT_LOAD_HIVE
  DEPENDENT_LOAD_ACID
      Statements to be executed during the "dependent load" phase. These statements
      are run after the initial (base table) load is complete.

  HIVE_MAJOR_VERSION
       The required major version of Hive for this table. If the major version
       of Hive at runtime does not exactly match the version specified in this section,
       the table will be skipped.

       NOTE: this is not a _minimum_ version -- if HIVE_MAJOR_VERSION specifies '2',
                   the table will _not_ be loaded/created on Hive 3.