122 Commits

Author SHA1 Message Date
jichen0919
826c8cf9b0 IMPALA-14081: Support create/drop paimon table for impala
This patch mainly implement the creation/drop of paimon table
through impala.

Supported impala data types:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE

Syntax for creating paimon table:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
[col_name data_type ,...]
[PRIMARY KEY (col1,col2)]
)
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
STORED AS PAIMON
[LOCATION 'hdfs_path']
[TBLPROPERTIES (
'primary-key'='col1,col2',
'file.format' = 'orc/parquet',
'bucket' = '2',
'bucket-key' = 'col3',
];

Two types of paimon catalogs are supported.

(1) Create table with hive catalog:

CREATE TABLE paimon_hive_cat(userid INT,movieId INT)
STORED AS PAIMON;

(2) Create table with hadoop catalog:

CREATE [EXTERNAL] TABLE paimon_hadoop_cat
STORED AS PAIMON
TBLPROPERTIES('paimon.catalog'='hadoop',
'paimon.catalog_location'='/path/to/paimon_hadoop_catalog',
'paimon.table_identifier'='paimondb.paimontable');

SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES
statements are also supported.

TODO:
    - Patches pending submission:
        - Query support for paimon data files.
        - Partition pruning and predicate push down.
        - Query support with time travel.
        - Query support for paimon meta tables.
    - WIP:
        - Complex type query support.
        - Virtual Column query support for querying
          paimon data table.
        - Native paimon table scanner, instead of
          jni based.
Testing:
    - Add unit test for paimon impala type conversion.
    - Add unit test for ToSqlTest.java.
    - Add unit test for AnalyzeDDLTest.java.
    - Update default_file_format TestEnumCase in
      be/src/service/query-options-test.cc.
    - Update test case in
      testdata/workloads/functional-query/queries/QueryTest/set.test.
    - Add test cases in metadata/test_show_create_table.py.
    - Add custom test test_paimon.py.

Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef
Reviewed-on: http://gerrit.cloudera.org:8080/22914
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-10 21:24:49 +00:00
stiga-huang
5bdd9c7f39 IMPALA-14227: (Addendum) Add more tests for catalogd HA warm failover
This adds more tests in test_catalogd_ha.py for warm failover.
Refactored _test_metadata_after_failover to run in the following way:
 - Run DDL/DML in the active catalogd.
 - Kill the active catalogd and wait until the failover finishes.
 - Verify the DDL/DML results in the new active catalogd.
 - Restart the killed catalogd
It accepts two methods in parameters to perform the DDL/DML and the
verifier. In the last step, the killed catalogd is started so we keep
having 2 catalogd and can merge these into a single test by invoking
_test_metadata_after_failover for different method pairs. This saves
some test time.

The following DDL/DML statements are tested:
 - CreateTable
 - AddPartition
 - REFRESH
 - DropPartition
 - INSERT
 - DropTable
After each failover, the table is verified to be warmed up (i.e. loaded).

Also validate flags in startup to make sure enable_insert_events and
enable_reload_events are both set to true when warm failover is enabled,
i.e. --catalogd_ha_reset_metadata_on_failover=false.

Change-Id: I6b20adeb0bd175592b425e521138c41196347600
Reviewed-on: http://gerrit.cloudera.org:8080/23206
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2025-07-30 00:29:07 +00:00
jasonmfehr
fdad954ce4 IMPALA-13237: [Patch 4 of 5] - Helpers to Visualize OpenTelemetry Traces
Adds helper scripts and configurations to run an OpenTelemetry OTLP
collector and a Jaeger instance. The collector is configured to
receive telemetry data on port 55888 via OTLP-over-http and to
forward traces to a Jaeger-all-in-one container receiving data on
port 4317.

Testing was accomplished by running this setup locally and verifying traces appeared in
the Jaeger UI.

Generated-by: Github Copilot (GPT-4.1)
Change-Id: I198c00ddc99a87c630a6f654042bffece2c9d0fd
Reviewed-on: http://gerrit.cloudera.org:8080/23100
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-18 01:33:57 +00:00
stiga-huang
da190f1d86 IMPALA-14074: Warmup metadata cache in catalogd for critical tables
*Background*

Catalogd starts with a cold metadata cache - only the db/table names and
functions are loaded. Metadata of a table is unloaded until there are
queries submitted on the table. The first query will suffer from the
delay of loading metadata. There is a flag,
--load_catalog_in_background, to let catalogd eagerly load metadata of
all tables even if no queries come. Catalogd may load metadata for
tables that are possibly never used, potentially increasing catalog size
and consequently memory usage. So this flag is turned off by default and
not recommended to be used in production.

Users do need the metadata of some critical tables to be loaded. Before
that the service is considered not ready since important queries might
fail in timeout. When Catalogd HA is enabled, it’s also required that
the standby catalogd has an up-to-date metadata cache to smoothly take
over the active one when failover happens.

*New Flags*

This patch adds a startup flag for catalogd to specify a config file
containing tables that users want their metadata to be loaded. Catalogd
adds them to the table loading queue in background when a catalog reset
happens, i.e. at catalogd startup or global INVALIDATE METADATA runs.

The flag is --warmup_tables_config_file. The value can be a path in the
local FS or in remote storage (e.g. HDFS). E.g.
  --warmup_tables_config_file=file:///opt/impala/warmup_table_list.txt
  --warmup_tables_config_file=hdfs:///tmp/warmup_table_list.txt

Each line in the config file can be a fully qualified table name or a
wildcard under a db, e.g. "tpch.*". Catalogd loads the table names at
startup and schedules loading on them after a reset of the catalog. The
scheduling order is based on the order in the config file. So important
tables can be put first. Comments start with "#" or "//" are ignored in
the config file.

Another flag, --keeps_warmup_tables_loaded (defaults to false), is added
to control whether to reload the table after it’s been invalidated,
either by an explicit INVALIDATE METADATA <table> command or implicitly
invalidated by CatalogdTableInvalidator or HMS RELOAD events.

When CatalogdTableInvalidator is enabled with
--invalidate_tables_on_memory_pressure=true, users shouldn’t set
keeps_warmup_tables_loaded to true if the catalogd heap size is not
enough to cache metadata of all these tables. Otherwise, these tables
will keep being loaded and invalidated.

*Catalogd HA Changes*
When Catalogd HA is enabled, the standby catalogd will also reset its
catalog and start loading metadata of these tables, after the HA state
(active/standby) is determined. Standby catalogd keeps its metadata
cache up-to-date by applying HMS notification events. To support a
warmed up switch, --catalogd_ha_reset_metadata_on_failover should be set
to false.

*Limitation*
The standby catalogd could still have a stale cache if there are
operations in the active catalogd that don’t trigger HMS notification
events, or if the HMS notification event is not applied correctly. E.g.
Adding a new native function generates an ALTER_DATABASE event, but when
applying the event, native function list of the db is not refreshed
(IMPALA-14210). These will be resolved in separate JIRAs.

*Test*
 - Added FE unit tests.
 - Added e2e test for local/hdfs config files.
 - Added e2e test to verify the standby catalogd has a warmed up cache
   when failover happens.

Change-Id: I2d09eae1f12a8acd2de945984d956d11eeee1ab6
Reviewed-on: http://gerrit.cloudera.org:8080/23155
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-12 18:50:56 +00:00
Zoltan Borok-Nagy
9c12ef66cc IMPALA-14018: Adding utility scripts to run Lakekeeper in Impala dev envinroment
This patch adds utility scripts to run Lakekeeper (an open source
Iceberg REST Catalog) in Impala's dev environment. Lakekeeper's HDFS
support is in preview phase, so we are using a preview docker image
for now.

IcebergRESTCatalog's config setup is also refactored, and now we don't
always set "credentials" in the SessionContext, only if they are
provided.

Usage

To start Lakekeeper:
testdata/bin/run-lakekeeper.sh

To stop Lakekeeper:
testdata/bin/stop-lakekeeper.sh

Now you can create schemas and tables via Trino (need to rebuild the
Trino image for this, TODO: use docker compose for this):

docker stop impala-minicluster-trino
docker rm impala-minicluster-trino
./testdata/bin/build-trino-docker-image.sh
./testdata/bin/run-trino.sh

Then via Trino CLI:
testdata/bin/trino-cli.sh

show catalogs;
create schema iceberg_lakekeeper.trino_db;
use iceberg_lakekeeper.trino_db;
create table trino_t (i int);
insert into trino_t values (35);

After this, you should be able to query the table via Impala:

mkdir /tmp/iceberg_lakekeeper
cp testdata/bin/minicluster_trino/iceberg_lakekeeper.properties /tmp/iceberg_lakekeeper

bin/start-impala-cluster.py --no_catalogd \
    --impalad_args="--catalogd_deployed=false --use_local_catalog=true \
    --catalog_config_dir=/tmp/iceberg_lakekeeper/"

bin/impala-shell.sh

Change-Id: I610f5859f92b2ff82e310f46356e3f118e986b2c
Reviewed-on: http://gerrit.cloudera.org:8080/23141
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-11 07:50:16 +00:00
gaurav1086
3781132ef6 IMPALA-13675: OAuth AuthN Support for Impala Shell
This patch adds the support to fetch access tokens
from the OAuth Server using the OAuth client_id and
client_secret if the access token is not provided.
It covers the flow: client_credentials.
The client_secret can either be passed as a file or
be prompted to enter.

Added a test param for impala shell oauth_mock_response_cmd
to mock oauth server response only to be used for testing.
Also suppressed existing option hs2_x_forward from the
impala --help output.

Testing(okta oauth server):
- Added custom_cluster tests in test_shell_jwt_auth.py:
    test_oauth_auth_with_clientid_and_secret_success
    test_oauth_auth_with_clientid_and_secret_failure
- Tested manually by providing --user <user> and
  --oauth_client_secret_cmd="cat password_file.txt"
- Tested manually by providing --user <user> and no
  --oauth_client_secret_cmd, thereby prompting the user
  to enter the client_secret.

Example command: impala-shell.sh -a
--auth_creds_ok_in_clear --protocol="hs2-http"
--oauth_client_id="client_id"
--oauth_client_secret_cmd="cat client_secret.txt"
--oauth_server="dev.us.auth01.com"
--oauth_endpoint="/oauth/token"

Change-Id: I84e26d54f6a53696660728efb239ffd43de4c55d
Reviewed-on: http://gerrit.cloudera.org:8080/22424
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-05 21:15:47 +00:00
Mihaly Szjatinya
4837cedc79 IMPALA-10319: Support arbitrary encodings on Text files
As proposed in Jira, this implements decoding and encoding of text
buffers for Impala/Hive text tables. Given a table with
'serialization.encoding' property set, similarly to Hive, Impala should
be able to encode the inserted data into charset specified, consequently
saving it into a text file. The opposite decoding operation should be
performed upon reading data buffers from text files. Both operations
employ boost::locale::conv library.

Since Hive doesn't encode line delimiters, charsets that would have
delimiters stored differently from ASCII are not allowed.

One difference from Hive is that Impala implements
'serialization.encoding' only as a per partition serdeproperty to avoid
confusion of allowing both serde and tbl properties. (See related
IMPALA-13748)

Note: Due to precreated non-UTF-8 files present in the patch
'gerrit-code-review-checks' was performed locally. (See IMPALA-14100)

Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Reviewed-on: http://gerrit.cloudera.org:8080/22049
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-01 21:31:00 +00:00
Surya Hebbar
e419964250 IMPALA-13615: Support row grouping of instances based on fragment names
In the "Fragment Instances" page of a query, even though it is possible
to sort the rows based on the fragment's name, it is difficult to
distinguish between fragments and their instances.

With row grouping based on fragment's name, it becomes easier to
distinguish one fragment's instance from the other.

The lexographical sorting of instances can still be done based on
different columns, which splits the fragment's group and orders the rows
lexicographically only based on the column's values.

Row grouping has been implemented using the "RowGroup" extension
for datatables - https://datatables.net/extensions/rowgroup/.

Datatable libraries and its extensions have been added under the
directory - "www/datatables".

The datatable library's license has been updated according to
version 1.13.2, which was previously not updated.

The related row grouping extension's license has also been included.

Change-Id: If2b7ed6e2a6d605553242a7db4dbeaa7fcae4606
Reviewed-on: http://gerrit.cloudera.org:8080/22226
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-05-22 05:42:18 +00:00
Joe McDonnell
ea0969a772 IMPALA-11980 (part 2): Fix absolute import issues for impala_shell
Python 3 changed the behavior of imports with PEP328. Existing
imports become absolute unless they use the new relative import
syntax. This adapts the impala-shell code to use absolute
imports, fixing issues where it is imported from our test code.

There are several parts to this:
1. It moves impala shell code into shell/impala_shell.
   This matches the directory structure of the PyPi package.
2. It changes the imports in the shell code to be
   absolute paths (i.e. impala_shell.foo rather than foo).
   This fixes issues with Python 3 absolute imports.
   It also eliminates the need for ugly hacks in the PyPi
   package's __init__.py.
3. This changes Thrift generation to put it directly in
   $IMPALA_HOME/shell rather than $IMPALA_HOME/shell/gen-py.
   This means that the generated Thrift code is rooted in
   the same directory as the shell code.
4. This changes the PYTHONPATH to include $IMPALA_HOME/shell
   and not $IMPALA_HOME/shell/gen-py. This means that the
   test code is using the same import paths as the pypi
   package.

With all of these changes, the source code is very close
to the directory structure of the PyPi package. As long as
CMake has generated the thrift files and the Python version
file, only a few differences remain. This removes those
differences by moving the setup.py / MANIFEST.in and other
files from the packaging directory to the top-level
shell/ directory. This means that one can pip install
directly from the source code. i.e. pip install $IMPALA_HOME/shell

This also moves the shell tarball generation script to the
packaging directory and changes bin/impala-shell.sh to use
Python 3.

This sorts the imports using isort for the affected Python files.

Testing:
 - Ran a regular core job with Python 2
 - Ran a core job with Python 3 and verified that the absolute
   import issues are gone.

Change-Id: Ica75a24fa6bcb78999b9b6f4f4356951b81c3124
Reviewed-on: http://gerrit.cloudera.org:8080/22330
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-05-21 15:14:11 +00:00
Zoltan Borok-Nagy
49aaaa2cd5 IMPALA-13927: Fix crash on invalid BINARY data in TEXT tables
BINARY data in text files are expected to be Base64 encoded.
TextConverter::WriteSlot has a bug when it decodes base64 code,
it does not set the NULL-indicator bit to NULL for the slots of
the invalid BINARY values. Therefore later Tuple::CopyStrings can
try to copy invalid StringValue objects.

This patch fixes TextConverter::WriteSlot to set the NULL-indicator
bit in case of Base64 parse errors.

Testing
 * e2e test added

Change-Id: I79b712e2abe8ce6ecfbce508fd9e4e93fd63c964
Reviewed-on: http://gerrit.cloudera.org:8080/22721
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-03 13:57:24 +00:00
Surya Hebbar
a68717cac9 IMPALA-13304: Include aggregate instance-level metrics in JSON profile
The JSON representation of the aggregated profile or
`AggregatedRuntimeProfile` excludes some instance-level metrics
(e.g. timeseries counters, event sequences, etc) in order to limit
the profile size from growing rapidly.

In contrast, the traditional profile contains associated metrics
from root profiles(`RuntimeProfile`s) of all instances and
the aggregated profile.

The experimental profile only contains the aggregated profile.
Hence, the instance-level metrics are not present.

This patch introduces some of the aggregated instance-level metrics
to the experimental profile's JSON representation without allowing
the profile size to grow rapidly.

It also provides insights into instances with unreported or missing events.

The following attributes have been exposed to the JSON form after grouping
or aggregation.

- Aggregated Event Sequences
- Aggregated Info Strings (i.e. Table Name)

The timestamps across instances for a particular event are grouped
when the number of instances is small and aggregated otherwise,
in order to maintain the profile size and facilitate analysis.

This behavior is controlled by the json_profile_event_timestamp_limit,
which defaults to 5.

For events where the number of timestamps exceeds this limit,
they are grouped into buckets of the same size.

These buckets are spans divided evenly between the minimum and
maximum timestamps for the event. With the default limit of 5,
this results in spans of 20%.

The following aggregates are then calculated for each of these buckets,
to provide a clear and efficient summary of the data.
* Maximum timestamp
* Minimum timestamp
* Average timestamp
* Total no. of instances

The aggregate metrics are calculated with minimal overhead through
assignment to a particular division without the need for sorting,
resulting in a time complexity of O(n) with only two passes through
the entire list of timestamps.

To further optimize performance, the aggregates are computed by
circumventing the need to store each division's timestamps utilizing
only the memory required for a single value per metric, instead of
the entire range of values, while reusing previously allocated vectors.

For efficiently copying the calculated values without internally
reallocating on each insertion, memory is preallocated for each array
of metrics using RapidJSON library.

In the case of missing events, the timestamps are ordered and aligned
through the analysis of 'label_idxs'. If at least one instance contains
a complete set of events, all instances with missing timestamps are
ordered and aligned efficiently by referencing the reordering of labels.
Otherwise, the initial ordering and alignment are retained.

If any fragment instances report only a subset of events due to failure
or error, such instances are reported and only the unavailable timestamps
are skipped during the aggregate metrics calculation, while leveraging
the available timestamps.

The instances containing missing events are further recorded into the
"unreported_event_instance_idxs" field within the event sequence. These
indexes for instances are based on 'exec_params_' set during execution.
Please refer to IMPALA-13555 for further details.

All of the above logic has been encapsulated into the newly added
`ToJson` method within the `AggEventSequence` struct, prioritizing better
reuse and maintainability.

Structure of the `AggEventSequence` in JSON form -
{
  profile_name : <PLAN_NODE_NAME>,
  num_children : <NUM_CHILDREN>
  node_metadata : <NODE_METADATA_OBJECT>
  event_sequences :
  [{
    events : // An example event
    [{
      label : "Open Started""
      ts_list : [ 2257887941, <other instances' timestamps> ]
       // OR
      ts_stat :
      {
        min : [ 2257887941, ...<other divisions' minimum timestamps> ],
        max : [ 3257887941, ...<other divisions' maximum timestamps> ],
        avg : [ 2757887941, ...<other divisions' average timestamps> ]
        count : [ 2, ... <other counts of divisions' no. of instances> ]
      }
    }, <...other plan node's events>
    ],
    // This field is only included, if there are unreported events
    unreported_event_instance_idxs : [ 3, 5, 0 ]
  }],
  counters : <COUNTERS_OBJECT_ARRAY>,
  child_profiles : <CHILD_PROFILES>
}

Structure of `AggInfoStrings` in JSON form.
{
  profile_name : <PLAN_NODE_NAME>,
  num_children : <NUM_CHILDREN>
  node_metadata : <NODE_METADATA_OBJECT>
  "info_strings" :
  [{
    "key": "<info string's key>",
    "values": [<distinct info string values>]
  }]
  counters : <COUNTERS_OBJECT_ARRAY>,
  child_profiles : <CHILD_PROFILES>
}

Note: In the above structures, unlike a plan node's profile,
a fragment's profile does not contain the 'node_metadata' field.

Added unit tests for the serialization of aggregated metrics -
- Added tests for handling info strings in aggregated JSON profiles
- Introduced AggregatedEventSequenceToJsonTest fixture to validate
  event sequence serialization
- Added random profile generation for varied test conditions
- Covered scenarios for complete and missing events in both aggregated
  and grouped cases
- Ensured correct JSON structure for info strings and event sequences
- Ensured proper timestamp ordering and aggregation logic in serialized
  JSON profiles

Generated the latest expected JSON profile outputs from the
'impala-profile-tool' using the stored impala profile logs.

Added additional tests in tests/observability for profile v2's
JSON output, after inclusion of the new expected JSON profile
formats.

Change-Id: I49e18a7a7e1288e3e674e15b6fc86aad60a08214
Reviewed-on: http://gerrit.cloudera.org:8080/21683
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-07 11:54:35 +00:00
Joe McDonnell
aefd1b0920 IMPALA-13551: Produce the shell tarball by pip installing impala-shell
Currently, the shell tarball maintains its own packaging code
and directory layout. This is very complicated and currently has
several Python packages directly checked into our repository.

To simplify it, this changes the shell tarball to be based on
pip installing the pypi package. Specifically, the new directory
structure for an unpack shell tarball is:
impala-shell-4.5.0-SNAPSHOT/
  impala-shell
  install_py${PYTHON_VERSION}/
  install_py${ANOTHER_PYTHON_VERSION}/
For example, install_py2.7 is the Python 2.7 pip install of impala-shell.
install_py3.8 is a Python 3.8 pip install of impala-shell. This means
that the impala-shell script simply picks the install for the
specified version of python and uses that pip install directory.
To make this more consistent across different Linux distributions, this
upgrades pip in the virtualenv to the latest.

With this, ext-py and pkg_resources.py can be removed.

This requires rearranging the shell build code. Specifically, this splits
out the code that generates impala_build_version.py so that it can run
before generating the pypi package. The shell tarball now has a dependency
on the pypi package and must run after it.

This builds on Michael Smith's work from IMPALA-11399.

Testing:
 - Ran shell tests locally
 - Built on Centos 7, Redhat 8 & 9, Ubuntu 20 & 22, SLES 15

Change-Id: Ifbb66ab2c5bc7180221f98d9bf5e38d62f4ac036
Reviewed-on: http://gerrit.cloudera.org:8080/20171
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-12-17 22:52:01 +00:00
Zoltan Borok-Nagy
f1133acc2a IMPALA-13088, IMPALA-13109: Use RoaringBitmap instead of sorted vector of int64s
This patch substitutes the sorted 64-bit integer vectors that we
use in IcebergDeleteNode to 64-bit roaring bitmaps. We use the
CRoaring library (version 4.0.0). CRoaring also offers C++ classes,
but this patch adds its own thin C++ wrapper class around the C
functions to get the best performance.

Toolchain Clang 5.0.1 was not able to compile CRoaring due to a
bug which is tracked by IMPALA-13190, this patch also fixes it
with a new toolchain.

Performance
I used an extended version of the "One Trillion Row" challenge. This
means after inserting 1 Trillion records to a table I also deleted /
updated lots of records (see statements at the end). So at the end
I had 1 Trillion data records and ~68.5 Billion delete records in
the table.

For the measurements I used clusters with 10 and 40 executors, and
executed the following query:

 SELECT station, min(measure), max(measure), avg(measure)
 FROM measurements_extra_1trc_partitioned
 GROUP BY 1
 ORDER BY 1;

JOIN BUILD times:
+----------------+--------------+--------------+
| Implementation | 10 executors | 40 executors |
+----------------+--------------+--------------+
| Sorted vectors | CRASH        | 4m15s        |
| Roaring bitmap | 6m35s        | 1m51s        |
+----------------+--------------+--------------+

10 executors cluster with sorted vectors failed to run the query because
executors crashed due to out-of-memory.

Memory usage (VmRSS) for 10 executors:
+----------------+------------------------+
| Implementation |      10 executors      |
+----------------+------------------------+
| Sorted vectors | 54.4 GB (before CRASH) |
| Roaring bitmap | 7.4 GB                 |
+----------------+------------------------+

The resource estimations were wrong when MT_DOP was greater than 1. This
has been also fixed.

Testing:
 * added tests for RoaringBitmap64
 * added tests for resource estimations

Statements I used to delete / update the records for the One Trillion
Row challenge:

create table measurements_extra_1trc_partitioned(
    station string, ts timestamp, sensor_type int, measure decimal(5,2))
partitioned by spec (bucket(11, station), day(ts),
    truncate(10, sensor_type))
stored as iceberg;

The original challenge didn't have any row-level modifications, columns
'ts' and 'sensor_type' are new:
 'ts': timestamps that span a year
 'sensor_type': integer between 0 and 100

Both 'ts' and 'sensor_type' has uniform distribution.

Ingested data with the help of the original table One Trillion Row
challenge, then issued the following DML statements:

-- DELETE ~10 Billion
delete from measurements_extra_1trc_partitioned
where sensor_type = 13;

-- UPDATE ~220 Million
update measurements_extra_1trc_partitioned
set measure = cast(measure - 2 as decimal(5,2))
  where station in ('Budapest', 'Paris', 'Zurich', 'Kuala Lumpur')
  and sensor_type in (7, 17, 77);

-- DELETE ~7.1 Billion
delete from measurements_extra_1trc_partitioned
where ts between '2024-01-15 11:30:00' and '2024-09-10 11:30:00'
  and sensor_type between 45 and 51
  and station regexp '[ATZ].*';

-- UPDATE ~334 Million
update measurements_extra_1trc_partitioned
set measure = cast(measure + 5 as decimal(5,2))
where station in ('Accra', 'Addis Ababa', 'Entebbe', 'Helsinki',
    'Hong Kong', 'Nairobi', 'Ottawa', 'Tauranga', 'Yaounde', 'Zagreb',
    'Zurich')
  and ts > '2024-11-05 22:30:00'
  and sensor_type > 90;

-- DELETE 50.6 Billion
delete from measurements_extra_1trc_partitioned
where
  sensor_type between 65 and 77
  and ts > '2024-08-11 12:00:00'
;

-- UPDATE ~200 Million
update measurements_extra_1trc_partitioned
set measure = cast(measure + 3.5 as decimal(5,2))
where
  sensor_type in (56, 66, 76, 86, 96)
  and ts < '2024-03-17 01:00:00'
  and (station like 'Z%' or station like 'Y%');

Change-Id: Ib769965d094149e99c43e0044914d9ecccc76107
Reviewed-on: http://gerrit.cloudera.org:8080/21557
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-07-16 17:35:13 +00:00
Surya Hebbar
42e5ea7ea3 IMPALA-13106: Support larger imported query profile sizes through compression
Imported query profiles are currently being stored in IndexedDB.
Although IndexedDB does not have storage limitations like other
browser storage APIs, there is a storage limit for a single
attribute / field.

For supporting larger query profiles, 'pako' compression library's
v2.1.0 has been added along with its associated license.

Before adding query profile JSON to indexedDB, it undergoes compression
using this library.

As compression and parsing profile is a long running process
that can block the main thread, this has been delegated to
a worker script running in the background. The worker script
returns parsed query attributes and compressed profile text sent to it.

The process of compression consumes time; hence, an alert message is
displayed on the queries page warning user to refrain from closing or
reloading the page. On completion, the raw total size, compressed
total size, and total processing time are logged to the browser console.

When multiple profiles are chosen, after each query profile insertion,
the subsequent one is not triggered until compression and insertion
are finished.

The inserted query profile field is decompressed before parsing on
the query plan, query profile, query statement, and query timeline page.

Added tests for the compression library methods utilized by
the worker script.

Manual testing has been done on Firefox 126.0.1 and Chrome 126.0.6478.

Change-Id: I8c4f31beb9cac89051460bf764b6d50c3933bd03
Reviewed-on: http://gerrit.cloudera.org:8080/21463
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-25 22:06:24 +00:00
Fang-Yu Rao
3a2f5f28c9 IMPALA-12921, IMPALA-12985: Support running Impala with locally built Ranger
The goals and non-goals of this patch could be summarized as follows.
Goals:
 - Add changes to the minicluster configuration that allow a non-default
   version of Ranger (possibly built locally) to run in the context of
   the minicluster, and to be used as the authorization server by
   Impala.
 - Switch to the new constructor when instantiating
   RangerAccessRequestImpl. This resolves IMPALA-12985 and also makes
   Impala compatible with Apache Ranger if RangerAccessRequestImpl from
   Apache Ranger is consumed.
 - Prepare Ranger and Impala patches as supplemental material to verify
   what authorization-related tests could be passed if Apache Ranger is
   the authorization provider. Merging IMPALA-12921_addendum.diff to
   the Impala repository is not in the scope of this patch in that the
   diff file changes the behavior of Impala and thus more discussion is
   required if we'd like to merge it in the future.

Non-goals:
 - Set up any automation for building Ranger from source.
 - Pass all Impala authorization-related tests with a non-default
   version of Ranger.

Instructions on running Impala with locally built Ranger:

Suppose the Ranger project is under the folder $RANGER_SRC_DIR. We could
execute the following to build Apache Ranger for easy reference. By
default, the compressed tarball is produced under
$RANGER_SRC_DIR/target.

mvn clean compile -B -nsu -DskipCheck=true -Dcheckstyle.skip=true \
package install -DskipITs -DskipTests -Dmaven.javadoc.skip=true

After building Ranger, we need to build Impala's Java code so that
Impala's Java code could consume the locally produced Ranger classes. We
will need to export the following environment variables before building
Impala. This prevents bootstrap_toolchain.py from trying to download the
compressed Ranger tarball.

1. export RANGER_VERSION_OVERRIDE=\
   $(mvn -f $RANGER_SRC_DIR/pom.xml -q help:evaluate \
   -Dexpression=project.version -DforceStdout)

2. export RANGER_HOME_OVERRIDE=$RANGER_SRC_DIR/target/\
   ranger-${RANGER_VERSION_OVERRIDE}-admin

It then suffices to execute the following to point
Impala to the locally built Ranger server before starting Impala.

1. source $IMPALA_HOME/bin/impala-config.sh

2. tar zxv -f $RANGER_SRC_DIR/target/\
   ranger-${IMPALA_RANGER_VERSION}-admin.tar.gz \
   -C $RANGER_SRC_DIR/target/

3. $IMPALA_HOME/bin/create-test-configuration.sh

4. $IMPALA_HOME/bin/create-test-configuration.sh \
   -create_ranger_policy_db

5. $IMPALA_HOME/testdata/bin/run-ranger.sh
   (run-all.sh has to be executed instead if other underlying services
   have not been started)

6. $IMPALA_HOME/testdata/bin/setup-ranger.sh

Testing:
 - Manually verified that we could point Impala to a locally built
   Apache Ranger on the master branch (with tip being
   https://github.com/apache/ranger/commit/4abb993).
 - Manually verified that with RANGER-4771.diff and
   IMPALA-12921_addendum.diff, only 3 authorization-related tests
   failed. They failed because the resource type of 'storage-type' is
   not supported in Apache Ranger yet and thus the test cases added in
   IMPALA-10436 could fail.
 - Manually verified that the log files of Apache and CDP Ranger's Admin
   server could be created under ${RANGER_LOG_DIR} after we start the
   Ranger service.
 - Verified that this patch passed the core tests when CDP Ranger is
   used.

Change-Id: I268d6d4d6e371da7497aac8d12f78178d57c6f27
Reviewed-on: http://gerrit.cloudera.org:8080/21160
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-15 10:25:13 +00:00
Riza Suminto
b1320bd1d6 IMPALA-13075: Cap memory usage for ExprValuesCache at 256KB
ExprValuesCache uses BATCH_SIZE as a deciding factor to set its
capacity. It bounds the capacity such that expr_values_array_ memory
usage stays below 256KB. This patch tightens that limit to include all
memory usage from ExprValuesCache::MemUsage() instead of
expr_values_array_ only. Therefore, setting a very high BATCH_SIZE will
not push the total memory usage of ExprValuesCache beyond 256KB.

Simplify table dimension creation methods and fix few flake8 warnings in
test_dimensions.py.

Testing:
- Add test_join_queries.py::TestExprValueCache.
- Pass core tests.

Change-Id: Iee27cbbe8d3100301d05a6516b62c45975a8d0e0
Reviewed-on: http://gerrit.cloudera.org:8080/21455
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-15 00:28:38 +00:00
halim.kim
f7e629935b IMPALA-11871: Skip permissions loading and check on HDFS if Ranger is enabled
Before this patch, Impala checked whether the Impala service user had
the WRITE access to the target HDFS table/partition(s) during the
analysis of the INSERT and LOAD DATA statements in the legacy catalog
mode. The access levels of the corresponding HDFS table and partitions
were computed by the catalog server solely based on the HDFS permissions
and ACLs when the table and partitions were instantiated.

After this patch, we skip loading HDFS permissions and assume the
Impala service user has the READ_WRITE permission on all the HDFS paths
associated with the target table during query analysis when Ranger is
enabled. The assumption could be removed after Impala's implementation
of FsPermissionChecker could additionally take Ranger's policies of HDFS
into consideration when performing the check.

Testing:
 - Added end-to-end tests to verify Impala's behavior with respect to
   the INSERT and LOAD DATA statements when Ranger is enabled in the
   legacy catalog mode.

Change-Id: Id33c400fbe0c918b6b65d713b09009512835a4c9
Reviewed-on: http://gerrit.cloudera.org:8080/20221
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-06-11 06:28:53 +00:00
Yida Wu
9837637d93 IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API
Added support for following built-in functions:
- ai_generate_text_default(prompt)
- ai_generate_text(ai_endpoint, prompt, ai_model,
  ai_api_key_jceks_secret, additional_params)

'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile
options. 'ai_generate_text_default(prompt)' syntax expects all these
to be set to proper values. The other syntax, will try to use the
provided input parameter values, but fallback to instance level values
if the inputs are NULL or empty.

Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com)
API endpoints are currently supported.

Exposed these functions in FunctionContext so that they can also be
called from UDFs:
- ai_generate_text_default(context, model)
- ai_generate_text(context, ai_endpoint, prompt, ai_model,
  ai_api_key_jceks_secret, additional_params)

Testing:
- Added unit tests for AiGenerateTextInternal function
- Added fe test for JniFrontend::getSecretFromKeyStore
- Ran manual tests to make sure Impala can talk with OpenAI LLMs using
'ai_generate_text' built-in function. Example sql:
select ai_generate_text("https://api.openai.com/v1/chat/completions",
"hello", "gpt-3.5-turbo", "open-ai-key",
'{"temperature": 0.9, "model": "gpt-4"}')
- Tested using standalone UDF SDK and made sure that the UDFs can invoke
  BuiltInFunctions (ai_generate_text and ai_generate_text_default)

Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b
Reviewed-on: http://gerrit.cloudera.org:8080/21168
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-11 07:25:50 +00:00
Xiang Yang
e74bb9d81b IMPALA-12362: (part-2/4) Optimize default configurations for packaging module.
To avoid absolutely paths and keep it simple, optimize the default
configurations for packaging module by remove or change some entries.

At the same time, add license header to 'package/conf/*-site.xml' and
rename them to '*-site.xml.template' to force administrator making
configurations appropriate for their cluster.

Testing:
 - Manually deploy packages on Ubuntu22.04 and verify it.

Change-Id: Ifda229b779a3d6fca647bb81fe23dd61ad7e5d66
Reviewed-on: http://gerrit.cloudera.org:8080/20928
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-08 04:29:26 +00:00
wzhou-code
fc74ca672a IMPALA-12378: Auto Ship JDBC Data Source
This patch moves the source files of jdbc package to fe.
Data source location is optional. Data source could be created without
specifying HDFS location. Assume data source class is in the classpath
and instance of data source class could be created with current class
loader. Impala still try to load the jar file of the data source in
runtime if it's set in data source location.

Testing:
 - Passed core test
 - Passed dockerised-tests

Change-Id: I0daff8db6231f161ec27b45b51d78e21733d9b1f
Reviewed-on: http://gerrit.cloudera.org:8080/20971
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2024-02-07 16:29:11 +00:00
Eyizoha
3af1930229 IMPALA-12322: Support converting UTC timestamps read from Kudu to local time
This patch adds a query option 'convert_kudu_utc_timestamps' similar to
'convert_legacy_hive_parquet_utc_timestamps'. When enabled, it converts
UTC timestamps read from Kudu to local timestamps.

The corresponding modification also include predicate pushdown and
runtime filter. Due to the ambiguity of timestamps caused by daylight
saving time changes, it is difficult to resolve in the bloom filter.
This patch additionally introduces a query option
'disable_kudu_local_timestamp_bloom_filter' to default disable the Kudu
timestamp bloom filter after enabling time zone conversion in order to
avoid erroneously filtering out data. However, for regions that do not
observe daylight saving time, it can be set to false to re-enable the
Kudu local timestamp bloom filter.

Testing:
- Add TestKuduTimestampConvert in query_test/test_kudu.py
Perform end-to-end testing in a custom cluster, including basic Kudu UTC
timestamp conversion testing, as well as checking if related predicate
pushdown and runtime filters are working correctly (even with timestamps
involving daylight saving time conversions).

Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3
Reviewed-on: http://gerrit.cloudera.org:8080/20681
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2023-12-14 13:19:35 +00:00
Michael Smith
465cc7acf7 IMPALA-11542: Import LLVM SectionMemoryManager for fixes
Imports SectionMemoryManager so we can add some fixes that require
modifying private code. This could be done as a patch on LLVM, but the
memory manager seems like something we should own and may need to
customize in other ways.

Changes namespace and adds using statements to make it compile. Uses
LLVM 5.0.1 with commit 787614d08bda40daf3e168bd46a8c2a86319ec63 added as
a small security fix.

Change-Id: I8917005094903ed0ece25e40eb445abb159b569b
Reviewed-on: http://gerrit.cloudera.org:8080/20696
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-12-01 18:25:51 +00:00
Surya Hebbar
fed578580b IMPALA-12364: Display memory, disk and network metrics in webUI's query timeline
This patch adds fragment-level metrics to the WebUI query timeline
display along with additional disk and network metrics.

The fragment's plan nodes are enlarged with an animated transition on
hovering over the fragment's row in query timeline's fragment diagram.

On clicking the plan nodes, total thread and memory usage of the parent
fragment are displayed, after accumulating memory and thread usage of
all child nodes. Thread usage is being shown on the additional Y-axis.

In this way, memory and thread usage of multiple fragments can be
compared alongside. A fragment's usage can be hidden by clicking
on any of the child plan nodes again.

These counters are available within the profile with following names.

- MemoryUsage
- ThreadUsage

Once a fragment's metrics are displayed, they are updated as they
are collected from the profile during a running query.

A grid-line is displayed along with a tooltip on hovering over the
fragment diagram, containing the instantaneous time at that position.
This grid-line also triggers tooltips and gridlines in other charts.

A warning is displayed on clicking a fragment with less number of samples
available.

RESOURCE_TRACE_RATIO query option must be set for providing periodic
metrics within the profile. This allows the following time series
counters to be displayed on the query timeline.

- HostDiskWriteThroughput
- HostDiskReadThroughput
- HostNetworkRx
- HostNetworkTx

The additional Y-axis within the utilization chart is used to represent
the average of these metrics.

The memory units in tooltips and ticks on co-ordinate axes are displayed
in human readable form such as KB, MB, GB and PB for convenience.

Both of the charts contain controls to close the chart. These charts
can also be resized until a maximum and minmum limit by dragging the
resize bar's handle.

Along with mouse wheel events, the diagrams can be horizontally
stretched by the help of buttons with horizontal zoom icons at the
top of the page. The zoom out button is disabled, when further zoom out
is not possible.

Timeticks are being autoscaled during fragment diagram's horizontal zoom.

In addition to the scrollbar, hovering on edges of the window allows
horizontal scrolling.

Test cases have been for the additional disk, network and fragment level
memory metrics parsing functions.

Change-Id: Ifd25e6f0bc9fbd664ec98936daff3f27182dfc7f
Reviewed-on: http://gerrit.cloudera.org:8080/20355
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-11-09 02:32:27 +00:00
Fucun Chu
c2bd30a1b3 IMPALA-5741: Initial support for reading tiny RDBMS tables
This patch uses the "external data source" mechanism in Impala to
implement data source for querying JDBC.
It has some limitations due to the restrictions of "external data
source":
  - It is not distributed, e.g, fragment is unpartitioned. The queries
    are executed on coordinator.
  - Queries which read following data types from external JDBC tables
    are not supported:
    BINARY, CHAR, DATETIME, and COMPLEX.
  - Only support binary predicates with operators =, !=, <=, >=,
    <, > to be pushed to RDBMS.
  - Following data types are not supported for predicates:
    DECIMAL, TIMESTAMP, DATE, and BINARY.
  - External tables with complex types of columns are not supported.
  - Support is limited to the following databases:
    MySQL, Postgres, Oracle, MSSQL, H2, DB2, and JETHRO_DATA.
  - Catalog V2 is not supported (IMPALA-7131).
  - DataSource objects are not persistent (IMPALA-12375).

Additional fixes are planned on top of this patch.

Source files under jdbc/conf, jdbc/dao and jdbc/exception are
replicated from Hive JDBC Storage Handler.

In order to query the RDBMS tables, the following steps should be
followed (note that existing data source table will be rebuilt):
1. Make sure the Impala cluster has been started.

2. Copy the jar files of JDBC drivers and the data source library into
HDFS.
${IMPALA_HOME}/testdata/bin/copy-ext-data-sources.sh

3. Create an `alltypes` table in the Postgres database.
${IMPALA_HOME}/testdata/bin/load-ext-data-sources.sh

4. Create data source tables (alltypes_jdbc_datasource and
alltypes_jdbc_datasource_2).
${IMPALA_HOME}/bin/impala-shell.sh -f\
  ${IMPALA_HOME}/testdata/bin/create-ext-data-source-table.sql

5. It's ready to run query to access data source tables created
in last step. Don't need to restart Impala cluster.

Testing:
 - Added unit-test for Postgres and ran unit-test with JDBC driver
   postgresql-42.5.1.jar.
 - Ran manual unit-test for MySql with JDBC driver
   mysql-connector-j-8.1.0.jar.
 - Ran core tests successfully.

Change-Id: I8244e978c7717c6f1452f66f1630b6441392e7d2
Reviewed-on: http://gerrit.cloudera.org:8080/17842
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-10-10 02:13:59 +00:00
Sebastian Pop
333902afcc [arm64] remove dependence on sse2neon
This patch removes the dependence on sse2neon by rewriting SSE2 and AVX code
with native NEON instructions. Part of the patch has been submitted to kudu
https://gerrit.cloudera.org/#/c/20374/

Change-Id: If3c78c877ef530fa9f35d36da523ad67ab34e5e7
Reviewed-on: http://gerrit.cloudera.org:8080/19954
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-26 02:22:27 +00:00
Joe McDonnell
1a1a84ee23 IMPALA-12434: Isolate pkg_resources.py to its own directory
In some build environments, the impala-shell Python 3
virtualenv install fails due to interactions with
shell/pkg_resources.py. This doesn't reproduce in the standard
development environment, but it is consistent. It seems to
be related to invoking a command in ${IMPALA_HOME}/shell
and the pkg_resources.py being in that directory.

To avoid any interactions, this moves shell/pkg_resources.py
to shell/legacy/pkg_resources.py. This keeps it off of the
path for the failing command, and it also keeps it off of
our PYTHONPATH (which includes ${IMPALA_HOME}/shell).

Testing:
 - Ran a build in the affected build environment
 - Ran a core job

Change-Id: Id8f2d8a8472c7bb405bf88673ed9779e23cde1d6
Reviewed-on: http://gerrit.cloudera.org:8080/20468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-19 04:30:09 +00:00
Eyizoha
2f06a7b052 IMPALA-10798: Initial support for reading JSON files
Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-05 16:55:41 +00:00
Fredy Wijaya
4b62812995 [tools] Add Dev Container support for Impala development.
Currently only VS Code is supported since IntelliJ/CLion support for
Dev Container is still beta at the time of this writing.

To use it, simply open Impala source code.

$ git clone https://github.com/apache/impala.git
$ cd impala
$ code .

The bootstrap_development.sh will be automatically executed post Docker
container creation and all necessary extensions for IDE-like experience
will be automatically installed. For C++, it will use clangd that uses
compilation database instead of the Microsoft C++ extension since it
works better with Clang related tools.

Change-Id: I50508a09710641ec2a299b001fef3e7fefb0b7d5
Reviewed-on: http://gerrit.cloudera.org:8080/20380
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
2023-08-23 12:30:14 +00:00
Laszlo Gaal
ee069687fc IMPALA-12212: Bump Maven to 3.9.2, pull dependencies in parallel
Maven 3.9.x offers a new dependency resolver, HttpClient, which allows
downloading project dependencies in parallel.

This patch bumps the Maven version installed by bootstrap_system.sh to
v3.9.2, and adds the flags enabling the new resolver to download
dependencies (including POM files) in parallel. Parallelism is set to
10 threads.

The flags are added to a project-specific Maven setting file in the
newly created java/.mvn directory. The settings file is added to the
RAT exclusion list in bin/rat_exclude_files.txt.

The --show-version flag is added for debugging purposes.

The same flags are added to the JAMM subproject as well.

The new resolver in Maven 3.9 has also changed the warning message
emitted for missing component checksums, so the new warning string
is added to the filter in bin/mvn-quiet.sh
Unfortunately Maven 3.9 has also changed the way it responds to missing
checksum files: the resolver now emits a stack trace when checksums
cannot be determined, and missing checksums are not explicitly ignored.

Detailed documentation for the new Maven resolver in Maven 3.9.0+ is
located at:
https://maven.apache.org/guides/mini/guide-resolver-transport.html
resolver configuration reference:
https://maven.apache.org/resolver/configuration.html

Tests:
- verified in a core-mode test run with Maven 3.9.2 installed
- verified in a local build using an earlier version of Maven
  to verify that the new default setting does not cause regressions
  with the old dependency resolver.

Change-Id: I75d05215effc724f5bd471646fb352f37443e185
Reviewed-on: http://gerrit.cloudera.org:8080/20142
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2023-07-24 18:50:34 +00:00
stiga-huang
8d0ab2b684 IMPALA-10262: RPM/DEB Packaging Support
This patch bases on a previous patch contributed by Shant Hovsepian:
https://gerrit.cloudera.org/c/16612/

It adds a new option, -package, to buildall.sh for building a package
for the current OS type (e.g. CentOS/Ubuntu). You can also use
"make/ninja package" to build the package. Scripts for launching the
services and the required configuration files are also added.

Tests:
 - Built on Ubuntu 18.04/20.04 and CentOS 7 using
   ./buildall.sh -noclean -skiptests -release -package
 - Deployed the RPM package on a CDP cluster. Verifed the scripts.
 - Deployed the DEB package on a docker container. Verified the scripts.

Change-Id: I64419fd400fe8d233dac016b6306157fe9461d82
Reviewed-on: http://gerrit.cloudera.org:8080/18939
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-07-16 11:13:23 +00:00
Surya Hebbar
36a63d0a33 IMPALA-12182: Add CPU utilization chart for RuntimeProfile's sampled metrics
This change adds support for a stacked area chart for CPU utilization
to the query timeline display, while also providing the ability to scale
timetick values, precision, and the ability to horizontally scale
the fragment timing diagram along with the utilization chart.

Rendering of different components within the diagram has been decoupled
to isolate scaling of timeticks, also to improve the overall
efficiency by making the rendering functions asynchronous
for better performance during resize events. Additionally,
re-rendering of fragment diagram is only triggered during new
fragment events.

The following are the associated key bindings to scale the timeline
with mouse wheel events.
- shift + wheel events on #fragment_diagram
- shift + wheel events on #timeticks_footer
- alt + shift + wheel events on #timeticks_footer for precision control

Note:
Ctrl + mouse wheel events and ctrl + '+'/'-' events can be used to
resize the timeline through the browser.

Mouse wheel events have been associated with respective components
for better efficiency and maintainability.

Constraints have been added to above attributes to limit scaling/zooming
for appropriate display and rendering across all DOM elements.

RESOURCE_TRACE_RATIO query option provides the utilization values to be
traced within the RuntimeProfile. It contains samples of CPU utilization
metrics for user, sys and iowait. These time series counters are available
within the profile having the following names.

Per Node Profiles -
  - HostCpuIoWaitPercentage
  - HostCpuSysPercentage
  - HostCpuUserPercentage

The samples are updated based on 'periodic_counter_update_period_ms'
providing the 'period' within profile's 'Per Node Profiles'.

These are retrieved from the ChunkedTimeSeriesCounter in the
RuntimeProfile. Currently, JSON profiles and webUI summary pages
contain the downsampled values.

Utilization samples are aligned with the fragment diagram by
associating the number of samples and the period.

Aggregate CPU usage for each node is being calculated
after accumulating the basis point values for user, sys
and iowait. These are being displayed after grouping the
associated counters for each node as a stacked line chart.

c3.js charting library based on d3.v5 is being used to plot
the utilization.

The license associated with d3 v5 during the related time frame
has been included along with the charting library's.

Support for experimental profile V2 is currently not included.

Scaling a large number of values to support profile V2 would be
possible with appropriate down-sampling in the back-end.

Testing: Manual testing with TPC-DS and TPC-H queries

Change-Id: Idea2a6db217dbfaa7a0695aeabb6d9c1ecf62158
Reviewed-on: http://gerrit.cloudera.org:8080/20008
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-07-11 00:06:24 +00:00
Michael Smith
3b0705ba63 IMPALA-11941: Support Java 17 in Impala
Enables building for Java 17 - and particularly using Java 17 in
containers - but won't run a minicluster fully with Java 17 as some
projects (Hadoop) don't yet support it.

Starting with Java 15, ehcache.sizeof encounters
UnsupportedOperationException: can't get field offset on a hidden class
in class members pointing to capturing lambda functions. Java 17 also
introduces new modules that need to be added to add-opens. Both of these
pose problems for continued use of ehcache.

Adds https://github.com/jbellis/jamm as a new cache weigher for Java
15+. We build from HEAD as an external project until Java 17 support is
released (https://github.com/jbellis/jamm/issues/44). Adds the
'java_weigher' option to select 'sizeof' or 'jamm'; defaults to 'auto',
which uses jamm for Java 15+ and sizeof for everything else. Also adds
metrics for viewing cache weight results.

Adds JAVA_HOME/lib/server to LD_LIBRARY_PATH in run-jvm-binary to
simplify switching between JDK versions for testing. You can now
- export IMPALA_JDK_VERSION=11
- source bin/impala-config.sh
- start-impala-cluster.py
and have Impala running a different JDK (11) version.

Retains add-opens calls that are still necessary due to dependencies'
use of lambdas for jamm, and all others for ehcache. Add-opens are still
required as a fallback, as noted in
https://github.com/jbellis/jamm#object-graph-crawling. We catch the
exceptions jamm and ehcache throw - CannotAccessFieldException,
UnsupportedOperationException - to avoid crashing Impala, and add it to
the list of banned log messages (as we should add-opens when we find
them).

Testing:
- container test run with Java 11 and 17 (excludes custom cluster)
- manual custom_cluster/test_local_catalog.py +
  test_banned_log_messages.py run with Java 11 and 17 (Java 8 build)
- full Java 11 build (passed except IMPALA-12184)
- add test catalog cache entry size metrics fit reasonable bounds
- add unit test for utility to find jamm jar file in classpath

Change-Id: Ic378896f572e030a3a019646a96a32a07866a737
Reviewed-on: http://gerrit.cloudera.org:8080/19863
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-06-24 10:11:54 +00:00
jasonmfehr
63d13a35f3 IMPALA-11880: Adds support for authenticating to Impala using JWTs.
This support was modeled after the LDAP authentication.

If JWT authentication is used, the Impala shell enforces the use of the
hs2-http protocol since the JWT is sent via the "Authentication"
HTTP header.

The following flags have been added to the Impala shell:
* -j, --jwt: indicates that JWT authentication will be used
* --jwt_cmd: shell command to run to retrieve the JWT to use for
  authentication

Testing
New Python tests have been added:
* The shell tests ensure that the various command line arguments are
  handled properly. Situations such as a single authentication method,
  JWTs cannot be sent in clear text without the proper arguments, etc
  are asserted.
* The Python custom cluster tests leverage a test JWKS and test JWTs.
  Then, a custom Impala cluster is started with the test JWKS. The
  Impala shell attempts to authenticate using a valid JWT, an expired
  (invalid) JWT, and a valid JWT signed by a different, untrusted JWKS.
  These tests also exercise the Impala JWT authentication mechanism and
  assert the prometheus JWT auth success and failure metrics are
  reported accurately.

Change-Id: I52247f9262c548946269fe5358b549a3e8c86d4c
Reviewed-on: http://gerrit.cloudera.org:8080/19837
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-11 23:22:05 +00:00
Gergely Farkas
490dd7b115 IMPALA-11726: Allow LDAP user and group filter when Kerberos is enabled
This change does two things for the Kerberos authentication support
for impala clients:

1) Introduces allow_custom_ldap_filters_with_kerberos_auth flag,
which removes the restriction that prevents to use LDAP group/user
search filters when Kerberos authentication is enabled. When the flag
is set both Kerberos and LDAP can work with impala clients
(impala-shell, jdbc, odbc, impyla) even if the group/user filters are
defined. The flag default value is false, which ensures backwards
compatibility.

2) Introduces enable_group_filter_check_for_authenticated_kerberos_user
flag, which allows group filters to be applied for non-proxy users
that belong to the authenticated Kerberos principals.
The verified username comes from the Kerberos principal: The username
is the first member of the authenticated Kerberos principal, where the
principal can be username/host@realm or username@realm.
Regardless of whether the flag is enabled or not, LDAP filters are not
applied for authorized proxy users (neither when using LDAP nor when
using Kerberos authentication). In case of delegation, filters are
applied for delegated users.
This flag makes sense if Kerberos and LDAP authentication is enabled
and the users in the KDC and LDAP are synchronized (e.g. Active
Directory provides both LDAP and Kerberos authentication).
The flag default value is false, which ensures backwards compatibility.

Notes:

If the allow_custom_ldap_filters_with_kerberos_auth flag is disabled,
it is still possible to use LDAP and Kerberos authentication together,
but in a limited way: Only LDAP search bind authentication mode can be
used, where there are default user and group search filters (that are
defined for Active Directory LDAP schema). One major limitation here
- apart from the AD directory schema assumed in the default filters -
is that the only possibility to control user access is to select the
appropriate user and group search base dn (e.g. granting LDAP access
to users/groups defined in a given subtree)
Even in this edge case, it is still allowed to enable the
enable_group_filter_check_for_authenticated_kerberos_user flag. If this
happens, then the default filters in LDAP search bind will be applied
for Kerberos authenticated non-proxy users.

Another edge case where the LDAP authentication is enabled, the
user access is controlled by custom LDAP filters (LDAP auth only),
and the external Kerberos authentication is also enabled, but the users
in KDC and LDAP are not in sync:
In this case the allow_custom_ldap_filters_with_kerberos_auth flag must
be set, but enable_group_filter_check_for_authenticated_kerberos_user
flag should be disabled, otherwise an unauthorized response may be
received during Kerberos authentication (depending on whether the
authenticated Kerberos user passes the custom LDAP filters or not).
In such cases, access to Kerberos users must be controlled by other
ways (e.g. within FreeIPA KDC with host-based access control rules).

Tests:
- New unit test created to check the behavior of AuthManager with
  and without allow_custom_ldap_filters_with_kerberos_auth flag.
- New custom cluster tests created:
  - impala-shell tests that validate existing LDAP search bind
    and simple bind functionality with Kerberos authentication
    enabled (LdapSearchBindImpalaShellTest and
    LdapSimpleBindImpalaShellTest suites are now parameterized),
  - impala-shell tests that validate backwards compatibility
    when allow_custom_ldap_filters_with_kerberos_auth flag and
    enable_group_filter_check_for_authenticated_kerberos_user
    flags are disabled
    (LdapSearchBindDefaultFiltersKerberosImpalaShellTest)
  - various impala-shell tests that validate Kerberos
    authentication in an environment where LDAP authentication
    is also enabled (LdapKerberosImpalaShellTest)
- Manual tests with a snapshot build in CDP PVC DS with LDAP and
  Kerberos authentication enabled, user and group filters provided.

Change-Id: If3ca9c4ff8a17167e5233afabdd14c948edb46de
Reviewed-on: http://gerrit.cloudera.org:8080/19561
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-10 19:49:30 +00:00
Joe McDonnell
566df80891 IMPALA-11959: Add Python 3 virtualenv
This adds a Python 3 equivalent to the impala-python
virtualenv base on the toolchain Python 3.7.16.
This modifies bootstrap_virtualenv.py to support
the two different modes. This adds py2-requirements.txt
and py3-requirements.txt to allow some differences
between the Python 2 and Python 3 virtualenvs.

Here are some specific package changes:
 - allpairs is replaced with allpairspy, as allpairs did
   not support Python 3.
 - requests is upgraded slightly, because otherwise is has issues
   with idna==2.8.
 - pylint is limited to Python 3, because we are adding it
   and don't need it on both
 - flake8 is limited to Python 2, because it will take
   some work to switch to a version that works on Python 3
 - cm_api is limited to Python 2, because it doesn't support
   Python 3
 - pytest-random does not support Python 3 and it is unused,
   so it is removed
 - Bump the version of setuptool-scm to support Python 3

This adds impala-pylint, which can be used to do further
Python 3 checks via --py3k. This also adds a bin/check-pylint-py3k.sh
script to enforce specific py3k checks. The banned py3k warnings
are specified in the bin/banned_py3k_warnings.txt. This is currently
empty, but this can ratchet up the py3k strictness over time
to avoid regressions.

This pulls in a new toolchain with the fix for IMPALA-11956
to get Python 3.7.16.

Testing:
 - Hand tested that the allpairs libraries produce the
   same results
 - The python3 virtualenv has no influence on regular
   tests yet

Change-Id: Ica4853f440c9a46a79bd5fb8e0a66730b0b4efc0
Reviewed-on: http://gerrit.cloudera.org:8080/19567
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
a9cfc7b33f IMPALA-11624: Bump Impyla dependency to 0.18.0
IMPALA_THRIFT_PY_VERSION is also bumped to 0.16.0p3.
As 0.16.0p3 Thrift does not contain Python related
patches and Impyla 0.18.0 depends on Thrift 0.16.0,
now we are consistently using Thrift 0.16.0 in all
Python code. This also bumps the Thrift in the
shell's ext-py directory to 0.16.0 (based on the
Thrift 0.16.0 pypi tarball with the egg directory
removed).

Testing:
 - Ran a GVO job

Change-Id: I7265558b0e07959c606cba73cd251c3edfcb3ed5
Reviewed-on: http://gerrit.cloudera.org:8080/18456
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-02-27 20:39:26 +00:00
Michael Smith
feb4a76ed4 IMPALA-11913: Upgrade datatables to 1.13.2
Upgrades datatables from datatables.net to the latest available version
to address XSS and prototype pollution issues with 1.10.18.

Testing:
- clicked around to all the UI pages

Change-Id: I323fd06da003789485d340eaa25d4ab79a7f3ece
Reviewed-on: http://gerrit.cloudera.org:8080/19489
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-14 22:35:56 +00:00
Michael Smith
16190b4f77 IMPALA-11737: Update sasl to 0.3.1 for Python 3.10
sasl 0.2.1 fails to build with Python 3.10. Updates to sasl 0.3.1 for
Python 3.10 compatibility.

Testing:
- built under Python 3.8
- automated tests will test with built bundle and pip install using
current Python version
- pip3 installed shell/build/dist on Ubuntu 22.04 with Python 3.10

Change-Id: I6b522f2b8cb5546150cd3274c7670a6ca9b8ff63
Reviewed-on: http://gerrit.cloudera.org:8080/19265
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2022-11-28 17:16:42 +00:00
Tamas Mate
e1e92da796 IMPALA-11676: Prettify asf-site docs
This commit refactors and adds a new build option to the docs build
script/Makefile, these options are:
 - plain-html: the plain html docs, without css and navigation bar, this
was "the" html build before this change.
 - asf-site-html: html docs, with css and navigation bar.
 - pdf

The css is comming from DITA project's documentation.

Testing:
 - Built the docs and tested the pages manually.

Change-Id: Ic9621cb0abaa7fd9bf445da08440c0f6a9788180
Reviewed-on: http://gerrit.cloudera.org:8080/19242
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-16 20:26:31 +00:00
Riza Suminto
ea6173440c IMPALA-11669: Set TConfiguration during Thrift connection setup
THRIFT-5237 Implement MAX_MESSAGE_SIZE and consolidate limits into a
TConfiguration class. The MAX_MESSAGE_SIZE default to 100MB. This patch
adds backend flag 'thrift_rpc_max_message_size' to override the default
MAX_MESSAGE_SIZE by calling function AssignDefaultTConfiguration.
We set higher default, 1GB, to minimize interruption on existing Impala
workload. We should consider lowering this default once we ensure that
all thrift rpc response can be made in batches, such as explained in
IMPALA-11402.

Thrift tuning for communication with HMS is fixed through HIVE-26633.
Appropriate Hive version bump is required.

Testing:
- Add EXPECT_NO_THROW to verify that 'checkReadBytesAvailable' to does
  not throw exception given default FLAGS_thrift_rpc_max_message_size.
- Add MaxMessageSizeFit and MaxMessageSizeExceeded tests in
  thrift-server-test.
- Run and pass thrift-server-test

Change-Id: I137683d43c72a34105fd7b32fea3a93532601ae3
Reviewed-on: http://gerrit.cloudera.org:8080/19162
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-28 02:29:21 +00:00
Csaba Ringhofer
7ca11dfc7f IMPALA-9482: Support for BINARY columns
This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-19 13:55:42 +00:00
Michael Smith
64b324ac40 IMPALA-11389: Include Python 3 eggs in tarball
Build Python 3 eggs for the shell tarball so it works with both Python 2
and Python 3. The impala-shell script selects eggs based on the
available Python version.

Inlines thrift for impala-shell so we can easily build Python 2 and
Python 3 versions, consistent with other libraries. The impala-shell
version should always be at least as new as IMPALA_THRIFT_PY_VERSION.

Thrift 0.13.0+ wraps all exceptions during TSocket read/write operations
in TTransportException. Specifically socket.error that we got as raw
exceptions are now wrapped. Unwraps them before raising to preserve
prior behavior.

A specific Python version can be selected with IMPALA_PYTHON_EXECUTABLE;
otherwise it will use 'python', and if unavailable try 'python3'.

Adds tests for impala-shell tarball with Python 3.

Change-Id: I94f86de9e2a6303151c2f0e6454b5f629cbc9444
Reviewed-on: http://gerrit.cloudera.org:8080/18653
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-14 23:52:04 +00:00
wzhou-code
b867f4c4f1 IMPALA-10745 (part 2): Support Kerberos over HTTP for impala-shell
This patch adds kerberos-1.3.1 Python module to shell/ext-py so that
the egg file of Kerberos module is built and added into impala-shell
tarball when running script shell/make_shell_tarball.sh.
Kerberos Python module is distributed under Apache License Version 2.
Its source distribution is available at:
https://pypi.org/project/kerberos/

Testing:
 - Passed core run.
 - Installed impala-shell from impala-shell tarball on dev box as
   standalone package. Verified that impala-shell could be ran without
   additional configurations.
 - Installed impala-shell from impala-shell tarball on a real cluster
   with a full Kerberos setup. Verified that impala-shell could
   connect to impala server with options "-k --protocol=hs2-http".

Change-Id: Id34074cbe725ba2cf1407fcf59e00475cd417a6d
Reviewed-on: http://gerrit.cloudera.org:8080/18523
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-15 21:46:06 +00:00
Fucun Chu
4186727fe6 IMPALA-10871: Add MetastoreShim to support Apache Hive 3.1.2
Like IMPALA-8369, this patch adds a compatibility shim in fe so that
Impala can interoperate with Hive 3.1.2. we need adds a new
Metastoreshim class under compat-apache-hive-3 directory. These shim
classes implement method which are different in cdp-hive-3 vs
apache-hive-3 and are used by front end code. At the build time, based
on the environment variable IMPALA_HIVE_DIST_TYPE one of the two shims
is added to as source using the fe/pom.xml build plugin.

Some codes that directly use Hive 4 APIs need to be ignored in
compilation, eg. fe/src/main/java/org/apache/impala/catalog/metastore/.
Use Maven profile to ignore some codes, profile will automatically
activated based on the IMPALA_HIVE_DIST_TYPE.

Testing:
1. Code compiles and runs against both HMS-3 and ASF-HMS-3
2. Ran full-suite of tests against HMS-3
3. Running full-tests against ASF-HMS-3 will need more work
supporting Tez in the mini-cluster (for dataloading) and HMS
transaction support. This will be on-going effort and test failures
on ASF-Hive-3 will be fixed in additional sub-tasks.

Notes:
1. Patch uses a custom build of Apache Hive to be deployed in
mini-cluster. This build has the fixes for HIVE-21569, HIVE-20038.
This hack will be added to the build script in additional sub-tasks.

Change-Id: I9f08db5f6da735ac431819063060941f0941f606
Reviewed-on: http://gerrit.cloudera.org:8080/17774
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-27 06:36:19 +00:00
Andrew Sherman
b96439f680 IMPALA-11078 Add simple CSP header to webui.
Content Security Policy (CSP) is a computer security standard designed
to prevent cross-site scripting, clickjacking and other code injection
attacks. CSP provides a method for websites to declare approved origins
of content that browsers should be allowed to load on that website.
A good resource is https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP
If a page breaks the rules then the included script or css will
typically not be run by the browser.

In the Impala webui we use a CSP header to declare that all web content
comes from the impalad, with some 'unsafe' inline code.

A new server flag "--disable_content_security_policy_header=true" can be
set to disable the emission of this header in case of any compatibility
issues.

A few small changes were needed to make this CSP header work. Chart.js
was previously included via http, this was changed to being bundled
like other javascript and css we use. Some dodgy array code that
handles connection metrics was also fixed.

TESTING:
  The main webui tests all now validate the CSP header is present.
  A test for the new flag is also added.

Change-Id: Idc335d65b117661da0b420ddb7c9ccd80d8d76ab
Reviewed-on: http://gerrit.cloudera.org:8080/18168
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-01-25 22:52:50 +00:00
Fang-Yu Rao
351e037472 IMPALA-10934 (Part 2): Enable table definition over a single file
This patch adds an end-to-end test to validate and characterize HMS'
behavior with respect to external table creation after HIVE-25569 via
which a user is allowed to create an external table associated with a
single file.

Change-Id: Ia4f57f07a9f543c660b102ebf307a6cf590a6784
Reviewed-on: http://gerrit.cloudera.org:8080/18033
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
2022-01-05 03:32:11 +00:00
wzhou-code
9e76a8f7c3 IMPALA-10784 (part 3): Prepare to publish impala-shell on PyPi
We are going to publish impala-shell release 4.1.0a1 on PyPi.
This patch upgrades following three python libraries which are used
for generating egg files when building impala-shell tarball.
  upgrade bitarray from 1.2.1 to 2.3.0
  upgrade prettytable from 0.7.1 to 0.7.2
  upgrade thrift_sasl from 0.4.2 to 0.4.3
Updates shell/packaging/requirements.txt for the versions of dependent
Python libraries.

Testing:
 - Ran core tests.
 - Built impala-shell package impala_shell-4.1.0a1.tar.gz, installed
   impala-shell package from local impala_shell-4.1.0a1.tar.gz, verified
   impala-shell was installed in ~/.local/lib/python2.7/site-packages.
   Verified the version of installed impala-shell and dependent Python
   libraries as expected.
 - Set IMPALA_SHELL_HOME as ~/.local/lib/python2.7/site-packages/
   impala_shell, copied over egg files under installed impala-shell
   python package so we can run the end-to-end unit tests against
   the impala-shell installed with the package downloaded from PyPi.
   Passed end-to-end impala-shell unit tests.
 - Verified the impala-shell tarball generated by
   shell/make_shell_tarball.sh.

Change-Id: I378404e2407396d4de3bb0eea4d49a9c5bb4e46a
Reviewed-on: http://gerrit.cloudera.org:8080/17826
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-28 04:55:57 +00:00
wzhou-code
03a7a59f5d IMPALA-10876: Support to download JWKS from given URL
This patch added functionality to download JWKS from a given URL and
support key rotation by periodically checking the JWKS URL for updates.

We use Kudu's EasyCurl wrapper to download file from the given URL.
curl was added to native-toolchain. This patch modified makefiles
and bootstrap_toolchain.py to integrate libcurl and libkudu_curl_util.

Added end-end JWT authentication test cases with JWKS specified as
HTTP/HTTPS URL.

Testing:
 - Passed core run, including new test cases.

Change-Id: Ic6ac8cf0010c13db30219776d1d275709bf211df
Reviewed-on: http://gerrit.cloudera.org:8080/17802
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-28 04:45:23 +00:00
wzhou-code
025500ccb5 IMPALA-10489: Implement JWT support
This patch added JWT support with following functionality:
 * Load and parse JWKS from pre-installed JSON file.
 * Read the JWT token from the HTTP Header.
 * Verify the JWT's signature with public key in JWKS.
 * Get the username out of the payload of JWT token.
 * Support following JSON Web Algorithms (JWA):
   HS256, HS384, HS512, RS256, RS384, RS512.

We use third party library jwt-cpp to verify JWT token. jwt-cpp is a
headers only C++ library. It was added to native-toolchain.
This patch modified bootstrap_toolchain.py to download jwt-cpp from
toolchain s3 bucket, and modified makefiles to add jwt-cpp/include
in the include path.

Added BE unit-tests for loading JWKS file and verifying JWT token.
Also added FE custom cluster test for JWT authentication.

Testing:
 - Passed core run.

Change-Id: I6b71fa854c9ddc8ca882878853395e1eb866143c
Reviewed-on: http://gerrit.cloudera.org:8080/17435
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-08 23:10:32 +00:00
Daniel Becker
817ca5920d IMPALA-10640: Support reading Parquet Bloom filters - most common types
This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---------------------------------------
|Parquet type |  Impala type            |
|---------------------------------------|
|INT32        |  TINYINT, SMALLINT, INT |
|INT64        |  BIGINT                 |
|FLOAT        |  FLOAT                  |
|DOUBLE       |  DOUBLE                 |
|BYTE_ARRAY   |  STRING                 |
 ---------------------------------------

The following types are not supported for the given reasons:

 ----------------------------------------------------------------
|Impala type |  Problem                                          |
|----------------------------------------------------------------|
|VARCHAR(N)  | truncation can change hash                        |
|CHAR(N)     | padding / truncation can change hash              |
|DECIMAL     | multiple encodings supported                      |
|TIMESTAMP   | multiple encodings supported, timezone conversion |
|DATE        | not considered yet                                |
 ----------------------------------------------------------------

Support may be added for these types later, see IMPALA-10641.

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
    whether Parquet Bloom filtering works for the supported types and
    that we do not incorrectly discard row groups for the unsupported
    type VARCHAR. The Parquet file used in the test was generated with
    an external tool.
  - Added unit tests for ParquetBloomFilter in file
    be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
    be/src/util/bloom-filter-test.cc: the MakeRandom() function had
    return type uint64_t, the documentation claimed it returned a 64 bit
    random number, but the actual number of random bits is 32, which is
    what is intended in the tests. The return type and documentation
    have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Reviewed-on: http://gerrit.cloudera.org:8080/17026
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2021-06-03 06:32:45 +00:00