impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
jichen0919	826c8cf9b0	IMPALA-14081: Support create/drop paimon table for impala This patch mainly implement the creation/drop of paimon table through impala. Supported impala data types: - BOOLEAN - TINYINT - SMALLINT - INTEGER - BIGINT - FLOAT - DOUBLE - STRING - DECIMAL(P,S) - TIMESTAMP - CHAR(N) - VARCHAR(N) - BINARY - DATE Syntax for creating paimon table: CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ( [col_name data_type ,...] [PRIMARY KEY (col1,col2)] ) [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)] STORED AS PAIMON [LOCATION 'hdfs_path'] [TBLPROPERTIES ( 'primary-key'='col1,col2', 'file.format' = 'orc/parquet', 'bucket' = '2', 'bucket-key' = 'col3', ]; Two types of paimon catalogs are supported. (1) Create table with hive catalog: CREATE TABLE paimon_hive_cat(userid INT,movieId INT) STORED AS PAIMON; (2) Create table with hadoop catalog: CREATE [EXTERNAL] TABLE paimon_hadoop_cat STORED AS PAIMON TBLPROPERTIES('paimon.catalog'='hadoop', 'paimon.catalog_location'='/path/to/paimon_hadoop_catalog', 'paimon.table_identifier'='paimondb.paimontable'); SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES statements are also supported. TODO: - Patches pending submission: - Query support for paimon data files. - Partition pruning and predicate push down. - Query support with time travel. - Query support for paimon meta tables. - WIP: - Complex type query support. - Virtual Column query support for querying paimon data table. - Native paimon table scanner, instead of jni based. Testing: - Add unit test for paimon impala type conversion. - Add unit test for ToSqlTest.java. - Add unit test for AnalyzeDDLTest.java. - Update default_file_format TestEnumCase in be/src/service/query-options-test.cc. - Update test case in testdata/workloads/functional-query/queries/QueryTest/set.test. - Add test cases in metadata/test_show_create_table.py. - Add custom test test_paimon.py. Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef Reviewed-on: http://gerrit.cloudera.org:8080/22914 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-09-10 21:24:49 +00:00
stiga-huang	5bdd9c7f39	IMPALA-14227: (Addendum) Add more tests for catalogd HA warm failover This adds more tests in test_catalogd_ha.py for warm failover. Refactored _test_metadata_after_failover to run in the following way: - Run DDL/DML in the active catalogd. - Kill the active catalogd and wait until the failover finishes. - Verify the DDL/DML results in the new active catalogd. - Restart the killed catalogd It accepts two methods in parameters to perform the DDL/DML and the verifier. In the last step, the killed catalogd is started so we keep having 2 catalogd and can merge these into a single test by invoking _test_metadata_after_failover for different method pairs. This saves some test time. The following DDL/DML statements are tested: - CreateTable - AddPartition - REFRESH - DropPartition - INSERT - DropTable After each failover, the table is verified to be warmed up (i.e. loaded). Also validate flags in startup to make sure enable_insert_events and enable_reload_events are both set to true when warm failover is enabled, i.e. --catalogd_ha_reset_metadata_on_failover=false. Change-Id: I6b20adeb0bd175592b425e521138c41196347600 Reviewed-on: http://gerrit.cloudera.org:8080/23206 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2025-07-30 00:29:07 +00:00
jasonmfehr	fdad954ce4	IMPALA-13237: [Patch 4 of 5] - Helpers to Visualize OpenTelemetry Traces Adds helper scripts and configurations to run an OpenTelemetry OTLP collector and a Jaeger instance. The collector is configured to receive telemetry data on port 55888 via OTLP-over-http and to forward traces to a Jaeger-all-in-one container receiving data on port 4317. Testing was accomplished by running this setup locally and verifying traces appeared in the Jaeger UI. Generated-by: Github Copilot (GPT-4.1) Change-Id: I198c00ddc99a87c630a6f654042bffece2c9d0fd Reviewed-on: http://gerrit.cloudera.org:8080/23100 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-18 01:33:57 +00:00
stiga-huang	da190f1d86	IMPALA-14074: Warmup metadata cache in catalogd for critical tables Background Catalogd starts with a cold metadata cache - only the db/table names and functions are loaded. Metadata of a table is unloaded until there are queries submitted on the table. The first query will suffer from the delay of loading metadata. There is a flag, --load_catalog_in_background, to let catalogd eagerly load metadata of all tables even if no queries come. Catalogd may load metadata for tables that are possibly never used, potentially increasing catalog size and consequently memory usage. So this flag is turned off by default and not recommended to be used in production. Users do need the metadata of some critical tables to be loaded. Before that the service is considered not ready since important queries might fail in timeout. When Catalogd HA is enabled, it’s also required that the standby catalogd has an up-to-date metadata cache to smoothly take over the active one when failover happens. New Flags This patch adds a startup flag for catalogd to specify a config file containing tables that users want their metadata to be loaded. Catalogd adds them to the table loading queue in background when a catalog reset happens, i.e. at catalogd startup or global INVALIDATE METADATA runs. The flag is --warmup_tables_config_file. The value can be a path in the local FS or in remote storage (e.g. HDFS). E.g. --warmup_tables_config_file=file:///opt/impala/warmup_table_list.txt --warmup_tables_config_file=hdfs:///tmp/warmup_table_list.txt Each line in the config file can be a fully qualified table name or a wildcard under a db, e.g. "tpch.". Catalogd loads the table names at startup and schedules loading on them after a reset of the catalog. The scheduling order is based on the order in the config file. So important tables can be put first. Comments start with "#" or "//" are ignored in the config file. Another flag, --keeps_warmup_tables_loaded (defaults to false), is added to control whether to reload the table after it’s been invalidated, either by an explicit INVALIDATE METADATA <table> command or implicitly invalidated by CatalogdTableInvalidator or HMS RELOAD events. When CatalogdTableInvalidator is enabled with --invalidate_tables_on_memory_pressure=true, users shouldn’t set keeps_warmup_tables_loaded to true if the catalogd heap size is not enough to cache metadata of all these tables. Otherwise, these tables will keep being loaded and invalidated. Catalogd HA Changes* When Catalogd HA is enabled, the standby catalogd will also reset its catalog and start loading metadata of these tables, after the HA state (active/standby) is determined. Standby catalogd keeps its metadata cache up-to-date by applying HMS notification events. To support a warmed up switch, --catalogd_ha_reset_metadata_on_failover should be set to false. Limitation The standby catalogd could still have a stale cache if there are operations in the active catalogd that don’t trigger HMS notification events, or if the HMS notification event is not applied correctly. E.g. Adding a new native function generates an ALTER_DATABASE event, but when applying the event, native function list of the db is not refreshed (IMPALA-14210). These will be resolved in separate JIRAs. Test - Added FE unit tests. - Added e2e test for local/hdfs config files. - Added e2e test to verify the standby catalogd has a warmed up cache when failover happens. Change-Id: I2d09eae1f12a8acd2de945984d956d11eeee1ab6 Reviewed-on: http://gerrit.cloudera.org:8080/23155 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-12 18:50:56 +00:00
Zoltan Borok-Nagy	9c12ef66cc	IMPALA-14018: Adding utility scripts to run Lakekeeper in Impala dev envinroment This patch adds utility scripts to run Lakekeeper (an open source Iceberg REST Catalog) in Impala's dev environment. Lakekeeper's HDFS support is in preview phase, so we are using a preview docker image for now. IcebergRESTCatalog's config setup is also refactored, and now we don't always set "credentials" in the SessionContext, only if they are provided. Usage To start Lakekeeper: testdata/bin/run-lakekeeper.sh To stop Lakekeeper: testdata/bin/stop-lakekeeper.sh Now you can create schemas and tables via Trino (need to rebuild the Trino image for this, TODO: use docker compose for this): docker stop impala-minicluster-trino docker rm impala-minicluster-trino ./testdata/bin/build-trino-docker-image.sh ./testdata/bin/run-trino.sh Then via Trino CLI: testdata/bin/trino-cli.sh show catalogs; create schema iceberg_lakekeeper.trino_db; use iceberg_lakekeeper.trino_db; create table trino_t (i int); insert into trino_t values (35); After this, you should be able to query the table via Impala: mkdir /tmp/iceberg_lakekeeper cp testdata/bin/minicluster_trino/iceberg_lakekeeper.properties /tmp/iceberg_lakekeeper bin/start-impala-cluster.py --no_catalogd \ --impalad_args="--catalogd_deployed=false --use_local_catalog=true \ --catalog_config_dir=/tmp/iceberg_lakekeeper/" bin/impala-shell.sh Change-Id: I610f5859f92b2ff82e310f46356e3f118e986b2c Reviewed-on: http://gerrit.cloudera.org:8080/23141 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-11 07:50:16 +00:00
gaurav1086	3781132ef6	IMPALA-13675: OAuth AuthN Support for Impala Shell This patch adds the support to fetch access tokens from the OAuth Server using the OAuth client_id and client_secret if the access token is not provided. It covers the flow: client_credentials. The client_secret can either be passed as a file or be prompted to enter. Added a test param for impala shell oauth_mock_response_cmd to mock oauth server response only to be used for testing. Also suppressed existing option hs2_x_forward from the impala --help output. Testing(okta oauth server): - Added custom_cluster tests in test_shell_jwt_auth.py: test_oauth_auth_with_clientid_and_secret_success test_oauth_auth_with_clientid_and_secret_failure - Tested manually by providing --user <user> and --oauth_client_secret_cmd="cat password_file.txt" - Tested manually by providing --user <user> and no --oauth_client_secret_cmd, thereby prompting the user to enter the client_secret. Example command: impala-shell.sh -a --auth_creds_ok_in_clear --protocol="hs2-http" --oauth_client_id="client_id" --oauth_client_secret_cmd="cat client_secret.txt" --oauth_server="dev.us.auth01.com" --oauth_endpoint="/oauth/token" Change-Id: I84e26d54f6a53696660728efb239ffd43de4c55d Reviewed-on: http://gerrit.cloudera.org:8080/22424 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-05 21:15:47 +00:00
Mihaly Szjatinya	4837cedc79	IMPALA-10319: Support arbitrary encodings on Text files As proposed in Jira, this implements decoding and encoding of text buffers for Impala/Hive text tables. Given a table with 'serialization.encoding' property set, similarly to Hive, Impala should be able to encode the inserted data into charset specified, consequently saving it into a text file. The opposite decoding operation should be performed upon reading data buffers from text files. Both operations employ boost::locale::conv library. Since Hive doesn't encode line delimiters, charsets that would have delimiters stored differently from ASCII are not allowed. One difference from Hive is that Impala implements 'serialization.encoding' only as a per partition serdeproperty to avoid confusion of allowing both serde and tbl properties. (See related IMPALA-13748) Note: Due to precreated non-UTF-8 files present in the patch 'gerrit-code-review-checks' was performed locally. (See IMPALA-14100) Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Reviewed-on: http://gerrit.cloudera.org:8080/22049 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-01 21:31:00 +00:00
Surya Hebbar	e419964250	IMPALA-13615: Support row grouping of instances based on fragment names In the "Fragment Instances" page of a query, even though it is possible to sort the rows based on the fragment's name, it is difficult to distinguish between fragments and their instances. With row grouping based on fragment's name, it becomes easier to distinguish one fragment's instance from the other. The lexographical sorting of instances can still be done based on different columns, which splits the fragment's group and orders the rows lexicographically only based on the column's values. Row grouping has been implemented using the "RowGroup" extension for datatables - https://datatables.net/extensions/rowgroup/. Datatable libraries and its extensions have been added under the directory - "www/datatables". The datatable library's license has been updated according to version 1.13.2, which was previously not updated. The related row grouping extension's license has also been included. Change-Id: If2b7ed6e2a6d605553242a7db4dbeaa7fcae4606 Reviewed-on: http://gerrit.cloudera.org:8080/22226 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-05-22 05:42:18 +00:00
Joe McDonnell	ea0969a772	IMPALA-11980 (part 2): Fix absolute import issues for impala_shell Python 3 changed the behavior of imports with PEP328. Existing imports become absolute unless they use the new relative import syntax. This adapts the impala-shell code to use absolute imports, fixing issues where it is imported from our test code. There are several parts to this: 1. It moves impala shell code into shell/impala_shell. This matches the directory structure of the PyPi package. 2. It changes the imports in the shell code to be absolute paths (i.e. impala_shell.foo rather than foo). This fixes issues with Python 3 absolute imports. It also eliminates the need for ugly hacks in the PyPi package's __init__.py. 3. This changes Thrift generation to put it directly in $IMPALA_HOME/shell rather than $IMPALA_HOME/shell/gen-py. This means that the generated Thrift code is rooted in the same directory as the shell code. 4. This changes the PYTHONPATH to include $IMPALA_HOME/shell and not $IMPALA_HOME/shell/gen-py. This means that the test code is using the same import paths as the pypi package. With all of these changes, the source code is very close to the directory structure of the PyPi package. As long as CMake has generated the thrift files and the Python version file, only a few differences remain. This removes those differences by moving the setup.py / MANIFEST.in and other files from the packaging directory to the top-level shell/ directory. This means that one can pip install directly from the source code. i.e. pip install $IMPALA_HOME/shell This also moves the shell tarball generation script to the packaging directory and changes bin/impala-shell.sh to use Python 3. This sorts the imports using isort for the affected Python files. Testing: - Ran a regular core job with Python 2 - Ran a core job with Python 3 and verified that the absolute import issues are gone. Change-Id: Ica75a24fa6bcb78999b9b6f4f4356951b81c3124 Reviewed-on: http://gerrit.cloudera.org:8080/22330 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-05-21 15:14:11 +00:00
Zoltan Borok-Nagy	49aaaa2cd5	IMPALA-13927: Fix crash on invalid BINARY data in TEXT tables BINARY data in text files are expected to be Base64 encoded. TextConverter::WriteSlot has a bug when it decodes base64 code, it does not set the NULL-indicator bit to NULL for the slots of the invalid BINARY values. Therefore later Tuple::CopyStrings can try to copy invalid StringValue objects. This patch fixes TextConverter::WriteSlot to set the NULL-indicator bit in case of Base64 parse errors. Testing * e2e test added Change-Id: I79b712e2abe8ce6ecfbce508fd9e4e93fd63c964 Reviewed-on: http://gerrit.cloudera.org:8080/22721 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-03 13:57:24 +00:00
Surya Hebbar	a68717cac9	IMPALA-13304: Include aggregate instance-level metrics in JSON profile The JSON representation of the aggregated profile or `AggregatedRuntimeProfile` excludes some instance-level metrics (e.g. timeseries counters, event sequences, etc) in order to limit the profile size from growing rapidly. In contrast, the traditional profile contains associated metrics from root profiles(`RuntimeProfile`s) of all instances and the aggregated profile. The experimental profile only contains the aggregated profile. Hence, the instance-level metrics are not present. This patch introduces some of the aggregated instance-level metrics to the experimental profile's JSON representation without allowing the profile size to grow rapidly. It also provides insights into instances with unreported or missing events. The following attributes have been exposed to the JSON form after grouping or aggregation. - Aggregated Event Sequences - Aggregated Info Strings (i.e. Table Name) The timestamps across instances for a particular event are grouped when the number of instances is small and aggregated otherwise, in order to maintain the profile size and facilitate analysis. This behavior is controlled by the json_profile_event_timestamp_limit, which defaults to 5. For events where the number of timestamps exceeds this limit, they are grouped into buckets of the same size. These buckets are spans divided evenly between the minimum and maximum timestamps for the event. With the default limit of 5, this results in spans of 20%. The following aggregates are then calculated for each of these buckets, to provide a clear and efficient summary of the data. * Maximum timestamp * Minimum timestamp * Average timestamp * Total no. of instances The aggregate metrics are calculated with minimal overhead through assignment to a particular division without the need for sorting, resulting in a time complexity of O(n) with only two passes through the entire list of timestamps. To further optimize performance, the aggregates are computed by circumventing the need to store each division's timestamps utilizing only the memory required for a single value per metric, instead of the entire range of values, while reusing previously allocated vectors. For efficiently copying the calculated values without internally reallocating on each insertion, memory is preallocated for each array of metrics using RapidJSON library. In the case of missing events, the timestamps are ordered and aligned through the analysis of 'label_idxs'. If at least one instance contains a complete set of events, all instances with missing timestamps are ordered and aligned efficiently by referencing the reordering of labels. Otherwise, the initial ordering and alignment are retained. If any fragment instances report only a subset of events due to failure or error, such instances are reported and only the unavailable timestamps are skipped during the aggregate metrics calculation, while leveraging the available timestamps. The instances containing missing events are further recorded into the "unreported_event_instance_idxs" field within the event sequence. These indexes for instances are based on 'exec_params_' set during execution. Please refer to IMPALA-13555 for further details. All of the above logic has been encapsulated into the newly added `ToJson` method within the `AggEventSequence` struct, prioritizing better reuse and maintainability. Structure of the `AggEventSequence` in JSON form - { profile_name : <PLAN_NODE_NAME>, num_children : <NUM_CHILDREN> node_metadata : <NODE_METADATA_OBJECT> event_sequences : [{ events : // An example event [{ label : "Open Started"" ts_list : [ 2257887941, <other instances' timestamps> ] // OR ts_stat : { min : [ 2257887941, ...<other divisions' minimum timestamps> ], max : [ 3257887941, ...<other divisions' maximum timestamps> ], avg : [ 2757887941, ...<other divisions' average timestamps> ] count : [ 2, ... <other counts of divisions' no. of instances> ] } }, <...other plan node's events> ], // This field is only included, if there are unreported events unreported_event_instance_idxs : [ 3, 5, 0 ] }], counters : <COUNTERS_OBJECT_ARRAY>, child_profiles : <CHILD_PROFILES> } Structure of `AggInfoStrings` in JSON form. { profile_name : <PLAN_NODE_NAME>, num_children : <NUM_CHILDREN> node_metadata : <NODE_METADATA_OBJECT> "info_strings" : [{ "key": "<info string's key>", "values": [<distinct info string values>] }] counters : <COUNTERS_OBJECT_ARRAY>, child_profiles : <CHILD_PROFILES> } Note: In the above structures, unlike a plan node's profile, a fragment's profile does not contain the 'node_metadata' field. Added unit tests for the serialization of aggregated metrics - - Added tests for handling info strings in aggregated JSON profiles - Introduced AggregatedEventSequenceToJsonTest fixture to validate event sequence serialization - Added random profile generation for varied test conditions - Covered scenarios for complete and missing events in both aggregated and grouped cases - Ensured correct JSON structure for info strings and event sequences - Ensured proper timestamp ordering and aggregation logic in serialized JSON profiles Generated the latest expected JSON profile outputs from the 'impala-profile-tool' using the stored impala profile logs. Added additional tests in tests/observability for profile v2's JSON output, after inclusion of the new expected JSON profile formats. Change-Id: I49e18a7a7e1288e3e674e15b6fc86aad60a08214 Reviewed-on: http://gerrit.cloudera.org:8080/21683 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-07 11:54:35 +00:00
Joe McDonnell	aefd1b0920	IMPALA-13551: Produce the shell tarball by pip installing impala-shell Currently, the shell tarball maintains its own packaging code and directory layout. This is very complicated and currently has several Python packages directly checked into our repository. To simplify it, this changes the shell tarball to be based on pip installing the pypi package. Specifically, the new directory structure for an unpack shell tarball is: impala-shell-4.5.0-SNAPSHOT/ impala-shell install_py${PYTHON_VERSION}/ install_py${ANOTHER_PYTHON_VERSION}/ For example, install_py2.7 is the Python 2.7 pip install of impala-shell. install_py3.8 is a Python 3.8 pip install of impala-shell. This means that the impala-shell script simply picks the install for the specified version of python and uses that pip install directory. To make this more consistent across different Linux distributions, this upgrades pip in the virtualenv to the latest. With this, ext-py and pkg_resources.py can be removed. This requires rearranging the shell build code. Specifically, this splits out the code that generates impala_build_version.py so that it can run before generating the pypi package. The shell tarball now has a dependency on the pypi package and must run after it. This builds on Michael Smith's work from IMPALA-11399. Testing: - Ran shell tests locally - Built on Centos 7, Redhat 8 & 9, Ubuntu 20 & 22, SLES 15 Change-Id: Ifbb66ab2c5bc7180221f98d9bf5e38d62f4ac036 Reviewed-on: http://gerrit.cloudera.org:8080/20171 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-17 22:52:01 +00:00
Zoltan Borok-Nagy	f1133acc2a	IMPALA-13088, IMPALA-13109: Use RoaringBitmap instead of sorted vector of int64s This patch substitutes the sorted 64-bit integer vectors that we use in IcebergDeleteNode to 64-bit roaring bitmaps. We use the CRoaring library (version 4.0.0). CRoaring also offers C++ classes, but this patch adds its own thin C++ wrapper class around the C functions to get the best performance. Toolchain Clang 5.0.1 was not able to compile CRoaring due to a bug which is tracked by IMPALA-13190, this patch also fixes it with a new toolchain. Performance I used an extended version of the "One Trillion Row" challenge. This means after inserting 1 Trillion records to a table I also deleted / updated lots of records (see statements at the end). So at the end I had 1 Trillion data records and ~68.5 Billion delete records in the table. For the measurements I used clusters with 10 and 40 executors, and executed the following query: SELECT station, min(measure), max(measure), avg(measure) FROM measurements_extra_1trc_partitioned GROUP BY 1 ORDER BY 1; JOIN BUILD times: +----------------+--------------+--------------+ \| Implementation \| 10 executors \| 40 executors \| +----------------+--------------+--------------+ \| Sorted vectors \| CRASH \| 4m15s \| \| Roaring bitmap \| 6m35s \| 1m51s \| +----------------+--------------+--------------+ 10 executors cluster with sorted vectors failed to run the query because executors crashed due to out-of-memory. Memory usage (VmRSS) for 10 executors: +----------------+------------------------+ \| Implementation \| 10 executors \| +----------------+------------------------+ \| Sorted vectors \| 54.4 GB (before CRASH) \| \| Roaring bitmap \| 7.4 GB \| +----------------+------------------------+ The resource estimations were wrong when MT_DOP was greater than 1. This has been also fixed. Testing: * added tests for RoaringBitmap64 * added tests for resource estimations Statements I used to delete / update the records for the One Trillion Row challenge: create table measurements_extra_1trc_partitioned( station string, ts timestamp, sensor_type int, measure decimal(5,2)) partitioned by spec (bucket(11, station), day(ts), truncate(10, sensor_type)) stored as iceberg; The original challenge didn't have any row-level modifications, columns 'ts' and 'sensor_type' are new: 'ts': timestamps that span a year 'sensor_type': integer between 0 and 100 Both 'ts' and 'sensor_type' has uniform distribution. Ingested data with the help of the original table One Trillion Row challenge, then issued the following DML statements: -- DELETE ~10 Billion delete from measurements_extra_1trc_partitioned where sensor_type = 13; -- UPDATE ~220 Million update measurements_extra_1trc_partitioned set measure = cast(measure - 2 as decimal(5,2)) where station in ('Budapest', 'Paris', 'Zurich', 'Kuala Lumpur') and sensor_type in (7, 17, 77); -- DELETE ~7.1 Billion delete from measurements_extra_1trc_partitioned where ts between '2024-01-15 11:30:00' and '2024-09-10 11:30:00' and sensor_type between 45 and 51 and station regexp '[ATZ].*'; -- UPDATE ~334 Million update measurements_extra_1trc_partitioned set measure = cast(measure + 5 as decimal(5,2)) where station in ('Accra', 'Addis Ababa', 'Entebbe', 'Helsinki', 'Hong Kong', 'Nairobi', 'Ottawa', 'Tauranga', 'Yaounde', 'Zagreb', 'Zurich') and ts > '2024-11-05 22:30:00' and sensor_type > 90; -- DELETE 50.6 Billion delete from measurements_extra_1trc_partitioned where sensor_type between 65 and 77 and ts > '2024-08-11 12:00:00' ; -- UPDATE ~200 Million update measurements_extra_1trc_partitioned set measure = cast(measure + 3.5 as decimal(5,2)) where sensor_type in (56, 66, 76, 86, 96) and ts < '2024-03-17 01:00:00' and (station like 'Z%' or station like 'Y%'); Change-Id: Ib769965d094149e99c43e0044914d9ecccc76107 Reviewed-on: http://gerrit.cloudera.org:8080/21557 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-07-16 17:35:13 +00:00
Surya Hebbar	42e5ea7ea3	IMPALA-13106: Support larger imported query profile sizes through compression Imported query profiles are currently being stored in IndexedDB. Although IndexedDB does not have storage limitations like other browser storage APIs, there is a storage limit for a single attribute / field. For supporting larger query profiles, 'pako' compression library's v2.1.0 has been added along with its associated license. Before adding query profile JSON to indexedDB, it undergoes compression using this library. As compression and parsing profile is a long running process that can block the main thread, this has been delegated to a worker script running in the background. The worker script returns parsed query attributes and compressed profile text sent to it. The process of compression consumes time; hence, an alert message is displayed on the queries page warning user to refrain from closing or reloading the page. On completion, the raw total size, compressed total size, and total processing time are logged to the browser console. When multiple profiles are chosen, after each query profile insertion, the subsequent one is not triggered until compression and insertion are finished. The inserted query profile field is decompressed before parsing on the query plan, query profile, query statement, and query timeline page. Added tests for the compression library methods utilized by the worker script. Manual testing has been done on Firefox 126.0.1 and Chrome 126.0.6478. Change-Id: I8c4f31beb9cac89051460bf764b6d50c3933bd03 Reviewed-on: http://gerrit.cloudera.org:8080/21463 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-25 22:06:24 +00:00
Fang-Yu Rao	3a2f5f28c9	IMPALA-12921, IMPALA-12985: Support running Impala with locally built Ranger The goals and non-goals of this patch could be summarized as follows. Goals: - Add changes to the minicluster configuration that allow a non-default version of Ranger (possibly built locally) to run in the context of the minicluster, and to be used as the authorization server by Impala. - Switch to the new constructor when instantiating RangerAccessRequestImpl. This resolves IMPALA-12985 and also makes Impala compatible with Apache Ranger if RangerAccessRequestImpl from Apache Ranger is consumed. - Prepare Ranger and Impala patches as supplemental material to verify what authorization-related tests could be passed if Apache Ranger is the authorization provider. Merging IMPALA-12921_addendum.diff to the Impala repository is not in the scope of this patch in that the diff file changes the behavior of Impala and thus more discussion is required if we'd like to merge it in the future. Non-goals: - Set up any automation for building Ranger from source. - Pass all Impala authorization-related tests with a non-default version of Ranger. Instructions on running Impala with locally built Ranger: Suppose the Ranger project is under the folder $RANGER_SRC_DIR. We could execute the following to build Apache Ranger for easy reference. By default, the compressed tarball is produced under $RANGER_SRC_DIR/target. mvn clean compile -B -nsu -DskipCheck=true -Dcheckstyle.skip=true \ package install -DskipITs -DskipTests -Dmaven.javadoc.skip=true After building Ranger, we need to build Impala's Java code so that Impala's Java code could consume the locally produced Ranger classes. We will need to export the following environment variables before building Impala. This prevents bootstrap_toolchain.py from trying to download the compressed Ranger tarball. 1. export RANGER_VERSION_OVERRIDE=\ $(mvn -f $RANGER_SRC_DIR/pom.xml -q help:evaluate \ -Dexpression=project.version -DforceStdout) 2. export RANGER_HOME_OVERRIDE=$RANGER_SRC_DIR/target/\ ranger-${RANGER_VERSION_OVERRIDE}-admin It then suffices to execute the following to point Impala to the locally built Ranger server before starting Impala. 1. source $IMPALA_HOME/bin/impala-config.sh 2. tar zxv -f $RANGER_SRC_DIR/target/\ ranger-${IMPALA_RANGER_VERSION}-admin.tar.gz \ -C $RANGER_SRC_DIR/target/ 3. $IMPALA_HOME/bin/create-test-configuration.sh 4. $IMPALA_HOME/bin/create-test-configuration.sh \ -create_ranger_policy_db 5. $IMPALA_HOME/testdata/bin/run-ranger.sh (run-all.sh has to be executed instead if other underlying services have not been started) 6. $IMPALA_HOME/testdata/bin/setup-ranger.sh Testing: - Manually verified that we could point Impala to a locally built Apache Ranger on the master branch (with tip being https://github.com/apache/ranger/commit/4abb993). - Manually verified that with RANGER-4771.diff and IMPALA-12921_addendum.diff, only 3 authorization-related tests failed. They failed because the resource type of 'storage-type' is not supported in Apache Ranger yet and thus the test cases added in IMPALA-10436 could fail. - Manually verified that the log files of Apache and CDP Ranger's Admin server could be created under ${RANGER_LOG_DIR} after we start the Ranger service. - Verified that this patch passed the core tests when CDP Ranger is used. Change-Id: I268d6d4d6e371da7497aac8d12f78178d57c6f27 Reviewed-on: http://gerrit.cloudera.org:8080/21160 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-15 10:25:13 +00:00
Riza Suminto	b1320bd1d6	IMPALA-13075: Cap memory usage for ExprValuesCache at 256KB ExprValuesCache uses BATCH_SIZE as a deciding factor to set its capacity. It bounds the capacity such that expr_values_array_ memory usage stays below 256KB. This patch tightens that limit to include all memory usage from ExprValuesCache::MemUsage() instead of expr_values_array_ only. Therefore, setting a very high BATCH_SIZE will not push the total memory usage of ExprValuesCache beyond 256KB. Simplify table dimension creation methods and fix few flake8 warnings in test_dimensions.py. Testing: - Add test_join_queries.py::TestExprValueCache. - Pass core tests. Change-Id: Iee27cbbe8d3100301d05a6516b62c45975a8d0e0 Reviewed-on: http://gerrit.cloudera.org:8080/21455 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-15 00:28:38 +00:00
halim.kim	f7e629935b	IMPALA-11871: Skip permissions loading and check on HDFS if Ranger is enabled Before this patch, Impala checked whether the Impala service user had the WRITE access to the target HDFS table/partition(s) during the analysis of the INSERT and LOAD DATA statements in the legacy catalog mode. The access levels of the corresponding HDFS table and partitions were computed by the catalog server solely based on the HDFS permissions and ACLs when the table and partitions were instantiated. After this patch, we skip loading HDFS permissions and assume the Impala service user has the READ_WRITE permission on all the HDFS paths associated with the target table during query analysis when Ranger is enabled. The assumption could be removed after Impala's implementation of FsPermissionChecker could additionally take Ranger's policies of HDFS into consideration when performing the check. Testing: - Added end-to-end tests to verify Impala's behavior with respect to the INSERT and LOAD DATA statements when Ranger is enabled in the legacy catalog mode. Change-Id: Id33c400fbe0c918b6b65d713b09009512835a4c9 Reviewed-on: http://gerrit.cloudera.org:8080/20221 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-11 06:28:53 +00:00
Yida Wu	9837637d93	IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API Added support for following built-in functions: - ai_generate_text_default(prompt) - ai_generate_text(ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) 'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile options. 'ai_generate_text_default(prompt)' syntax expects all these to be set to proper values. The other syntax, will try to use the provided input parameter values, but fallback to instance level values if the inputs are NULL or empty. Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com) API endpoints are currently supported. Exposed these functions in FunctionContext so that they can also be called from UDFs: - ai_generate_text_default(context, model) - ai_generate_text(context, ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) Testing: - Added unit tests for AiGenerateTextInternal function - Added fe test for JniFrontend::getSecretFromKeyStore - Ran manual tests to make sure Impala can talk with OpenAI LLMs using 'ai_generate_text' built-in function. Example sql: select ai_generate_text("https://api.openai.com/v1/chat/completions", "hello", "gpt-3.5-turbo", "open-ai-key", '{"temperature": 0.9, "model": "gpt-4"}') - Tested using standalone UDF SDK and made sure that the UDFs can invoke BuiltInFunctions (ai_generate_text and ai_generate_text_default) Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b Reviewed-on: http://gerrit.cloudera.org:8080/21168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 07:25:50 +00:00
Xiang Yang	e74bb9d81b	IMPALA-12362: (part-2/4) Optimize default configurations for packaging module. To avoid absolutely paths and keep it simple, optimize the default configurations for packaging module by remove or change some entries. At the same time, add license header to 'package/conf/-site.xml' and rename them to '-site.xml.template' to force administrator making configurations appropriate for their cluster. Testing: - Manually deploy packages on Ubuntu22.04 and verify it. Change-Id: Ifda229b779a3d6fca647bb81fe23dd61ad7e5d66 Reviewed-on: http://gerrit.cloudera.org:8080/20928 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-08 04:29:26 +00:00
wzhou-code	fc74ca672a	IMPALA-12378: Auto Ship JDBC Data Source This patch moves the source files of jdbc package to fe. Data source location is optional. Data source could be created without specifying HDFS location. Assume data source class is in the classpath and instance of data source class could be created with current class loader. Impala still try to load the jar file of the data source in runtime if it's set in data source location. Testing: - Passed core test - Passed dockerised-tests Change-Id: I0daff8db6231f161ec27b45b51d78e21733d9b1f Reviewed-on: http://gerrit.cloudera.org:8080/20971 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2024-02-07 16:29:11 +00:00
Eyizoha	3af1930229	IMPALA-12322: Support converting UTC timestamps read from Kudu to local time This patch adds a query option 'convert_kudu_utc_timestamps' similar to 'convert_legacy_hive_parquet_utc_timestamps'. When enabled, it converts UTC timestamps read from Kudu to local timestamps. The corresponding modification also include predicate pushdown and runtime filter. Due to the ambiguity of timestamps caused by daylight saving time changes, it is difficult to resolve in the bloom filter. This patch additionally introduces a query option 'disable_kudu_local_timestamp_bloom_filter' to default disable the Kudu timestamp bloom filter after enabling time zone conversion in order to avoid erroneously filtering out data. However, for regions that do not observe daylight saving time, it can be set to false to re-enable the Kudu local timestamp bloom filter. Testing: - Add TestKuduTimestampConvert in query_test/test_kudu.py Perform end-to-end testing in a custom cluster, including basic Kudu UTC timestamp conversion testing, as well as checking if related predicate pushdown and runtime filters are working correctly (even with timestamps involving daylight saving time conversions). Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3 Reviewed-on: http://gerrit.cloudera.org:8080/20681 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2023-12-14 13:19:35 +00:00
Michael Smith	465cc7acf7	IMPALA-11542: Import LLVM SectionMemoryManager for fixes Imports SectionMemoryManager so we can add some fixes that require modifying private code. This could be done as a patch on LLVM, but the memory manager seems like something we should own and may need to customize in other ways. Changes namespace and adds using statements to make it compile. Uses LLVM 5.0.1 with commit 787614d08bda40daf3e168bd46a8c2a86319ec63 added as a small security fix. Change-Id: I8917005094903ed0ece25e40eb445abb159b569b Reviewed-on: http://gerrit.cloudera.org:8080/20696 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-12-01 18:25:51 +00:00
Surya Hebbar	fed578580b	IMPALA-12364: Display memory, disk and network metrics in webUI's query timeline This patch adds fragment-level metrics to the WebUI query timeline display along with additional disk and network metrics. The fragment's plan nodes are enlarged with an animated transition on hovering over the fragment's row in query timeline's fragment diagram. On clicking the plan nodes, total thread and memory usage of the parent fragment are displayed, after accumulating memory and thread usage of all child nodes. Thread usage is being shown on the additional Y-axis. In this way, memory and thread usage of multiple fragments can be compared alongside. A fragment's usage can be hidden by clicking on any of the child plan nodes again. These counters are available within the profile with following names. - MemoryUsage - ThreadUsage Once a fragment's metrics are displayed, they are updated as they are collected from the profile during a running query. A grid-line is displayed along with a tooltip on hovering over the fragment diagram, containing the instantaneous time at that position. This grid-line also triggers tooltips and gridlines in other charts. A warning is displayed on clicking a fragment with less number of samples available. RESOURCE_TRACE_RATIO query option must be set for providing periodic metrics within the profile. This allows the following time series counters to be displayed on the query timeline. - HostDiskWriteThroughput - HostDiskReadThroughput - HostNetworkRx - HostNetworkTx The additional Y-axis within the utilization chart is used to represent the average of these metrics. The memory units in tooltips and ticks on co-ordinate axes are displayed in human readable form such as KB, MB, GB and PB for convenience. Both of the charts contain controls to close the chart. These charts can also be resized until a maximum and minmum limit by dragging the resize bar's handle. Along with mouse wheel events, the diagrams can be horizontally stretched by the help of buttons with horizontal zoom icons at the top of the page. The zoom out button is disabled, when further zoom out is not possible. Timeticks are being autoscaled during fragment diagram's horizontal zoom. In addition to the scrollbar, hovering on edges of the window allows horizontal scrolling. Test cases have been for the additional disk, network and fragment level memory metrics parsing functions. Change-Id: Ifd25e6f0bc9fbd664ec98936daff3f27182dfc7f Reviewed-on: http://gerrit.cloudera.org:8080/20355 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-11-09 02:32:27 +00:00
Fucun Chu	c2bd30a1b3	IMPALA-5741: Initial support for reading tiny RDBMS tables This patch uses the "external data source" mechanism in Impala to implement data source for querying JDBC. It has some limitations due to the restrictions of "external data source": - It is not distributed, e.g, fragment is unpartitioned. The queries are executed on coordinator. - Queries which read following data types from external JDBC tables are not supported: BINARY, CHAR, DATETIME, and COMPLEX. - Only support binary predicates with operators =, !=, <=, >=, <, > to be pushed to RDBMS. - Following data types are not supported for predicates: DECIMAL, TIMESTAMP, DATE, and BINARY. - External tables with complex types of columns are not supported. - Support is limited to the following databases: MySQL, Postgres, Oracle, MSSQL, H2, DB2, and JETHRO_DATA. - Catalog V2 is not supported (IMPALA-7131). - DataSource objects are not persistent (IMPALA-12375). Additional fixes are planned on top of this patch. Source files under jdbc/conf, jdbc/dao and jdbc/exception are replicated from Hive JDBC Storage Handler. In order to query the RDBMS tables, the following steps should be followed (note that existing data source table will be rebuilt): 1. Make sure the Impala cluster has been started. 2. Copy the jar files of JDBC drivers and the data source library into HDFS. ${IMPALA_HOME}/testdata/bin/copy-ext-data-sources.sh 3. Create an `alltypes` table in the Postgres database. ${IMPALA_HOME}/testdata/bin/load-ext-data-sources.sh 4. Create data source tables (alltypes_jdbc_datasource and alltypes_jdbc_datasource_2). ${IMPALA_HOME}/bin/impala-shell.sh -f\ ${IMPALA_HOME}/testdata/bin/create-ext-data-source-table.sql 5. It's ready to run query to access data source tables created in last step. Don't need to restart Impala cluster. Testing: - Added unit-test for Postgres and ran unit-test with JDBC driver postgresql-42.5.1.jar. - Ran manual unit-test for MySql with JDBC driver mysql-connector-j-8.1.0.jar. - Ran core tests successfully. Change-Id: I8244e978c7717c6f1452f66f1630b6441392e7d2 Reviewed-on: http://gerrit.cloudera.org:8080/17842 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-10-10 02:13:59 +00:00
Sebastian Pop	333902afcc	[arm64] remove dependence on sse2neon This patch removes the dependence on sse2neon by rewriting SSE2 and AVX code with native NEON instructions. Part of the patch has been submitted to kudu https://gerrit.cloudera.org/#/c/20374/ Change-Id: If3c78c877ef530fa9f35d36da523ad67ab34e5e7 Reviewed-on: http://gerrit.cloudera.org:8080/19954 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-26 02:22:27 +00:00
Joe McDonnell	1a1a84ee23	IMPALA-12434: Isolate pkg_resources.py to its own directory In some build environments, the impala-shell Python 3 virtualenv install fails due to interactions with shell/pkg_resources.py. This doesn't reproduce in the standard development environment, but it is consistent. It seems to be related to invoking a command in ${IMPALA_HOME}/shell and the pkg_resources.py being in that directory. To avoid any interactions, this moves shell/pkg_resources.py to shell/legacy/pkg_resources.py. This keeps it off of the path for the failing command, and it also keeps it off of our PYTHONPATH (which includes ${IMPALA_HOME}/shell). Testing: - Ran a build in the affected build environment - Ran a core job Change-Id: Id8f2d8a8472c7bb405bf88673ed9779e23cde1d6 Reviewed-on: http://gerrit.cloudera.org:8080/20468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-19 04:30:09 +00:00
Eyizoha	2f06a7b052	IMPALA-10798: Initial support for reading JSON files Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Added a startup flag, enable_json_scanner, to be able to disable this feature if we hit critical bugs in production. Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Reviewed-on: http://gerrit.cloudera.org:8080/19699 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-05 16:55:41 +00:00
Fredy Wijaya	4b62812995	[tools] Add Dev Container support for Impala development. Currently only VS Code is supported since IntelliJ/CLion support for Dev Container is still beta at the time of this writing. To use it, simply open Impala source code. $ git clone https://github.com/apache/impala.git $ cd impala $ code . The bootstrap_development.sh will be automatically executed post Docker container creation and all necessary extensions for IDE-like experience will be automatically installed. For C++, it will use clangd that uses compilation database instead of the Microsoft C++ extension since it works better with Clang related tools. Change-Id: I50508a09710641ec2a299b001fef3e7fefb0b7d5 Reviewed-on: http://gerrit.cloudera.org:8080/20380 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Quanlong Huang <huangquanlong@gmail.com>	2023-08-23 12:30:14 +00:00
Laszlo Gaal	ee069687fc	IMPALA-12212: Bump Maven to 3.9.2, pull dependencies in parallel Maven 3.9.x offers a new dependency resolver, HttpClient, which allows downloading project dependencies in parallel. This patch bumps the Maven version installed by bootstrap_system.sh to v3.9.2, and adds the flags enabling the new resolver to download dependencies (including POM files) in parallel. Parallelism is set to 10 threads. The flags are added to a project-specific Maven setting file in the newly created java/.mvn directory. The settings file is added to the RAT exclusion list in bin/rat_exclude_files.txt. The --show-version flag is added for debugging purposes. The same flags are added to the JAMM subproject as well. The new resolver in Maven 3.9 has also changed the warning message emitted for missing component checksums, so the new warning string is added to the filter in bin/mvn-quiet.sh Unfortunately Maven 3.9 has also changed the way it responds to missing checksum files: the resolver now emits a stack trace when checksums cannot be determined, and missing checksums are not explicitly ignored. Detailed documentation for the new Maven resolver in Maven 3.9.0+ is located at: https://maven.apache.org/guides/mini/guide-resolver-transport.html resolver configuration reference: https://maven.apache.org/resolver/configuration.html Tests: - verified in a core-mode test run with Maven 3.9.2 installed - verified in a local build using an earlier version of Maven to verify that the new default setting does not cause regressions with the old dependency resolver. Change-Id: I75d05215effc724f5bd471646fb352f37443e185 Reviewed-on: http://gerrit.cloudera.org:8080/20142 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2023-07-24 18:50:34 +00:00
stiga-huang	8d0ab2b684	IMPALA-10262: RPM/DEB Packaging Support This patch bases on a previous patch contributed by Shant Hovsepian: https://gerrit.cloudera.org/c/16612/ It adds a new option, -package, to buildall.sh for building a package for the current OS type (e.g. CentOS/Ubuntu). You can also use "make/ninja package" to build the package. Scripts for launching the services and the required configuration files are also added. Tests: - Built on Ubuntu 18.04/20.04 and CentOS 7 using ./buildall.sh -noclean -skiptests -release -package - Deployed the RPM package on a CDP cluster. Verifed the scripts. - Deployed the DEB package on a docker container. Verified the scripts. Change-Id: I64419fd400fe8d233dac016b6306157fe9461d82 Reviewed-on: http://gerrit.cloudera.org:8080/18939 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-07-16 11:13:23 +00:00
Surya Hebbar	36a63d0a33	IMPALA-12182: Add CPU utilization chart for RuntimeProfile's sampled metrics This change adds support for a stacked area chart for CPU utilization to the query timeline display, while also providing the ability to scale timetick values, precision, and the ability to horizontally scale the fragment timing diagram along with the utilization chart. Rendering of different components within the diagram has been decoupled to isolate scaling of timeticks, also to improve the overall efficiency by making the rendering functions asynchronous for better performance during resize events. Additionally, re-rendering of fragment diagram is only triggered during new fragment events. The following are the associated key bindings to scale the timeline with mouse wheel events. - shift + wheel events on #fragment_diagram - shift + wheel events on #timeticks_footer - alt + shift + wheel events on #timeticks_footer for precision control Note: Ctrl + mouse wheel events and ctrl + '+'/'-' events can be used to resize the timeline through the browser. Mouse wheel events have been associated with respective components for better efficiency and maintainability. Constraints have been added to above attributes to limit scaling/zooming for appropriate display and rendering across all DOM elements. RESOURCE_TRACE_RATIO query option provides the utilization values to be traced within the RuntimeProfile. It contains samples of CPU utilization metrics for user, sys and iowait. These time series counters are available within the profile having the following names. Per Node Profiles - - HostCpuIoWaitPercentage - HostCpuSysPercentage - HostCpuUserPercentage The samples are updated based on 'periodic_counter_update_period_ms' providing the 'period' within profile's 'Per Node Profiles'. These are retrieved from the ChunkedTimeSeriesCounter in the RuntimeProfile. Currently, JSON profiles and webUI summary pages contain the downsampled values. Utilization samples are aligned with the fragment diagram by associating the number of samples and the period. Aggregate CPU usage for each node is being calculated after accumulating the basis point values for user, sys and iowait. These are being displayed after grouping the associated counters for each node as a stacked line chart. c3.js charting library based on d3.v5 is being used to plot the utilization. The license associated with d3 v5 during the related time frame has been included along with the charting library's. Support for experimental profile V2 is currently not included. Scaling a large number of values to support profile V2 would be possible with appropriate down-sampling in the back-end. Testing: Manual testing with TPC-DS and TPC-H queries Change-Id: Idea2a6db217dbfaa7a0695aeabb6d9c1ecf62158 Reviewed-on: http://gerrit.cloudera.org:8080/20008 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-07-11 00:06:24 +00:00
Michael Smith	3b0705ba63	IMPALA-11941: Support Java 17 in Impala Enables building for Java 17 - and particularly using Java 17 in containers - but won't run a minicluster fully with Java 17 as some projects (Hadoop) don't yet support it. Starting with Java 15, ehcache.sizeof encounters UnsupportedOperationException: can't get field offset on a hidden class in class members pointing to capturing lambda functions. Java 17 also introduces new modules that need to be added to add-opens. Both of these pose problems for continued use of ehcache. Adds https://github.com/jbellis/jamm as a new cache weigher for Java 15+. We build from HEAD as an external project until Java 17 support is released (https://github.com/jbellis/jamm/issues/44). Adds the 'java_weigher' option to select 'sizeof' or 'jamm'; defaults to 'auto', which uses jamm for Java 15+ and sizeof for everything else. Also adds metrics for viewing cache weight results. Adds JAVA_HOME/lib/server to LD_LIBRARY_PATH in run-jvm-binary to simplify switching between JDK versions for testing. You can now - export IMPALA_JDK_VERSION=11 - source bin/impala-config.sh - start-impala-cluster.py and have Impala running a different JDK (11) version. Retains add-opens calls that are still necessary due to dependencies' use of lambdas for jamm, and all others for ehcache. Add-opens are still required as a fallback, as noted in https://github.com/jbellis/jamm#object-graph-crawling. We catch the exceptions jamm and ehcache throw - CannotAccessFieldException, UnsupportedOperationException - to avoid crashing Impala, and add it to the list of banned log messages (as we should add-opens when we find them). Testing: - container test run with Java 11 and 17 (excludes custom cluster) - manual custom_cluster/test_local_catalog.py + test_banned_log_messages.py run with Java 11 and 17 (Java 8 build) - full Java 11 build (passed except IMPALA-12184) - add test catalog cache entry size metrics fit reasonable bounds - add unit test for utility to find jamm jar file in classpath Change-Id: Ic378896f572e030a3a019646a96a32a07866a737 Reviewed-on: http://gerrit.cloudera.org:8080/19863 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-24 10:11:54 +00:00
jasonmfehr	63d13a35f3	IMPALA-11880: Adds support for authenticating to Impala using JWTs. This support was modeled after the LDAP authentication. If JWT authentication is used, the Impala shell enforces the use of the hs2-http protocol since the JWT is sent via the "Authentication" HTTP header. The following flags have been added to the Impala shell: * -j, --jwt: indicates that JWT authentication will be used * --jwt_cmd: shell command to run to retrieve the JWT to use for authentication Testing New Python tests have been added: * The shell tests ensure that the various command line arguments are handled properly. Situations such as a single authentication method, JWTs cannot be sent in clear text without the proper arguments, etc are asserted. * The Python custom cluster tests leverage a test JWKS and test JWTs. Then, a custom Impala cluster is started with the test JWKS. The Impala shell attempts to authenticate using a valid JWT, an expired (invalid) JWT, and a valid JWT signed by a different, untrusted JWKS. These tests also exercise the Impala JWT authentication mechanism and assert the prometheus JWT auth success and failure metrics are reported accurately. Change-Id: I52247f9262c548946269fe5358b549a3e8c86d4c Reviewed-on: http://gerrit.cloudera.org:8080/19837 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-11 23:22:05 +00:00
Gergely Farkas	490dd7b115	IMPALA-11726: Allow LDAP user and group filter when Kerberos is enabled This change does two things for the Kerberos authentication support for impala clients: 1) Introduces allow_custom_ldap_filters_with_kerberos_auth flag, which removes the restriction that prevents to use LDAP group/user search filters when Kerberos authentication is enabled. When the flag is set both Kerberos and LDAP can work with impala clients (impala-shell, jdbc, odbc, impyla) even if the group/user filters are defined. The flag default value is false, which ensures backwards compatibility. 2) Introduces enable_group_filter_check_for_authenticated_kerberos_user flag, which allows group filters to be applied for non-proxy users that belong to the authenticated Kerberos principals. The verified username comes from the Kerberos principal: The username is the first member of the authenticated Kerberos principal, where the principal can be username/host@realm or username@realm. Regardless of whether the flag is enabled or not, LDAP filters are not applied for authorized proxy users (neither when using LDAP nor when using Kerberos authentication). In case of delegation, filters are applied for delegated users. This flag makes sense if Kerberos and LDAP authentication is enabled and the users in the KDC and LDAP are synchronized (e.g. Active Directory provides both LDAP and Kerberos authentication). The flag default value is false, which ensures backwards compatibility. Notes: If the allow_custom_ldap_filters_with_kerberos_auth flag is disabled, it is still possible to use LDAP and Kerberos authentication together, but in a limited way: Only LDAP search bind authentication mode can be used, where there are default user and group search filters (that are defined for Active Directory LDAP schema). One major limitation here - apart from the AD directory schema assumed in the default filters - is that the only possibility to control user access is to select the appropriate user and group search base dn (e.g. granting LDAP access to users/groups defined in a given subtree) Even in this edge case, it is still allowed to enable the enable_group_filter_check_for_authenticated_kerberos_user flag. If this happens, then the default filters in LDAP search bind will be applied for Kerberos authenticated non-proxy users. Another edge case where the LDAP authentication is enabled, the user access is controlled by custom LDAP filters (LDAP auth only), and the external Kerberos authentication is also enabled, but the users in KDC and LDAP are not in sync: In this case the allow_custom_ldap_filters_with_kerberos_auth flag must be set, but enable_group_filter_check_for_authenticated_kerberos_user flag should be disabled, otherwise an unauthorized response may be received during Kerberos authentication (depending on whether the authenticated Kerberos user passes the custom LDAP filters or not). In such cases, access to Kerberos users must be controlled by other ways (e.g. within FreeIPA KDC with host-based access control rules). Tests: - New unit test created to check the behavior of AuthManager with and without allow_custom_ldap_filters_with_kerberos_auth flag. - New custom cluster tests created: - impala-shell tests that validate existing LDAP search bind and simple bind functionality with Kerberos authentication enabled (LdapSearchBindImpalaShellTest and LdapSimpleBindImpalaShellTest suites are now parameterized), - impala-shell tests that validate backwards compatibility when allow_custom_ldap_filters_with_kerberos_auth flag and enable_group_filter_check_for_authenticated_kerberos_user flags are disabled (LdapSearchBindDefaultFiltersKerberosImpalaShellTest) - various impala-shell tests that validate Kerberos authentication in an environment where LDAP authentication is also enabled (LdapKerberosImpalaShellTest) - Manual tests with a snapshot build in CDP PVC DS with LDAP and Kerberos authentication enabled, user and group filters provided. Change-Id: If3ca9c4ff8a17167e5233afabdd14c948edb46de Reviewed-on: http://gerrit.cloudera.org:8080/19561 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-10 19:49:30 +00:00
Joe McDonnell	566df80891	IMPALA-11959: Add Python 3 virtualenv This adds a Python 3 equivalent to the impala-python virtualenv base on the toolchain Python 3.7.16. This modifies bootstrap_virtualenv.py to support the two different modes. This adds py2-requirements.txt and py3-requirements.txt to allow some differences between the Python 2 and Python 3 virtualenvs. Here are some specific package changes: - allpairs is replaced with allpairspy, as allpairs did not support Python 3. - requests is upgraded slightly, because otherwise is has issues with idna==2.8. - pylint is limited to Python 3, because we are adding it and don't need it on both - flake8 is limited to Python 2, because it will take some work to switch to a version that works on Python 3 - cm_api is limited to Python 2, because it doesn't support Python 3 - pytest-random does not support Python 3 and it is unused, so it is removed - Bump the version of setuptool-scm to support Python 3 This adds impala-pylint, which can be used to do further Python 3 checks via --py3k. This also adds a bin/check-pylint-py3k.sh script to enforce specific py3k checks. The banned py3k warnings are specified in the bin/banned_py3k_warnings.txt. This is currently empty, but this can ratchet up the py3k strictness over time to avoid regressions. This pulls in a new toolchain with the fix for IMPALA-11956 to get Python 3.7.16. Testing: - Hand tested that the allpairs libraries produce the same results - The python3 virtualenv has no influence on regular tests yet Change-Id: Ica4853f440c9a46a79bd5fb8e0a66730b0b4efc0 Reviewed-on: http://gerrit.cloudera.org:8080/19567 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	a9cfc7b33f	IMPALA-11624: Bump Impyla dependency to 0.18.0 IMPALA_THRIFT_PY_VERSION is also bumped to 0.16.0p3. As 0.16.0p3 Thrift does not contain Python related patches and Impyla 0.18.0 depends on Thrift 0.16.0, now we are consistently using Thrift 0.16.0 in all Python code. This also bumps the Thrift in the shell's ext-py directory to 0.16.0 (based on the Thrift 0.16.0 pypi tarball with the egg directory removed). Testing: - Ran a GVO job Change-Id: I7265558b0e07959c606cba73cd251c3edfcb3ed5 Reviewed-on: http://gerrit.cloudera.org:8080/18456 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-02-27 20:39:26 +00:00
Michael Smith	feb4a76ed4	IMPALA-11913: Upgrade datatables to 1.13.2 Upgrades datatables from datatables.net to the latest available version to address XSS and prototype pollution issues with 1.10.18. Testing: - clicked around to all the UI pages Change-Id: I323fd06da003789485d340eaa25d4ab79a7f3ece Reviewed-on: http://gerrit.cloudera.org:8080/19489 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-14 22:35:56 +00:00
Michael Smith	16190b4f77	IMPALA-11737: Update sasl to 0.3.1 for Python 3.10 sasl 0.2.1 fails to build with Python 3.10. Updates to sasl 0.3.1 for Python 3.10 compatibility. Testing: - built under Python 3.8 - automated tests will test with built bundle and pip install using current Python version - pip3 installed shell/build/dist on Ubuntu 22.04 with Python 3.10 Change-Id: I6b522f2b8cb5546150cd3274c7670a6ca9b8ff63 Reviewed-on: http://gerrit.cloudera.org:8080/19265 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2022-11-28 17:16:42 +00:00
Tamas Mate	e1e92da796	IMPALA-11676: Prettify asf-site docs This commit refactors and adds a new build option to the docs build script/Makefile, these options are: - plain-html: the plain html docs, without css and navigation bar, this was "the" html build before this change. - asf-site-html: html docs, with css and navigation bar. - pdf The css is comming from DITA project's documentation. Testing: - Built the docs and tested the pages manually. Change-Id: Ic9621cb0abaa7fd9bf445da08440c0f6a9788180 Reviewed-on: http://gerrit.cloudera.org:8080/19242 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-16 20:26:31 +00:00
Riza Suminto	ea6173440c	IMPALA-11669: Set TConfiguration during Thrift connection setup THRIFT-5237 Implement MAX_MESSAGE_SIZE and consolidate limits into a TConfiguration class. The MAX_MESSAGE_SIZE default to 100MB. This patch adds backend flag 'thrift_rpc_max_message_size' to override the default MAX_MESSAGE_SIZE by calling function AssignDefaultTConfiguration. We set higher default, 1GB, to minimize interruption on existing Impala workload. We should consider lowering this default once we ensure that all thrift rpc response can be made in batches, such as explained in IMPALA-11402. Thrift tuning for communication with HMS is fixed through HIVE-26633. Appropriate Hive version bump is required. Testing: - Add EXPECT_NO_THROW to verify that 'checkReadBytesAvailable' to does not throw exception given default FLAGS_thrift_rpc_max_message_size. - Add MaxMessageSizeFit and MaxMessageSizeExceeded tests in thrift-server-test. - Run and pass thrift-server-test Change-Id: I137683d43c72a34105fd7b32fea3a93532601ae3 Reviewed-on: http://gerrit.cloudera.org:8080/19162 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-28 02:29:21 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Michael Smith	64b324ac40	IMPALA-11389: Include Python 3 eggs in tarball Build Python 3 eggs for the shell tarball so it works with both Python 2 and Python 3. The impala-shell script selects eggs based on the available Python version. Inlines thrift for impala-shell so we can easily build Python 2 and Python 3 versions, consistent with other libraries. The impala-shell version should always be at least as new as IMPALA_THRIFT_PY_VERSION. Thrift 0.13.0+ wraps all exceptions during TSocket read/write operations in TTransportException. Specifically socket.error that we got as raw exceptions are now wrapped. Unwraps them before raising to preserve prior behavior. A specific Python version can be selected with IMPALA_PYTHON_EXECUTABLE; otherwise it will use 'python', and if unavailable try 'python3'. Adds tests for impala-shell tarball with Python 3. Change-Id: I94f86de9e2a6303151c2f0e6454b5f629cbc9444 Reviewed-on: http://gerrit.cloudera.org:8080/18653 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-14 23:52:04 +00:00
wzhou-code	b867f4c4f1	IMPALA-10745 (part 2): Support Kerberos over HTTP for impala-shell This patch adds kerberos-1.3.1 Python module to shell/ext-py so that the egg file of Kerberos module is built and added into impala-shell tarball when running script shell/make_shell_tarball.sh. Kerberos Python module is distributed under Apache License Version 2. Its source distribution is available at: https://pypi.org/project/kerberos/ Testing: - Passed core run. - Installed impala-shell from impala-shell tarball on dev box as standalone package. Verified that impala-shell could be ran without additional configurations. - Installed impala-shell from impala-shell tarball on a real cluster with a full Kerberos setup. Verified that impala-shell could connect to impala server with options "-k --protocol=hs2-http". Change-Id: Id34074cbe725ba2cf1407fcf59e00475cd417a6d Reviewed-on: http://gerrit.cloudera.org:8080/18523 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-15 21:46:06 +00:00
Fucun Chu	4186727fe6	IMPALA-10871: Add MetastoreShim to support Apache Hive 3.1.2 Like IMPALA-8369, this patch adds a compatibility shim in fe so that Impala can interoperate with Hive 3.1.2. we need adds a new Metastoreshim class under compat-apache-hive-3 directory. These shim classes implement method which are different in cdp-hive-3 vs apache-hive-3 and are used by front end code. At the build time, based on the environment variable IMPALA_HIVE_DIST_TYPE one of the two shims is added to as source using the fe/pom.xml build plugin. Some codes that directly use Hive 4 APIs need to be ignored in compilation, eg. fe/src/main/java/org/apache/impala/catalog/metastore/. Use Maven profile to ignore some codes, profile will automatically activated based on the IMPALA_HIVE_DIST_TYPE. Testing: 1. Code compiles and runs against both HMS-3 and ASF-HMS-3 2. Ran full-suite of tests against HMS-3 3. Running full-tests against ASF-HMS-3 will need more work supporting Tez in the mini-cluster (for dataloading) and HMS transaction support. This will be on-going effort and test failures on ASF-Hive-3 will be fixed in additional sub-tasks. Notes: 1. Patch uses a custom build of Apache Hive to be deployed in mini-cluster. This build has the fixes for HIVE-21569, HIVE-20038. This hack will be added to the build script in additional sub-tasks. Change-Id: I9f08db5f6da735ac431819063060941f0941f606 Reviewed-on: http://gerrit.cloudera.org:8080/17774 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-27 06:36:19 +00:00
Andrew Sherman	b96439f680	IMPALA-11078 Add simple CSP header to webui. Content Security Policy (CSP) is a computer security standard designed to prevent cross-site scripting, clickjacking and other code injection attacks. CSP provides a method for websites to declare approved origins of content that browsers should be allowed to load on that website. A good resource is https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP If a page breaks the rules then the included script or css will typically not be run by the browser. In the Impala webui we use a CSP header to declare that all web content comes from the impalad, with some 'unsafe' inline code. A new server flag "--disable_content_security_policy_header=true" can be set to disable the emission of this header in case of any compatibility issues. A few small changes were needed to make this CSP header work. Chart.js was previously included via http, this was changed to being bundled like other javascript and css we use. Some dodgy array code that handles connection metrics was also fixed. TESTING: The main webui tests all now validate the CSP header is present. A test for the new flag is also added. Change-Id: Idc335d65b117661da0b420ddb7c9ccd80d8d76ab Reviewed-on: http://gerrit.cloudera.org:8080/18168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-25 22:52:50 +00:00
Fang-Yu Rao	351e037472	IMPALA-10934 (Part 2): Enable table definition over a single file This patch adds an end-to-end test to validate and characterize HMS' behavior with respect to external table creation after HIVE-25569 via which a user is allowed to create an external table associated with a single file. Change-Id: Ia4f57f07a9f543c660b102ebf307a6cf590a6784 Reviewed-on: http://gerrit.cloudera.org:8080/18033 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2022-01-05 03:32:11 +00:00
wzhou-code	9e76a8f7c3	IMPALA-10784 (part 3): Prepare to publish impala-shell on PyPi We are going to publish impala-shell release 4.1.0a1 on PyPi. This patch upgrades following three python libraries which are used for generating egg files when building impala-shell tarball. upgrade bitarray from 1.2.1 to 2.3.0 upgrade prettytable from 0.7.1 to 0.7.2 upgrade thrift_sasl from 0.4.2 to 0.4.3 Updates shell/packaging/requirements.txt for the versions of dependent Python libraries. Testing: - Ran core tests. - Built impala-shell package impala_shell-4.1.0a1.tar.gz, installed impala-shell package from local impala_shell-4.1.0a1.tar.gz, verified impala-shell was installed in ~/.local/lib/python2.7/site-packages. Verified the version of installed impala-shell and dependent Python libraries as expected. - Set IMPALA_SHELL_HOME as ~/.local/lib/python2.7/site-packages/ impala_shell, copied over egg files under installed impala-shell python package so we can run the end-to-end unit tests against the impala-shell installed with the package downloaded from PyPi. Passed end-to-end impala-shell unit tests. - Verified the impala-shell tarball generated by shell/make_shell_tarball.sh. Change-Id: I378404e2407396d4de3bb0eea4d49a9c5bb4e46a Reviewed-on: http://gerrit.cloudera.org:8080/17826 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-28 04:55:57 +00:00
wzhou-code	03a7a59f5d	IMPALA-10876: Support to download JWKS from given URL This patch added functionality to download JWKS from a given URL and support key rotation by periodically checking the JWKS URL for updates. We use Kudu's EasyCurl wrapper to download file from the given URL. curl was added to native-toolchain. This patch modified makefiles and bootstrap_toolchain.py to integrate libcurl and libkudu_curl_util. Added end-end JWT authentication test cases with JWKS specified as HTTP/HTTPS URL. Testing: - Passed core run, including new test cases. Change-Id: Ic6ac8cf0010c13db30219776d1d275709bf211df Reviewed-on: http://gerrit.cloudera.org:8080/17802 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-28 04:45:23 +00:00
wzhou-code	025500ccb5	IMPALA-10489: Implement JWT support This patch added JWT support with following functionality: * Load and parse JWKS from pre-installed JSON file. * Read the JWT token from the HTTP Header. * Verify the JWT's signature with public key in JWKS. * Get the username out of the payload of JWT token. * Support following JSON Web Algorithms (JWA): HS256, HS384, HS512, RS256, RS384, RS512. We use third party library jwt-cpp to verify JWT token. jwt-cpp is a headers only C++ library. It was added to native-toolchain. This patch modified bootstrap_toolchain.py to download jwt-cpp from toolchain s3 bucket, and modified makefiles to add jwt-cpp/include in the include path. Added BE unit-tests for loading JWKS file and verifying JWT token. Also added FE custom cluster test for JWT authentication. Testing: - Passed core run. Change-Id: I6b71fa854c9ddc8ca882878853395e1eb866143c Reviewed-on: http://gerrit.cloudera.org:8080/17435 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-08 23:10:32 +00:00
Daniel Becker	817ca5920d	IMPALA-10640: Support reading Parquet Bloom filters - most common types This change adds read support for Parquet Bloom filters for types that can reasonably be supported in Impala. Other types, such as CHAR(N), would be very difficult to support because the length may be different in Parquet and in Impala which results in truncation or padding, and that changes the hash which makes using the Bloom filter impossible. Write support will be added in a later change. The supported Parquet type - Impala type pairs are the following: --------------------------------------- \|Parquet type \| Impala type \| \|---------------------------------------\| \|INT32 \| TINYINT, SMALLINT, INT \| \|INT64 \| BIGINT \| \|FLOAT \| FLOAT \| \|DOUBLE \| DOUBLE \| \|BYTE_ARRAY \| STRING \| --------------------------------------- The following types are not supported for the given reasons: ---------------------------------------------------------------- \|Impala type \| Problem \| \|----------------------------------------------------------------\| \|VARCHAR(N) \| truncation can change hash \| \|CHAR(N) \| padding / truncation can change hash \| \|DECIMAL \| multiple encodings supported \| \|TIMESTAMP \| multiple encodings supported, timezone conversion \| \|DATE \| not considered yet \| ---------------------------------------------------------------- Support may be added for these types later, see IMPALA-10641. If a Bloom filter is available for a column that is fully dictionary encoded, the Bloom filter is not used as the dictionary can give exact results in filtering. Testing: - Added tests/query_test/test_parquet_bloom_filter.py that tests whether Parquet Bloom filtering works for the supported types and that we do not incorrectly discard row groups for the unsupported type VARCHAR. The Parquet file used in the test was generated with an external tool. - Added unit tests for ParquetBloomFilter in file be/src/util/parquet-bloom-filter-test.cc - A minor, unrelated change was done in be/src/util/bloom-filter-test.cc: the MakeRandom() function had return type uint64_t, the documentation claimed it returned a 64 bit random number, but the actual number of random bits is 32, which is what is intended in the tests. The return type and documentation have been corrected to use 32 bits. Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287 Reviewed-on: http://gerrit.cloudera.org:8080/17026 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2021-06-03 06:32:45 +00:00

1 2 3

122 Commits