Commit Graph

185 Commits

Author SHA1 Message Date
stiga-huang
da190f1d86 IMPALA-14074: Warmup metadata cache in catalogd for critical tables
*Background*

Catalogd starts with a cold metadata cache - only the db/table names and
functions are loaded. Metadata of a table is unloaded until there are
queries submitted on the table. The first query will suffer from the
delay of loading metadata. There is a flag,
--load_catalog_in_background, to let catalogd eagerly load metadata of
all tables even if no queries come. Catalogd may load metadata for
tables that are possibly never used, potentially increasing catalog size
and consequently memory usage. So this flag is turned off by default and
not recommended to be used in production.

Users do need the metadata of some critical tables to be loaded. Before
that the service is considered not ready since important queries might
fail in timeout. When Catalogd HA is enabled, it’s also required that
the standby catalogd has an up-to-date metadata cache to smoothly take
over the active one when failover happens.

*New Flags*

This patch adds a startup flag for catalogd to specify a config file
containing tables that users want their metadata to be loaded. Catalogd
adds them to the table loading queue in background when a catalog reset
happens, i.e. at catalogd startup or global INVALIDATE METADATA runs.

The flag is --warmup_tables_config_file. The value can be a path in the
local FS or in remote storage (e.g. HDFS). E.g.
  --warmup_tables_config_file=file:///opt/impala/warmup_table_list.txt
  --warmup_tables_config_file=hdfs:///tmp/warmup_table_list.txt

Each line in the config file can be a fully qualified table name or a
wildcard under a db, e.g. "tpch.*". Catalogd loads the table names at
startup and schedules loading on them after a reset of the catalog. The
scheduling order is based on the order in the config file. So important
tables can be put first. Comments start with "#" or "//" are ignored in
the config file.

Another flag, --keeps_warmup_tables_loaded (defaults to false), is added
to control whether to reload the table after it’s been invalidated,
either by an explicit INVALIDATE METADATA <table> command or implicitly
invalidated by CatalogdTableInvalidator or HMS RELOAD events.

When CatalogdTableInvalidator is enabled with
--invalidate_tables_on_memory_pressure=true, users shouldn’t set
keeps_warmup_tables_loaded to true if the catalogd heap size is not
enough to cache metadata of all these tables. Otherwise, these tables
will keep being loaded and invalidated.

*Catalogd HA Changes*
When Catalogd HA is enabled, the standby catalogd will also reset its
catalog and start loading metadata of these tables, after the HA state
(active/standby) is determined. Standby catalogd keeps its metadata
cache up-to-date by applying HMS notification events. To support a
warmed up switch, --catalogd_ha_reset_metadata_on_failover should be set
to false.

*Limitation*
The standby catalogd could still have a stale cache if there are
operations in the active catalogd that don’t trigger HMS notification
events, or if the HMS notification event is not applied correctly. E.g.
Adding a new native function generates an ALTER_DATABASE event, but when
applying the event, native function list of the db is not refreshed
(IMPALA-14210). These will be resolved in separate JIRAs.

*Test*
 - Added FE unit tests.
 - Added e2e test for local/hdfs config files.
 - Added e2e test to verify the standby catalogd has a warmed up cache
   when failover happens.

Change-Id: I2d09eae1f12a8acd2de945984d956d11eeee1ab6
Reviewed-on: http://gerrit.cloudera.org:8080/23155
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-12 18:50:56 +00:00
Riza Suminto
48c4d31344 IMPALA-14130: Remove wait_num_tables arg in start-impala-cluster.py
IMPALA-13850 changed the behavior of bin/start-impala-cluster.py to wait
for the number of tables to be at least one. This is needed to detect
that the catalog has seen at least one update. There is special logic in
dataload to start Impala without tables in that circumstance.

This broke the perf-AB-test job, which starts Impala before loading
data. There are other times when we want to start Impala without tables,
and it is inconvenient to need to specify --wait_num_tables each time.

It is actually not necessary to wait for catalog metric of Coordinator
to reach certain value. Frontend (Coordinator) will not open its service
port until it heard the first catalog topic update form CatalogD.
IMPALA-13850 (part 2) also ensure that CatalogD with
--catalog_topic_mode=minimal will block serving Coordinator request
until it begin its first reset() operation. Therefore, waiting
Coordinator's catalog version is not needed anymore and
--wait_num_tables parameter can be removed.

This patch also slightly change the "progress log" of
start-impala-cluster.py to print the Coordinator's catalog version
instead of num DB and tables cached. The sleep interval time now include
time spent checking Coordinator's metric.

Testing:
- Pass dataload with updated script.
- Manually run start-impala-cluster.py in both legacy and local catalog
  mode and confirm it works.
- Pass custom cluster test_concurrent_ddls.py and test_catalogd_ha.py

Change-Id: I4a3956417ec83de4fb3fc2ef1e72eb3641099f02
Reviewed-on: http://gerrit.cloudera.org:8080/22994
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-06-11 13:55:12 +00:00
Riza Suminto
55feffb41b IMPALA-13850 (part 1): Wait until CatalogD active before resetting
In HA mode, CatalogD initialization can fail to complete within
reasonable time. Log messages showed that CatalogD is blocked trying to
acquire "CatalogServer.catalog_lock_" when calling
CatalogServer::UpdateActiveCatalogd() during statestore subscriber
registration. catalog_lock_ was held by GatherCatalogUpdatesThread which
is calling GetCatalogDelta(), which waits for the java lock versionLock_
which is held by the thread doing CatalogServiceCatalog.reset().

This patch remove catalog reset in JniCatalog constructor. In turn,
catalogd-server.cc is now responsible to trigger the metadata
reset (Invaidate Metadata) only if:

1. It is the active CatalogD, and
2. Gathering thread has collect the first topic update or CatalogD is
   set with catalog_topic_mode other than "minimal".

The later prerequisite is to ensure that all coordinators are not
blocked waiting for full topic update in on-demand metadata mode. This
is all managed by a new thread method TriggerResetMetadata that monitor
and trigger the initial reset metadata.

Note that this is a behavior change in on-demand catalog
mode (catalog_topic_mode=minimal). Previously, on-demand catalog mode
will send full database list in its first catalog topic update. This
behavior change is OK since coordinator can request metadata on-demand.

After this patch, catalog-server.active-status and /healthz page can
turn into true and OK respectively even if the very first metadata reset
is still ongoing. Observer that cares about having fully populated
metadata should check other metrics such as catalog.num-db,
catalog.num-tables, or /catalog page content.

Updated start-impala-cluster.py readiness check to wait for at least 1
table to be seen by coordinators, except during create-load-data.sh
execution (there is no table yet) and when use_local_catalog=true (local
catalog cache does not start with any table). Modified startup flag
checking from reading the actual command line args to reading the
'/varz?json' page of the daemon. Cleanup impala_service.py to fix some
flake8 issues.

Slightly update TestLocalCatalogCompactUpdates::test_restart_catalogd so
that unique_database cleanup is successful.

Testing:
- Refactor test_catalogd_ha.py to reduce repeated code, use
  unique_database fixture, and additionally validate /healthz page of
  both active and standby catalogd. Changed it to test using hs2
  protocol by default.
- Run and pass test_catalogd_ha.py and test_concurrent_ddls.py.
- Pass core tests.

Change-Id: I58cc66dcccedb306ff11893f2916ee5ee6a3efc1
Reviewed-on: http://gerrit.cloudera.org:8080/22634
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-04-17 01:59:54 +00:00
stiga-huang
d2e495e83a IMPALA-13284: Loading test data on Apache Hive3
There are some failures in loading test data on Apache Hive 3.1.3:
 - STORED AS JSONFILE is not supported
 - STORED BY ICEBERG is not supported. Similarly, STORED BY ICEBERG
   STORED AS AVRO is not supported.
 - Missing the jar of iceberg-hive-runtime in CLASSPATH of HMS and Tez
   jobs.
 - Creating table in Impala is not translated to EXTERNAL table in HMS
 - Hive INSERT on insert-only tables failed in generating InsertEvents
   (HIVE-20067).

This patch fixes the syntax issues by using old syntax of Apache Hive
3.1.3:
 - Convert STORED AS JSONFILE to ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.JsonSerDe'
 - Convert STORED BY ICEBERG to STORED BY
   'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
 - Convert STORED BY ICEBERG STORED AS AVRO to the above one with
   tblproperties('write.format.default'='avro')
Most of the conversion are done in generate-schema-statements.py. One
exception is in testdata/bin/load-dependent-tables.sql where we need to
generate a new file with the conversion when using it.

The missing jar of iceberg-hive-runtime is added into HIVE_AUX_JARS_PATH
in bin/impala-config.sh. Note that this is only needed by Apache Hive3
since CDP Hive3 has the jar of hive-iceberg-handler in its lib folder.

To fix the failure of InsertEvents, we add the patch of HIVE-20067 and
modify testdata/bin/patch_hive.sh to also recompile the submodule
standalone-metastore.

Modified some statements in
testdata/datasets/functional/functional_schema_template.sql to be more
reliable in retry.

Tests
 - Verified the testdata can be loaded in ubuntu-20.04-from-scratch

Change-Id: I8f52c91602da8822b0f46f19dc4111c7187ce400
Reviewed-on: http://gerrit.cloudera.org:8080/21657
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-20 07:01:21 +00:00
wzhou-code
08f8a30025 IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables
This patch adds script to create external JDBC tables for the dataset of
TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for
external JDBC tables with Impala-Impala federation. Note that JDBC
tables are mapping tables, they don't take additional disk spaces.
It fixes the race condition when caching of SQL DataSource objects by
using a new DataSourceObjectCache class, which checks reference count
before closing SQL DataSource.
Adds a new query-option 'clean_dbcp_ds_cache' with default value as
true. When it's set as false, SQL DataSource object will not be closed
when its reference count equals 0 and will be kept in cache until
the SQL DataSource is idle for more than 5 minutes. Flag variable
'dbcp_data_source_idle_timeout_s' is added to make the duration
configurable.
java.sql.Connection.close() fails to remove a closed connection from
connection pool sometimes, which causes JDBC working threads to wait
for available connections from the connection pool for a long time.
The work around is to call BasicDataSource.invalidateConnection() API
to close a connection.
Two flag variables are added for DBCP configuration properties
'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait'
properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively
in apache.commons.dbcp v2.
Fixes a bug for database type comparison since the type strings
specified by user could be lower case or mix of upper/lower cases, but
the code compares the types with upper case string.
Fixes issue to close SQL DataSource object in JdbcDataSource.open()
and JdbcDataSource.getNext() when some errors returned from DBCP APIs
or JDBC drivers.

testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables
for Impala-Impala, Postgres and MySQL.
Following sample commands creates TPCDS JDBC tables for Impala-Impala
federation with remote coordinator running at 10.19.10.86, and Postgres
server running at 10.19.10.86:
  ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
    --jdbc_db_name=tpcds_jdbc --workload=tpcds \
    --database_type=IMPALA --database_host=10.19.10.86 --clean

  ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
    --jdbc_db_name=tpcds_jdbc --workload=tpcds \
    --database_type=POSTGRES --database_host=10.19.10.86 \
    --database_name=tpcds --clean

TPCDS tests for JDBC tables run only for release/exhaustive builds.
TPCH tests for JDBC tables run for core and exhaustive builds, except
Dockerized builds.

Remaining Issues:
 - tpcds-decimal_v2-q80a failed with returned rows not matching expected
   results for some decimal values. This will be fixed in IMPALA-13018.

Testing:
 - Passed core tests.
 - Passed query_test/test_tpcds_queries.py in release/exhaustive build.
 - Manually verified that only one SQL DataSource object was created for
   test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option
   'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object
   was closed by cleanup thread.

Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a
Reviewed-on: http://gerrit.cloudera.org:8080/21304
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-05-02 02:14:20 +00:00
Abhishek Rawat
f620e5d5c0 IMPALA-13015: Dataload fails due to concurrency issue with test.jceks
Move 'hadoop credential' command used for creating test.jceks to
testdata/bin/create-load-data.sh. Earlier it was in bin/load-data.py
which is called in parallel and was causing failures due to race
conditions.

Testing:
- Ran JniFrontendTest#testGetSecretFromKeyStore after data loading and
test ran clean.

Change-Id: I7fbeffc19f2b78c19fee9acf7f96466c8f4f9bcd
Reviewed-on: http://gerrit.cloudera.org:8080/21346
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-23 11:09:47 +00:00
Peter Rozsa
7e5bb385e1 Addendum: IMPALA-12584: Enable strict data file access by default
This change sets the default value to 'true' for
'iceberg_restrict_data_file_location' and changes the flag name to
'iceberg_allow_datafiles_in_table_location_only'. Tests related to
multiple storage locations in Iceberg tables are moved out to custom
cluster tests. During test data loading, the flag is set to 'false'
to make the creation of 'iceberg_multiple_storage_locations' table
possible.

Change-Id: Ifec84c86132a8a44d7e161006dcf51be2e7c7e57
Reviewed-on: http://gerrit.cloudera.org:8080/20874
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-01-19 19:53:08 +00:00
Riza Suminto
8661f922d3 IMPALA-12601: Add a fully partitioned TPC-DS database
The current tpcds dataset only has store_sales table fully partitioned
and leaves the other facts table unpartitioned. This is intended for
faster data loading during tests. However, this is not an accurate
reflection of the larger scale TPC-DS dataset where all facts tables are
partitioned. Impala planner may change the details of the query plan if
a partition column exists.

This patch adds a new dataset tpcds_partitioned, loading a fully
partitioned TPC-DS db in parquet format named
tpcds_partitioned_parquet_snap. This dataset can not be loaded
independently and requires the base 'tpcds' db from the tpcds dataset to
be preloaded first. An example of how to load this dataset can be seen
at function load-tpcds-data in bin/create-load-data.sh.

This patch also changes PlannerTest#testProcessingCost from targeting
tpcds_parquet to tpcds_partitioned_parquet_snap. Other planner tests are
that currently target tpcds_parquet will be gradually changed to test
against tpcds_partitioned_parquet_snap in follow-up patches.

This addition adds a couple of seconds in the "Computing table stats"
step, but loading itself is negligible since it is parallelized with
TPC-H and functional-query. The total loading time for the three
datasets remains similar after this patch.

This patch also adds several improvements in the following files:

bin/load-data.py:
- Log elapsed time on serial steps.

testdata/bin/create-load-data.sh:
- Rename MSG to LOAD_MSG to avoid collision with the same variable name
  in ./testdata/bin/run-step.sh

testdata/bin/generate-schema-statements.py:
- Remove redundant FILE_FORMAT_MAP.
- Add build_partitioned_load to simplify expressing partitioned insert
  query in SQL template.

testdata/datasets/tpcds/tpcds_schema_template.sql:
- Reorder schema template to load all dimension tables before fact tables.

Testing:
- Pass core tests.

Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453
Reviewed-on: http://gerrit.cloudera.org:8080/20756
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-16 02:31:13 +00:00
gaurav1086
39adf42a30 IMPALA-12470: Support different schemes for jdbc driver url when
creating external jdbc table

This patch builds on top of IMPALA-5741 to copy the jdbc jar from
remote filesystems: Ozone and S3. Currenty we only support hdfs.

Testing:
Commented out "@skipif.not_hdfs" qualifier in files:
  - tests/query_test/test_ext_data_sources.py
  - tests/custom_cluster/test_ext_data_sources.py
1) tested locally by running tests:
  - impala-py.test tests/query_test/test_ext_data_sources.py
  - impala-py.test tests/custom_cluster/test_ext_data_sources.py
2) tested using jenkins job for ozone and S3

Change-Id: I804fa3d239a4bedcd31569f2b46edb7316d7f004
Reviewed-on: http://gerrit.cloudera.org:8080/20639
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-11-01 23:32:10 +00:00
Fucun Chu
c2bd30a1b3 IMPALA-5741: Initial support for reading tiny RDBMS tables
This patch uses the "external data source" mechanism in Impala to
implement data source for querying JDBC.
It has some limitations due to the restrictions of "external data
source":
  - It is not distributed, e.g, fragment is unpartitioned. The queries
    are executed on coordinator.
  - Queries which read following data types from external JDBC tables
    are not supported:
    BINARY, CHAR, DATETIME, and COMPLEX.
  - Only support binary predicates with operators =, !=, <=, >=,
    <, > to be pushed to RDBMS.
  - Following data types are not supported for predicates:
    DECIMAL, TIMESTAMP, DATE, and BINARY.
  - External tables with complex types of columns are not supported.
  - Support is limited to the following databases:
    MySQL, Postgres, Oracle, MSSQL, H2, DB2, and JETHRO_DATA.
  - Catalog V2 is not supported (IMPALA-7131).
  - DataSource objects are not persistent (IMPALA-12375).

Additional fixes are planned on top of this patch.

Source files under jdbc/conf, jdbc/dao and jdbc/exception are
replicated from Hive JDBC Storage Handler.

In order to query the RDBMS tables, the following steps should be
followed (note that existing data source table will be rebuilt):
1. Make sure the Impala cluster has been started.

2. Copy the jar files of JDBC drivers and the data source library into
HDFS.
${IMPALA_HOME}/testdata/bin/copy-ext-data-sources.sh

3. Create an `alltypes` table in the Postgres database.
${IMPALA_HOME}/testdata/bin/load-ext-data-sources.sh

4. Create data source tables (alltypes_jdbc_datasource and
alltypes_jdbc_datasource_2).
${IMPALA_HOME}/bin/impala-shell.sh -f\
  ${IMPALA_HOME}/testdata/bin/create-ext-data-source-table.sql

5. It's ready to run query to access data source tables created
in last step. Don't need to restart Impala cluster.

Testing:
 - Added unit-test for Postgres and ran unit-test with JDBC driver
   postgresql-42.5.1.jar.
 - Ran manual unit-test for MySql with JDBC driver
   mysql-connector-j-8.1.0.jar.
 - Ran core tests successfully.

Change-Id: I8244e978c7717c6f1452f66f1630b6441392e7d2
Reviewed-on: http://gerrit.cloudera.org:8080/17842
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-10-10 02:13:59 +00:00
Eyizoha
2f06a7b052 IMPALA-10798: Initial support for reading JSON files
Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-05 16:55:41 +00:00
Michael Smith
39fea06f2b IMPALA-11990: Make actual failures clearer
A hack to cleanup after Hbase fails when services haven't been started
yet (which is always at least once in a CI run) with a large error
message. That error isn't useful and can be misleading for people
reviewing test logs. Suppress it.

Guards data load for Ozone as a usable snapshot is required. Also fixes
a typo in fixed issues.

Change-Id: Idc37d03780fca35427b977524b2b97a6892c88f7
Reviewed-on: http://gerrit.cloudera.org:8080/19459
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-03-10 22:23:18 +00:00
yx91490
f4d306cbca IMPALA-11629: Support for huawei OBS FileSystem
This patch adds support for huawei OBS (Object Storage Service)
FileSystem. The implementation is similar to other remote FileSystems.

New flags for OBS:
- num_obs_io_threads: Number of OBS I/O threads. Defaults to be 16.

Testing:
 - Upload hdfs test data to an OBS bucket. Modify all locations in HMS
   DB to point to the OBS bucket. Remove some hdfs caching params.
   Run CORE tests.

Change-Id: I84a54dbebcc5b71e9bcdd141dae9e95104d98cb1
Reviewed-on: http://gerrit.cloudera.org:8080/19110
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-09 08:10:19 +00:00
Michael Smith
88d49b6919 IMPALA-11693: Enable allow_erasure_coded_files by default
Enables allow_erasure_coded_files by default as we've now completed all
planned work to support it.

Testing
- Ran HDFS+EC test suite
- Ran Ozone+EC test suite

Change-Id: I0cfef087f2a7ae0889f47e85c5fab61a795d8fd4
Reviewed-on: http://gerrit.cloudera.org:8080/19362
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-31 16:53:46 +00:00
Fang-Yu Rao
db1cac2a49 IMPALA-10399, IMPALA-11060, IMPALA-11788: Reset Ranger policy repository in an E2E test
test_show_grant_hive_privilege() uses Ranger's REST API to get all the
existing policies from the Ranger server after creating a policy that
grants the LOCK and SELECT privileges on all the tables and columns in
the unique database in order to verify the granted privileges indeed
exist in Ranger's policy repository.

The way we download all the policies from the Ranger server in
test_show_grant_hive_privilege(), however, did not
always work. Specifically, when there were already a lot of existing
policies in Ranger, the policy that granted the LOCK and SELECT
privileges would not be included in the result returned via one single
GET request. We found that to reproduce the issue it suffices to add 300
Ranger policies before adding the policy granting those 2 privileges.

Moreover, we found that even we set the argument 'stream' of
requests.get() to True and used iter_content() to read the response in
chunks, we still could not retrieve the policy added in
test_show_grant_hive_privilege().

As a workaround, instead of changing how we download all the policies
from the Ranger server, this patch resets Ranger's policy repository for
Impala before we create the policy granting those 2 privileges so that
this test will be more resilient to the number of existing policies in
the repository.

Change-Id: Iff56ec03ceeb2912039241ea302f4bb8948d61f8
Reviewed-on: http://gerrit.cloudera.org:8080/19373
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2022-12-28 01:48:26 +00:00
yacai
c953426692 IMPALA-11683: Support Aliyun OSS File System
This patch adds support for OSS (Aliyun Object Storage Service).
Using the hadoop-aliyun, the implementation is similar to other
remote FileSystems.

Tests:
- Prepare:
  Initialize OSS-related environment variables:
  OSS_ACCESS_KEY_ID, OSS_SECRET_ACCESS_KEY, OSS_ACCESS_ENDPOINT.
  Compile and create hdfs test data on a ECS instance. Upload test data
  to an OSS bucket.
- Modify all locations in HMS DB to point to the OSS bucket.
  Remove some hdfs caching params. Run CORE tests.

Change-Id: I267e6531da58e3ac97029fea4c5e075724587910
Reviewed-on: http://gerrit.cloudera.org:8080/19165
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-16 10:14:49 +00:00
Zoltan Borok-Nagy
522ee1fcc0 IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables
Virtual column FILE__POSITION returns the ordinal position of the row
in the data file. It will be useful to add support for Iceberg's
position-based delete files

This patch only adds FILE__POSITION to Parquet tables. It works
similarly to the handling of collection position slots. I.e. we
add the responsibility of dealing with the file position slot to
an existing column reader. Because of page-filtering and late
materialization we already tracked the file position in member
'current_row_' during scanning.

Querying the FILE__POSITION in other file formats raises an error.

Testing:
 * added e2e tests

Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7
Reviewed-on: http://gerrit.cloudera.org:8080/18704
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-12 19:21:55 +00:00
Zoltan Borok-Nagy
35dada4656 IMPALA-11419: Incremental build is broken
IMPALA-11384 broke incremental builds because:
* added a custom target that always rewrites a few generated
  header files
** The files are getting included directly/indirectly in most files,
   so we always need to recompile a large part of the project
* Didn't remove ${THRIFT_FILE_WE}_constants.cpp/h dependency from
  common/thrift/CMakeLists.txt
** These files are not generated anymore, so the build system always
   reconstruct all the generated files (because *_constant.cpp/h is
   always missing), and then builds every target that depend on them.

IMPALA-11415 fixes a sporadic error during data loading, but it only
hides the root cause, i.e. unnecessary recreation of thrift files.

This patch reverts IMPALA-11415 because:
* to make data-load more parallel
* to not hide similar issues in the future

Testing
* Tested locally that the thrift files are not getting regenerated

Change-Id: Ieb0e2007f3fa0cc721bd7b272956ce206ac65b0e
Reviewed-on: http://gerrit.cloudera.org:8080/18708
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-11 18:40:51 +00:00
Riza Suminto
42863ee34a IMPALA-11415: Add run-step-wait-all after Kudu data loading
IMPALA-11384 reveals an issue in testdata/bin/create-load-data.sh. If
$SKIP_METADATA_LOAD is true, all three of "Loading Kudu functional",
"Loading Kudu TPCH", and "Loading Hive UDFs" then ran in parallel in
the background. The later background step seemingly override the thrift
generated python code under shell/gen-py/hive_metastore/ and
shell/gen-py/beeswaxd/. This in turn cause sporadic python error upon
invocation of bin/load-data.py of the two former Kudu background steps.
Adding run-step-wait-all after the Kudu data loading seems to fix the
issue.

Testing:
- Successfully run create-load-data.sh with SKIP_METADATA_LOAD equals
  true.

Change-Id: I998cd1a1895f7c1bcaceb87e0592c6c0a0f6b4ea
Reviewed-on: http://gerrit.cloudera.org:8080/18701
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-06 14:02:35 +00:00
Michael Smith
d4cb3afe69 [tools] fix buildall.sh -testdata with prior data
The help output for buildall.sh notes running `buildall.sh -testdata` as
an option to incrementally load test data without formatting the
mini-cluster. However trying to do that with existing data loaded
results in an error when running `hadoop fs -mkdir /test-warehouse`. Add
`-p` so this step is idempotent, allowing the example to work as
documented.

Change-Id: Icc4ec4bb746abf53f6787fce4db493919806aaa9
Reviewed-on: http://gerrit.cloudera.org:8080/18522
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-16 08:19:33 +00:00
Fucun Chu
157086cb80 IMPALA-10771: Add Tencent COS support
This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.

New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.

Follow-up:
- Support for caching COS file handles will be addressed in
   IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   COS (IMPALA-10773).

Tests:
 - Upload hdfs test data to a COS bucket. Modify all locations in HMS
   DB to point to the COS bucket. Remove some hdfs caching params.
   Run CORE tests.

Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-12-08 16:32:02 +00:00
Yu-Wen Lai
33467a7b2e Bump up the GBN to 15549253
This patch bumps up the GBN to 15549253. This patch includes the fix by
Fang-Yu for using correct policy id to update the policy of "all - database"
due to the change on the Ranger side.

Testing:
* ran the create-load-data.sh

Change-Id: Ie7776e62dad0b9bec6c03fb9ee8f1b8728ff0e69
Reviewed-on: http://gerrit.cloudera.org:8080/17746
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-03 17:25:40 +00:00
Fang-Yu Rao
5aa47bf901 IMPALA-10694: Improve the error handling of setup-ranger
We found that setup-ranger will continue the execution if an error
occurs when i) wget was executed to initialize the environment
variables of GROUP_ID_OWNER and GROUP_ID_NON_OWNER, and ii) curl was
executed to upload a revised Ranger policy even though the -e option was
set when create-load-data.sh was executed. This patch improves the error
handling by making setup-ranger exit as soon as an error occurs so that
no test would be run at all in case there is an error.

To exit if an error occurs during wget, we separate the assignment and
the export of the environment variables since the export of an
environment variable will run and succeed even though there is an error
before the export if the assignment and the export are combined. That
is, combining them hides the error.

On the other hand, to exit if an error occurs during curl, we add an
additional -f option so that an error will no longer be silently
ignored.

Testing:
 - Verified that setup-ranger could be successfully executed after this
   patch.
 - Verified that setup-ranger would exit if a URL in setup-ranger is not
   correctly set up or if the 'id' field in policy_4_revised.json does
   not match the URL of the policy to be updated.

Change-Id: I45605d1a7441b734cf80249626638cde3adce28b
Reviewed-on: http://gerrit.cloudera.org:8080/17386
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-12 12:25:49 +00:00
Laszlo Gaal
559f6044be IMPALA-9331: Add symptom for dataload failing on schema mismatch
Streamline triaging a bit. When this fails, it does so in a specific
location, and until now you had to scan the build log to find the
problem. This JUnitXML symptom should make this failure mode obvious.

Tested by running an S3 build on private infrastructure with a knowingly
mismatched data snapshot.

Change-Id: I2fa193740a2764fdda799d6a9cc64f89cab64aba
Reviewed-on: http://gerrit.cloudera.org:8080/17242
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-01 15:09:10 +00:00
stiga-huang
2dfc68d852 IMPALA-7712: Support Google Cloud Storage
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.

New flags for GCS:
 - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.

Follow-up:
 - Support for spilling to GCS will be addressed in IMPALA-10561.
 - Support for caching GCS file handles will be addressed in
   IMPALA-10568.
 - test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   GCS (IMPALA-10562).
 - Some tests are skipped due to issues introduced by /etc/hosts setting
   on GCE instances (IMPALA-10563).

Tests:
 - Compile and create hdfs test data on a GCE instance. Upload test data
   to a GCS bucket. Modify all locations in HMS DB to point to the GCS
   bucket. Remove some hdfs caching params. Run CORE tests.
 - Compile and load snapshot data to a GCS bucket. Run CORE tests.

Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-13 11:20:08 +00:00
Joe McDonnell
97856478ec IMPALA-10198 (part 1): Unify Java in a single java/ directory
This changes all existing Java code to be submodules under
a single root pom. The root pom is impala-parent/pom.xml
with minor changes to add submodules.

This avoids most of the weird CMake/maven interactions,
because there is now a single maven invocation for all
the Java code.

This moves all the Java projects other than fe into
a top level java directory. fe is left where it is
to avoid disruption (but still is compiled via the
java directory's root pom). Various pieces of code
that reference the old locations are updated.

Based on research, there are two options for dealing
with the shaded dependencies. The first is to have an
entirely separate Maven project with a separate Maven
invocation. In this case, the consumers of the shaded
jars will see the reduced set of transitive dependencies.
The second is to have the shaded dependencies as modules
with a single Maven invocation. The consumer would see
all of the original transitive dependencies and need to
exclude them all. See MSHADE-206/MNG-5899. This chooses
the second.

This only moves code around and does not focus on version
numbers or making "mvn versions:set" work.

Testing:
 - Ran a core job
 - Verified existing maven commands from fe/ directory still work
 - Compared the *-classpath.txt files from fe and executor-deps
   and verified they are the same except for paths

Change-Id: I08773f4f9d7cb269b0491080078d6e6f490d8d7a
Reviewed-on: http://gerrit.cloudera.org:8080/16500
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2020-10-15 19:30:13 +00:00
stiga-huang
7b44b35132 IMPALA-9351: Fix tests depending on hard-coded file paths of managed tables
Some tests (e.g. AnalyzeDDLTest.TestCreateTableLikeFileOrc) depend on
hard-coded file paths of managed tables, assuming that there is always a
file named 'base_0000001/bucket_00000_0' under the table dir. However,
the file name is in the form of bucket_${bucket-id}_${attempt-id}. The
last part of the file name is not guaranteed to be 0. If the first
attempt fails and the second attempt succeeds, the file name will be
bucket_00000_1.

This patch replaces these hard-coded file paths to corresponding files
that are uploaded to HDFS by commands. For tests that do need to use the
file paths of managed table files, we do a listing on the table dir to
get the file names, instead of hard-coding the file paths.

Updated chars-formats.orc to contain column names in the file so can be
used in more tests. The original one only has names like col0, col1,
col2.

Tests:
 - Run CORE tests

Change-Id: Ie3136ee90e2444c4a12f0f2e1470fca1d5deaba0
Reviewed-on: http://gerrit.cloudera.org:8080/16441
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-09-12 05:30:45 +00:00
Tim Armstrong
6ec6aaae8e IMPALA-3695: Remove KUDU_IS_SUPPORTED
Testing:
Ran exhaustive tests.

Change-Id: I059d7a42798c38b570f25283663c284f2fcee517
Reviewed-on: http://gerrit.cloudera.org:8080/16085
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-18 01:11:18 +00:00
Joe McDonnell
f15a311065 IMPALA-9709: Remove Impala-lzo from the development environment
This removes Impala-lzo from the Impala development environment.
Impala-lzo is not built as part of the Impala build. The LZO plugin
is no longer loaded. LZO tables are not loaded during dataload,
and LZO is no longer tested.

This removes some obsolete scan APIs that were only used by Impala-lzo.
With this commit, Impala-lzo would require code changes to build
against Impala.

The plugin infrastructure is not removed, and this leaves some
LZO support code in place. If someone were to decide to revive
Impala-lzo, they would still be able to load it as a plugin
and get the same functionality as before. This plugin support
may be removed later.

Testing:
 - Dryrun of GVO
 - Modified TestPartitionMetadataUncompressedTextOnly's
   test_unsupported_text_compression() to add LZO case

Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e
Reviewed-on: http://gerrit.cloudera.org:8080/15814
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2020-06-15 23:42:12 +00:00
Joe McDonnell
f241fd08ac IMPALA-9731: Remove USE_CDP_HIVE=false and Hive 2 support
Impala 4 moved to using CDP versions for components, which involves
adopting Hive 3. This removes the old code supporting CDH components
and Hive 2. Specifically, it does the following:
1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true.
   USE_CDP_HIVE now has no effect on the Impala environment. This also
   means that bin/jenkins/build-all-flag-combinations.sh no longer
   include USE_CDP_HIVE=false as a configuration.
2. Remove USE_CDH_KUDU and default to getting Impala from the
   native toolchain.
3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including
   the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml.

There is a fair amount of code that still references the Hive major
version. Upstream Hive is now working on Hive 4, so there is a high
likelihood that we'll need some code to deal with that transition.
This leaves some code (such as maven profiles) and test logic in
place.

Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608
Reviewed-on: http://gerrit.cloudera.org:8080/15869
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-07 22:14:39 +00:00
Joe McDonnell
90ab610d34 Convert dataload hdfs copy commands to LOAD DATA statements
The schema file allows specifying a commandline command in
several of the sections (LOAD, DEPENDENT_LOAD, etc). These
are execute by testdata/bin/generate-schema-statements.py
when it is creating the SQL files that are later executed
for dataload. A fair number of tables use this flexibility
to execute hdfs mkdir and copy commands via the command line.

Unfortunately, this is very inefficient. HDFS command line
commands require spinning up a JVM and can take over one
second per command. These commands are executed during a
serial part of dataload, and they can be executed multiple
times. In short, these commands are a significant slowdown
for loading the functional tables.

This converts the hdfs command line statements to equivalent
Hive LOAD DATA LOCAL statements. These are doing the copy
from an already running JVM, so they do not need JVM startup.
They also run in the parallel part of dataload, speeding up
the SQL generation part.

This speeds up generate-schema-statements.py significantly.
On the functional dataset, it saves 7 minutes.
Before:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real    8m8.068s
user    10m11.218s
sys     0m44.932s

After:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real    0m35.800s
user    0m42.536s
sys     0m5.210s

This is currently a long-pole in dataload, so it translates directly to
an overall speedup of about 7 minutes.

Testing:
 - Ran debug tests

Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c
Reviewed-on: http://gerrit.cloudera.org:8080/15228
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-24 21:22:18 +00:00
norbert.luksa
ba903deba7 Add log of created files for data load
As Joe pointed out in IMPALA-9351, it would help debugging issues with
missing files if we had logged the created files when loading the data.

With this commit, running create-load-data.sh now logs the created
files into created-files.log.

Change-Id: I4f413810c6202a07c19ad1893088feedd9f7278f
Reviewed-on: http://gerrit.cloudera.org:8080/15234
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-17 19:18:08 +00:00
Fang-Yu Rao
1b4ca58a98 IMPALA-9149: part 1: Re-enabe Ranger-related FE tests
In IMPALA-9047, we disabled some Ranger-related FE and BE tests due to
changes in Ranger's behavior after upgrading Ranger from 1.2 to 2.0.
This patch aims to re-enable those disabled FE tests in
AuthorizationStmtTest.java and RangerAuditLogTest.java to increase
Impala's test coverage of authorization via Ranger.

There are at least two major changes in Ranger's behavior in the newer
versions.

1. The first is that the owner of the requested resource no longer has
to be explicitly granted privileges in order to access the resource.

2. The second is that a user not explicitly granted the privilege of
creating a database is able to do so.

Due to these changes, some of previous Ranger authorization requests
that were expected to be rejected are now granted after the upgrade.

To re-enable the tests affected by the first change described above, we
modify AuthorizationTestBase.java to allow our FE Ranger authorization
tests to specify the requesting user in an authorization test. Those
tests failed after the upgrade because the default requesting user in
Impala's AuthorizationTestBase.java happens to be the owner of the
resources involved in our FE authorization tests. After this patch, a
requesting user can be either a non-owner user or an owner user in a
Ranger authorization test and the requesting user would correspond to a
non-owner user if it is not explicitly specified. Note that in a Sentry
authorization test, we do not use the non-owner user as the requesting
user by default as we do in the Ranger authorization tests. Instead, we
set the name of the requesting user to a name that is the same as the
owner user in Ranger authorization tests to avoid the need for providing
a customized group mapping service when instantiating a Sentry
ResourceAuthorizationProvider as we do in AuthorizationTest.java, our
FE tests specifically for testing authorization via Sentry.

On the other hand, to re-enable the tests affected by the second change,
we remove from the Ranger policy for all databases the allowed
condition that grants any user the privilege of creating a database,
which is not by default granted by Sentry. After the removal of the
allowed codition, those tests in AuthorizationStmtTest.java and
RangerAuditLogTest.java affected by the second change now result in the
same authorization errors before the upgrade of Ranger.

Testing:
- Passed AuthorizationStmtTest.java in a local dev environment
- Passed RangerAuditLogTest.java in a local dev environment

Change-Id: I228533aae34b9ac03bdbbcd51a380770ff17c7f2
Reviewed-on: http://gerrit.cloudera.org:8080/14798
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-20 11:08:23 +00:00
Joe McDonnell
fc4a91cf8c IMPALA-9165: Add timeout for create-load-data.sh
This converts the existing bin/run-all-tests-timeout-check.sh
to a more generic bin/script-timeout-check.sh. It uses this
new script for both bin/run-all-tests.sh and
testdata/bin/create-load-data.sh. The new script takes two
arguments:
 -timeout : timeout in minutes
 -script_name : name of the calling script
The script_name is used in debugging output / output filenames
to make it clear what timed out.

The run-all-tests.sh timeout remains the same.
testdata/bin/create-load-data.sh uses a 2.5 hour timeout.
This should help debug the issue in IMPALA-9165, because at
least the logs would be preserved on the Jenkins job.

Testing:
 - Tested the timeout script by hand with a caller script that
   sleeps longer than the timeout
 - Ran a gerrit-verify-dryrun-external

Change-Id: I19d76bd8850c7d4b5affff4d21f32d8715a382c6
Reviewed-on: http://gerrit.cloudera.org:8080/14741
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2019-11-20 21:59:26 +00:00
Joe McDonnell
f8c8fa5b45 IMPALA-9150: Use HBase's stop-hbase.sh script for minicluster
testdata/bin/kill-hbase.sh currently uses the generic
kill-java-service.sh script to kill the region servers,
then the master, and then the zookeeper. Recent versions
of HBase become unusable after performing this type of
shutdown. The master seems to get stuck trying to recover,
even after restarting the minicluster.

The root cause in HBase is unclear, but HBase provides the
stop-hbase.sh script, which does a more graceful shutdown.
This switches tesdata/bin/kill-hbase.sh to use this script,
which avoids the recovery problems.

Testing:
 - Ran the test-with-docker.py tests (which does a minicluster
   restart). Before the change, the HBase tests timed out due
   to HBase getting stuck recovering. After the change, tests
   ran normally.
 - Added a minicluster restart after dataload so that this
   is tested.

Change-Id: I67283f9098c73c849023af8bfa7af62308bf3ed3
Reviewed-on: http://gerrit.cloudera.org:8080/14697
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-13 19:23:33 +00:00
Csaba Ringhofer
df2c6f200f IMPALA-8841: Try to fix Tez related dataload flakiness
The flakiness may be related to starting Hive queries in parallel which
triggers initializing Tez resources in parallel (only needed at the
first statement that uses Tez). Doing a non-parallel statement at first
may solve the issue.

Also includes a fix for a recent issue in  'build-and-copy-hive-udfs'
introduced by the version bump
in https://gerrit.cloudera.org/#/c/14043/

Change-Id: Id21d57483fe7a4f72f450fb71f8f53b3c1ef6327
Reviewed-on: http://gerrit.cloudera.org:8080/14081
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2019-08-16 23:00:01 +00:00
Todd Lipcon
3567a2b5d4 IMPALA-8369 (part 4): Hive 3: fixes for functional dataset loading
This fixes three issues for functional dataset loading:

- works around HIVE-21675, a bug in which 'CREATE VIEW IF NOT EXISTS'
  does not function correctly in our current Hive build. This has been
  fixed already, but the workaround is pretty simple, and actually the
  'drop and recreate' pattern is used more widely for data-loading than
  the 'create if not exists' one.

- Moves the creation of the 'hive_index' table from
  load-dependent-tables.sql to a new load-dependent-tables-hive2.sql
  file which is only executed on Hive 2.

- Moving from MR to Tez execution changed the behavior of data loading
  by disabling the auto-merging of small files. With Hive-on-MR, this
  behavior defaulted to true, but with Hive-on-Tez it defaults false.
  The change is likely motivated by the fact that Tez automatically
  groups small splits on the _input_ side and thus is less likely to
  produce lots of small files. However, that grouping functionality
  doesn't work properly in localhost clusters (TEZ-3310) so we aren't
  seeing the benefit. So, this patch enables the post-process merging of
  small files.

  Prior to this change, the 'alltypesaggmultifilesnopart' test table was
  getting 40+ files inside it, which broke various planner tests. With
  the change, it gets the expected 4 files.

Change-Id: Ic34930dc064da3136dde4e01a011d14db6a74ecd
Reviewed-on: http://gerrit.cloudera.org:8080/13251
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-15 11:00:45 +00:00
Todd Lipcon
9075099c27 Drop statestore update frequency during data loading
The statestore update frequency is the limiting factor in most DDL
statements. This improved the speed of an incremental data load of the
functional dataset by 5-10x or so on my machine in the case where data
had previously been loaded.

Change-Id: I8931a88aa04e0b4e8ef26a92bfe50a539a3c2505
Reviewed-on: http://gerrit.cloudera.org:8080/13260
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2019-05-10 06:27:41 +00:00
Tim Armstrong
0a9ea803d2 IMPALA-7290: part 1: clean up shell tests
This sets up the tests to be extensible to test shell
in both beeswax and HS2 modes.

Testing:
* Add test dimension containing only beeswax in preparation
  for HS2 dimension.
* Factor out hardcoded ports.
* Add tests for formatting of all types and NULL values.
* Merge date shell test into general type tests.
* Added testing for floating point output formatting, which does
  change as a result of switching to server-side vs client-side
  formatting.
* Use unique_database for tests that create tables.

Change-Id: Ibe5ab7f4817e690b7d3be08d71f8f14364b84412
Reviewed-on: http://gerrit.cloudera.org:8080/13083
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-30 11:30:45 +00:00
David Knupp
df2d9f1333 IMPALA-8346: Don't create FE testcase files unless testing locally
The same test data setup scripts get called when loading data for
mini-cluster testing and testing against a real deployed cluster.
Unfortunately, we're seeing more and more that not all set up steps
apply equally in both situations.

This patch avoids one such example. It skips the creation of TPCDS
testcase files that are used by the FE java tests. These tests don't
run against deployed clusters.

Change-Id: Ibe11d7cb50d9e2657152c94f8defcbc69ca7e1ba
Reviewed-on: http://gerrit.cloudera.org:8080/12958
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-13 03:48:09 +00:00
Austin Nobis
8bb366573c IMPALA-8393: Skip ranger setup for unsupported environments
Previously, the setup-ranger step in create-load-data.sh was hard coded
with localhost as the host for Ranger. This patch makes it possible to
skip the setup for Ranger by using the flag -skip_ranger. The script was
also updated to set the SKIP_RANGER variable when the REMOTE_LOAD
environment variable is set.

Testing:
- Testing was performed by calling the script with and without the
  setup-ranger flag set as well as calling the script with and without
  the REMOTE_LOAD environment variable set.

Change-Id: Ie81dda992cf29792468580b182e570132d5ce0a1
Reviewed-on: http://gerrit.cloudera.org:8080/12957
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-09 01:17:51 +00:00
Austin Nobis
17ed50b5ce IMPALA-8226: Add grant/revoke to/from group for Ranger
This patch adds fupport for GRANT privilege statements to GROUP and
REVOKE privilege statements from GROUP.  The grammar has been updated to
support FROM GROUP and TO GROUP for GRANT/REVOKE statements, i.e:

GRANT <privilege> ON <resource> TO GROUP <group>
REVOKE <privilege> ON <resource> FROM GROUP <group>

Currently, only Ranger's authorization implementation supports GROUP
based privileges. Sentry will throw an UnsupportedOperationException if
it is the enabled authorization provider and this new grammar is used.

Testing:
- AuthorizationStmtTest was updated to also test for GROUP
  authorization.
- ToSqlTest was updated to test for GROUP changes to the grammar.
- A GROUP based E2E test was added to test_ranger.py
- ParserTest was updated to test combinations for GrantRevokePrivilege
- AnalyzeAuthStmtsTest was updated to test for USER and GROUP identities
- Ran all FE tests
- Ran authorization E2E tests

Change-Id: I28b7b3e4c776ad1bb5bdc184c7d733d0b5ef5e96
Reviewed-on: http://gerrit.cloudera.org:8080/12914
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-05 00:04:37 +00:00
Austin Nobis
515ded0035 IMPALA-7918: Remove support for authorization policy file
This patch removes support for the authorization_policy_file. When the
flag is passed, the backend will issue a warning message that the flag
is being ignored.

Tests relying on the authorization_policy_file flag have been updated to
rely on sentry server instead.

Testing:
- Ran all FE tests
- Ran all E2E tests

Change-Id: Ic2a52c2d5d35f58fbff8c088fb0bf30169625ebd
Reviewed-on: http://gerrit.cloudera.org:8080/12637
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-03-25 20:23:33 +00:00
Fredy Wijaya
656a2e8af0 IMPALA-8100: Add initial support for Ranger
This patch adds an initial support for Ranger that can be enabled via
the following flags in both impalad and catalogd to do enforcement.
- ranger_service_type=hive
- ranger_app_id=some_app_id
- authorization_factory_class=\
    org.apache.impala.authorization.ranger.RangerAuthorizationFactory

The Ranger plugin for Impala uses Hive service definition to allow
sharing Ranger policies between Hive and Impala. Temporarily the REFRESH
privilege uses "read" access type and it will be updated in the later
patch once Ranger supports "refresh" access type.

There's a change in DESCRIBE <table> privilege requirement to use ANY
privilege instead of VIEW_METADATA privilege as the first-level check
to play nicely with Ranger. This is not a security risk since the
column-level filtering logic after the first-level check will use
VIEW_METADATA privilege to filter out unauthorized column access. In
other words, DESCRIBE <table> may return an empty result instead of
an authorization error as long as there exists any privilege in the
given table.

This patch updates AuthorizationStmtTest with a parameterized test that
runs the tests against Sentry and Ranger.

Testing:
- Updated AuthorizationStmtTest with Ranger
- Ran all FE tests
- Ran all E2E authorization tests

Change-Id: I8cad9e609d20aae1ff645c84fd58a02afee70276
Reviewed-on: http://gerrit.cloudera.org:8080/12632
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-03-21 19:08:12 +00:00
stiga-huang
9686545bfd IMPALA-6503: Support reading complex types from ORC
We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.
* Run exhaustive tests in DEBUG and RELEASE builds.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Reviewed-on: http://gerrit.cloudera.org:8080/12168
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-03-08 04:39:08 +00:00
Joe McDonnell
6938831ae9 Turn off shell debug tracing for create-load-data.sh
This removes a "set -x" from testdata/bin/create-load-data.sh.

Change-Id: I524ec48d0264f6180a13d6d068832809bcc86596
Reviewed-on: http://gerrit.cloudera.org:8080/12398
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-12 23:25:28 +00:00
Bharath Vissapragada
f7df8adfae IMPALA-5872: Testcase builder for query planner
Implements a new testcase builder for simulating query plans
from one cluster on a different cluster/minicluster with
different number of nodes. The testcase is collected from one
cluster and can be replayed on any other cluster. It includes
all the information that is needed to replay the query plan
exactly as in the source cluster.

Also adds a stand-alone tool (PlannerTestCaseLoader) that can
replay the testcase without having to start an actual cluster
or a dev minicluster. This is done to make testcase debugging
simpler.

Motivation:
----------
- Make query planner issues easily reproducible
- Improve user experience while collecting query diagnostics
- Make it easy to test new planner features by testing it on customer
  usecases collected from much larger clusters.

Commands:
--------
-- Collect testcase for a query stmt (outputs the testcase file path).
impala-shell> COPY TESTCASE TO <hdfs dirpath> <query stmt>

-- Load the testcase metadata in a target cluster (dumps the query stmt)
impala-shell> COPY TESTCASE FROM <hdfs testcase file path>
-- Replay the query plan
impala-shell> SET PLANNER_DEBUG_MODE=true
impala-shell> EXPLAIN <query stmt>

How it works?
------------
- During export on the source cluster, the command dumps all the thrift
  states of referenced objects in the query into a gzipped binary file.
- During replay on a target cluster, it adds these objects to the catalog
  cache by faking them as DDLs.
- The planner also fakes the number of hosts by using the scan range
  information from the target cluster.

Caveats:
------
- Tested to work with HDFS tables. Tables based on other filesystems like
  HBase/Kudu may not work as desired.
- The tool does not collect actual data files for the tables. Only the
  metadata state is dumped.
- Currently only imports databases/tables/views. We can extend it to
  work for UDFS etc.
- It only works for QueryStmts (select/union queries)
- On a sentry enabled cluster, the role running the query requires
  VIEW_METADATA privilege on every db/table/view referenced in the query
  statement.
- Once the metadata dump is loaded on a target cluster, the state is
  volatile. Hence it cannot survive a cluster restart / invalidate
  metadata
- Loading a testcase requires setting the query option (SET
  PLANNER_DEBUG_MODE=true) so that the planner knows to fake the number
  of hosts. Otherwise it takes into account the local cluster topology.
- Cross version compatibility of testcases needs some thought. For
  example, creating a testcase from Impala version 3.2 and trying to
  replay it on Impala version 3.5. This could be problematic if we don't
  keep the underlying thrift structures backward compatible.

Change-Id: Iec83eeb2dc5136768b70ed581fb8d3ed0335cb52
Reviewed-on: http://gerrit.cloudera.org:8080/12221
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-09 03:59:10 +00:00
Tim Armstrong
236b9194d3 IMPALA-7988: support loading data with dockerized Impalas
This patch does the work to load data and run some end-to-end
query tests on a dockerised cluster. Changes were required
in start-impala-cluster.py/ImpalaCluster and in some configuration
files.

ImpalaCluster is used for various things, including discovering
service ports and testing for cluster readiness. This patch adds
basic support and uses it from start-impala-cluster.py to check
for cluster readiness. Some logic is moved from
start-impala-cluster.py to ImpalaCluster.

Limitations:
* We're fairly inconsistent about whether services listen only on
  a single interface (e.g. loopback, traditionally) or whether it
  listens on all interfaces. This doesn't fix all of those issues.
  E.g. HDFS datanodes listen on all interfaces to work around
  some issues.
* Many tests don't pass yet, particularly those using
  ImpalaCluster(), which isn't initialised with the appropriate
  docker arguments.

Testing:
Did a full data load locally using a dockerised Impala cluster:

  START_CLUSTER_ARGS="--docker_network=impala-cluster" \
  TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" \
  ./buildall.sh -format -testdata -ninja -notests -skiptests -noclean

Ran a selection of end-to-end tests touching HDFS, Kudu and HBase
tables after I loaded data locally.

Ran exhaustive tests with non-dockerised impala cluster.

Change-Id: I98fb9c4f5a3a3bb15c7809eab28ec8e5f63ff517
Reviewed-on: http://gerrit.cloudera.org:8080/12189
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-18 21:33:16 +00:00
Joe McDonnell
70fbd1df44 IMPALA-7871: Don't load Hive builtins
Dataload has a step of "Loading Hive builtins" that
loads a bunch of jars into HDFS/S3/etc. Despite
its name, nothing seems to be using these.
Dataload and core tests succeed without this step.

This removes the Hive builtins step and associated
scripts.

Change-Id: Iaca5ffdaca4b5506e9401b17a7806d37fd7b1844
Reviewed-on: http://gerrit.cloudera.org:8080/11944
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-19 23:33:20 +00:00
David Knupp
6e5ec22b12 IMPALA-7399: Emit a junit xml report when trapping errors
This patch will cause a junitxml file to be emitted in the case of
errors in build scripts. Instead of simply echoing a message to the
console, we set up a trap function that also writes out to a
junit xml report that can be consumed by jenkins.impala.io.

Main things to pay attention to:

- New file that gets sourced by all bash scripts when trapping
  within bash scripts:

  https://gerrit.cloudera.org/c/11257/1/bin/report_build_error.sh

- Installation of the python lib into impala-python venv for use
  from within python files:

  https://gerrit.cloudera.org/c/11257/1/bin/impala-python-common.sh

- Change to the generate_junitxml.py file itself, for ease of
  https://gerrit.cloudera.org/c/11257/1/lib/python/impala_py_lib/jenkins/generate_junitxml.py

Most of the other changes are to source the new report_build_error.sh
script to set up the trap function.

Change-Id: Idd62045bb43357abc2b89a78afff499149d3c3fc
Reviewed-on: http://gerrit.cloudera.org:8080/11257
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-23 18:33:58 +00:00