239 Commits

Author SHA1 Message Date
Csaba Ringhofer
780e6683a2 IMPALA-14573: port critical geospatial functions to c++ (part 1)
This commit contains the simpler parts from
https://gerrit.cloudera.org/#/c/20602

This mainly means accessors for the header of the binary
format and bounding box check (st_envIntersects).
New tests for not yet covered functions / overloads are also added.

For details of the binary format see be/src/exprs/geo/shape-format.h

Differences from the PR above:

Only a subset of functions are added. The criteria was:
1. the native function must be fully compatible with the Java version*
2. must not rely on (de)serializing the full geometry
3. the function must be tested

1 implies 2 because (de)serialization is not implemented yet in
the original patch for >2d geometries, which would break compatibility
for the Java version for ZYZ/XYM/XYZM geometries.

*: there are 2 known differences:
 1. NULL handling: the Java functions return error instead of NULL
    when getting a NULL parameter
 2. st_envIntersects() doesn't check if the SRID matches - the Java
    library looks inconsistant about this

Because the native functions are fairly safe replacements for the Java
ones, they are always used when geospatial_library=HIVE_ESRI.

Change-Id: I0ff950a25320549290a83a3b1c31ce828dd68e3c
Reviewed-on: http://gerrit.cloudera.org:8080/23700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-06 07:50:23 +00:00
jichen0919
7e29ac23da IMPALA-14092 Part2: Support querying of paimon data table via JNI
This patch mainly implement the querying of paimon data table
through JNI based scanner.

Features implemented:
- support column pruning.
The partition pruning and predicate push down will be submitted
as the third part of the patch.

We implemented this by treating the paimon table as normal
unpartitioned table. When querying paimon table:
- PaimonScanNode will decide paimon splits need to be scanned,
  and then transfer splits to BE do the jni-based scan operation.

- We also collect the required columns that need to be scanned,
  and pass the columns to Scanner for column pruning. This is
  implemented by passing the field ids of the columns to BE,
  instead of column position to support schema evolution.

- In the original implementation, PaimonJniScanner will directly
  pass paimon row object to BE, and call corresponding paimon row
  field accessor, which is a java method to convert row fields to
  impala row batch tuples. We find it is slow due to overhead of
  JVM method calling.
  To minimize the overhead, we refashioned the implementation,
  the PaimonJniScanner will convert the paimon row batches to
  arrow recordbatch, which stores data in offheap region of
  impala JVM. And PaimonJniScanner will pass the arrow offheap
  record batch memory pointer to the BE backend.
  BE PaimonJniScanNode will directly read data from JVM offheap
  region, and convert the arrow record batch to impala row batch.

  The benchmark shows the later implementation is 2.x better
  than the original implementation.

  The lifecycle of arrow row batch is mainly like this:
  the arrow row batch is generated in FE,and passed to BE.
  After the record batch is imported to BE successfully,
  BE will be in charge of freeing the row batch.
  There are two free paths: the normal path, and the
  exception path. For the normal path, when the arrow batch
  is totally consumed by BE, BE will call jni to fetch the next arrow
  batch. For this case, the arrow batch is freed automatically.
  For the exceptional path, it happends when query  is cancelled, or memory
  failed to allocate. For these corner cases, arrow batch is freed in the
  method close if it is not totally consumed by BE.

Current supported impala data types for query includes:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE

TODO:
    - Patches pending submission:
        - Support tpcds/tpch data-loading
          for paimon data table.
        - Virtual Column query support for querying
          paimon data table.
        - Query support with time travel.
        - Query support for paimon meta tables.
    - WIP:
        - Snapshot incremental read.
        - Complex type query support.
        - Native paimon table scanner, instead of
          jni based.

Testing:
    - Create tests table in functional_schema_template.sql
    - Add TestPaimonScannerWithLimit in test_scanners.py
    - Add test_paimon_query in test_paimon.py.
    - Already passed the tpcds/tpch test for paimon table, due to the
      testing table data is currently generated by spark, and it is
      not supported by impala now, we have to do this since hive
      doesn't support generating paimon table for dynamic-partitioned
      tables. we plan to submit a separate patch for tpcds/tpch data
      loading and associated tpcds/tpch query tests.
    - JVM Offheap memory leak tests, have run looped tpch tests for
      1 day, no obvious offheap memory increase is observed,
      offheap memory usage is within 10M.

Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Reviewed-on: http://gerrit.cloudera.org:8080/23613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-12-05 18:19:57 +00:00
jasonmfehr
789991c6cc IMPALA-13237: [Patch 8] - OpenTelemetry Traces for DML/DDL Queries and Handle Leading Comments
Trace DML/DDL Queries
* Adds tracing for alter, compute, create, delete, drop, insert,
  invalidate metadata, and with queries.
* Stops tracing beeswax queries since that protocol is deprecated.
* Adds Coordinator attribute to Init and Root spans for identifying
  where the query is running.

Comment Handling
* Corrects handling of leading comments, both inline and full line.
  Previously, queries with comments before the first keyword were
  always ignored.
* Adds be ctest tests for determining whether or not a query should
  be traced.

General Improvements
* Handles the case where the first query keyword is followed by a
  newline character or an inline comment (without or with spaces
  between).
* Corrects traces for errored/cancelled queries. These cases
  short-circuit the normal query processing code path and have to be
  handled accordingly.
* Ends the root span when the query ends instead of waiting for the
  ClientRequestState to go out of scope. This change removes
  use-after-free issues caused by reading from ClientRequestState
  when the SpanManager went out of scope during that object's dtor.
* Simplified minimum tls version handling because the validators
  on the ssl_minimum_version eliminate invalid values that previously
  had to be accounted for.
* Removes the unnecessary otel_trace_enabled() function.
* Fixes IMPALA-14314 by waiting for the full trace to be written to
  the output file before asserting that trace.

Testing
* Full test suite passed.
* ASAN/TSAN builds passed.
* Adds new ctest test.
* Adds custom cluster tests to assert traces for the new supported
  query types.
* Adds custom cluster tests to assert traces for errored and
  cancelled queries.

Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: Ie9e83d7f761f3d629f067e0a0602224e42cd7184
Reviewed-on: http://gerrit.cloudera.org:8080/23279
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-03 04:38:36 +00:00
jasonmfehr
2ad6f818a5 IMPALA-13237: [Patch 5] - Implement OpenTelemetry Traces for Select Queries Tracking
Adds representation of Impala select queries using OpenTelemetry
traces.

Each Impala query is represented as its own individual OpenTelemetry
trace. The one exception is retried queries which will have an
individual trace for each attempt. These traces consist of a root span
and several child spans. Each child span has the root as its parent.
No child span has another child span as its parent. Each child span
represents one high-level query lifecycle stage. Each child span also
has span attributes that further describe the state of the query.

Child spans:
  1. Init
  2. Submitted
  3. Planning
  4. Admission Control
  5. Query Execution
  6. Close

Each child span contains a mix of universal attributes (available on
all spans) and query phase specific attributes. For example, the
"ErrorMsg" attribute, present on all child spans, is the error
message (if any) at the end of that particular query phase. One
example of a child span specific attribute is "QueryType" on the
Planning span. Since query type is first determined during query
planning, the "QueryType" attribute is present on the Planning span
and has a value of "QUERY" (since only selects are supported).

Since queries can run for lengthy periods of time, the Init span
communicates the beginning of a query along with global query
attributes. For example, span attributes include query id, session
id, sql, user, etc.

Once the query has closed, the root span is closed.

Testing accomplished with new custom cluster tests.

Generated-by: Github Copilot (GPT-4.1, Claude Sonnet 3.7)
Change-Id: Ie40b5cd33274df13f3005bf7a704299ebfff8a5b
Reviewed-on: http://gerrit.cloudera.org:8080/22924
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-12 04:11:06 +00:00
jasonmfehr
fe1a78d16e IMPALA-13235: [Patch 3 of 5] - Consume OpenTelemetry C++ SDK
Adds the OpenTelemetry C++ SDK version 1.20.0 from the toolchain into
the cmake files for consumption during builds.

Testing was accomplished by building locally and in Jenkins.

Generated-by: Github Copilot (GPT-4.1)
Change-Id: Ib30123f79270e3f11233e28a2a34725e7d455f5e
Reviewed-on: http://gerrit.cloudera.org:8080/23101
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-11 23:39:23 +00:00
Surya Hebbar
31e48c3cdf IMPALA-13843: Support usage of strings directly in rapidjson
The RapidJSON library recalculates length of a StringRef each time
when using char*, until null character.

Within the codebase, many times we convert from string to char* for
RapidJSON.

Instead with RAPIDJSON_HAS_STDSTRING, we can use strings directly,
which inherently contain the size, instead of recalculating each time.

This is a major convenience and a bit more efficient during both
JSON generation and parsing.

This simplifies many methods(i.e. AddMember, HasMember, etc) and
the creation of Document::Value.

This does not remove compatibility for existing char* and still remains
as a stable feature within RapidJSON.

Change-Id: I84c6f17690f7a2e958a6dfbb3637832f5ab6ee53
Reviewed-on: http://gerrit.cloudera.org:8080/22613
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-19 22:18:53 +00:00
jasonmfehr
7b6ccc644b IMPALA-12737: Query columns in workload management tables.
Adds "Select Columns", "Where Columns", "Join Columns", "Aggregate
Columns", and "OrderBy Columns" to the query profile and the workload
management active/completed queries tables. These fields are
presented as comma separate strings containing the fully qualified
column name in the format database.table_name.column_name. Aggregate
columns include all columns in the order by and having clauses.

Since new columns are being added, the workload management init
process is also being modified to allow for one-way upgrades of the
table schemas if necessary.  Additionally, workload management can be
set up to run under a schema version that is not the latest. This
ability will be useful during troubleshooting. To enable these
upgrades, the workload management initialization that manages the
structure of the tables has been moved to the catalogd.

The changes in this patch must be backwards compatible so that Impala
clusters running previous workload management code can co-exist with
Impala clusters running this workload management code. To enable that
backwards compatibility, a new table property named
'wm_schema_version' is now used to track the schema version of the
workload management tables. Thus, the old property 'schema_version'
will always be set to '1.0.0' since modifying that property value
causes Impala running previous workload management code to error at
startup.

Testing accomplished by
* Adding/updating workload and custom cluster tests to assert the new
  columns and the workload management upgrade process.
* JUnit tests added to verify the new workload management columns are
  being correctly parsed.
* GTests added to ensure the workload management columns are
  correctly defined and in the correct order.

Change-Id: I78f3670b067c0c192ee8a212fba95466fbcb51d7
Reviewed-on: http://gerrit.cloudera.org:8080/21142
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2024-10-31 17:06:43 +00:00
Joe McDonnell
89698426f6 IMPALA-12165: Add option for split debug information (-gsplit-dwarf)
This adds the IMPALA_SPLIT_DEBUG_INFO environment variable,
which controls whether the build uses -gsplit-dwarf. This option
puts debug information in a separate .dwo file for each C++
file. Executables contain metadata pointing to those .dwo files
and don't need to include the debug information themselves. This
reduces link time and disk space usage. The default for
IMPALA_SPLIT_DEBUG_INFO is off as this is intended to be an
opt-in option for developers.

For a debug build with compressed debug information,
it cuts disk space usage roughly in half:

Without backend tests (measuring "du -sh be"):
Regular: 5.6GB
Split debuginfo: 2.7GB
With backend tests:
Regular: 22GB
Split debuginfo: 12GB

This only works for the debug information from Impala itself.
The debug information from dependencies from the toolchain
are included in each executable.

Split debug information has been around for a long time,
so tools like GDB work. Resolving minidumps works properly.

Testing:
 - Ran builds locally (with GCC and Clang)
 - Attached to Impala with GDB and verified that symbols worked
 - Resolved a minidump and checked the output

Change-Id: I3bbe700279a5dc3fde338fdfa0ea355d3570c9d0
Reviewed-on: http://gerrit.cloudera.org:8080/21720
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2024-09-04 18:58:13 +00:00
Joe McDonnell
88dcdfd466 IMPALA-12807: Add support for mold linker
This adds support for using the mold linker. It changes
the existing USE_GOLD_LINKER environment variable to
IMPALA_LINKER, which accepts ld, gold, or mold as
values. It defaults to 'gold' to match current behavior.
Developers can override it in bin/impala-config-local.sh.

Clang does not implement -gz properly until version 12.
It does not enable compressed debuginfo in the final
binary. IMPALA_LINKER=mold doesn't work with
IMPALA_COMPRESSED_DEBUG_INFO=true on Clang due to this.
This detects Clang <12 and skips -gz as it is ineffective.

Mold follows similar to behavior to LLD and requires
--exclude-libs to use the full library name (i.e.
liblz4.a rather than liblz4). Gold will happily
accept the full library name, so this changes to use
the full library name.

Mold is much faster for incremental builds on my system:
(e.g. touch be/src/scheduling/scheduler.cc && make -j8 impalad)
gold: 15.8s
mold: 2.6s

Testing:
 - Ran builds with IMPALA_LINKER=mold on Centos 7, Redhat 8,
   and Ubuntu 20.

Change-Id: Ia9e9accd06b6ecd182d200d81afaae09a885c241
Reviewed-on: http://gerrit.cloudera.org:8080/21121
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-03-25 22:53:34 +00:00
Joe McDonnell
befa93ffa6 IMPALA-12730: Don't use -Weverything for clang-tidy
For current clang-tidy builds, we enable -Weverything and then
explicitly disable a couple verbose warnings. Clang-tidy uses
these warnings to produce its clang-diagnostic-* issues.
This gives clang-tidy the maximum flexibility to enforce any
warning the Clang can produce, but in practice, we disable
them via the .clang-tidy file's Checks: section.

The output from a -Weverything build is enormous and hard to
navigate. When there is a compilation issue, digging through
the logs is arduous. Clang discourages the use of -Weverything,
as it contains many noisy warnings.

This changes the clang-tidy build to use -Wextra instead and
shrinks the list of disabled clang-diagnostic-* issues
accordingly.

Testing:
 - Verified that bin/run_clang_tidy.sh passes

Change-Id: Ib4f4cc9a6d040b90495496b2396a26f3c5189231
Reviewed-on: http://gerrit.cloudera.org:8080/21112
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-03-07 22:22:28 +00:00
Tamas Mate
dce68e6a3b IMPALA-11996: Scanner change for Iceberg metadata querying
This commit adds a scan node for querying Iceberg metadata tables. The
scan node creates a Java scanner object that creates and scans the
metadata table. The scanner uses the Iceberg API to scan the table after
that the scan node fetches the rows one by one and materialises them
into RowBatches. The Iceberg row reader on the backend does the
translation between Iceberg and Impala types.

There is only one fragment created to query the Iceberg metadata table
which is supposed to be executed on the coordinator node that already
has the Iceberg table loaded. This way there is no need for further
table loading on the executor side.

This change will not cover nested column types, these slots are set to
NULL, it will be done in IMPALA-12205.

Testing:
 - Added e2e tests for querying metadata tables
 - Updated planner tests

Performance testing:
Created a table and inserted ~5500 rows one by one, this generated
~270000 ALL_MANIFESTS metadata table records. This table is quite wide
and has a String column as well.

I only mention count(*) test on ALL_MANIFESTS, because every row is
materialized in every scenario currently:
  - Cold cache: 15.76s
    - IcebergApiScanTime: 124.407ms
    - MaterializeTupleTime: 8s368ms
  - Warm cache: 7.56s
    - IcebergApiScanTime: 3.646ms
    - MaterializeTupleTime: 7s477ms

Change-Id: I0e943cecd77f5ef7af7cd07e2b596f2c5b4331e7
Reviewed-on: http://gerrit.cloudera.org:8080/20010
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-10-26 12:40:22 +00:00
Eyizoha
2f06a7b052 IMPALA-10798: Initial support for reading JSON files
Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-05 16:55:41 +00:00
Joe McDonnell
5d0a2f01a5 IMPALA-12372: Only use -Wno-deprecated-declaration for OpenSSL3
Redhat 9 and Ubuntu 22.04 both use OpenSSL3, which deprecated
several APIs that we use. To support those platforms, we added
the -Wno-deprecated-declaration to the build. Historically, the
Impala build has also specified -Wno-deprecated due to
use of deprecated headers in gutils. These flags limit our
ability to notice use of deprecated code in other parts of the
code.

The code in gutils no longer requires -Wno-deprecated, so
this removes it completely. Additionally, this limits the
-Wno-deprecated-declaration flag to machines using
OpenSSL 3.

Reenabling deprecation warnings also reenables Clang Tidy's
clang-diagnostic-deprecated enforcement. This is currently
broken, so this turns off clang-diagnostic-deprecated
until it can be addressed properly.

Testing:
 - Ran build-all-options on Ubuntu 22 and Ubuntu 16
 - Ran a Rocky 9.2 build

Change-Id: I1b36450d084f342eeab5dac2272580ab6b0c988b
Reviewed-on: http://gerrit.cloudera.org:8080/20369
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-08-17 20:47:03 +00:00
zhangyifan27
75bf213418 IMPALA-12288: Add BUILD_WITH_NO_TESTS option to remove test targets
This patch adds a new option 'BUILD_WITH_NO_TESTS' to tell CMake not
to generate test targets. In order to be consistent with the previous
test workflow, this option is only set ON when building impala using
the 'buildall.sh' script with '-notest' and '-package' flags. This
is useful for a packaging build which do not need to build all test
binaries.

Testing:
  - Ran 'buildall.sh -release -package' with and without '-notests'
flag and verified generated executables.

Change-Id: I575ce76176c9f6a05fd2db0f420ebe6926d0272a
Reviewed-on: http://gerrit.cloudera.org:8080/20294
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-08-11 14:07:53 +00:00
Michael Smith
8329e6f2e3 IMPALA-12314: Pre-compile LLVM bytecode with Os
Functions used in codegen fragments are compiled into the binary and
also compiled into LLVM bytecode that's embedded in the binary. The LLVM
bytecode is first optimized with clang at O1; however the evaluation of
which optimization level to use was performed with LLVM 3.3, and we're
now on LLVM 5. Re-testing with our current performance suite shows Os
performs the best of available optimization options (O1, O2, O3, Os).

With codegen cache disabled we see across-the-board improvement:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 3.28    | -2.33%     | 2.44       | -3.42%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCH(42) | TPCH-Q4  | parquet / none / none | 1.78   | 1.80        |   -1.05%   |   2.05%   |   1.62%        | 50    |   -0.32%       | -3.98   | -2.86  |
| TPCH(42) | TPCH-Q3  | parquet / none / none | 6.51   | 6.57        |   -1.00%   |   0.72%   |   0.54%        | 50    |   -0.83%       | -5.87   | -7.88  |
| TPCH(42) | TPCH-Q21 | parquet / none / none | 13.43  | 13.57       |   -1.07%   |   0.39%   |   0.47%        | 50    |   -1.12%       | -7.90   | -12.48 |
| TPCH(42) | TPCH-Q6  | parquet / none / none | 0.79   | 0.81        |   -2.34%   |   3.13%   |   0.93%        | 50    |   -0.31%       | -2.32   | -5.19  |
| TPCH(42) | TPCH-Q9  | parquet / none / none | 8.65   | 8.78        |   -1.43%   |   0.50%   |   0.55%        | 50    |   -1.45%       | -7.88   | -13.67 |
| TPCH(42) | TPCH-Q15 | parquet / none / none | 2.57   | 2.60        |   -1.30%   |   1.13%   |   1.02%        | 50    |   -1.75%       | -4.50   | -6.06  |
| TPCH(42) | TPCH-Q5  | parquet / none / none | 2.25   | 2.30        |   -2.27%   |   1.26%   |   1.21%        | 50    |   -2.28%       | -7.24   | -9.32  |
| TPCH(42) | TPCH-Q12 | parquet / none / none | 1.62   | 1.65        |   -1.94%   |   1.67%   |   1.45%        | 50    |   -2.69%       | -5.71   | -6.27  |
| TPCH(42) | TPCH-Q18 | parquet / none / none | 4.90   | 5.02        |   -2.39%   |   1.54%   |   1.02%        | 50    |   -2.41%       | -7.30   | -9.31  |
| TPCH(42) | TPCH-Q13 | parquet / none / none | 5.71   | 5.85        |   -2.33%   |   1.87%   |   1.70%        | 50    |   -2.59%       | -5.68   | -6.61  |
| TPCH(42) | TPCH-Q14 | parquet / none / none | 1.66   | 1.70        |   -2.17%   |   2.01%   |   1.75%        | 50    |   -2.87%       | -4.76   | -5.84  |
| TPCH(42) | TPCH-Q7  | parquet / none / none | 2.69   | 2.76        |   -2.62%   |   1.43%   |   1.24%        | 50    |   -2.77%       | -7.07   | -9.95  |
| TPCH(42) | TPCH-Q19 | parquet / none / none | 1.98   | 2.04        |   -3.21%   |   1.31%   |   1.73%        | 50    |   -2.62%       | -7.19   | -10.58 |
| TPCH(42) | TPCH-Q17 | parquet / none / none | 1.86   | 1.92        |   -3.15%   |   1.62%   |   1.75%        | 50    |   -2.92%       | -7.14   | -9.47  |
| TPCH(42) | TPCH-Q8  | parquet / none / none | 3.61   | 3.73        |   -3.20%   |   0.98%   |   1.05%        | 50    |   -3.09%       | -8.26   | -15.96 |
| TPCH(42) | TPCH-Q1  | parquet / none / none | 2.98   | 3.08        |   -3.16%   |   1.23%   |   1.30%        | 50    |   -3.33%       | -7.64   | -12.64 |
| TPCH(42) | TPCH-Q22 | parquet / none / none | 1.55   | 1.60        |   -3.45%   |   2.25%   |   1.89%        | 50    |   -3.28%       | -5.87   | -8.50  |
| TPCH(42) | TPCH-Q10 | parquet / none / none | 2.81   | 2.90        |   -3.31%   |   1.20%   |   2.18%        | 50    |   -3.47%       | -7.67   | -9.48  |
| TPCH(42) | TPCH-Q20 | parquet / none / none | 1.88   | 1.97        |   -4.42%   |   1.20%   |   1.42%        | 50    |   -5.06%       | -8.52   | -17.12 |
| TPCH(42) | TPCH-Q16 | parquet / none / none | 1.04   | 1.10        | I -5.43%   |   1.28%   |   1.86%        | 50    | I -4.91%       | -8.28   | -17.29 |
| TPCH(42) | TPCH-Q2  | parquet / none / none | 1.05   | 1.13        | I -6.88%   |   2.26%   |   2.12%        | 50    | I -8.96%       | -7.93   | -16.30 |
| TPCH(42) | TPCH-Q11 | parquet / none / none | 0.74   | 0.88        | I -15.95%  |   3.36%   |   3.73%        | 50    | I -20.48%      | -8.50   | -24.09 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

(I) Improvement: TPCH(42) TPCH-Q16 [parquet / none / none] (1.10s -> 1.04s [-5.43%])
+---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| Operator            | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| 11:AGGREGATE        | 9.62%      | 116.74ms | 114.24ms | +2.19%     |   4.85%    | 236.00ms | 228.00ms | +3.51%     | 3      | 15    | 4.98M  | 1.28M     |
| F00:EXCHANGE SENDER | 12.71%     | 154.30ms | 150.21ms | +2.72%     |   5.93%    | 232.00ms | 216.00ms | +7.41%     | 3      | 15    | -1     | -1        |
| 05:AGGREGATE        | 6.62%      | 80.35ms  | 79.17ms  | +1.50%     |   5.18%    | 164.00ms | 152.00ms | +7.89%     | 3      | 15    | 4.99M  | 1.28M     |
| 02:SCAN HDFS        | 20.99%     | 254.69ms | 253.55ms | +0.45%     |   9.62%    | 316.00ms | 308.00ms | +2.60%     | 1      | 1     | 207    | 42.00K    |
| 03:HASH JOIN        | 6.40%      | 77.68ms  | 83.29ms  | -6.74%     |   5.80%    | 156.00ms | 164.00ms | -4.88%     | 3      | 15    | 4.99M  | 1.28M     |
| F07:JOIN BUILD      | 14.37%     | 174.42ms | 183.92ms | -5.16%     |   6.35%    | 228.00ms | 240.00ms | -5.00%     | 3      | 3     | -1     | -1        |
| F01:EXCHANGE SENDER | 4.90%      | 59.46ms  | 60.27ms  | -1.34%     | * 12.04% * | 124.00ms | 136.00ms | -8.82%     | 3      | 8     | -1     | -1        |
| 01:SCAN HDFS        | 7.37%      | 89.43ms  | 89.69ms  | -0.30%     |   6.22%    | 140.00ms | 160.00ms | -12.50%    | 3      | 8     | 1.25M  | 331.46K   |
| 00:SCAN HDFS        | 5.67%      | 68.84ms  | 68.41ms  | +0.64%     |   6.78%    | 132.00ms | 128.00ms | +3.12%     | 3      | 15    | 33.60M | 33.60M    |
+---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+

(I) Improvement: TPCH(42) TPCH-Q2 [parquet / none / none] (1.13s -> 1.05s [-6.88%])
+--------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+---------+-----------+
| Operator     | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows   | Est #Rows |
+--------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+---------+-----------+
| 01:SCAN HDFS | 4.61%      | 76.65ms  | 73.71ms  | +3.99%     | * 18.60% * | 120.00ms | 108.00ms | +11.11%    | 1      | 1     | 84.23K  | 420.00K   |
| 00:SCAN HDFS | 2.51%      | 41.82ms  | 44.19ms  | -5.38%     | * 15.65% * | 108.00ms | 112.00ms | -3.57%     | 3      | 8     | 33.42K  | 53.13K    |
| 02:SCAN HDFS | 19.49%     | 324.36ms | 314.51ms | +3.13%     |   8.44%    | 484.00ms | 496.00ms | -2.42%     | 3      | 15    | 2.36M   | 33.60M    |
| 07:SCAN HDFS | 16.68%     | 277.63ms | 295.43ms | -6.02%     | * 12.45% * | 328.00ms | 348.00ms | -5.75%     | 1      | 1     | 5       | 25        |
| 06:SCAN HDFS | 18.46%     | 307.18ms | 326.86ms | -6.02%     | * 11.80% * | 360.00ms | 400.00ms | -10.00%    | 1      | 1     | 84.23K  | 420.00K   |
| 05:SCAN HDFS | 28.94%     | 481.61ms | 470.92ms | +2.27%     |   5.51%    | 640.00ms | 616.00ms | +3.90%     | 3      | 15    | 493.60K | 33.60M    |
+--------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+---------+-----------+

(I) Improvement: TPCH(42) TPCH-Q11 [parquet / none / none] (0.88s -> 0.74s [-15.95%])
+--------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| 07:SCAN HDFS | 9.77%      | 61.88ms  | 57.88ms  | +6.91%     | * 35.58% * | 104.00ms | 104.00ms | -0.00%     | 1      | 1     | 16.88K | 420.00K   |
| 06:SCAN HDFS | 32.20%     | 203.97ms | 197.21ms | +3.43%     |   9.44%    | 344.00ms | 352.00ms | -2.27%     | 3      | 15    | 1.35M  | 33.60M    |
| 01:SCAN HDFS | 8.65%      | 54.78ms  | 47.35ms  | +15.69%    | * 57.37% * | 104.00ms | 116.00ms | -10.35%    | 1      | 1     | 16.88K | 420.00K   |
| 00:SCAN HDFS | 34.70%     | 219.78ms | 221.34ms | -0.71%     |   8.53%    | 356.00ms | 364.00ms | -2.20%     | 3      | 15    | 1.35M  | 33.60M    |
+--------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+

With codegen cache enabled - highlighting generated code performance -
we still see improvement, although not as definitive:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 3.17    | -1.77%     | 2.25       | -1.41%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCH(42) | TPCH-Q1  | parquet / none / none | 2.80   | 2.74        |   +1.84%   |   1.65%   |   2.25%        | 50    |   +1.93%       | 4.76    | 4.64   |
| TPCH(42) | TPCH-Q15 | parquet / none / none | 2.47   | 2.45        |   +0.81%   |   1.56%   |   1.47%        | 50    |   +0.34%       | 2.99    | 2.67   |
| TPCH(42) | TPCH-Q14 | parquet / none / none | 1.66   | 1.64        |   +0.75%   |   1.94%   |   2.66%        | 50    |   +0.23%       | 3.58    | 1.60   |
| TPCH(42) | TPCH-Q12 | parquet / none / none | 1.60   | 1.60        |   +0.12%   |   2.86%   |   2.55%        | 50    |   +0.19%       | 1.44    | 0.21   |
| TPCH(42) | TPCH-Q13 | parquet / none / none | 6.63   | 6.62        |   +0.15%   |   7.42%   |   8.26%        | 50    |   +0.08%       | 0.26    | 0.10   |
| TPCH(42) | TPCH-Q10 | parquet / none / none | 2.79   | 2.79        |   -0.31%   |   4.14%   |   2.50%        | 50    |   -0.11%       | -0.55   | -0.45  |
| TPCH(42) | TPCH-Q16 | parquet / none / none | 0.89   | 0.89        |   -0.52%   |   2.28%   |   2.54%        | 50    |   +0.06%       | 0.76    | -1.08  |
| TPCH(42) | TPCH-Q17 | parquet / none / none | 1.81   | 1.82        |   -0.65%   |   2.17%   |   1.88%        | 50    |   -0.08%       | -1.10   | -1.60  |
| TPCH(42) | TPCH-Q11 | parquet / none / none | 0.52   | 0.52        |   -0.97%   |   2.13%   |   2.25%        | 50    |   +0.12%       | 1.19    | -2.23  |
| TPCH(42) | TPCH-Q20 | parquet / none / none | 1.73   | 1.75        |   -1.22%   |   1.67%   |   1.76%        | 50    |   -0.27%       | -3.08   | -3.59  |
| TPCH(42) | TPCH-Q6  | parquet / none / none | 0.79   | 0.80        |   -1.63%   |   3.14%   |   2.58%        | 50    |   -0.13%       | -2.88   | -2.86  |
| TPCH(42) | TPCH-Q19 | parquet / none / none | 1.27   | 1.29        |   -1.51%   |   2.02%   |   1.95%        | 50    |   -0.36%       | -3.91   | -3.82  |
| TPCH(42) | TPCH-Q21 | parquet / none / none | 13.20  | 13.35       |   -1.10%   |   0.48%   |   0.53%        | 50    |   -1.13%       | -7.27   | -10.86 |
| TPCH(42) | TPCH-Q3  | parquet / none / none | 6.45   | 6.57        |   -1.75%   |   1.21%   |   1.48%        | 50    |   -1.59%       | -5.23   | -6.53  |
| TPCH(42) | TPCH-Q7  | parquet / none / none | 2.48   | 2.53        |   -2.06%   |   2.16%   |   2.50%        | 50    |   -2.10%       | -3.99   | -4.44  |
| TPCH(42) | TPCH-Q18 | parquet / none / none | 4.70   | 4.81        |   -2.24%   |   2.39%   |   2.32%        | 50    |   -2.18%       | -4.86   | -4.82  |
| TPCH(42) | TPCH-Q5  | parquet / none / none | 2.16   | 2.21        |   -2.12%   |   2.30%   |   2.14%        | 50    |   -2.38%       | -4.65   | -4.84  |
| TPCH(42) | TPCH-Q4  | parquet / none / none | 1.72   | 1.75        |   -2.09%   |   2.34%   |   2.18%        | 50    |   -2.88%       | -4.52   | -4.66  |
| TPCH(42) | TPCH-Q8  | parquet / none / none | 3.50   | 3.59        |   -2.63%   |   2.34%   |   2.79%        | 50    |   -2.86%       | -4.52   | -5.16  |
| TPCH(42) | TPCH-Q22 | parquet / none / none | 1.50   | 1.54        |   -2.53%   |   4.70%   |   3.77%        | 50    |   -3.34%       | -2.99   | -3.02  |
| TPCH(42) | TPCH-Q2  | parquet / none / none | 0.75   | 0.79        |   -4.37%   |   2.93%   |   1.77%        | 50    |   -6.02%       | -5.93   | -9.32  |
| TPCH(42) | TPCH-Q9  | parquet / none / none | 8.28   | 8.87        | I -6.66%   |   0.89%   |   1.25%        | 50    | I -7.27%       | -8.53   | -31.40 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

(I) Improvement: TPCH(42) TPCH-Q9 [parquet / none / none] (8.87s -> 8.28s [-6.66%])
+---------------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+-------+---------+-----------+
| Operator            | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%) | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows   | Est #Rows |
+---------------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+-------+---------+-----------+
| F11:JOIN BUILD      | 7.52%      | 1.38s    | 1.34s    | +2.94%     |   3.85%   | 2.15s    | 1.83s    | +17.25%    | 3      | 15    | -1      | -1        |
| F05:EXCHANGE SENDER | 7.78%      | 1.43s    | 1.32s    | +8.21%     |   3.54%   | 2.17s    | 2.10s    | +3.63%     | 3      | 15    | -1      | -1        |
| F04:EXCHANGE SENDER | 5.82%      | 1.07s    | 1.13s    | -5.57%     |   3.76%   | 1.54s    | 1.68s    | -8.57%     | 3      | 15    | -1      | -1        |
| F12:JOIN BUILD      | 5.33%      | 978.25ms | 886.69ms | +10.33%    |   5.86%   | 1.48s    | 1.32s    | +12.43%    | 3      | 15    | -1      | -1        |
| F03:EXCHANGE SENDER | 3.40%      | 623.44ms | 606.68ms | +2.76%     |   4.22%   | 1.05s    | 991.98ms | +5.65%     | 3      | 15    | -1      | -1        |
| F00:EXCHANGE SENDER | 5.67%      | 1.04s    | 1.10s    | -5.39%     |   2.28%   | 1.51s    | 1.60s    | -5.51%     | 3      | 15    | -1      | -1        |
| 01:SCAN HDFS        | 10.25%     | 1.88s    | 2.02s    | -7.03%     |   4.28%   | 2.05s    | 2.16s    | -5.17%     | 1      | 1     | 420.00K | 420.00K   |
| 06:HASH JOIN        | 2.58%      | 473.29ms | 517.49ms | -8.54%     |   3.72%   | 876.00ms | 811.98ms | +7.88%     | 3      | 15    | 13.72M  | 24.34M    |
| 00:SCAN HDFS        | 12.79%     | 2.35s    | 2.43s    | -3.48%     |   2.35%   | 2.62s    | 2.72s    | -3.82%     | 3      | 8     | 457.14K | 840.00K   |
| 02:SCAN HDFS        | 28.98%     | 5.32s    | 5.65s    | -5.78%     |   1.22%   | 6.21s    | 6.52s    | -4.66%     | 3      | 15    | 68.38M  | 252.01M   |
+---------------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+-------+---------+-----------+

It would be useful to get more evaluation of this option. Adds a new
startup option 'llvm_ir_opt' to select from several pre-optimized
versions of LLVM bytecode that provide library functions for our
generated code. Available options are O1, O2, and Os, defaulting to Os.
Release binary size increases by ~6.5MB.

Leaves impala_legacy_avx_llvm_ir untouched.

Change-Id: I6dd1a07ce63dbc2c27b00f450e11eceaa7bb0822
Reviewed-on: http://gerrit.cloudera.org:8080/20265
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-08-07 21:51:34 +00:00
Joe McDonnell
234d641d7b IMPALA-11961/IMPALA-12207: Add Redhat 9 / Ubuntu 22 support
This adds support for Redhat 9 / Ubuntu 22. It updates
to a newer toolchain that has those builds, and it adds
supporting code in bootstrap_system.sh.

Redhat 9 and Ubuntu 22 use python = python3, which requires
various changes to build scripts and tests. Ubuntu 22 uses
Python 3.10, which deprecates certain ssl.PROTOCOL_TLS, so
this adapts test_client_ssl.py to that change until it
can be fully addressed in IMPALA-12219.

Various OpenSSL methods have been deprecated. As a workaround
until these can be addressed properly, this specifies
-Wno-deprecated-declarations. This can be removed once the
code is adapted to the non-deprecated APIs in IMPALA-12226.

Impala crashes with tcmalloc errors unless we update to a newer
gperftools, so this moves to gperftools 2.10. gperftools changed
the default for tcmalloc.aggressive_memory_decommit to off, so
this adapts our code to set it for backend tests. The gperftools
upgrade does not show any performance regression:

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 3.08    | -0.64%     | 2.20       | -0.37%         |
+----------+-----------------------+---------+------------+------------+----------------+

With newer Python versions, the impala-virtualenv command
fails to create a Python 3 virtualenv. This switches to
using Python 3's builtin venv command for Python >=3.6.

Kudu needed a newer version and LLVM required a couple patches.

Testing:
 - Ran a core job on Ubuntu 22 and Redhat 9. The tests run
   to completion without crashing. There are test failures
   that will be addressed in follow-up JIRAs.
 - Ran dockerised tests on Ubuntu 22.
 - Ran dockerised tests on Ubuntu 20 and Rocky 8.5.

Change-Id: If1fcdb2f8c635ecd6dc7a8a1db81f5f389c78b86
Reviewed-on: http://gerrit.cloudera.org:8080/20073
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-06-21 05:21:01 +00:00
Joe McDonnell
f54b3c3757 IMPALA-11713: Switch to C++17
This switches from C++14 to C++17, which allows Impala
code to interact with libraries written in C++17.
This fixes a few minor compilation differences, such
as the need for comparators to be const. Since C++17
includes -faligned-new, this removes the CMake code
that specified it.

Testing:
 - Ran core job

Change-Id: Iadac41817fe5eaaa469a5f0e9f94056a409c14b9
Reviewed-on: http://gerrit.cloudera.org:8080/19183
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-23 03:22:48 +00:00
Joe McDonnell
8271bdd587 IMPALA-11695: Reduce clang tidy warning output size
The Clang Tidy build enables all warnings via -Wall
and -Weverything. This produces enormous output.
Looking at a recent failed Clang Tidy build, there
are ~4.5 million warnings generated. Of these,
about 4 million are from C++98 compatibility warnings.
A further 250 thousand are from padding warnings.
Since these are not particularly interesting, this
disables both of those to reduce the output size.

Disabling these warnings allowed Clang Tidy to find
some issues in DataSketches that it was previously
missing. Perhaps there is some limit on the number
or size of warnings that it was processing. This
modifies the DataSketches code to fix those (which
are all minor issues with const correctness).

Testing:
 - Built with clang tidy locally

Change-Id: I28c6ed1e7a4f525d81a9c48e90d051b374d44941
Reviewed-on: http://gerrit.cloudera.org:8080/19182
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-01 04:06:13 +00:00
Daniel Becker
c1a9a23737 IMPALA-11623: Put *-ir.cc files into their own libraries to avoid extra
recompilation

Currently, modifying most files incurs a rebuild of the LLVM IR, which
is a slow serial step.

This change splits off the *-ir.cc files into their own libraries that
have no dependency relation with the original libraries from which they
are separated. The build of the LLVM IR depends only on the new IR
libraries, so if a non-IR C++ source file is modified, the LLVM IR does
not need to be rebuilt.

Note that the LLVM IR build does not actually need or use the new IR
libraries - it is a completely separate procedure in which the *-ir.cc
files are included in impala-ir.cc and compiled with Clang into LLVM
bitcode. We add the dependency on the IR libraries to trigger the
rebuilding of the LLVM IR whenever an *-ir.cc file is modified.

However, the new IR libraries still need to be compiled into the main
Impala executable.

Also, dependence on the target 'gen_ir_descriptions' was not consistent.
As it is a fast target and needed by most other targets, 'gen-deps' now
depends on it to keep things safer and simpler. Direct dependencies on
'gen_ir_descriptions' were removed from that targets that also depend on
'gen-deps'.

Testing:
 - manually verified for each modified target library that the LLVM IR
   is not recompiled when a non-IR .cc file is modified, but is
   recompiled if the corresponding header or *-ir.cc file is modified.

Change-Id: I63b9285bac6494a19f614d0ebc694a91bdf7a8a0
Reviewed-on: http://gerrit.cloudera.org:8080/19083
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-17 22:29:54 +00:00
Joe McDonnell
97480c3e32 IMPALA-11640/IMPALA-11641: Workaround errors in shared library build on Ubuntu 18+
Building with -so on Ubuntu 18 or higher fails due to
an issue finding dlopen in unwind_safeness.cc:

unwind_safeness.cc:76] Check failed: !error failed to find symbol dlopen

unwind_safeness.cc is using dlsym to load the dlopen symbol
so that it can wrap it with its own dlopen code. The Impala
build has issues with the ordering of libraries, and this
code does not find dlopen. This has previously happened with
the dl_iterate_phdr symbol in Kudu.

This is a problem starting with Ubuntu 18.04, because Ubuntu 16.04
uses a version of glibc that has a bug in reporting this
error. Ubuntu 18.04 uses a newer glibc with a fix for the bug. See
https://sourceware.org/bugzilla/show_bug.cgi?id=19509 .
As a workaround for this issue, this tolerates not finding
dlopen/dlclose when building with shared libraries. Impala shared
libraries are not used in production, so this bypasses the issue.
This also adds extra validation to make sure the symbols are non-null.
Specifically, this adds another CHECK in dlsym_or_die to verify
that the symbol is non-null. This also adds a DCHECK to verify that
the symbol is non-null at dereference.

This also fixes an issue where Boost was always using static
libraries, even for shared library builds. This makes Boost use
shared libraries for shared library builds.

Testing:
 - The shared library build passes on Ubuntu 18 and Ubuntu 20
 - Impala can boot and run queries with shared libraries

Change-Id: Iaab196b3d669ccc12854a98d0dbfbae2b9b91244
Reviewed-on: http://gerrit.cloudera.org:8080/19104
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-12 12:00:56 +00:00
Joe McDonnell
10c19b1a57 IMPALA-11511: Add build options for reducing binary sizes
Impala's build produces dozens of C++ binaries
that link in all Impala libraries. Each binary is
hundreds of megabytes, leading to 10s of gigabytes
of disk space. A large proportion of this (~80%) is debug
information. The debug information increases in newer
versions of GCC such as GCC 10.

This introduces two options for reducing the size
of debug information:
 - IMPALA_MINIMAL_DEBUG_INFO=true builds Impala with
   minimal debug information (-g1). This contains line tables
   and can resolve backtraces, but it does not contain
   variable information and restricts further debugging.
 - IMPALA_COMPRESSED_DEBUG_INFO=true builds Impala with
   compressed debug information (-gz). This does not change
   the debug information included, but the compression saves
   significant disk space. gdb is known to work with
   compressed debug information, but other tools may not
   support it. The dump_breakpad_symbols.py script has been
   adjusted to handle these binaries.
These are disabled by default.

Release impalad binary sizes:
Configuration                  | Size (bytes) | % reduction over base
Base                           | 707834808    | N/A
Stripped                       |  83351664    | 88%
Minimal debuginfo              | 215924096    | 69%
Compressed debuginfo           | 301619286    | 57%
Minimal + compressed debuginfo | 120886705    | 83%

Testing:
 - Generated minidumps and resolved them
 - Verified this is disabled by default

Change-Id: I04a20258a86053d8f3972b9c7c81cd5bec1bbb66
Reviewed-on: http://gerrit.cloudera.org:8080/18962
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-26 17:47:55 +00:00
Joe McDonnell
cff286e751 IMPALA-9999: Switch to GCC 10.4
This upgrades GCC and libstdc++ to version 10.4. This
required patching or upgrading several dependencies
so they could compile with GCC 10. The toolchain
companion change has details on what items needed
to be upgraded and why.

The toolchain companion change switches GCC to build
with toolchain binutils rather than host binutils. This
means that the python virtualenv initialization needs
to include binutils on the path.

This disables two warnings introduced in the new GCC
versions (Wclass-memaccess and Winit-list-lifetime).
These two warnings occur in our code and also in
dependencies like LLVM and rapidjson. These are not
critical warnings, so they can be addressed
independently and reenabled later.

Binary sizes increase, particulary when including
debug symbols:
                         | GCC 7.5     | GCC 10.4
impalad RELEASE stripped |  83204768   |  88702824
impalad RELEASE          | 707278904   | 971711456
impalad DEBUG stripped   | 106677672   |  97391944
impalad DEBUG            | 725864760   | 867647512

Testing:
 - Multiple test jobs (core, release exhaustive, ASAN)
 - Performance testing for TPC-H and TPC-DS shows
   a modest improvement (2-4%).
 - Code compiles without warnings on debug and release

Change-Id: Ibe6857b822925226d39fd4d6413457ef6bbaabec
Reviewed-on: http://gerrit.cloudera.org:8080/18134
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2022-09-20 15:50:18 +00:00
Peter Rozsa
d74d6994cc IMPALA-11525: Rename Exec libraries to avoid conflicts with external libraries
This patch renames the Exec libraries to distinguish them from
external libraries like avro, kudu_client and orc
Due to IMPALA-11257, the cmake_minimum_required calls are
also removed from the new libraries

Change-Id: I042478e371049a7aed74f43c2bbfcb7900a53dd0
Reviewed-on: http://gerrit.cloudera.org:8080/18910
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-02 02:22:58 +00:00
zhangyifan27
85c6366d99 IMPALA-11467: Force Boost to use /dev/random for UUID generation
To avoid crash issue at OS with kernel lower than Linux 3.17, this patch
pick kudu's solution 35b5664f908cd1250c9f01e5dff77b653cfd12b7 into impala.

Testing:
 - run pre-built impalad at CentOS 7 kernel 3.10 with no crash.

Change-Id: Ic48bd59b0a846bcb91a6faf77156c0a49cd08ae8
Reviewed-on: http://gerrit.cloudera.org:8080/18805
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Alexey Serbin <alexey@apache.org>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-08-31 20:03:12 +00:00
Peter Rozsa
9518ee2907 IMPALA-10800: Tidy up the be/src/exec directory
This change separates the scanner implementations
to different folders to make the exec directory clearer.

Change-Id: Ie936c400ea8b112073bba892497ab8a1498c418d
Reviewed-on: http://gerrit.cloudera.org:8080/18815
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-18 18:31:43 +00:00
Joe McDonnell
ba4cb95b62 IMPALA-11257: Fix CMake warnings for module names and cmake_minimum_required
This fixes a few different CMake warnings:
1. This removes cmake_minimum_required invocations except for the
   top-most CMakeLists.txt. This eliminates the warnings like this:
     Compatibility with CMake < 2.8.12 will be removed from a future version of
     CMake.

     Update the VERSION argument <min> value or use a ...<max> suffix to tell
     CMake that the project does not need compatibility with older versions.
   Moving to a later version also required setting CMAKE_ENABLE_EXPORTS
   to continue exporting symbols.
2. This modifies the module names so that they match the corresponding
   module names from Find*.cmake. This is mostly dealing with case
   differences. This address warnings like:
     The package name passed to `find_package_handle_standard_args` (PROTOBUF)
     does not match the name of the calling package (Protobuf).  This can lead
     to problems in calling code that expects `find_package` result variables
     (e.g., `_FOUND`) to follow a certain pattern.
   This fixed the detection logic for KerberosPrograms, and so it required
   adding more Kerberos packages to bin/bootstrap_build.sh.
3. This adds a missing .cc suffix. This addresses the following warning:
     CMake Warning (dev) at be/src/util/CMakeLists.txt:141 (add_library):
     Policy CMP0115 is not set: Source file extensions must be explicit.  Run
     "cmake --help-policy CMP0115" for policy details.  Use the cmake_policy
     command to set the policy and suppress this warning.

These fixes mostly match how these warnings were handled in
Apache Kudu.

Testing:
 - Ran GVO

Change-Id: I2a97dd07cdd0831e90882a2035415ac71d670147
Reviewed-on: http://gerrit.cloudera.org:8080/18444
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-11 05:48:36 +00:00
Riza Suminto
06b1db4675 IMPALA-11369: Separate thrift compiler for different component
Impala used to have one thrift compiler version to compile C++, Java,
and Python code.

Most Thrift serialization/deserialization between minor versions are
compatible with each other. So it is possible to have different thrift
compiler versions for different target codes. It is beneficial to do so
because it will allow Impala to upgrade separate components
independently.

This patch implements the infrastructure change required to do so. It
replace most of the 'THRIFT_*' environment variable and CMake variable
with 'THRFIT_CPP_*', 'THRFIT_JAVA_*', and 'THRFIT_PY_*' to compile C++,
Java, and Python code accordingly. All three still refer to the same
thrift version (thrift-0.11.0-p5).

Testing:
- Build Impala and pass core tests.

Change-Id: I56479dc69b79024d1a4d09211bbe88a61fa0c6a4
Reviewed-on: http://gerrit.cloudera.org:8080/18636
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-21 02:40:59 +00:00
Gergely Fürnstáhl
33724d623f IMPALA-11311: Fixed debug_noopt build directory
It used "release" by default, changed it to debug.

Change-Id: I202065ca25ba622954ac11526e1c55db0f0e8a1c
Reviewed-on: http://gerrit.cloudera.org:8080/18555
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-23 17:14:46 +00:00
Joe McDonnell
a5a9b1e3f9 IMPALA-11110: Switch debug builds to use -Og
GCC's -Og applies optimizations that are compatible with
being debuggable. It is similar to -O1 and results
in a modest speed-up. This modifies the default debug
build to use -Og, so it is now more akin to a fastdebug
mode.

Even though -Og is intended to preserve debuggability,
optimization always impacts debuggability and -Og is
no exception. To enable the old behavior, this adds
a DEBUG_NOOPT build mode that retains the old
non-optimized behavior. Using the -debug_noopt flag
with buildall.sh enables this behavior.

Change-Id: Ie06c149c8181c90572b8668bd01dfd26c0a5971e
Reviewed-on: http://gerrit.cloudera.org:8080/18200
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Laszlo Gaal (Cloudera) <laszlo.gaal@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2022-03-21 16:22:31 +00:00
Joe McDonnell
6c845eb24b IMPALA-10046: Switch backend to compile with DWARF 4 debug info
Currently, Impala compiles with -gdwarf-2 to use DWARF 2 debug
information. DWARF 4 provides better C++ language support and
better compression (reducing binary sizes), but it requires
GDB 7.0 or above. All of the Linux distributions that Impala
supports now have GDB 7.0 or above. This switches to using DWARF 4.

Binary sizes are reduced somewhat:
                | DWARF 2 size    | DWARF 4 size
impalad DEBUG   | 573321352       | 519329200
impalad RELEASE | 713582048       | 693286800

This is also true for all unit tests. This has no impact on the
stripped binary sizes.

Testing:
 - Ran core job
 - Created a minidump and resolved it via the usual method
 - Attached to impalad with gdb, set breakpoints, took backtraces, etc.

Change-Id: I7b6e75845ab137d0a7674289e4b331f682eee5b2
Reviewed-on: http://gerrit.cloudera.org:8080/18194
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-05 02:01:01 +00:00
Joe McDonnell
7b490eed5b IMPALA-10951 (preparation): Update Kudu to a more recent version
As part of moving to a newer protobuf, this updates the Kudu version
to get the fix for KUDU-3334. With this newer Kudu version, Clang
builds hit an error while linking:
lib/libLLVMCodeGen.a(TargetPassConfig.cpp.o):TargetPassConfig.cpp:
  function llvm::TargetPassConfig::createRegAllocPass(bool):
    error: relocation refers to global symbol "std::call_once<void (&)()>(std::once_flag&, void (&)())::{lambda()#2}::_FUN()",
    which is defined in a discarded section
  section group signature: "_ZZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ENKUlvE0_clEv"
  prevailing definition is from ../../build/debug/security/libsecurity.a(openssl_util.cc.o)
(This is from a newer binutils that will be pursued separately.)

As a hack to get around this error, this adds the calloncehack
shared library. The shared library publicly defines the symbol that
was coming from kudu_client. By linking it ahead of kudu_client, the
linker uses that rather than the one from kudu_client. This fixes
the Clang builds.

The new Kudu also requires a minor change to the flags for tserver
startup.

Testing:
 - Ran debug tests and verified calloncehack is not used
 - Ran ASAN tests

Change-Id: Ieccbe284f11445e1de792352ebc7c9e1fa2ca0c3
Reviewed-on: http://gerrit.cloudera.org:8080/18129
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-01-07 01:44:58 +00:00
wzhou-code
bf728516fe IMPALA-11005 (part 2): Upgrade Boost library to 1.74.0 for Impala
There are some header files deprecated in the new version of Boost
library. Need to define BOOST_ALLOW_DEPRECATED_HEADERS in
CMakeLists.txt to avoid compiling errors. Also define
BOOST_BIND_GLOBAL_PLACEHOLDERS to keep current behaviour of boost::bind
and avoid big code changes.

Define exception handler for a new boost::throw_exception() API since
BOOST_NO_EXCEPTIONS is defined in be/CMakeLists.txt and we have to
provide handlers which will be called by codegen'd code.

Revert the code change made by IMPALA-2846 and IMPALA-9571 since the
bug was fixed in Boost 1.74.0.

Testing:
 - Passed core DEBUG build and exhaustive release build.
 - Passed core ASAN build.

Change-Id: I78f32ae3c274274325e7af9e9bc9643814ae346a
Reviewed-on: http://gerrit.cloudera.org:8080/17996
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2021-11-12 15:28:31 +00:00
wzhou-code
03a7a59f5d IMPALA-10876: Support to download JWKS from given URL
This patch added functionality to download JWKS from a given URL and
support key rotation by periodically checking the JWKS URL for updates.

We use Kudu's EasyCurl wrapper to download file from the given URL.
curl was added to native-toolchain. This patch modified makefiles
and bootstrap_toolchain.py to integrate libcurl and libkudu_curl_util.

Added end-end JWT authentication test cases with JWKS specified as
HTTP/HTTPS URL.

Testing:
 - Passed core run, including new test cases.

Change-Id: Ic6ac8cf0010c13db30219776d1d275709bf211df
Reviewed-on: http://gerrit.cloudera.org:8080/17802
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-28 04:45:23 +00:00
wzhou-code
025500ccb5 IMPALA-10489: Implement JWT support
This patch added JWT support with following functionality:
 * Load and parse JWKS from pre-installed JSON file.
 * Read the JWT token from the HTTP Header.
 * Verify the JWT's signature with public key in JWKS.
 * Get the username out of the payload of JWT token.
 * Support following JSON Web Algorithms (JWA):
   HS256, HS384, HS512, RS256, RS384, RS512.

We use third party library jwt-cpp to verify JWT token. jwt-cpp is a
headers only C++ library. It was added to native-toolchain.
This patch modified bootstrap_toolchain.py to download jwt-cpp from
toolchain s3 bucket, and modified makefiles to add jwt-cpp/include
in the include path.

Added BE unit-tests for loading JWKS file and verifying JWT token.
Also added FE custom cluster test for JWT authentication.

Testing:
 - Passed core run.

Change-Id: I6b71fa854c9ddc8ca882878853395e1eb866143c
Reviewed-on: http://gerrit.cloudera.org:8080/17435
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-08 23:10:32 +00:00
Joe McDonnell
1f3160b4c0 IMPALA-8304: Generate JUnitXML if a command run by CMake fails
This wraps each command executed by CMake with a wrapper that
generates a JUnitXML file if the command fails. If the command
succeeds, the wrapper does nothing. The wrapper applies to C++
compilation, linking, and custom shell commands (such as
building the frontend via maven). It does not apply to failures
coming from CMake itself. It can be disabled by setting
DISABLE_CMAKE_JUNITXML.

The command output can include Unicode (e.g. smart quotes for
g++), so this also updates generate_junitxml.py to handle
Unicode.

The wrapper interacts poorly with add_custom_command/add_custom_target
CMake commands that use 'cd directory && do_something', so this
switches those locations (in /docker) to use CMake's WORKING_DIRECTORY.

Testing:
 - Verified it does not impact a successful build (including with
   ccache and/or distcc).
 - Verified it generates JUnitXML for C++ and Java compilation
   failures.
 - Verified it doesn't use the wrapper when DISABLE_CMAKE_JUNITXML
   is set.

Change-Id: If71f2faf3ab5052b56b38f1b291fee53c390ce23
Reviewed-on: http://gerrit.cloudera.org:8080/12668
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-10-09 15:52:05 +00:00
Daniel Becker
5d9daf4051 IMPALA-10078: Proper codegen for KuduPartitionExpr
Implementing codegen for KuduPartitionExpr.

Testing:
 - Added a sanity test that checks that numbers are evenly distributed
   when inserted into a Kudu table.

Change-Id: Ifcae34f71b407837e2c5f1b97aa230e490a268df
Reviewed-on: http://gerrit.cloudera.org:8080/16419
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-09-25 14:02:13 +00:00
zhaorenhai
6390e7e1da IMPALA-9544 Replace Intel's SSE instructions with ARM's NEON instructions
Replace Intel's SSE instructions with ARM's NEON instructions
Replace Intel's crc32 instructions with ARM's instructions
Replace Intel's popcntq instruction with ARM's mechanism
Replace Intel's pcmpestri and pcmpestrm instructions
with ARM mechanism

Change-Id: Id7dfe17125b2910ece54e7dd18b4e4b25d7de8b9
Reviewed-on: http://gerrit.cloudera.org:8080/15531
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-08-24 16:49:15 +00:00
zhaorenhai
d45aac59d9 IMPALA-9676 Add aarch64 compile options for clang
Add signed-char and armv8a and crc compile options to clang

Change-Id: I69a5ff64bbd4427dd87ec6e884251e76d6a73122
Reviewed-on: http://gerrit.cloudera.org:8080/15755
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-08-13 02:22:31 +00:00
wzhou-code
c62a6808fc IMPALA-3741 [part 2]: Push runtime bloom filter to Kudu
Defined the BloomFilter class as the wrapper of kudu::BlockBloomFilter.
impala::BloomFilter build runtime bloom filter in kudu::BlockBloomFilter
APIs with FastHash as default hash algorithm.
Removed the duplicated functions from impala::BloomFillter class.
Pushed down bloom filter to Kudu through Kudu clinet API.

Added a new query option ENABLED_RUNTIME_FILTER_TYPES to set enabled
runtime filter types, which only affect Kudu scan node now. By default,
bloom filter is not enabled, only min-max filter will be enabled for
Kudu. With this option, user could enable bloom filter, min-max filter,
or both bloom and min-max runtime filters.

Added new test cases in PlannerTest and end-end runtime_filters test
for pushing down bloom filter to Kudu.
Added test cases to compare the number of rows returned from Kudu
scan when appling different types of runtime filter on same queries.
Updated bloom-filter-benchmark due to the bloom-filter implementation
change.

Bump Kudu version to d652cab17.

Testing:
 - Passed all exhaustive tests.

Performance benchmark:
 - Ran single_node_perf_run.py on TPC-H with scale as 30 for parquet
   and Kudu. Verified that new hash function and bloom-filter
   implementation don't cause regressions for HDFS bloom filters.
   For Kudu, there is one regression for query TPCH-Q9 and there
   are improvement for about 8 queris when appling both bloom and
   min-max filters. The bloom filter reduce the number of rows
   returned from Kudu scan, hence reduce the cost for aggregation
   and hash join. But bloom filter evaluation add extra cost for
   Kudu scan, which offset the gain on aggregation and join.
   Kudu scan need to be optimized for bloom filter in following
   tasks.
 - Ran bloom-filter microbenchmarks and verified that there is no
   regression for Insert/Find/Union functions with or without AVX2
   due to bloom-filter implementation changes. There is small
   performance degradation for Init function, but this function is
   not in hot path.

Change-Id: I9100076f68ea299ddb6ec8bc027cac7a47f5d754
Reviewed-on: http://gerrit.cloudera.org:8080/15683
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-05 17:43:32 +00:00
Joe McDonnell
56ee90c598 IMPALA-9760: Add IMPALA_TOOLCHAIN_PACKAGES_HOME to prepare for GCC7
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.

This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.

Testing:
 - The only impediment to building with
   IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
   Impala-lzo. With a custom Impala-lzo, compilation succeeds.
   Either Impala-lzo will be fixed or it will be removed.
 - Core tests

Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-30 16:25:37 +00:00
Tim Armstrong
c43c03c5ee IMPALA-3926: part 2: avoid setting LD_LIBRARY_PATH
This removes LD_LIBRARY_PATH and LD_PRELOAD from the
developer's shell and cleans it up. With the preceding
change, toolchain utilities like clang can be run without
a special LD_LIBRARY_PATH.

This fixes a bug where libjvm.so was registered as a
static instead of a shared library, which adds it to the
RUNPATH variable in the binary, which provides a default
search location that can be overriden by LD_LIBRARY_PATH.

Impala binaries don't have the rpath baked in for some
libraries, including Impala-lzo, libgcc and libstdc++.
, so we still need to set LD_LIBRARY_PATH when running
those. That is solved with wrapper scripts that sets
the environment variables only when invoking those
binaries, e.g. starting a daemon or running a backend
test. I added three scripts because there were 3 sets
of environment variables. The scripts are:
* run-binary.sh: just sets LD_LIBRARY_PATH
* run-jvm-binary.sh: sets LD_LIBRARY_PATH and CLASSPATH
* start-daemon.sh: sets LD_LIBRARY_PATH and CLASSPATH and
  kerberos-related environment variables.

The binaries, in almost all cases, work fine without
those tweaks, because libstdc++ and libgcc are picked
up along with libkuduclient.so from the toolchain (they
are in the same directory). I decided to leave good enough
alone here. run-binary.sh and friends can be used in
any remaining edge cases to run binaries.

An alternative to the 3 scripts would be to have an
uber-script that set all the variables, but I felt
that it was better to be specific about what
each binary needed. Cleaning the LD_LIBRARY_PATH
mess up has given me a distaste for scattershot
setting of environment variables. I am open to
revisiting this.

Testing:
* Ran tests on centos 7
* Manually tested that my dev env with
 LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu continued
 to work (for now). All ubuntu 16.04 and 18.04 dev
 envs that were set up with bootstrap_development.sh
 will be in this state.

Change-Id: I61c83e6cca6debb87a12135e58ee501244bc9603
Reviewed-on: http://gerrit.cloudera.org:8080/14494
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-07 08:50:44 +00:00
zhaorenhai
dbd22365fd IMPALA-9545 Decide cacheline size of aarch64
ARM64's L3 cacheline size is different according
 to CPU vendor's architecture. If user defined
 CACHELINESIZE_AARCH64 in impala-config-local.sh,
then we will use that value, if user did not
 define it, then we will get the value from OS,
if fail, then we will use the default value 64.

Change-Id: Id56bfa63e4b6cd957c4997f10de78a5f4111f61f
Reviewed-on: http://gerrit.cloudera.org:8080/15555
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-07 15:58:27 +00:00
Zoltan Borok-Nagy
d9a38c0bac Limit short versions of KUDU_* macros to Kudu source.
So far we allowed short versions of Kudu macros in Impala by
defining KUDU_HEADERS_USE_SHORT_STATUS_MACROS in all compilation units.
I.e. DCHECK_OK meant KUDU_DCHECK_OK. This is a problem when we want to
define our own DCHECK_OK for an Impala Status object.

This commit limits KUDU_HEADERS_USE_SHORT_STATUS_MACROS to the Kudu
source files.

Unfortunately KUDU_HEADERS_USE_SHORT_STATUS_MACROS also controls some
Kudu log macros, e.g. KUDU_LOG. With the macro KUDU_LOG is defined to
LOG (glog). Without the macro KUDU_LOG is not defined, the Kudu library
expects that the clients will include 'kudu/client/stubs.h' which will
define KUDU_LOG to kudu::internal_logging::CerrLog. So far Impala didn't
use stubs.h (we define KUDU_HEADERS_NO_STUBS), and I think we still want
to use glog.
So in 'common/logging.h' I defined the KUDU logging macros to glog
macros (like if KUDU_HEADERS_USE_SHORT_STATUS_MACROS was still defined).

Change-Id: I06a65353d2a9eecf956e4ceb8d21eda2eebc69d5
Reviewed-on: http://gerrit.cloudera.org:8080/15596
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-31 13:10:04 +00:00
zhaorenhai
ded55d696a IMPALA-9427 Add -fsigned-char compile option
On aarch64 platform, type of char default is unsigned char,

here set it to signed-char to be compatible with x86-64

Change-Id: I4f0f8159b2e1167413fa3b577ad1f8db1da983a5
Reviewed-on: http://gerrit.cloudera.org:8080/15299
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-02-26 21:36:22 +00:00
Joe McDonnell
091faebe61 IMPALA-8690 (prep 2): Switch DataCache to Impala's cache implementation
In a previous change, the Kudu caching code from be/src/kudu/util was
copied to be/src/util/cache. This gets that code workable for Impala
and uses it for the DataCache.

This involves the following changes:
1. Fix up includes to point to Impala's cache implementation
2. Remove NVM Cache support and code (not used by Impala)
3. Remove CacheMetrics code (not used by Impala)
4. Move the cache code to the impala namespace
5. Modify cache-bench and cache-test to use Impala's backend
   test infrastructure
6. Switch DataCache to use be/src/util/cache

These are only boilerplate changes. The cache implementation has not
meaningfully changed.

Testing:
 - Ran core tests
 - The modified version of Kudu's cache-test passes
 - The modified version of Kudu's cache-bench runs successfully

Change-Id: I917f8352c9276373dd2761af986bf3487855271c
Reviewed-on: http://gerrit.cloudera.org:8080/15170
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-20 23:53:34 +00:00
Sahil Takiar
ca6c8d43d7 IMPALA-5904: Add full_tsan option and fix several TSAN bugs
This patch adds an additional build flag -full_tsan in addition to the
existing -tsan build flag. -full_tsan is equivalent to the current -tsan
behavior, and -tsan is changed to set the ignore_noninstrumented_modules
flag to true. ignore_noninstrumented_modules causes TSAN to ignore any
modules that are not TSAN-instrumented. This is necessary to get TSAN to
play nicely with Java, since Java is not TSAN-instrumented (see
https://wiki.openjdk.java.net/display/tsan/Main and JDK-8208520). While
this might decrease the number of issues surfaced by TSAN, it drastically
decreases the amount of noise produced by TSAN because the JVM is not
running TSAN-instrumented code. Without this flag set to true, almost
every single backend test fails with the error:

WARNING: ThreadSanitizer: data race (pid=12939)
  Write of size 1 at 0x7fcbe379c4c6 by thread T31:
    #0 strncpy /mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:650 (unifiedbetests+0x1b2a4ad)
    #1 <null> <null> (libjvm.so+0x90e706)

This patch fixes various TSAN bugs (e.g. data races) reported while
running backend tests and E2E against a TSAN build (it does not make
Impala completely TSAN-clean). This patch makes the following changes:
* Fixes several bugs involving issues with updating shared variables
  between threads
* Fixes a few race conditions in test classes
* Where possible, existing locks are used to fix any data races; in cases
  where the locking logic is non-trivial, atomics are used
* There are a few places where variables are marked as 'volatile'
  presumably for synchronization purposes; TSAN flags these 'volatile'
  variables as unsafe, and according to
  https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rconc-volatile
  using 'volatile' for synchronization is dangerous; in these cases, the
  'volatile' variables are changed to 'atomic' variables
* This patch adds a suppression file (bin/tsan-suppresions.txt) similar to
  the UBSAN suppresion file (bin/ubsan-suppresions.txt)

Testing:
* Ran exhaustive tests
* Ran core tests w/ ASAN build
* Manually re-ran backend tests against a TSAN build and made sure the
  reported errors are gone

Change-Id: I3d7ef5c228afd5882e145e6f53885b355d6c25a0
Reviewed-on: http://gerrit.cloudera.org:8080/15116
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-10 20:49:15 +00:00
Tim Armstrong
3ae5f98420 IMPALA-9034: fix distcc+ccache
According to the ccache manpage, the correct way to compose it with
distcc is to use CCACHE_PREFIX. I think this explains why local
ccache wasn't working for me.

I updated the distcc wrapper scripts to do this instead and
confirmed that it works - after doing a clean and rebuilding
the same branch, it is much faster and "ccache -s" shows a
bunch of cache hit.

ccache on the distcc server side still works.

Change-Id: Ie0b080709bd765056b9296d3deb805038fc01e5d
Reviewed-on: http://gerrit.cloudera.org:8080/14408
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2019-10-15 23:34:16 +00:00
Joe McDonnell
4cc3ff9c67 IMPALA-8176: Convert simple backend tests to the unified executable
This converts tests with trivial main() functions to the unified
executable. This means that the code change is strictly removing
main() functions and updating the CMakeLists.txt files. Any test
that requires a change larger than that will be addressed
separately. The only exceptions are:
 - exec/incr-stats-util-test.cc requires naming changes to avoid
   conflicts with util/rle-test.cc
 - runtime/decimal-test.cc simplified the naming to make the
   CMakeLists.txt arguments easier.

The new test libraries are marked STATIC, because they are linked
into a single binary (unifiedbetests) and googletest has problems
with tests in shared libraries.

Converting this set of tests saves about 18GB of disk
space for a debug build and saves a minute or two of link time.

For any CMakeLists.txt that has unified tests, this adds a comment
for each test that is not unified.

Testing:
 - Ran backend tests in DEBUG and ASAN modes on Centos7
 - Ran backend tests in DEBUG mode on Centos6

Change-Id: I840d0f9b70edb3a7195a2a33b21fd2874d4c52bd
Reviewed-on: http://gerrit.cloudera.org:8080/13515
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-07-19 22:01:00 +00:00
Abhishek
51e8175c62 IMPALA-8450: Add support for zstd in parquet
Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain
directory. Other changes were made to make zstd headers and libs
accessible.

Class ZstandardCompressor/ZstandardDecompressor was added to provide
interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd
supports different compression levels (clevel) from 1 to
ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive
values represents uncompressed data they won't be supported. The default
clevel is ZSTD_CLEVEL_DEFAULT.

HdfsParquetTableWriter was updated to support ZSTD codec. The
new codecs can be set using existing query option as follows:
  set COMPRESSION_CODEC=ZSTD:<clevel>;
  set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT

Testing:
  - Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT
    clevel and a random clevel. The test unit decompresses an input
    compressed data and validates the result. It also tests for
    expected behavior when passing an over/under sized buffer for
    decompressing.
  - Added unit tests for valid/invalid values for COMPRESSION_CODEC.
  - Added e2e test in test_insert_parquet.py which tests writing/read-
    ing (null/non-null) data into/from a table (w different data type
    columns) using multiple codecs. Other existing e2e tests were
    updated to also use parquet/zstd table format.
  - Manual interoperability tests were run between Impala and Hive.

Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7
Reviewed-on: http://gerrit.cloudera.org:8080/13507
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-05 11:15:04 +00:00
Abhishek Rawat
a80ebb7a72 IMPALA-8375: Add metrics for spill disk usage
Added two new metrics tmp-file-mgr.scratch-space-bytes-used-high-water-mark
& tmp-file-mgr.scratch-space-bytes-used for tracking HWM and current
value for spilled bytes, respectively.

A new class AtomicHighWaterMarkGauge was added to keep track of the HWM
value. The new class also encapsulates a metric object which keeps track
of the current value for the spilled bytes.

The current value is incremented every time a new range is allocated from
a temporary file. The current value for spilled bytes is decremented when
a temporary file is closed. The new metrics are not updated when ranges
are recycled from a file. We can add a new metric in future for keeping
track of actual spilled bytes. The HWM value is updated whenever the
current value is greater than the HWM value.

Testing:
- Added new unit tests to the metrics-test test case.
- E2E testing for both the metrics by running concurrent spilling queries
  and ensuring that both the current value metric and the HWM metric were
  behaving as expected. Ran concurrent queries and monitored the metrics
  on the impala daemon's metric page.

Change-Id: Ia1b3dd604c7234a8d8af34d70ca731544a46d298
Reviewed-on: http://gerrit.cloudera.org:8080/12956
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-17 03:19:37 +00:00