16 Commits

Author SHA1 Message Date
Riza Suminto
ff5cbadec4 IMPALA-12864: Deflake test_query_log_size_in_bytes.
test_query_log_size_in_bytes is flaky in exhaustive builds. The expected
QueryStateRecord is off by around 100KB per query, indicating that the
test query might be too complex and lead to variability in final query
profile that is being archived.

This patch simplifies the test query and relaxes the assertion. This
patch also adds QUERY_LOG_EST_TOTAL_BYTES metric (key name
"impala-server.query-log-est-total-bytes") that is validated as well in
test_query_log_size_in_bytes. query-state-record-test.cc is modified a
bit to resolve segmentation fault issue when building
unified-be-test-validated-executable target run_clang_tidy.sh.

Testing:
- Pass test_query_log_size_in_bytes in exhaustive build.

Change-Id: Ic9b543ffa931a01c11d58e45774ea46af78b3a25
Reviewed-on: http://gerrit.cloudera.org:8080/21100
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-03-05 02:51:30 +00:00
jasonmfehr
97cd30c607 IMPALA-12426: Workload Management Supporting Changes
Contains several disparate pieces of functionality that support the
overall workload management work.

1. TNetworkAddressComparator
  An unused existing comparator for the Thrift class TNetworkAddress
  already existed in the thrift-util.h file. This comparator has been
  moved to the network-util.h file where it now resides in the same
  place as other utility functions that operator on TNetworkAddress
  instances.

  The existing comparator did not consider the uds address. It only
  considered hostname and port. The new comparator considers all three.

  Testing is accomplished by porting the existing ctests and adding
  additional ctests.

2. StringStreamPop
  This new class extends a std::basic_stringstream<char> to add a
  function that enables removing a character from the end.

  Testing is accomplished using new ctests.

3. Ticker
  This new header-only class notifies a condition variable at periodic
  intervals. It is a lightweight that sleeps until the configured
  duration has passed at which point it wakes up and notifies the
  condition variable. It also enables consumers to offload spurious
  wakeup guards to this class.

  Ctests have been added to test the functionality of this new class.

4. TUniqueId Empty Utility Function
  A new function UUIDEmpty returns true if a provided TUniqueID does
  not contain a UUID or false otherwise.

  Ctests have been added to test this new function.

5. run_clang_tidy.sh
  Additional informational outputs have been added to this script to
  enable tracking the status of the script and to easily identify
  errors found by clang tidy.

6. Summary Util Test
  A ctest was developed for testing the text table generation code that
  generates the exec summary portion of the query profile. This ctest
  was developed as part of an idea that did not ultimately pan out.
  Rather than throwing away that test code, it has been added as a new
  ctest.

Change-Id: Iee23334ec56a18b192a75d052468bf59159b6424
Reviewed-on: http://gerrit.cloudera.org:8080/21048
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-02-27 17:06:28 +00:00
Joe McDonnell
dd8ddf77c3 IMPALA-12668: Enable clang-tidy checks for implicit fallthrough
In switch/case statements, one case can fallthrough to the
next case. Sometimes this is intentional, but it is also a
common source of bugs (i.e. a missing break/return statement).
Clang-Tidy's clang-diagnostic-implicit-fallthrough flags
locations where a case falls through to the next case without
an explicit fallthrough declaration.

This change enables clang-diagnostic-implicit-fallthrough and
fixes failing locations. Since Impala uses C++17, this uses
C++17's [[fallthrough]] to indicate an explicit fallthrough.
This also adjusts clang-tidy's output to suggest [[fallthrough]]
as the preferred way to indicate fallthrough.

Testing:
 - Ran core job
 - Ran clang tidy

Change-Id: I6d65c92b442fa0317c3af228997571e124a54092
Reviewed-on: http://gerrit.cloudera.org:8080/20847
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zihao Ye <eyizoha@163.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2024-01-04 19:36:20 +00:00
gaurav1086
3614a6a776 IMPALA-12390 (part 2): Enable some clang-tidy performance related checks
This enables the clang tidy performance check:
performance-inefficient-string-concatenation
"warning: string concatenation results in allocation of unnecessary
temporary strings"
Fix: Use StrCat() to concatenate multiple strings

Testing:
 - Ran bin/run_clang_tidy.sh with the new checks
 - Ran GVO

Change-Id: Ibad8bd0f12aab92ad874f5a6b9ec922dce7f3190
Reviewed-on: http://gerrit.cloudera.org:8080/20445
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-19 08:55:04 +00:00
Daniel Becker
817ca5920d IMPALA-10640: Support reading Parquet Bloom filters - most common types
This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---------------------------------------
|Parquet type |  Impala type            |
|---------------------------------------|
|INT32        |  TINYINT, SMALLINT, INT |
|INT64        |  BIGINT                 |
|FLOAT        |  FLOAT                  |
|DOUBLE       |  DOUBLE                 |
|BYTE_ARRAY   |  STRING                 |
 ---------------------------------------

The following types are not supported for the given reasons:

 ----------------------------------------------------------------
|Impala type |  Problem                                          |
|----------------------------------------------------------------|
|VARCHAR(N)  | truncation can change hash                        |
|CHAR(N)     | padding / truncation can change hash              |
|DECIMAL     | multiple encodings supported                      |
|TIMESTAMP   | multiple encodings supported, timezone conversion |
|DATE        | not considered yet                                |
 ----------------------------------------------------------------

Support may be added for these types later, see IMPALA-10641.

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
    whether Parquet Bloom filtering works for the supported types and
    that we do not incorrectly discard row groups for the unsupported
    type VARCHAR. The Parquet file used in the test was generated with
    an external tool.
  - Added unit tests for ParquetBloomFilter in file
    be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
    be/src/util/bloom-filter-test.cc: the MakeRandom() function had
    return type uint64_t, the documentation claimed it returned a 64 bit
    random number, but the actual number of random bits is 32, which is
    what is intended in the tests. The return type and documentation
    have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Reviewed-on: http://gerrit.cloudera.org:8080/17026
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2021-06-03 06:32:45 +00:00
Gabor Kaszab
6cb7cecacf IMPALA-10237: Support Bucket and Truncate partition transforms as built-in functions
This patch implements Truncate and Bucket partition transforms in
Impala BE as built-in functions. The expectation is that these
functions give the same result as Iceberg's implementation of the same
functions. These built-in functions are invisible so users won't be
able to invoke them e.g. from impala-shell.

Truncate:
  - Supported types are IntVal, BigIntVal, StringVal, DecimalVal.
  - Receives an input from the above types and a width.
  - Returns the same type as the input.
  - Expected behaviour is explained here:
    https://iceberg.apache.org/spec/#truncate-transform-details
Bucket:
  - Supported types are IntVal, BigIntVal, StringVal, DecimalVal,
    DateVal, TimestampVal.
  - Receives an input from the above types and the number of buckets as
    IntVal.
  - Returns IntVal.
  - Expected behaviour is explained here:
    https://iceberg.apache.org/spec/#bucket-transform-details

Change-Id: I485680cf79d96d578dd8cfbfd554bec468fe84bd
Reviewed-on: http://gerrit.cloudera.org:8080/16741
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-19 03:09:06 +00:00
Joe McDonnell
56ee90c598 IMPALA-9760: Add IMPALA_TOOLCHAIN_PACKAGES_HOME to prepare for GCC7
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.

This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.

Testing:
 - The only impediment to building with
   IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
   Impala-lzo. With a custom Impala-lzo, compilation succeeds.
   Either Impala-lzo will be fixed or it will be removed.
 - Core tests

Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-30 16:25:37 +00:00
Gabor Kaszab
9a3108709b IMPALA-9631: Import HLL functionality from DataSketches
This patch imports the functionality needed for HLL approximate
algorithm from Apache DataSketches. I decided to copy the necessary
files into be/src/thirdparty/datasketches. Note, that the original
structure of files was changed during this process as originally hll/
and common/ libraries were both affected but I copied these into the
same directory so that Impala can compile them without rewriting the
include paths in the files themselves. Also note, that not the whole
common/ directory was copied just the files needed for HLL.

The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29

Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp

Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Reviewed-on: http://gerrit.cloudera.org:8080/15746
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-24 16:26:21 +00:00
Tim Armstrong
aa4c115e58 IMPALA-9312: Better error handling for run_clang_tidy.sh
This turns a buildall.sh failure into a job failure and
dumps full output of the script.

Change-Id: I4a90e5b48f0d75b0a1c52cfe629c22bd37c70b26
Reviewed-on: http://gerrit.cloudera.org:8080/15090
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2020-01-22 17:18:06 +00:00
Joe McDonnell
27c788f826 IMPALA-7132: Filter out useless output from run_clang_tidy.sh
Clang's run-clang-tidy.py script produces a lot of
output even when there are no warnings or errors.
None of this output is useful.

This patch has two parts:
1. Bump LLVM to 5.0.1-p1, which has patched run-clang-tidy.py
   to make it reduce its own output when passed -quiet
   (along with other enhancements).
2. Pass -quiet to run-clang-tidy.py and pipe the stderr output
   to a temporary file. Display this output only if
   run-clang-tidy.py hits an error, as this output is not
   useful otherwise.

Testing with a known clang tidy issue shows that warnings
and errors are still in the output, and the output is
clean on a clean Impala checkout.

Change-Id: I63c46a7d57295eba38fac8ab49c7a15d2802df1d
Reviewed-on: http://gerrit.cloudera.org:8080/10615
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-17 04:58:28 +00:00
Jinchul
061287d912 IMPALA-5754: Rollback the exclusion of clang-tidy check for pcg-cpp
In the commit 4feb4f3a, the third party library pcg-cpp was excluded
from the clang-tidy check. It could make unexpected side effect, so
fixing some warnings from clang-tidy is better than avoidance of the
check.

Change-Id: I591d30373cb13f0eb89afbe16d81b1d3fb783365
Reviewed-on: http://gerrit.cloudera.org:8080/8829
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-12-18 18:37:45 +00:00
Jinchul
4feb4f3a54 IMPALA-5754: Improve randomness of rand()/random()
Currently implementation of rand/random built-in functions
use rand_r of C library. We recognized its randomness was poor.
pcg32 of third party library shows better randomness than rand_r.

Testing:
Revise unit test in expr-test
Add E2E test to random.test

Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684
Reviewed-on: http://gerrit.cloudera.org:8080/8355
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-12-13 10:04:40 +00:00
Alex Behm
0936e32966 IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-12 22:20:18 +00:00
Henry Robinson
1cbfa423b4 IMPALA-4669: [KUTIL] Add kudu_util library to the build.
A few miscellaneous changes to allow kudu_util to compile with Impala.

Add kudu_version.cc to substitute for the version.cc file that is
automatically built during the full Kudu build.

Set LZ4_DISABLE_DEPRECATE_WARNINGS to allow Kudu's compressor utility to
use deprecated names for LZ4 methods.

Add NO_NVM_SUPPORT flag to Kudu build (plan to upstream this later) to
disable building with nvm support, removing a library dependency.

Also remove imported FindOpenSSL.cmake in favour of the standard one provided
by cmake itself.

Finally, a few changes to allow compilation on RHEL5:

* Only use sched_getcpu() if supported
* Only include magic.h if available
* Workaround for kernels that don't have SOCK_NONBLOCK
* Workaround for kernels that don't have O_CLOEXEC (ignore the flag)
* Provide non-working implementation of fallocate()
* Disable inclusion of linux/fiemap.h - although this exists on RHEL5,
  it does not compile due to other #includes in env_posix.cc. We disable
  the path this is used for, since Impala does not call that code.
* Use Kudu's implementation of pipe(2), preadv(2) and pwritev(2) where
  it doesn't exist.

In most cases these changes simply force kutil to revert to a different
implementation that was already written for OSX support - this patch
generalises the logic to provide the implementation whenever the
required function doesn't exist.

This patch compiles on RHEL5.5 and 6.0, SLES11 and 12, Ubuntu 12.04 and
14.04 and Debian 7.0 and 8.0.

Change-Id: I8da798ee55506d6f969416b17c191eb03cb215f5
Reviewed-on: http://gerrit.cloudera.org:8080/5715
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2017-07-28 21:47:07 +00:00
Jim Apple
9434e38abb clang-tidy should tidy tests; fix alignas error in clang builds.
run_clang_tidy.sh was mistakenly using -notests, which doesn't even
compile the tests, rather than -skiptests, which compiles (but does
not run) the backend tests.

When I discovered this, I also found that all clang builds (including
tidy and asan) had been broken by my previous alignas commit
(10a4c5a2e4). This patch fixes that as
well.

Change-Id: Ib7066039f78d7ee039db619b96e25665b4d63503
Reviewed-on: http://gerrit.cloudera.org:8080/5094
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 23:33:45 +00:00
Jim Apple
14891fe004 IMPALA-3676: Use clang as a static analysis tool
This patch adds a script to run clang-tidy over the whole code
base. It is a first step towards running clang-tidy over patches as a
tool to help users spot bugs before code review.

Because of the number of clang-tidy checks, this patch only addresses
some of them. In particular, only checks starting with 'clang' are
considered. Many of them which are flaky or not part of our style are
excluded from the analysis. This patch also exlcudes some checks which
are part of our current style but which would be too laborious to fix
over the entire codebase, like using nullptr rather than NULL.

This patch also fixes a number of small bugs found by clang-tidy.

Finally, this patch adds the class AlignedNew, the purpose of which is
to provide correct alignment on heap-allocated data. The global new
operator only guarantees 16-byte alignment. A class that includes a
member variable that must be aligned on a k-byte boundary for k>16 can
inherit from AlignedNew<k> to ensure correct alignment on the heap,
quieting clang's -Wover-aligned warning. (Static and stack allocation
are required by the standard to respect the alignment of the type and
its member variables, so no extra code is needed for allocation in
those places.)

Change-Id: I4ed168488cb30ddeccd0087f3840541d858f9c06
Reviewed-on: http://gerrit.cloudera.org:8080/4758
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-04 00:13:12 +00:00