test_query_log_size_in_bytes is flaky in exhaustive builds. The expected
QueryStateRecord is off by around 100KB per query, indicating that the
test query might be too complex and lead to variability in final query
profile that is being archived.
This patch simplifies the test query and relaxes the assertion. This
patch also adds QUERY_LOG_EST_TOTAL_BYTES metric (key name
"impala-server.query-log-est-total-bytes") that is validated as well in
test_query_log_size_in_bytes. query-state-record-test.cc is modified a
bit to resolve segmentation fault issue when building
unified-be-test-validated-executable target run_clang_tidy.sh.
Testing:
- Pass test_query_log_size_in_bytes in exhaustive build.
Change-Id: Ic9b543ffa931a01c11d58e45774ea46af78b3a25
Reviewed-on: http://gerrit.cloudera.org:8080/21100
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Contains several disparate pieces of functionality that support the
overall workload management work.
1. TNetworkAddressComparator
An unused existing comparator for the Thrift class TNetworkAddress
already existed in the thrift-util.h file. This comparator has been
moved to the network-util.h file where it now resides in the same
place as other utility functions that operator on TNetworkAddress
instances.
The existing comparator did not consider the uds address. It only
considered hostname and port. The new comparator considers all three.
Testing is accomplished by porting the existing ctests and adding
additional ctests.
2. StringStreamPop
This new class extends a std::basic_stringstream<char> to add a
function that enables removing a character from the end.
Testing is accomplished using new ctests.
3. Ticker
This new header-only class notifies a condition variable at periodic
intervals. It is a lightweight that sleeps until the configured
duration has passed at which point it wakes up and notifies the
condition variable. It also enables consumers to offload spurious
wakeup guards to this class.
Ctests have been added to test the functionality of this new class.
4. TUniqueId Empty Utility Function
A new function UUIDEmpty returns true if a provided TUniqueID does
not contain a UUID or false otherwise.
Ctests have been added to test this new function.
5. run_clang_tidy.sh
Additional informational outputs have been added to this script to
enable tracking the status of the script and to easily identify
errors found by clang tidy.
6. Summary Util Test
A ctest was developed for testing the text table generation code that
generates the exec summary portion of the query profile. This ctest
was developed as part of an idea that did not ultimately pan out.
Rather than throwing away that test code, it has been added as a new
ctest.
Change-Id: Iee23334ec56a18b192a75d052468bf59159b6424
Reviewed-on: http://gerrit.cloudera.org:8080/21048
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In switch/case statements, one case can fallthrough to the
next case. Sometimes this is intentional, but it is also a
common source of bugs (i.e. a missing break/return statement).
Clang-Tidy's clang-diagnostic-implicit-fallthrough flags
locations where a case falls through to the next case without
an explicit fallthrough declaration.
This change enables clang-diagnostic-implicit-fallthrough and
fixes failing locations. Since Impala uses C++17, this uses
C++17's [[fallthrough]] to indicate an explicit fallthrough.
This also adjusts clang-tidy's output to suggest [[fallthrough]]
as the preferred way to indicate fallthrough.
Testing:
- Ran core job
- Ran clang tidy
Change-Id: I6d65c92b442fa0317c3af228997571e124a54092
Reviewed-on: http://gerrit.cloudera.org:8080/20847
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zihao Ye <eyizoha@163.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
This enables the clang tidy performance check:
performance-inefficient-string-concatenation
"warning: string concatenation results in allocation of unnecessary
temporary strings"
Fix: Use StrCat() to concatenate multiple strings
Testing:
- Ran bin/run_clang_tidy.sh with the new checks
- Ran GVO
Change-Id: Ibad8bd0f12aab92ad874f5a6b9ec922dce7f3190
Reviewed-on: http://gerrit.cloudera.org:8080/20445
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:
---------------------------------------
|Parquet type | Impala type |
|---------------------------------------|
|INT32 | TINYINT, SMALLINT, INT |
|INT64 | BIGINT |
|FLOAT | FLOAT |
|DOUBLE | DOUBLE |
|BYTE_ARRAY | STRING |
---------------------------------------
The following types are not supported for the given reasons:
----------------------------------------------------------------
|Impala type | Problem |
|----------------------------------------------------------------|
|VARCHAR(N) | truncation can change hash |
|CHAR(N) | padding / truncation can change hash |
|DECIMAL | multiple encodings supported |
|TIMESTAMP | multiple encodings supported, timezone conversion |
|DATE | not considered yet |
----------------------------------------------------------------
Support may be added for these types later, see IMPALA-10641.
If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.
Testing:
- Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
- Added unit tests for ParquetBloomFilter in file
be/src/util/parquet-bloom-filter-test.cc
- A minor, unrelated change was done in
be/src/util/bloom-filter-test.cc: the MakeRandom() function had
return type uint64_t, the documentation claimed it returned a 64 bit
random number, but the actual number of random bits is 32, which is
what is intended in the tests. The return type and documentation
have been corrected to use 32 bits.
Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Reviewed-on: http://gerrit.cloudera.org:8080/17026
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
This patch implements Truncate and Bucket partition transforms in
Impala BE as built-in functions. The expectation is that these
functions give the same result as Iceberg's implementation of the same
functions. These built-in functions are invisible so users won't be
able to invoke them e.g. from impala-shell.
Truncate:
- Supported types are IntVal, BigIntVal, StringVal, DecimalVal.
- Receives an input from the above types and a width.
- Returns the same type as the input.
- Expected behaviour is explained here:
https://iceberg.apache.org/spec/#truncate-transform-details
Bucket:
- Supported types are IntVal, BigIntVal, StringVal, DecimalVal,
DateVal, TimestampVal.
- Receives an input from the above types and the number of buckets as
IntVal.
- Returns IntVal.
- Expected behaviour is explained here:
https://iceberg.apache.org/spec/#bucket-transform-details
Change-Id: I485680cf79d96d578dd8cfbfd554bec468fe84bd
Reviewed-on: http://gerrit.cloudera.org:8080/16741
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.
This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.
Testing:
- The only impediment to building with
IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
Impala-lzo. With a custom Impala-lzo, compilation succeeds.
Either Impala-lzo will be fixed or it will be removed.
- Core tests
Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch imports the functionality needed for HLL approximate
algorithm from Apache DataSketches. I decided to copy the necessary
files into be/src/thirdparty/datasketches. Note, that the original
structure of files was changed during this process as originally hll/
and common/ libraries were both affected but I copied these into the
same directory so that Impala can compile them without rewriting the
include paths in the files themselves. Also note, that not the whole
common/ directory was copied just the files needed for HLL.
The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29
Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp
Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Reviewed-on: http://gerrit.cloudera.org:8080/15746
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Clang's run-clang-tidy.py script produces a lot of
output even when there are no warnings or errors.
None of this output is useful.
This patch has two parts:
1. Bump LLVM to 5.0.1-p1, which has patched run-clang-tidy.py
to make it reduce its own output when passed -quiet
(along with other enhancements).
2. Pass -quiet to run-clang-tidy.py and pipe the stderr output
to a temporary file. Display this output only if
run-clang-tidy.py hits an error, as this output is not
useful otherwise.
Testing with a known clang tidy issue shows that warnings
and errors are still in the output, and the output is
clean on a clean Impala checkout.
Change-Id: I63c46a7d57295eba38fac8ab49c7a15d2802df1d
Reviewed-on: http://gerrit.cloudera.org:8080/10615
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In the commit 4feb4f3a, the third party library pcg-cpp was excluded
from the clang-tidy check. It could make unexpected side effect, so
fixing some warnings from clang-tidy is better than avoidance of the
check.
Change-Id: I591d30373cb13f0eb89afbe16d81b1d3fb783365
Reviewed-on: http://gerrit.cloudera.org:8080/8829
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
Currently implementation of rand/random built-in functions
use rand_r of C library. We recognized its randomness was poor.
pcg32 of third party library shows better randomness than rand_r.
Testing:
Revise unit test in expr-test
Add E2E test to random.test
Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684
Reviewed-on: http://gerrit.cloudera.org:8080/8355
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.
SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.
Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html
The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.
Testing:
- added functional tests
- core/hdfs run passed
Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
A few miscellaneous changes to allow kudu_util to compile with Impala.
Add kudu_version.cc to substitute for the version.cc file that is
automatically built during the full Kudu build.
Set LZ4_DISABLE_DEPRECATE_WARNINGS to allow Kudu's compressor utility to
use deprecated names for LZ4 methods.
Add NO_NVM_SUPPORT flag to Kudu build (plan to upstream this later) to
disable building with nvm support, removing a library dependency.
Also remove imported FindOpenSSL.cmake in favour of the standard one provided
by cmake itself.
Finally, a few changes to allow compilation on RHEL5:
* Only use sched_getcpu() if supported
* Only include magic.h if available
* Workaround for kernels that don't have SOCK_NONBLOCK
* Workaround for kernels that don't have O_CLOEXEC (ignore the flag)
* Provide non-working implementation of fallocate()
* Disable inclusion of linux/fiemap.h - although this exists on RHEL5,
it does not compile due to other #includes in env_posix.cc. We disable
the path this is used for, since Impala does not call that code.
* Use Kudu's implementation of pipe(2), preadv(2) and pwritev(2) where
it doesn't exist.
In most cases these changes simply force kutil to revert to a different
implementation that was already written for OSX support - this patch
generalises the logic to provide the implementation whenever the
required function doesn't exist.
This patch compiles on RHEL5.5 and 6.0, SLES11 and 12, Ubuntu 12.04 and
14.04 and Debian 7.0 and 8.0.
Change-Id: I8da798ee55506d6f969416b17c191eb03cb215f5
Reviewed-on: http://gerrit.cloudera.org:8080/5715
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
run_clang_tidy.sh was mistakenly using -notests, which doesn't even
compile the tests, rather than -skiptests, which compiles (but does
not run) the backend tests.
When I discovered this, I also found that all clang builds (including
tidy and asan) had been broken by my previous alignas commit
(10a4c5a2e4). This patch fixes that as
well.
Change-Id: Ib7066039f78d7ee039db619b96e25665b4d63503
Reviewed-on: http://gerrit.cloudera.org:8080/5094
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This patch adds a script to run clang-tidy over the whole code
base. It is a first step towards running clang-tidy over patches as a
tool to help users spot bugs before code review.
Because of the number of clang-tidy checks, this patch only addresses
some of them. In particular, only checks starting with 'clang' are
considered. Many of them which are flaky or not part of our style are
excluded from the analysis. This patch also exlcudes some checks which
are part of our current style but which would be too laborious to fix
over the entire codebase, like using nullptr rather than NULL.
This patch also fixes a number of small bugs found by clang-tidy.
Finally, this patch adds the class AlignedNew, the purpose of which is
to provide correct alignment on heap-allocated data. The global new
operator only guarantees 16-byte alignment. A class that includes a
member variable that must be aligned on a k-byte boundary for k>16 can
inherit from AlignedNew<k> to ensure correct alignment on the heap,
quieting clang's -Wover-aligned warning. (Static and stack allocation
are required by the standard to respect the alignment of the type and
its member variables, so no extra code is needed for allocation in
those places.)
Change-Id: I4ed168488cb30ddeccd0087f3840541d858f9c06
Reviewed-on: http://gerrit.cloudera.org:8080/4758
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins