udf_samples makefile doesn't use ${CLANG_INCLUDE_FLAGS} so it will use
the default boost installation. If dev env has a very old boost installed,
you could get the following comiling error.
../udf/udf.h:143:3: error: unknown type name 'uint8_t'
uint8_t* Allocate(int byte_size);
^
Change-Id: I3878b9d73d6022855b0cfbbdbee17eaf4c2557e1
Reviewed-on: http://gerrit.cloudera.org:8080/692
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
This patch provides the last fixes to finally enable the toolchain:
- Remove static OpenSSL dependency
- Fixing inline assembly problems in ASAN
- Issues with non-relocatable LLVM 3.3 - adds manual system
includes to fix issues with hardcoded header paths in clang.
When the toolchain is enabled and we build for ASAN we use a specific
toolchain file to build with LLVM-trunk as the main compiler. Even
though this uses LLVM-trunk for compiling the Impala code, this will use
LLVM 3.3 for codegen. In addition, this enables us to follow up with
TSAN and LEAKSAN.
Change-Id: I0abb914ca3f192cb7edd83ead134bc9e2d02071f
Reviewed-on: http://gerrit.cloudera.org:8080/556
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
If a static version of zlib and bzip2 is picked up we assumed that it
would be compiled with -fPIC. However, this is not always the case. Thus
in the non-toolchain case we specifically dynamic link with zlib and
bzip2 for the dynamic targets.
In addition, this patch removes static linking of libgcc in the
toolchain case as LLVM is not able to find the exception handling
symbols even if they are present in the binary. Static linking of libgcc
is postponed.
Next, if Impala is build with -notests the external data source thrift
files would not be generated. This patch make sure the dependencies are
expressed correctly.
Finally, if a user would have google perftools installed on the system
we would accidentally pick up the system libraries and the thirdparty
headers which will end in linker errors. This patch fixes the path
issues.
Change-Id: Ic000101c33da26d75a0cd733f7ef02f1bd694937
Reviewed-on: http://gerrit.cloudera.org:8080/460
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This patch allows to optionally enable the new Impala binary
toolchain. For now there are now major version differences in the
toolchain dependencies and what is currently kept in thirdparty.
To enable the toolchain, export the variable IMPALA_TOOLCHAIN to the
folder where the binaries are available.
In addition this patch moves gutil from the thirdparty directory into
the source tree of be/src to allow easy propagation of compiler and
linker flags. Furthermore, the thrift-cpp target was added as a
dependency to all targets that require the generated thrift sources to
be available before the build is started.
What is the new toolchain: The goal of the toolchain is to homogenize
the build environment and to make sure that Impala is build nearly
identical on every platform. To achieve this, we limit the flexibility
of using the systems host libraries and rather rely on a set of custom
produced binaries including the necessary compiler.
Change-Id: If2dac920520e4a18be2a9a75b3184a5bd97a065b
Reviewed-on: http://gerrit.cloudera.org:8080/427
Reviewed-by: Adar Dembo <adar@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
The command line used:
git ls-files *.h | xargs sed -i '14,$s/^\( *\/\/\) /\1\/ /g'
...then some manual fix-up to remove false positives on inlined
functions that contain comments.
Change-Id: Ia835ae21f189d5a8dc5627fb3983081a0bd1f1e2
Reviewed-on: http://gerrit.cloudera.org:8080/305
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
This patch removes all occurrences of "using namespace std" and "using
namespace boost(.*)" from the codebase. However, there are still cases
where namespace directives are used (e.g. for rapidjson, thrift,
gutil). These have to be tackled in subsequent patches.
To reduce the patch size, this patch introduces a new header file called
"names.h" that will include many of our most frequently used symbols iff
the corresponding include was already added. This means, that this
header file will pull in for example map / string / vector etc, only iff
vector was already included. This requires "common/names.h" to be the
last include. After including `names.h` a new block contains a sorted list
of using definitions (this patch does not fix namespace directive
declarations for other than std / boost namespaces.)
Change-Id: Iebe4c054670d655bc355347e381dae90999cfddf
Reviewed-on: http://gerrit.cloudera.org:8080/338
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
This patch adds the ability to compute and drop column and table
statistics at partition granularity.
The following commands are added. Detail about the implementation
follows.
COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>]
This variant of COMPUTE STATS will, ultimately, do the same thing as the
traditional COMPUTE STATS statement, but does so by caching the
intermediate state of the computation for each partition in the Hive
MetaStore. If the PARTITION clause is added, the computation is
performed for only that partition. If the PARTITION clause is omitted,
incremental stats are updated only for those partitions with missing
incremental stats (e.g. one column does not have stats, or incremental
stats was never computed for this partition). In this patch, incremental
stats are only invalidated when a DROP STATS variant is executed. Future
patches can automatically invalidate the statistics after REFRESH or
INSERT queries, etc.
DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec>
This variant of DROP stats removes the incremental statistics for the
given table. It does *not* recalculate the statistics for the whole
table, so this should be used only to invalidate the intermediate state
for a partition which will shortly be subject to COMPUTE INCREMENTAL
STATS. The point of this variant is to allow users to notify Impala when
they believe a partition has changed significantly enough to warrant
recomputation of its statistics. It is not necessary for new partitions;
Impala will detect that they do not have any valid statistics.
--------
This is achieved by adapting the existing HLL UDA via swapping its
finalize method for a new one which returns the intermediate HLL
buckets, rather than aggregating and then disposing of them. This
intermediate state is then returned to Impala's catalog-op-executor.cc,
which then passes the intermediate state back to the frontend to be
ultimately stored in the HMS.
This intermediate state is computed on a per-partition basis by grouping
the input to the UDA by partition. Thus, the incremental computation
produces one row for each partition selected (the set of which might be
quite small, if there are few partitions without valid incremental
stats: this is the point of the new commands).
At the same time, the query coordinator aggregates the output of the UDA
to produce table-level statistics. This computation incorporates any
existing (and not re-computed) intermediate partition state which is
passed to the coordinator by the frontend. The resulting statistics are
saved to the table as normal.
Intermediate statistics are serialised to the HMS by writing a Thrift
structure's serialised form to the partition's 'parameters' map. There
is a schema-imposed limit of 4000 characters to the serialised string,
which is exacerbated by the fact that the Thrift representation must
first be base-64 encoded to avoid type errors in the HMS. The current
patch breaks the encoded structure into 4k chunks, and then recombines
them on read. The alltypes table (11 columns) takes about three of these
chunks. This may mean that incremental stats are not suitable for
particularly wide tables: these structures could be zipped before
encoding for some space savings. In the meantime, the NDV estimates are
run-length encoded (since they are generally sparse); this can result in
substantial space savings.
Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408
* AggFnEvaluator now uses the UDF mem pool (I'm planning to change
this to per-exec node pools in the expr refactoring)
* FunctionContext::TrackAllocation()/Free() actually use the UDF's mem tracker
* Added FunctionContextImpl::Close() which sets warnings for leaked allocations
Change-Id: I792ffd49102a92b57e34df18d8ff5f5d0fd27370
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1792
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 41a5f7cfa718789fa3b2de3a31f085411fb5000c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1954
Tested-by: jenkins
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).
This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.
Some other minor cleanup in:
- JNI from FE to BE
- UDFs/UDAs that are loaded as test data
Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>