Commit Graph

13 Commits

Author SHA1 Message Date
Juan Yu
99db6812a9 Fix udf_samples building issue if machine has a very old boost installed
udf_samples makefile doesn't use ${CLANG_INCLUDE_FLAGS} so it will use
the default boost installation. If dev env has a very old boost installed,
you could get the following comiling error.
  ../udf/udf.h:143:3: error: unknown type name 'uint8_t'
    uint8_t* Allocate(int byte_size);
    ^

Change-Id: I3878b9d73d6022855b0cfbbdbee17eaf4c2557e1
Reviewed-on: http://gerrit.cloudera.org:8080/692
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-08-26 02:59:27 +00:00
Martin Grund
5afd5bc8f6 Toolchain Cleanup and ASAN Improvements
This patch provides the last fixes to finally enable the toolchain:

     - Remove static OpenSSL dependency
     - Fixing inline assembly problems in ASAN
     - Issues with non-relocatable LLVM 3.3 - adds manual system
       includes to fix issues with hardcoded header paths in clang.

When the toolchain is enabled and we build for ASAN we use a specific
toolchain file to build with LLVM-trunk as the main compiler. Even
though this uses LLVM-trunk for compiling the Impala code, this will use
LLVM 3.3 for codegen.  In addition, this enables us to follow up with
TSAN and LEAKSAN.

Change-Id: I0abb914ca3f192cb7edd83ead134bc9e2d02071f
Reviewed-on: http://gerrit.cloudera.org:8080/556
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-08-21 20:14:31 +00:00
Martin Grund
384ae3ab08 Fixes for Toolchain Issues
If a static version of zlib and bzip2 is picked up we assumed that it
would be compiled with -fPIC. However, this is not always the case. Thus
in the non-toolchain case we specifically dynamic link with zlib and
bzip2 for the dynamic targets.

In addition, this patch removes static linking of libgcc in the
toolchain case as LLVM is not able to find the exception handling
symbols even if they are present in the binary. Static linking of libgcc
is postponed.

Next, if Impala is build with -notests the external data source thrift
files would not be generated. This patch make sure the dependencies are
expressed correctly.

Finally, if a user would have google perftools installed on the system
we would accidentally pick up the system libraries and the thirdparty
headers which will end in linker errors. This patch fixes the path
issues.

Change-Id: Ic000101c33da26d75a0cd733f7ef02f1bd694937
Reviewed-on: http://gerrit.cloudera.org:8080/460
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-06-15 23:14:32 +00:00
Martin Grund
81f247b171 Optional Impala Toolchain
This patch allows to optionally enable the new Impala binary
toolchain. For now there are now major version differences in the
toolchain dependencies and what is currently kept in thirdparty.

To enable the toolchain, export the variable IMPALA_TOOLCHAIN to the
folder where the binaries are available.

In addition this patch moves gutil from the thirdparty directory into
the source tree of be/src to allow easy propagation of compiler and
linker flags. Furthermore, the thrift-cpp target was added as a
dependency to all targets that require the generated thrift sources to
be available before the build is started.

What is the new toolchain: The goal of the toolchain is to homogenize
the build environment and to make sure that Impala is build nearly
identical on every platform. To achieve this, we limit the flexibility
of using the systems host libraries and rather rely on a set of custom
produced binaries including the necessary compiler.

Change-Id: If2dac920520e4a18be2a9a75b3184a5bd97a065b
Reviewed-on: http://gerrit.cloudera.org:8080/427
Reviewed-by: Adar Dembo <adar@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-06-13 03:11:44 +00:00
Henry Robinson
75b16d5b8e Rewrite header comments to use Doxygen-compatible ///
The command line used:
  git ls-files *.h | xargs sed -i '14,$s/^\( *\/\/\) /\1\/ /g'

...then some manual fix-up to remove false positives on inlined
functions that contain comments.

Change-Id: Ia835ae21f189d5a8dc5627fb3983081a0bd1f1e2
Reviewed-on: http://gerrit.cloudera.org:8080/305
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2015-05-07 23:07:57 +00:00
Martin Grund
2eb12e9593 Deprecating namespace directive declarations (std, boost)
This patch removes all occurrences of "using namespace std" and "using
namespace boost(.*)" from the codebase. However, there are still cases
where namespace directives are used (e.g. for rapidjson, thrift,
gutil). These have to be tackled in subsequent patches.

To reduce the patch size, this patch introduces a new header file called
"names.h" that will include many of our most frequently used symbols iff
the corresponding include was already added. This means, that this
header file will pull in for example map / string / vector etc, only iff
vector was already included. This requires "common/names.h" to be the
last include. After including `names.h` a new block contains a sorted list
of using definitions (this patch does not fix namespace directive
declarations for other than std / boost namespaces.)

Change-Id: Iebe4c054670d655bc355347e381dae90999cfddf
Reviewed-on: http://gerrit.cloudera.org:8080/338
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-04-18 01:26:47 +00:00
Henry Robinson
44f57e5fb6 IMPALA-1122: Compute stats with partition granularity
This patch adds the ability to compute and drop column and table
statistics at partition granularity.

The following commands are added. Detail about the implementation
follows.

COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>]

This variant of COMPUTE STATS will, ultimately, do the same thing as the
traditional COMPUTE STATS statement, but does so by caching the
intermediate state of the computation for each partition in the Hive
MetaStore. If the PARTITION clause is added, the computation is
performed for only that partition. If the PARTITION clause is omitted,
incremental stats are updated only for those partitions with missing
incremental stats (e.g. one column does not have stats, or incremental
stats was never computed for this partition). In this patch, incremental
stats are only invalidated when a DROP STATS variant is executed. Future
patches can automatically invalidate the statistics after REFRESH or
INSERT queries, etc.

DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec>

This variant of DROP stats removes the incremental statistics for the
given table. It does *not* recalculate the statistics for the whole
table, so this should be used only to invalidate the intermediate state
for a partition which will shortly be subject to COMPUTE INCREMENTAL
STATS. The point of this variant is to allow users to notify Impala when
they believe a partition has changed significantly enough to warrant
recomputation of its statistics. It is not necessary for new partitions;
Impala will detect that they do not have any valid statistics.

--------

This is achieved by adapting the existing HLL UDA via swapping its
finalize method for a new one which returns the intermediate HLL
buckets, rather than aggregating and then disposing of them. This
intermediate state is then returned to Impala's catalog-op-executor.cc,
which then passes the intermediate state back to the frontend to be
ultimately stored in the HMS.

This intermediate state is computed on a per-partition basis by grouping
the input to the UDA by partition. Thus, the incremental computation
produces one row for each partition selected (the set of which might be
quite small, if there are few partitions without valid incremental
stats: this is the point of the new commands).

At the same time, the query coordinator aggregates the output of the UDA
to produce table-level statistics. This computation incorporates any
existing (and not re-computed) intermediate partition state which is
passed to the coordinator by the frontend. The resulting statistics are
saved to the table as normal.

Intermediate statistics are serialised to the HMS by writing a Thrift
structure's serialised form to the partition's 'parameters' map. There
is a schema-imposed limit of 4000 characters to the serialised string,
which is exacerbated by the fact that the Thrift representation must
first be base-64 encoded to avoid type errors in the HMS. The current
patch breaks the encoded structure into 4k chunks, and then recombines
them on read. The alltypes table (11 columns) takes about three of these
chunks. This may mean that incremental stats are not suitable for
particularly wide tables: these structures could be zipped before
encoding for some space savings. In the meantime, the NDV estimates are
run-length encoded (since they are generally sparse); this can result in
substantial space savings.

Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408
2014-11-25 09:13:37 -08:00
Skye Wanderman-Milne
c8b2017093 Add decimal UDF/UDA support.
Change-Id: Ie48c1cb8e978c7282593b7f602dd68added6d3fd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2625
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 5048f04b332c13b1bff32fb257272b0fea4b8584)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2739
2014-05-29 20:49:53 -07:00
Skye Wanderman-Milne
44125729dc UDF/UDA memory management improvements
* AggFnEvaluator now uses the UDF mem pool (I'm planning to change
  this to per-exec node pools in the expr refactoring)
* FunctionContext::TrackAllocation()/Free() actually use the UDF's mem tracker
* Added FunctionContextImpl::Close() which sets warnings for leaked allocations

Change-Id: I792ffd49102a92b57e34df18d8ff5f5d0fd27370
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1792
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 41a5f7cfa718789fa3b2de3a31f085411fb5000c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1954
Tested-by: jenkins
2014-03-17 20:38:25 -07:00
Nong Li
6b9a7de02e Add symbol resolution during analysis for create function stmts.
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).

This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.

Some other minor cleanup in:
  - JNI from FE to BE
  - UDFs/UDAs that are loaded as test data

Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:20 -08:00
Nong Li
b93b15f10f Integrate function context with mempool.
Change-Id: I55edb6cb89b67eb2c8031ac3a4f119df92a0896f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/565
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:05 -08:00
Nong Li
93bece32ae Rename UdfContext to FunctionContext.
Change-Id: I45da3f51a66c3e2cc4580c26733269f30ab9be83
Reviewed-on: http://gerrit.ent.cloudera.com:8080/546
Tested-by: jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:53:01 -08:00
Nong Li
052ab52c19 UDF/UDA interface, some examples and developer libs.
Change-Id: I9dd97d11ebebd5ba996f8317063eb7eff3bcc6f6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/216
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:52:45 -08:00