Previous patch upgraded boost library. This patch changes 64-bit random
number generator from ranlux64_3 to mt19937_64 since mt19937_64 has
better performance according to boost benchmark at https://www.boost.org
/doc/libs/1_74_0/doc/html/boost_random/performance.html.
Also fixs an unit-test which is affected by the change of random number
generator.
Testing:
- Passed exhaustive tests.
Change-Id: Iade226fc17442f4d7b9b14e4a9e80a30a3856226
Reviewed-on: http://gerrit.cloudera.org:8080/18022
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.
get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.
More examples can be found in expr-test.cc.
Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory
Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Reviewed-on: http://gerrit.cloudera.org:8080/10950
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If result.ptr allocation fails for some reason inside the StringVal
constructor, we still overwrite result.len and continue.
This change checks that the StringVal pointer is not NULL before
dereferencing it, and returns NULL if it is.
Testing: Added a test case of the to_date() function to
alloc-fail-init.test to leverage the fault injector
--stress_fn_ctx_alloc.
Change-Id: I14cfb29a592885bb2f39958c8644f93db5220a68
Reviewed-on: http://gerrit.cloudera.org:8080/11286
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Currently implementation of rand/random built-in functions
use rand_r of C library. We recognized its randomness was poor.
pcg32 of third party library shows better randomness than rand_r.
Testing:
Revise unit test in expr-test
Add E2E test to random.test
Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684
Reviewed-on: http://gerrit.cloudera.org:8080/8355
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This is also a step towards IMPALA-2399 (remove QueryMaintenance()).
"local" allocations containing expression results (either intermediate
or final results) have the following properties:
* They are usually small allocations
* They can be made frequently (e.g. every function call)
* They are owned and managed by the Impala runtime
* They are freed in bulk at various points in query execution.
A MemPool (i.e. bump allocator) is the right mechanism to manage
allocations with the above properties. Before this patch
FunctionContext's used a FreePool + vector of allocations to emulate the
above behaviour. This patch switches to using a MemPool to bring these
allocations in line with the rest of the codebase.
The steps required to do this conversion.
* Use a MemPool for FunctionContext local allocations.
* Identify appropriate MemPools for all of the local allocations from
function contexts so that the memory lifetime is correct.
* Various cleanup and documentation of existing MemPools.
* Replaces calls to FreeLocalAllocations() with calls to
MemPool::Clear()
More involved surgery was required in a few places:
* Made the Sorter own its comparator, exprs and MemPool.
* Remove FunctionContextImpl::ReallocateLocal() and just have
StringFunctions::Replace() do the doubling itself to avoid
the need for a special interface. Worst-case this doubles
the memory requirements for Replace() since n / 2 + n / 4
+ n / 8 + .... bytes of memory could be wasted instead of recycled
for an n-byte output string.
* Provide a way redirect agg fn Serialize()/Finalize() allocations
to come directly from the output RowBatch's MemPool. This is
also potentially applicable to other places where we currently
copy out strings from local allocations, e.g.
AnalyticEvalNode::AddResultTuple() and Tuple::MaterializeExprs().
* --stress_free_pool_alloc was changed to instead intercept at the
FunctionContext layer so that it retains the old behaviour even
though allocations do not all come from FreePools.
The "local" allocation concept was not exposed directly in udf.h so this
patch also renames them to better reflect that they're used for expr
results.
Testing:
* ran exhaustive and ASAN
Change-Id: I4ba5a7542ed90a49a4b5586c040b5985a7d45b61
Reviewed-on: http://gerrit.cloudera.org:8080/8025
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Impala has a hardcoded limit of 1GB in size for StringVal.
If the length of the string exceeds 1GB, Impala will simply
mark the StringVal as NULL (i.e. is_null = true). It's important
that string functions or built-in UDFs check this field before
accessing the pointer or Impala may end up doing null pointer
access, leading to crashes.
Change-Id: I55777487fff15a521818e39b4f93a8a242770ec2
Reviewed-on: http://gerrit.cloudera.org:8080/2786
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
FunctionContext::Allocate(), FunctionContextImpl::AllocateLocal()
and FunctionContext::Reallocate() allocate memory without taking
memory limits into account. The problem is that these functions
invoke FreePool::Allocate() which may call MemPool::Allocate()
that doesn't check against the memory limits. This patch fixes
the problem by making these FunctionContext functions check for
memory limits and set an error in the FunctionContext object if
memory limits are exceeded.
An alternative would be for these functions to call
MemPool::TryAllocate() instead and return NULL if memory limits
are exceeded. However, this may break some existing external
UDAs which don't check for allocation failures, leading to
unexpected crashes of Impala. Therefore, we stick with this
ad hoc approach until the UDF/UDA interfaces are updated in
the future releases.
Callers of these FunctionContext functions are also updated to
handle potential failed allocations instead of operating on
NULL pointers. The query status will be polled at various
locations and terminate the query.
This patch also fixes MemPool to handle the case in which malloc
may return NULL. It propagates the failure to the callers instead
of continuing to run with NULL pointers. In addition, errors during
aggregate functions' initialization are now properly propagated.
Change-Id: Icefda795cd685e5d0d8a518cbadd37f02ea5e733
Reviewed-on: http://gerrit.cloudera.org:8080/1445
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins