The bug was a simple oversight where copied the array data, but forgot
to update the pointer of the corresponding ArrayValue.
Change-Id: Ib6ec0380f66194efc7ea3eb989535652eb8b526f
Reviewed-on: http://gerrit.cloudera.org:8080/855
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The sorter does not currently support sorting tuples with collection
slots because the necessary deep copy logic is not implemented.
Fortunately, projection should ensure that all array values that reach
the sorter have been set to null. This patch adds DCHECKs to ensure
that this is the case. Variables are also renamed to reflect that with
nested types string values are a subset of variable-length values.
Change-Id: If617abe678903c69d12d1c65062c8063ae137296
Reviewed-on: http://gerrit.cloudera.org:8080/844
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Rename some functions and consistently skip the variable length code
path when there are no variable length slots.
Change-Id: I2f3405fcc5f545b207fa48e17f37fe968208d94c
Reviewed-on: http://gerrit.cloudera.org:8080/773
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This was unimplemented and is used on some code paths. Arrays were not
properly copied into the BufferedTupleStream, potentially leaving stray
pointers to invalid or reused memory. Arrays are now correctly deep
copied. Includes a unit test that copys rows containing arrays in and
out of a BufferedTupleStream.
Also implement matching optimisation for deep copy in RowBatch.
Change-Id: I75d91a6b439450c5b47646b73bc528cfb8f7109b
Reviewed-on: http://gerrit.cloudera.org:8080/751
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Implement Tuple-to-Tuple DeepCopy for collections. Add query test
that uses the TOP-N node, which deep copies tuples in this way.
Confirmed that the query test failed before this fix.
Change-Id: I3fea860d8251038d7b5eb85c77973939abe9dbf8
Reviewed-on: http://gerrit.cloudera.org:8080/757
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Before this patch, the row-batch serialization procedure 'blindly'
copied the data of all tuples to the serialization output buffer,
even if the same tuple appeared multiple times in the row batch
(e.g., as a result of a join we can have repeated tuples that are
backed by the same tuple data).
This patch addresses the most common case of the problem by checking
tuples in adjacent rows to see if they are duplicates, and if so
refers back to the previously serialized tuple.
Deduping adjacent tuples has minimal performance overhead, and offers
significant performance improvements when duplicates are present.
Tests are included to validate the correctness of deduplication and an
benchmark is included to show that deduplication does not regress
performance of serialization or deserialization.
Change-Id: I0e4153c7f73685a116dd3e70072a0895b4daa561
Reviewed-on: http://gerrit.cloudera.org:8080/659
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch adds support for ArrayValues in RowBatch::Serialize() and
the RowBatch deserialization constructor (which takes a
TRowBatch). This requires deep copying the arrays, which is currently
done naively, i.e., all array and string data is copied. This is
extremely inefficient in many cases. For example, subplans can produce
tuples that contain string data from arrays, meaning there are at
least two StringValues pointing to the same data (one in the string
slot populated by the unnest node, and the original string slot in the
array's item tuple). When we serialize this row batch, we will make
two copies of the string data. At some point we'll need to improve
this behavior, e.g. by de-duping pointers in a single row batch.
This patch also adds a row batch serialization test that covers the
new behavior (but not all the old behavior, e.g. it doesn't test every
slot type). This test can be used in the future to validate fancier,
more efficient deep-copying implementations.
This patch also changes the ArrayValue slot size to 16 (which is
sizeof(ArrayValue)).
Change-Id: I92ed999065e78faf7bfc96c70567af21f2e23eaa
Reviewed-on: http://gerrit.cloudera.org:8080/513
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
This patch removes all occurrences of "using namespace std" and "using
namespace boost(.*)" from the codebase. However, there are still cases
where namespace directives are used (e.g. for rapidjson, thrift,
gutil). These have to be tackled in subsequent patches.
To reduce the patch size, this patch introduces a new header file called
"names.h" that will include many of our most frequently used symbols iff
the corresponding include was already added. This means, that this
header file will pull in for example map / string / vector etc, only iff
vector was already included. This requires "common/names.h" to be the
last include. After including `names.h` a new block contains a sorted list
of using definitions (this patch does not fix namespace directive
declarations for other than std / boost namespaces.)
Change-Id: Iebe4c054670d655bc355347e381dae90999cfddf
Reviewed-on: http://gerrit.cloudera.org:8080/338
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
Adds fixes and tests for Hive CHAR & VARCHAR compatibility.
Also fixes a bug in tuple materialization for VARCHAR and non in-lined CHAR.
Change-Id: I400b089cb8ddba2e264ef9f2e37956b2ceaaf9fb
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4054
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
This patch addresses:
1. Char doesn't use codegen
2. Not in-lining large CHAR(N) for N > 128
3. Parquet reader/writer for CHAR(N) and VARCHAR(N)
Change-Id: I83a29a8bd312841a3e29bfe2243884074570f247
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4280
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
VARCHAR is treated as StringVal in the backend. All UDAs and UDFs which accept STRING
will also accept VARCHAR(N).
TODO: Reverted Avro codegen to fix Jenkins; needs separate patch.
Change-Id: Ifc120b6f0fe1f996b11a48b134d339ad3719331e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/2527
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 3fcbf4f677b8e26c37eded4d8bb628e6fc53c1e9)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4058
This patch changes the interface for evaluating expressions, in order
to allow for thread-safe expression evaluations and easier
codegen. Thread safety is achieved via the ExprContext class, a
light-weight container for expression tree evaluation state. Codegen
is easier because more expressions can be cross-compiled to IR.
See expr.h and expr-context.h for an overview of the API
changes. See sort-exec-exprs.cc for a simple example of the new
interface and hdfs-scanner.cc for a more complicated example.
This patch has not been completely code reviewed and may need further
cleanup/stylistic work, as well as additional perf work.
Change-Id: I3e3baf14ebffd2687533d0cc01a6fb8ac4def849
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3459
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
Once the expr refactoring goes in, the BE will not be able to evaluate
any TYPE_NULL exprs. This patch ensures that the FE casts all null
literals and slot refs before they reach the BE.
There are a bunch of places where we know the appropriate type and
just weren't using it before. This patch also introduces a few notable
hacks:
* Serializing null SlotRefs and NullLiterals as boolean NullLiterals
in case they weren't cast earlier.
* Converting null SlotRefs to NullLiterals in uncheckedCastTo() since
we don't need to read from the slot at all.
This works, but we should consider adding a final pass that cleans up
the plan tree and takes care of this.
Change-Id: Ic2ee181139059553d7f2d0e17e9dacaee241df17
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3294
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit a8a67ebcad12956a8260b4ea4189afb7ffab4b68)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3361
Enable order-by without limit
Added BufferedBlockMgr to allocate buffers and spill to disk.
Added Sorter for the external sort impelementation
Added new SortNode execution node that completely sorts its input
Changes to enable writing in IoMgr went in a separate patch.
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
Conflicts:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
Adding MemPool::GetOffset()/GetDataPtr().
Fixed planner bug (wouldn't generate TScanParams for more than one scan).
Fixed bug in java test harness (which made it ignore the fact that the join tests have been broken for a while).
cases in executor code.
Adding MemPool::Release(), which allows passing data between pools.
Changing the semantics of GetNext() not to overwrite tuple data even beyond the next call;
the previous semantics (data only good until the next call) would have required joins
to create copies.
Adding mem-pool-test.