impala

mirror of https://github.com/apache/impala.git synced 2026-01-19 09:01:38 -05:00

Author	SHA1	Message	Date
Alex Behm	057b0b7dba	IMPALA-2322: Set new pointer for ArrayValue in Tuple::DeepCopyVarlenData(). The bug was a simple oversight where copied the array data, but forgot to update the pointer of the corresponding ArrayValue. Change-Id: Ib6ec0380f66194efc7ea3eb989535652eb8b526f Reviewed-on: http://gerrit.cloudera.org:8080/855 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Tim Armstrong	c5b9b7a97d	Added DCHECKS for non-null array values in sorter The sorter does not currently support sorting tuples with collection slots because the necessary deep copy logic is not implemented. Fortunately, projection should ensure that all array values that reach the sorter have been set to null. This patch adds DCHECKs to ensure that this is the case. Variables are also renamed to reflect that with nested types string values are a subset of variable-length values. Change-Id: If617abe678903c69d12d1c65062c8063ae137296 Reviewed-on: http://gerrit.cloudera.org:8080/844 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2015-09-22 10:58:32 -07:00
Tim Armstrong	302df6be5d	Cleanup of DeepCopy Rename some functions and consistently skip the variable length code path when there are no variable length slots. Change-Id: I2f3405fcc5f545b207fa48e17f37fe968208d94c Reviewed-on: http://gerrit.cloudera.org:8080/773 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-10 21:27:52 +00:00
Tim Armstrong	235a8d08da	IMPALA-2295: deep copy arrays in BufferedTupleStream This was unimplemented and is used on some code paths. Arrays were not properly copied into the BufferedTupleStream, potentially leaving stray pointers to invalid or reused memory. Arrays are now correctly deep copied. Includes a unit test that copys rows containing arrays in and out of a BufferedTupleStream. Also implement matching optimisation for deep copy in RowBatch. Change-Id: I75d91a6b439450c5b47646b73bc528cfb8f7109b Reviewed-on: http://gerrit.cloudera.org:8080/751 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-09 02:39:14 +00:00
Tim Armstrong	5ac55f24cc	IMPALA-2296: missing DeepCopy array support Implement Tuple-to-Tuple DeepCopy for collections. Add query test that uses the TOP-N node, which deep copies tuples in this way. Confirmed that the query test failed before this fix. Change-Id: I3fea860d8251038d7b5eb85c77973939abe9dbf8 Reviewed-on: http://gerrit.cloudera.org:8080/757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-08 23:40:53 +00:00
Tim Armstrong	6627604655	Nested Types: Dedup adjacent tuples in row batch Before this patch, the row-batch serialization procedure 'blindly' copied the data of all tuples to the serialization output buffer, even if the same tuple appeared multiple times in the row batch (e.g., as a result of a join we can have repeated tuples that are backed by the same tuple data). This patch addresses the most common case of the problem by checking tuples in adjacent rows to see if they are duplicates, and if so refers back to the previously serialized tuple. Deduping adjacent tuples has minimal performance overhead, and offers significant performance improvements when duplicates are present. Tests are included to validate the correctness of deduplication and an benchmark is included to show that deduplication does not regress performance of serialization or deserialization. Change-Id: I0e4153c7f73685a116dd3e70072a0895b4daa561 Reviewed-on: http://gerrit.cloudera.org:8080/659 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-08-25 17:55:10 +00:00
Skye Wanderman-Milne	e067cea55b	Nested types: serialization of row batches containing collections This patch adds support for ArrayValues in RowBatch::Serialize() and the RowBatch deserialization constructor (which takes a TRowBatch). This requires deep copying the arrays, which is currently done naively, i.e., all array and string data is copied. This is extremely inefficient in many cases. For example, subplans can produce tuples that contain string data from arrays, meaning there are at least two StringValues pointing to the same data (one in the string slot populated by the unnest node, and the original string slot in the array's item tuple). When we serialize this row batch, we will make two copies of the string data. At some point we'll need to improve this behavior, e.g. by de-duping pointers in a single row batch. This patch also adds a row batch serialization test that covers the new behavior (but not all the old behavior, e.g. it doesn't test every slot type). This test can be used in the future to validate fancier, more efficient deep-copying implementations. This patch also changes the ArrayValue slot size to 16 (which is sizeof(ArrayValue)). Change-Id: I92ed999065e78faf7bfc96c70567af21f2e23eaa Reviewed-on: http://gerrit.cloudera.org:8080/513 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 20:59:33 +00:00
Martin Grund	2eb12e9593	Deprecating namespace directive declarations (std, boost) This patch removes all occurrences of "using namespace std" and "using namespace boost(.*)" from the codebase. However, there are still cases where namespace directives are used (e.g. for rapidjson, thrift, gutil). These have to be tackled in subsequent patches. To reduce the patch size, this patch introduces a new header file called "names.h" that will include many of our most frequently used symbols iff the corresponding include was already added. This means, that this header file will pull in for example map / string / vector etc, only iff vector was already included. This requires "common/names.h" to be the last include. After including `names.h` a new block contains a sorted list of using definitions (this patch does not fix namespace directive declarations for other than std / boost namespaces.) Change-Id: Iebe4c054670d655bc355347e381dae90999cfddf Reviewed-on: http://gerrit.cloudera.org:8080/338 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-04-18 01:26:47 +00:00
Victor Bittorf	af4b2086dc	Char PARQUET, AVRO, and TEXT tests Adds fixes and tests for Hive CHAR & VARCHAR compatibility. Also fixes a bug in tuple materialization for VARCHAR and non in-lined CHAR. Change-Id: I400b089cb8ddba2e264ef9f2e37956b2ceaaf9fb Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4054 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-26 12:24:07 -07:00
Victor Bittorf	6289121261	CHAR(N) Followup Patch This patch addresses: 1. Char doesn't use codegen 2. Not in-lining large CHAR(N) for N > 128 3. Parquet reader/writer for CHAR(N) and VARCHAR(N) Change-Id: I83a29a8bd312841a3e29bfe2243884074570f247 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4280 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-20 16:12:03 -07:00
Victor Bittorf	2dce31f6c2	Adding VARCHAR front & backend. VARCHAR is treated as StringVal in the backend. All UDAs and UDFs which accept STRING will also accept VARCHAR(N). TODO: Reverted Avro codegen to fix Jenkins; needs separate patch. Change-Id: Ifc120b6f0fe1f996b11a48b134d339ad3719331e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/2527 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit 3fcbf4f677b8e26c37eded4d8bb628e6fc53c1e9) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4058	2014-08-27 13:52:58 -07:00
Skye Wanderman-Milne	559b83d3d0	Expr refactoring This patch changes the interface for evaluating expressions, in order to allow for thread-safe expression evaluations and easier codegen. Thread safety is achieved via the ExprContext class, a light-weight container for expression tree evaluation state. Codegen is easier because more expressions can be cross-compiled to IR. See expr.h and expr-context.h for an overview of the API changes. See sort-exec-exprs.cc for a simple example of the new interface and hdfs-scanner.cc for a more complicated example. This patch has not been completely code reviewed and may need further cleanup/stylistic work, as well as additional perf work. Change-Id: I3e3baf14ebffd2687533d0cc01a6fb8ac4def849 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3459 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-08-17 12:44:44 -07:00
Skye Wanderman-Milne	f0fb28158b	FE changes to avoid shipping null-type expressions to the BE. Once the expr refactoring goes in, the BE will not be able to evaluate any TYPE_NULL exprs. This patch ensures that the FE casts all null literals and slot refs before they reach the BE. There are a bunch of places where we know the appropriate type and just weren't using it before. This patch also introduces a few notable hacks: * Serializing null SlotRefs and NullLiterals as boolean NullLiterals in case they weren't cast earlier. * Converting null SlotRefs to NullLiterals in uncheckedCastTo() since we don't need to read from the slot at all. This works, but we should consider adding a final pass that cleans up the plan tree and takes care of this. Change-Id: Ic2ee181139059553d7f2d0e17e9dacaee241df17 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3294 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins (cherry picked from commit a8a67ebcad12956a8260b4ea4189afb7ffab4b68) Reviewed-on: http://gerrit.ent.cloudera.com:8080/3361	2014-07-01 15:48:08 -07:00
Srinath Shankar	5755b0bdee	Order by without limit for Impala Enable order-by without limit Added BufferedBlockMgr to allocate buffers and spill to disk. Added Sorter for the external sort impelementation Added new SortNode execution node that completely sorts its input Changes to enable writing in IoMgr went in a separate patch. Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins Conflicts: testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins	2014-06-09 16:58:08 -07:00
Skye Wanderman-Milne	c5afb11558	Compress serialized RowBatchs	2014-01-08 10:49:13 -08:00
Henry Robinson	2f339f2ed8	Add ASL license to all public files	2014-01-08 10:46:32 -08:00
ishaan	05c65789bb	Change Copyrights from 2011 ti 2012	2014-01-08 10:46:29 -08:00
Nong Li	6c2beaf37e	HdfsTextScanNode codegen.	2012-06-01 17:32:25 -07:00
Michael Ubell	a0d3b59aa6	Change char* to uint8_t* Conflicts: be/src/exec/hdfs-sequence-scanner.h Conflicts: be/src/exec/hdfs-sequence-scanner.cc	2012-05-02 07:31:53 -07:00
Marcel Kornacker	a8acd52281	Defining data serialization format (Data.thrift). Adding MemPool::GetOffset()/GetDataPtr(). Fixed planner bug (wouldn't generate TScanParams for more than one scan). Fixed bug in java test harness (which made it ignore the fact that the join tests have been broken for a while).	2011-11-22 15:42:32 -08:00
Nong Li	6eda6d19c6	Implemented TopN.	2011-11-06 17:03:33 -08:00
Marcel Kornacker	33166839c7	Changing QueryTests to run queries with different batch sizes, in order to hit more corner cases in executor code. Adding MemPool::Release(), which allows passing data between pools. Changing the semantics of GetNext() not to overwrite tuple data even beyond the next call; the previous semantics (data only good until the next call) would have required joins to create copies. Adding mem-pool-test.	2011-10-17 05:10:21 -07:00

22 Commits