impala

mirror of https://github.com/apache/impala.git synced 2026-01-23 21:00:25 -05:00

Author	SHA1	Message	Date
Alex Behm	1c528492d3	Nested Types: Fix projection of collection-typed slots. There was a bug with projecting collection-typed slots in the UnnestNode by setting them to NULL. The problem was that the same tuple/slot could be referenced by multiple input rows. As a result, all unnests after the first that operate on the same collection value would incorrectly return an empty row batch because the slot had been set to NULL by the first unnesting. The fix is to ignore the null bit when retrieving a collection-typed slot's value in the UnnestNode. We still set the null bit after retrieving the value for projection. This solution purposely ignores the conventional NULL semantics of slots. It is a temporary hack which must be removed eventually. We rely on the producer of collection-typed slot values (scan node) to write an empty array value into such slots when the they are NULL in addition to setting the null bit. Change-Id: Ie6dc671b3d031f1dfe4d95090b1b6987c2c974da Reviewed-on: http://gerrit.cloudera.org:8080/859 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:33 -07:00
Tim Armstrong	c5b9b7a97d	Added DCHECKS for non-null array values in sorter The sorter does not currently support sorting tuples with collection slots because the necessary deep copy logic is not implemented. Fortunately, projection should ensure that all array values that reach the sorter have been set to null. This patch adds DCHECKs to ensure that this is the case. Variables are also renamed to reflect that with nested types string values are a subset of variable-length values. Change-Id: If617abe678903c69d12d1c65062c8063ae137296 Reviewed-on: http://gerrit.cloudera.org:8080/844 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2015-09-22 10:58:32 -07:00
Tim Armstrong	302df6be5d	Cleanup of DeepCopy Rename some functions and consistently skip the variable length code path when there are no variable length slots. Change-Id: I2f3405fcc5f545b207fa48e17f37fe968208d94c Reviewed-on: http://gerrit.cloudera.org:8080/773 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-10 21:27:52 +00:00
Tim Armstrong	5ac55f24cc	IMPALA-2296: missing DeepCopy array support Implement Tuple-to-Tuple DeepCopy for collections. Add query test that uses the TOP-N node, which deep copies tuples in this way. Confirmed that the query test failed before this fix. Change-Id: I3fea860d8251038d7b5eb85c77973939abe9dbf8 Reviewed-on: http://gerrit.cloudera.org:8080/757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-08 23:40:53 +00:00
Tim Armstrong	6627604655	Nested Types: Dedup adjacent tuples in row batch Before this patch, the row-batch serialization procedure 'blindly' copied the data of all tuples to the serialization output buffer, even if the same tuple appeared multiple times in the row batch (e.g., as a result of a join we can have repeated tuples that are backed by the same tuple data). This patch addresses the most common case of the problem by checking tuples in adjacent rows to see if they are duplicates, and if so refers back to the previously serialized tuple. Deduping adjacent tuples has minimal performance overhead, and offers significant performance improvements when duplicates are present. Tests are included to validate the correctness of deduplication and an benchmark is included to show that deduplication does not regress performance of serialization or deserialization. Change-Id: I0e4153c7f73685a116dd3e70072a0895b4daa561 Reviewed-on: http://gerrit.cloudera.org:8080/659 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-08-25 17:55:10 +00:00
Skye Wanderman-Milne	e067cea55b	Nested types: serialization of row batches containing collections This patch adds support for ArrayValues in RowBatch::Serialize() and the RowBatch deserialization constructor (which takes a TRowBatch). This requires deep copying the arrays, which is currently done naively, i.e., all array and string data is copied. This is extremely inefficient in many cases. For example, subplans can produce tuples that contain string data from arrays, meaning there are at least two StringValues pointing to the same data (one in the string slot populated by the unnest node, and the original string slot in the array's item tuple). When we serialize this row batch, we will make two copies of the string data. At some point we'll need to improve this behavior, e.g. by de-duping pointers in a single row batch. This patch also adds a row batch serialization test that covers the new behavior (but not all the old behavior, e.g. it doesn't test every slot type). This test can be used in the future to validate fancier, more efficient deep-copying implementations. This patch also changes the ArrayValue slot size to 16 (which is sizeof(ArrayValue)). Change-Id: I92ed999065e78faf7bfc96c70567af21f2e23eaa Reviewed-on: http://gerrit.cloudera.org:8080/513 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 20:59:33 +00:00
Henry Robinson	75b16d5b8e	Rewrite header comments to use Doxygen-compatible /// The command line used: git ls-files .h \| xargs sed -i '14,$s/^$ \/\/$ /\1\/ /g' ...then some manual fix-up to remove false positives on inlined functions that contain comments. Change-Id: Ia835ae21f189d5a8dc5627fb3983081a0bd1f1e2 Reviewed-on: http://gerrit.cloudera.org:8080/305 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2015-05-07 23:07:57 +00:00
Skye Wanderman-Milne	559b83d3d0	Expr refactoring This patch changes the interface for evaluating expressions, in order to allow for thread-safe expression evaluations and easier codegen. Thread safety is achieved via the ExprContext class, a light-weight container for expression tree evaluation state. Codegen is easier because more expressions can be cross-compiled to IR. See expr.h and expr-context.h for an overview of the API changes. See sort-exec-exprs.cc for a simple example of the new interface and hdfs-scanner.cc for a more complicated example. This patch has not been completely code reviewed and may need further cleanup/stylistic work, as well as additional perf work. Change-Id: I3e3baf14ebffd2687533d0cc01a6fb8ac4def849 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3459 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-08-17 12:44:44 -07:00
Srinath Shankar	5755b0bdee	Order by without limit for Impala Enable order-by without limit Added BufferedBlockMgr to allocate buffers and spill to disk. Added Sorter for the external sort impelementation Added new SortNode execution node that completely sorts its input Changes to enable writing in IoMgr went in a separate patch. Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins Conflicts: testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins	2014-06-09 16:58:08 -07:00
Henry Robinson	16af29ea5f	IMPALA-770: Fix crash in aggregation node with zero-width tuple The select exprs of an inline view may not always be materialised, yet the output tuple itself may be. This patch fixes a crash in this situation in the backend aggregation node which assumed its output tuple would always have at least one materialised slot. The cause was a couple of too-conservative DCHECKs that failed if the tuple was NULL. In fact, the code was robust to this possibility without the checks, so this bug didn't affect release builds of Impala. Change-Id: If0b90809d30fcd196f55197953392452d1ac9c4f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1431 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins (cherry picked from commit 8c1c21b66c43e900760ace54d090305f32a85a1f) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1471 Tested-by: Henry Robinson <henry@cloudera.com>	2014-02-05 22:01:35 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Nong Li	d1c5ec3293	Fix clang compiler warnings.	2014-01-08 10:50:54 -08:00
Skye Wanderman-Milne	c5afb11558	Compress serialized RowBatchs	2014-01-08 10:49:13 -08:00
Henry Robinson	2f339f2ed8	Add ASL license to all public files	2014-01-08 10:46:32 -08:00
ishaan	05c65789bb	Change Copyrights from 2011 ti 2012	2014-01-08 10:46:29 -08:00
Nong Li	6c2beaf37e	HdfsTextScanNode codegen.	2012-06-01 17:32:25 -07:00
Nong Li	344c171c6a	Aggregation Node Codegen.	2012-05-21 14:47:57 -07:00
Nong Li	563d5e3f71	Added non-nullable slots for the FE/BE.	2012-02-22 13:22:32 -08:00
Marcel Kornacker	38b6d6286e	Added support for single-process distributed query execution: * new class ExchangeNode: ExecNode for incoming data stream * new class Coordinator: coordinates execution of all plan fragments * reorganized classes PlanExecutor and QueryExecutor * renamed PlanExecutorAdaptor to JniCoordinator * backend-service: creates thrift server that exports ImpalaBackendService * added --num_backends flag for runquery	2012-02-01 12:06:55 -08:00
Marcel Kornacker	a8acd52281	Defining data serialization format (Data.thrift). Adding MemPool::GetOffset()/GetDataPtr(). Fixed planner bug (wouldn't generate TScanParams for more than one scan). Fixed bug in java test harness (which made it ignore the fact that the join tests have been broken for a while).	2011-11-22 15:42:32 -08:00
Nong Li	6eda6d19c6	Implemented TopN.	2011-11-06 17:03:33 -08:00
Marcel Kornacker	33166839c7	Changing QueryTests to run queries with different batch sizes, in order to hit more corner cases in executor code. Adding MemPool::Release(), which allows passing data between pools. Changing the semantics of GetNext() not to overwrite tuple data even beyond the next call; the previous semantics (data only good until the next call) would have required joins to create copies. Adding mem-pool-test.	2011-10-17 05:10:21 -07:00
Marcel Kornacker	018c3975b4	Adding RowDescriptor. Assigning ids to catalog tables and referencing those in the thrift descriptors. Avoiding materialization of slots whose only references are pushed into the scan keys/filters. Added work-around for hbase bug where a filter isn't applied if that col family isn't explicitly requested.	2011-09-09 16:31:26 -07:00
Marcel Kornacker	1087bf061c	some bug fixes in agg-node and is-null-predicate; more tests	2011-08-11 17:34:36 -07:00
marcel	5ca75fae47	Adding agg-node.{cc,h}. Some bug fixes (removing dups from agg exprs).	2011-08-08 12:25:44 -07:00
Marcel Kornacker	c23616a30c	deserializing plan request in c++ Coordinator.main(): util function to execute single query against test schema removed dead code from TestSchemaUtils	2011-07-13 13:48:54 -07:00
marcel	f2ceb748ab	adding some expr eval functions plus script to generate them separating StringValue from tuple.h	2011-07-12 11:04:35 -07:00
marcel	3286190599	Initial version of backend.	2011-07-07 15:49:46 -07:00

28 Commits