Commit Graph

606 Commits

Author SHA1 Message Date
Lenni Kuff
a2cbd2820e Add Catalog Service and support for automatic metadata refresh
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.

Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.

What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
  contains the change. An impalad will wait for the statestore heartbeat that contains this
  version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing

Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
  same JAR.

Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:11 -08:00
Nong Li
1eb2b7a964 Add execution for vararg UDFs.
Change-Id: I46e5670c09ac0b8e62f39dfc832fe880dd1dc995
Reviewed-on: http://gerrit.ent.cloudera.com:8080/572
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:09 -08:00
Alex Behm
9065648d77 Improvements to cost estimation and explain output.
Fixed cost estimation of union queries and exchange nodes.
Fixed propagation of stats through cloning of exprs and plan nodes.
Fixed propagation of expr stats to slots they are materialized into (e.g., grouping columns in multi-level aggs).
Improved explain output for constant selects.

Change-Id: I96d1652c00d48e4093b85ae7fc8bad28d74b8b81
Reviewed-on: http://gerrit.ent.cloudera.com:8080/547
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:53:08 -08:00
Nong Li
4bb1e8c854 Add varargs to UDF/UDA parser/analyzer.
Change-Id: I4c3f2e74f6c29cee4b0b787c058b0455b16a11fd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/548
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:05 -08:00
Skye Wanderman-Milne
b7f83bcd73 Add support for LLVM IR UDFs.
This patch also adds a number of improvements to NativeUdfExpr. Highlights include:

* Correctly handling the lowering of AnyVal struct types (required for ABI compatibility)
* A rudimentary library cache for reusing handles produced by dlopen
* More complicated test cases

Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/540
Tested-by: jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:53:03 -08:00
Nong Li
e5ed8e4105 Move minicluster_xml_conf to HADOOP_CONF_DIR.
The current location gets deleted if you rebuild, making you have to restart mini dfs.

Change-Id: If71b144534255fa8df2bfa187c0814ffdf28463e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/550
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:03 -08:00
Nong Li
8963d79f51 Fix build break from UdfContext rename.
Change-Id: Ia3df23fcba7d3812ae90565daab89916cbb50861
Reviewed-on: http://gerrit.ent.cloudera.com:8080/549
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:01 -08:00
Nong Li
e39de94316 Add parser/analysis to support UDAs.
I looked around some and I think having create/drop/show [aggregate] function
seems reasonable and extends nicely for UDTs.

The create aggregate function can accept a lot of arguments. The non-essential one, I
went with resolving them by name rather than position (i.e. argName="value"). I think
this is better for the user than specifying it by position.

The grammar is:
CREATE AGGREGATE <name>(<arg_types>) RETURNS <type> [INTERMEDIATE <type>]
LOCATION '/path' UpdateFn='Fn' [comment='comment']
[SerializeFn='symbol'] [MergeFn='symbol'] [InitFn='symbol'] [FinalizeFn='symbol']

The optional args at the end can be in any order. If the other symbols are not
specified, we derive them from the UpdateFn symbol that's required. The analyzer
would try to figure it out and fail if we can't find the derived symbol in the binary.

The simplest example would be:
CREATE AGGREGATE FUNCTION count(float) RETURNS BIGINT LOCATION '/path'
UpdateFn='CountUpdateFn';

In which case we assume the intermediate type is the return type and the other functions
are called 'CountInitFn', 'CountSerializeFn', 'CountMergeFn' 'CountFinalizeFn'.

Change-Id: Iefc5741293050f5b295df28e9d1a7d039ead8675
Reviewed-on: http://gerrit.ent.cloudera.com:8080/513
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:59 -08:00
Alex Behm
39f9a067fa IMPALA-444: Fixed accuracy of string to double conversion. Falling back to strod for scientific notation.
Change-Id: I9a5d948620907d34601ef041e58b1c9bb2172f71
Reviewed-on: http://gerrit.ent.cloudera.com:8080/507
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:56 -08:00
Lenni Kuff
79cdeac3d6 Consolidate test cluster under IMPALA_HOME/cluster_logs + store logs during data loading
Change-Id: I8f6239e4ccb0515c85bf80193a475788fb18dedb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/518
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:56 -08:00
Alex Behm
6253b21834 IMPALA-505: Fixed conjunct evaluation against partition columns in hdfs scan node when there are no matarialized slots.
Change-Id: Ia003347bd7ee4986f5411c7175057192635a4c6c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/509
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:54 -08:00
Skye Wanderman-Milne
fd99db0300 First pass at UdfExpr.
Change-Id: I517bf56541749b5c2459554821c7bf838239fdf0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/439
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:50 -08:00
Henry Robinson
a46276325c IMPALA-415: Don't delete hidden files in the root directory for INSERT
OVERWRITE

INSERT OVERWRITE into an unpartitioned table is supposed to remove all
data files from the root. This should not include hidden files or
directories. This patch excludes hidden files from deletion, and adds a
test case.

Partition directories are still removed in their entirety: the cost of
statting a large number of files and directories rather than issuing a
single "rm -rf" outweighs the benefits of preserving hidden files for
now.

Hive does not preserve hidden files in either configuration.

Change-Id: Ia73e55e011c26c88f14745075210cf359764e3c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/418
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:50 -08:00
Nong Li
2b9105cd11 IMPALA-487: don't compact data from rhs of join if it is going through an exchange node.
Change-Id: I442445e7370218352cd6d3137f2a454c9afb73ba
Reviewed-on: http://gerrit.ent.cloudera.com:8080/476
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:50 -08:00
Lenni Kuff
a1f2f72f49 Add Impala DDL support for creation of AVRO tables + support for CREATE/ALTER SERDEPROPERTIES
This change adds Impala DDL support for creation of AVRO tables.
Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES
which are used when creating Avro backed tables. This syntax is not
exactly the same as the Hive support since it introduces a new
fileformat (AVROFILE) that implies the needed Serialization library,
input format, and output format.

Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/464
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:48 -08:00
Alex Behm
77c0e54bb9 Set HDFS block size to 128MB because HDFS versions since 2.1.0-beta use 128MB as a default (HDFS-4053).
Change-Id: If112d2eab242b44f05f64ee071ebea5b253c7927
Reviewed-on: http://gerrit.ent.cloudera.com:8080/470
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:48 -08:00
Nong Li
a0bf45a0b4 Add udf type.
Change-Id: Ic5f52c127750cc9c847a3e34d3fdcfc78bee5a8a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/454
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:48 -08:00
Lenni Kuff
f5c9e4d075 Fix comment location in schema generation script
Our data generation doesn't appear to fully support comments in the
base table name section. This fixes the data generation, but we should
follow on by improving our comment support in the framework.

Change-Id: I68274bd98d5b0d54868d6c80b7137a59e7329229
Reviewed-on: http://gerrit.ent.cloudera.com:8080/465
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:47 -08:00
Lenni Kuff
8e2a313673 IMPALA-590: Impala should more gracefully fail when loading HBase tables with complex types
Change-Id: Ifc3338ee1339ff0544ed14066824f1aa2d9d7c25
Reviewed-on: http://gerrit.ent.cloudera.com:8080/457
Tested-by: jenkins
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:47 -08:00
Alex Behm
33000b8c15 Fixed codegen of floating-point modulo.
Change-Id: Idd28c6a71a659471aa632a6e26d970557daeb3bf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/385
Tested-by: jenkins
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:46 -08:00
Greg Rahn
8492db3b7d fix typo for tpcds-q3
Change-Id: Ia678957dcda6ddf261422b6c43a718f5779d3553
Reviewed-on: http://gerrit.ent.cloudera.com:8080/453
Reviewed-by: Greg Rahn <grahn@cloudera.com>
Tested-by: Greg Rahn <grahn@cloudera.com>
2014-01-08 10:52:44 -08:00
Nong Li
308650f208 Fix create function ddl test setup issue.
Change-Id: I30c9a4342efbdb17bd53fb14bdcee172506cdadb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/447
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:44 -08:00
Nong Li
8eb727b585 UDF ddl cleanup
Change-Id: I381fed277b5809727d2d8bf430258c01d2d0ae1f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/436
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:43 -08:00
ishaan
d3a94aa4fe Don't load tpcds.store_sales_unpartitioned into any file format except text.
Change-Id: I398d26ca8e36a45cb3d0a076cdd604ff6eba793d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/444
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:43 -08:00
Lenni Kuff
d6d1557fe7 Capture cluster logs with each test run / don't use mvn for starting cluster services
Change-Id: I708b547e49d035c5f029ea86119cc844ccbc5643
Reviewed-on: http://gerrit.ent.cloudera.com:8080/404
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:40 -08:00
Lenni Kuff
9f54242941 Add retry loop around split-hbase to fix build breaks
Change-Id: I539407ce05d705b6b4e88d0791fc4ec236c79c80
Reviewed-on: http://gerrit.ent.cloudera.com:8080/399
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:39 -08:00
Nong Li
b22d1f41a7 Change all "Status Close()" to "void Close()"
Doing it this way makes sure we don't bail early on the Close path
which is rarely the right thing to do. This found a few places where
we were not doing proper cleanup because of this.

Change-Id: Ie663c68398c14589b5cbc1bd980644b0b10fd865
Reviewed-on: http://gerrit.ent.cloudera.com:8080/373
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:38 -08:00
ishaan
6735e3983f Fix build failure because of hbase data loading.
Change-Id: I796656332c3733a1ffdc338d206009efa6c451ac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/360
Tested-by: jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:37 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Nong Li
af90c8a133 Fix memory usage tracking.
Changes MemLimit to MemTracker:
- the limit is optional
- it also records a label and an optional parent
- Consume() and Release() also update the ancestors and there's also a new
  AnyLimitExceeded(), which also checks the ancestors
- the consumption counter is a HighwaterMarkCounter and can optionally be created
  as part of a profile

Each fragment instance now has a MemTracker that is part of a 3-level
hierarchy: process, query, fragment instance.

Change-Id: I5f580f4956fdf07d70bd9a6531032439aaf0fd07
Reviewed-on: http://gerrit.ent.cloudera.com:8080/339
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:36 -08:00
Nong Li
2394ae2e66 UDF parsing and analysis.
Change-Id: If8058c1cb66bf5e9c7049d4b78f5882b46c03fc1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/318
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:32 -08:00
Aaron Davidson
cafb7b72f8 External sorting
This is an experimental implementation of external sorting. This patch includes the following additions:
(1) creation and implementation of the Sorter interface, which can sort Impala Tuples.
(2) normalization of Tuples to allow memcmp-able sorting.
(3) a testing framework for the Sorter,
(4) a benchmark to compare the current state of the Sorter with other sorts,
(5) an implementation of a Vector which can store data whose size is only known at runtime,
(6) a sorting algorithm (basically a dumbed down STL sort) which can operate over such a vector,
(7) implementation of a simple in-memory Merger, and
(8) logic to stream blocks of memory in and out of memory for the actual external merging.

I have a local branch for experimental optimizations and benchmarking -- this should be considered
a "basic", working sort.

The following optimizations have been implemented:
(i)   Optionally extracting keys instead of writing them in place.
(ii)  Optionally opportunistically parallelize run building (sorting & prepare for output).
(iii) Maximize disk IO and minimize buffer recycling by writing buffers out, but also keeping
      them in memory until right when they're needed.
(iv)  Prepare auxililary data backwards so the buffers can be released as we go, and still
      go out in an order which preserves the first buffers of the run.
(v)   Always merge maximum number of runs at a time, taking from the next merge level if
      available.

Change-Id: I1d7304d54d73152da929b1efffc1e851e5fb8fd4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/126
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Aaron Davidson <aaron.davidson@cloudera.com>
2014-01-08 10:52:27 -08:00
ishaan
13343fb5ec Annotate tpcds count queries.
Annotation helps in easily identifying queries and searching for them in the performance
database.

Change-Id: I89dcfe4c2885f1d5b3d5158c026aac922ff6559d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/299
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:26 -08:00
Aaron Davidson
00275ce3a9 (IMPALA-422) Add string concatenation function
Implements a group_concat() function which concatenates all the values in a group together.

The format is group_concat(str_col, [separator]). The default separator is ', '. NULLs
are ignored.

Change-Id: If152df6f528401117dba81d66ef691bfb548cc7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/117
Reviewed-by: Aaron Davidson <aaron.davidson@cloudera.com>
Tested-by: Aaron Davidson <aaron.davidson@cloudera.com>
2014-01-08 10:52:21 -08:00
Skye Wanderman-Milne
efac6f82fd Print errors to shell in BaseSequenceScanner.
Change-Id: I0d1b041695c0d61b8c4994833f0a703e3bfa9c6a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/278
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:20 -08:00
Lenni Kuff
d66d3bfce3 IMPALA-161: Add Impala support for CREATE TABLE AS SELECT
This adds support for CREATE TABLE AS SELECT to Impala. It supports all functionality a
regular CREATE TABLE statement includes, except it does not allow for for specifying
partition columns. Hive also has this limitation and it wouldn't be too hard to support
in the future.

Change-Id: I4ca3c3b8f1576441b8bb5ed9dc521d7dfa96ab74
Reviewed-on: http://gerrit.ent.cloudera.com:8080/157
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:17 -08:00
Lenni Kuff
f264db1647 Automatically force load partitioned tables to ensure valid partition metadata
Change-Id: Ief91102f30d4669503d473299256a74a50d8fe3c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/261
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:17 -08:00
Nong Li
707a566b5d Add test to tpcds queries to validate table row counts.
I tried to investigate the jenkins issue where we weren't returning any rows.
I setup the cluster on that box manually and noticed there weren't any results
because the store_sales table was empty. Refresh did not fix. This looks like
a data loading issue. Adding this test would make discovering this like this
much easier.

Change-Id: I8ccddd43892b279d506371b9de717629815c6a08
Reviewed-on: http://gerrit.ent.cloudera.com:8080/260
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:17 -08:00
Lenni Kuff
a3016cc4d4 Add partitioned tpcds insert workload and tests
Change-Id: Iff45853153bf0830be3e423c994392998385a64f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/256
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:16 -08:00
ishaan
e9e23bff5d Fix build because of a change in parquetfile.
This changes QueryTest/create.test to unblock the builds.

Change-Id: If91ac43e349c2f81034ba7504c27890781f33260
Reviewed-on: http://gerrit.ent.cloudera.com:8080/255
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:16 -08:00
Nong Li
a3bc1ce133 Some parquet encoder/decoder refactoring. Added dictionary to other types.
Split out the encoder/type for parquet reader/writer. I think this puts us
in a better place to support future encodings.

On the tpch lineitem table, the results are:
Before:
  BytesWritten: 236.45 MB
  Per Column Sizes:
    l_comment: 75.71 MB
    l_commitdate: 8.64 MB
    l_discount: 11.19 MB
    l_extendedprice: 33.02 MB
    l_linenumber: 4.56 MB
    l_linestatus: 869.98 KB
    l_orderkey: 8.99 MB
    l_partkey: 27.02 MB
    l_quantity: 11.58 MB
    l_receiptdate: 8.65 MB
    l_returnflag: 1.40 MB
    l_shipdate: 8.65 MB
    l_shipinstruct: 1.45 MB
    l_shipmode: 2.17 MB
    l_suppkey: 21.91 MB
    l_tax: 10.68 MB
After:
 BytesWritten: 198.63 MB            (84%)
  Per Column Sizes:
    l_comment: 75.71 MB             (100%)
    l_commitdate: 8.64 MB           (100%)
    l_discount: 2.89 MB             (25.8%)
    l_extendedprice: 33.13 MB       (100.33%)
    l_linenumber: 1.50 MB           (32.89%)
    l_linestatus: 870.26 KB         (100.032%)
    l_orderkey: 9.18 MB             (102.11%)
    l_partkey: 27.10 MB             (100.29%)
    l_quantity: 4.32 MB             (37.31%)
    l_receiptdate: 8.65 MB          (100%)
    l_returnflag: 1.40 MB           (100%)
    l_shipdate: 8.65 MB             (100%)
    l_shipinstruct: 1.45 MB         (100%)
    l_shipmode: 2.17 MB             (100%)
    l_suppkey: 10.11 MB             (46.14%)
    l_tax: 2.89 MB                  (27.06%)

The table is overall 84% as big (i.e. 16% smaller). A few columns got marginally
bigger. If the file filled  the 1 GB, I'd expect the overhead to decrease even
more.

The restructuring to use a virtual call doesn't seem to change things much and
will go away when we codegen the scanner.

Here's what they look like with this patch (note this is on the before data files,
so only string cols are dictionary encoded).

Before query times:
  Insert Time: 8.5 sec
  select *: 2.3 sec
  select avg(l_orderkey): .33 sec

After query times:
  Insert Time: 9.5 sec                  <-- Longer due to doing dictionary encoding
  select *: 2.4 sec                     <-- kind of noisy, possibly a slight slow down
  select avg(l_orderkey): .33 sec

Change-Id: I213fdca1bb972cc200dc0cd9fb14b77a8d36d9e6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/238
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:16 -08:00
Lenni Kuff
be1d42c05a IMPALA-538: Look for Avro schema in SERDEPROPERTIES as well as TBLPROPERTIES
Change-Id: If5c0b36d5a3963176b07a0cb1ea680e3e36b2f96
Reviewed-on: http://gerrit.ent.cloudera.com:8080/248
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:15 -08:00
Skye Wanderman-Milne
b9ea32e9b7 Fix IMPALA-129, IMPALA-534, and other scanner bugs.
Change-Id: Idbd29af3fcc35b9e1173d08ac55b5780751c5938
Reviewed-on: http://gerrit.ent.cloudera.com:8080/196
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:14 -08:00
Lenni Kuff
17ed6ea177 Partition TPC-DS dataset and add additional TPC-DS workload queries
Change-Id: I5410e68fdfd818a8287e0974332c3e36c344c300
Reviewed-on: http://gerrit.ent.cloudera.com:8080/99
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:13 -08:00
Alex Behm
e52ed0800b IMPALA-524: Fix computation of stats for ExchangeNode and merge AggregationNodes. The issue caused unnecessary repartitioning for static partition insert queries having grouped aggregation in the feeding query stmt.
Change-Id: I5f4017e2c4d5a1bf88f51c4e0ff7ab28911e14f1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/202
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:11 -08:00
Alex Behm
9a201645cd IMPALA-496: Fix escaping of field delimiter and escape character in inserts
Change-Id: I49c36ae9823b35dcb9e92d1a13bef270657e36f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/163
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:09 -08:00
Alex Behm
48ee7ce891 IMPALA-508: Fix join-cardinality estimation and choice of join strategy when a join involves a table lacking table stats.
Change-Id: I871273e1d9f048377ce638c201118fc21086db9a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/152
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:05 -08:00
Alex Behm
f0e2d539fc IMPALA-495: Views Sometimes Not Utilizing Partition Pruning.
Change-Id: I65daebbe8c4b72b956a409fe28edd3773fda7cb7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/128
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:04 -08:00
Alex Behm
c9965e5a5c Fix build break due to views defined by a constant select.
Change-Id: I5deeeb03469494f5ba6ed7a911354bbdd6c98195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/149
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
2014-01-08 10:52:04 -08:00
Alex Behm
2b427208e5 IMPALA-507: Creating a VIEW that does not reference a table fails with IllegalStateException.
Change-Id: I11470ba919bbfced76730adae2a46647c4ef110b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/146
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:04 -08:00