Commit Graph

18 Commits

Author SHA1 Message Date
Zoltan Borok-Nagy
eb8b118db5 IMPALA-10384: Make partition names consistent between BE and FE
In the BE we build partition names with the trailing char '/'. In the FE
we build partition names without a trailing char. We should make this
consistent because this causes some annoying string adjustments in
the FE and can cause hidden bugs.

This patch creates partition names without the trailing '/' both in
the BE and the FE. This follows Hive's behavior that also prints
partition names without the trailing '/'.

Testing:
 * Ran exhaustive tests

Change-Id: I7e40111e2d1148aeb01ebc985bbb15db7d6a6012
Reviewed-on: http://gerrit.cloudera.org:8080/16850
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-11 19:51:28 +00:00
Adam Tamas
7295edcc26 IMPALA-9680: Fixed compressed inserts failing
Modified the insert testfiles to get which database they need to
use for 'CREATE TABLE LIKE' dynamically.

Tests:
Did targeted exhaustive testruns in test_insert.py and
test_mt_dop.py and did a full exhaustive testrun.

Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d
Reviewed-on: http://gerrit.cloudera.org:8080/15816
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2020-05-11 19:32:08 +00:00
Adam Tamas
c32849a391 IMPALA-8980: Remove functional*.alltypesinsert from EE tests
-Modified the ‘test_insert.py’ so the tests can run parallel.
  -Every test will create its own temporary tables for insert testing.
-Swapped out the  SETUP tags to Truncate table QUERY statement.
  -Becouse the SETUP tag is not used anymore, the correspondig
  code was removed.
-A test query in ‘insert.test’. The test was incorrect so modified
to test for the right behavior.

Testing:
-tests/run-tests.py query_test/test_insert.py
-impala-py.test tests/query_test/test_insert.py
-the same for test_insert_permutation.py and test_load.py

Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea
Reviewed-on: http://gerrit.cloudera.org:8080/15529
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-14 12:18:21 +00:00
Taras Bobrovytsky
b8b7930377 Add nested types support to Create Table Like File
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.

Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 01:46:26 +00:00
Alex Behm
1497002013 Added SHOW TABLE/COLUMN STATS command.
Fixed the following stats-related bugs:
- Per-partition row count was not distributed properly via CatalogService
- HBase column stats were not loaded and distributed properly

Enhancements to test framework:
- Allow regex specification of expected row or column values
- Fixed expected results of some tests because the test framework
  did not catch that they were incorrect

Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:51 -08:00
Lenni Kuff
a2cbd2820e Add Catalog Service and support for automatic metadata refresh
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.

Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.

What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
  contains the change. An impalad will wait for the statestore heartbeat that contains this
  version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing

Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
  same JAR.

Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:11 -08:00
Nong Li
a3bc1ce133 Some parquet encoder/decoder refactoring. Added dictionary to other types.
Split out the encoder/type for parquet reader/writer. I think this puts us
in a better place to support future encodings.

On the tpch lineitem table, the results are:
Before:
  BytesWritten: 236.45 MB
  Per Column Sizes:
    l_comment: 75.71 MB
    l_commitdate: 8.64 MB
    l_discount: 11.19 MB
    l_extendedprice: 33.02 MB
    l_linenumber: 4.56 MB
    l_linestatus: 869.98 KB
    l_orderkey: 8.99 MB
    l_partkey: 27.02 MB
    l_quantity: 11.58 MB
    l_receiptdate: 8.65 MB
    l_returnflag: 1.40 MB
    l_shipdate: 8.65 MB
    l_shipinstruct: 1.45 MB
    l_shipmode: 2.17 MB
    l_suppkey: 21.91 MB
    l_tax: 10.68 MB
After:
 BytesWritten: 198.63 MB            (84%)
  Per Column Sizes:
    l_comment: 75.71 MB             (100%)
    l_commitdate: 8.64 MB           (100%)
    l_discount: 2.89 MB             (25.8%)
    l_extendedprice: 33.13 MB       (100.33%)
    l_linenumber: 1.50 MB           (32.89%)
    l_linestatus: 870.26 KB         (100.032%)
    l_orderkey: 9.18 MB             (102.11%)
    l_partkey: 27.10 MB             (100.29%)
    l_quantity: 4.32 MB             (37.31%)
    l_receiptdate: 8.65 MB          (100%)
    l_returnflag: 1.40 MB           (100%)
    l_shipdate: 8.65 MB             (100%)
    l_shipinstruct: 1.45 MB         (100%)
    l_shipmode: 2.17 MB             (100%)
    l_suppkey: 10.11 MB             (46.14%)
    l_tax: 2.89 MB                  (27.06%)

The table is overall 84% as big (i.e. 16% smaller). A few columns got marginally
bigger. If the file filled  the 1 GB, I'd expect the overhead to decrease even
more.

The restructuring to use a virtual call doesn't seem to change things much and
will go away when we codegen the scanner.

Here's what they look like with this patch (note this is on the before data files,
so only string cols are dictionary encoded).

Before query times:
  Insert Time: 8.5 sec
  select *: 2.3 sec
  select avg(l_orderkey): .33 sec

After query times:
  Insert Time: 9.5 sec                  <-- Longer due to doing dictionary encoding
  select *: 2.4 sec                     <-- kind of noisy, possibly a slight slow down
  select avg(l_orderkey): .33 sec

Change-Id: I213fdca1bb972cc200dc0cd9fb14b77a8d36d9e6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/238
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:16 -08:00
Alex Behm
9a201645cd IMPALA-496: Fix escaping of field delimiter and escape character in inserts
Change-Id: I49c36ae9823b35dcb9e92d1a13bef270657e36f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/163
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:09 -08:00
Alex Behm
9ff09cd3f4 IMPALA-70: Respect tbl properties to allow empty strings to be treated as NULL 2014-01-08 10:50:28 -08:00
Henry Robinson
ead69d377f IMPALA-249, IMPALA-252: Fixes for static partition keys. 2014-01-08 10:50:14 -08:00
Alex Behm
1b2e8280d4 Fix NULL issues. 2014-01-08 10:49:32 -08:00
Alex Behm
673d7b97cf IMPALA-190: Insert with NULL partition keys results in SIGSEGV. 2014-01-08 10:49:22 -08:00
ishaan
5138a720bb IMP-768: Enable the python test framework to check for insert results. 2014-01-08 10:48:22 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Michael Ubell
0750384b41 IMP-497 Insert with limit, remove extra files from test. 2014-01-08 10:46:33 -08:00
Michael Ubell
116241f1d1 IMP-497 Insert with limit. 2014-01-08 10:46:33 -08:00
Michael Ubell
7536510b69 IMP-258 Test writing nulls. 2014-01-08 10:46:31 -08:00