We need to pass a flag to the metastore for the cleanup to happen. Previously we
were passing 'false' when we need to pass 'true' to get the same behavior as Hive
when dropping databases. Added a test case to validate the cleanup when dropping
databases and tables.
Change-Id: I500a3d3ac52c1b2031fae842403a670cfe43fa98
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1035
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
A compute stats command computes the table and column stats for a given
table and persists them in the metastore.
The table stats consist of the per-partition and per-table row count.
The column stats are computed on a per-table basis and consist of the
number of distinct values and the number of NULLs per column.
This patch introduces a new 'child query' concept that
compute stats utilizes. Child queries are cancelled
if the parent query is cancelled. A compute stats stmt is
executed by the following query hirarchy:
parent: compute stats query (DDL)
- child: compute table stats query (QUERY)
- child: compute column stats query (QUERY)
The new child query concept is necessary to decouple child query fetches
from parent query fetches, i.e., we could not execute a child query as
part of the original compute stats query, because then a client could
fetch the results we need for updating the Metastore statistics. The
reason why our existing CTAS works without this decoupling
is that its insert 'child query' is not fetchable.
Change-Id: I560533e3cb09bcbbdb3eea7fcf0b460bc6b36dcd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/873
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
We were previously wasting memory by always reading into 8MB IO
buffers, even when the data read was much less than 8MB. With this
patch, the IO manager picks a buffer size closer to the actual amount
being read (we don't use the exact size so we can continue to recycle
buffers). The minimum IO buffer size is determined via the
--min_buffer_size flag, and the max IO buffer size via the --read_size
flag.
This technique also helps with IMPALA-652, since short columns will
not use as much memory as before (we will not use considerably more
memory than the size of the table).
This patch also changes StringBuffer to use a doubling strategy so it
doesn't end up allocating many large unused buffers, and has the
scanner context use the requested length as the sync read size if it's
larger than the size produced by read_past_size_cb(). These changes
help prevent the boundary buffer in the scanner context from
allocating excess memory.
Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50
Reviewed-on: http://gerrit.ent.cloudera.com:8080/938
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
When dropping functions, we neeed to remove the function from the list
of Functions with that name AND remove the list from the Function map if
the list is empty. The second part wasn't happening.
Also fixes the test_ddl to properly create all test databases.
Change-Id: Id85af7d5db74a31161f48bea3816bdf734063133
Reviewed-on: http://gerrit.ent.cloudera.com:8080/952
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
This change adds support for cluster-synchronized catalog operations. This provides the
guaranteethat after a catalog op completes, all other subscribers to the catalog topic have
also processed that update. This is useful when load balancing, because a common workflow
is to target a different impalad for each statement executed.
For example if each of the following were executed sequentially, but targeting
a different node:
1) CREATE TABLE Foo
2) INSERT INTO Foo
3) SELECT * FROM Foo
4) INSERT INTO Foo ....
Since both the INSERT and the CREATE update the catalog, it would not work as expected
without this patch. The user might either get a "table not found" error or would be
missing partition information from the INSERT.
The downside is that this approach to DDL takes a bit longer because we need to wait
until all subscribers have processed an update. If all nodes are healthy, this overhead
should not be significantly longer than the current DDL time. However, a single bad node
might slow down or completely block the completion of all DDL operations. By default
this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1
To test this, the base test suite was updated to support selecting a random impalad
to execute each query section in a query test file. This is currently only enabled
for the insert and DDL tests, but could be leveraged by more tests in the future.
TODO: Add additional failure tests around this functionality.
TODO: Add an explicit "sync" statement so users do not need to run all their DDL
in this mode (since it is slower).
Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83
Reviewed-on: http://gerrit.ent.cloudera.com:8080/899
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This change adds support for faster DDL via the CatalogServer by directly
returning the TCatalogObject from each catalog operation and using this result
to update the local impalad's catalog cache directly, rather than waiting
for a state store heartbeat that contains the change.
Because the Impalad's catalog can now be updated in two ways, it means that
we need to be careful when applying updates to ensure no work gets "undone".
For example, consider the following sequence of events:
t1: [Direct Update] - Add item A - (Catalog Version 9)
t2: [Direct Update] - Drop item A - (Catalog Version 10)
t3: [StateStore Update] - (From Catalog Version 9)
In this case, we need to ensure that the state store update in t3 does not undo the
drop in t2, even though that update will contain the change to "add item A".
To support this, we now check the catalog versions before adding any item to ensure
that an existing item does not overwrite an item with a newer catalog version.
To handle the case of removals, a new CatalogUpdateLog is introduced. This log tracks
the catalog version each item was removed from the catalog. When adding a new
catalog object, it is checked to see if this object was removed in a catalog version >
than the version of the current object. If so, the update is ignored.
This covers most updates, but there is still one concurrency issue that is not covered
with this change. If someone issues an "invalidate metadata" concurrently with a
direct catalog operation, it may briefly set the catalog back in time. This seems like
okay behavior to me (the command is invalidating the catalog metadata). If we want
to address this the CatalogUpdateLog could be extended to track additions to the catalog
and we could replay the log after invalidating the metadata (as one possible solution).
Change-Id: Icc9bdecc3c32436708bf9e9e7974f91d40e514f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/864
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
This patch fixes an issue where Impala would crash if two partitions had
the same HDFS location. This is now fixed in hdfs-scan-node. It also includes some
cleanup and bug fixes to the FE partition related classes and adds tests.
There is still a problem where partition location metadata is not sent
to the BE for INSERT statements, but that will be resolved in a separate
patch.
Change-Id: I0f1c3113d654f7d2b410f00e793ff6b0cae1ae18
Reviewed-on: http://gerrit.ent.cloudera.com:8080/876
Reviewed-by: Alan Choi <alan@cloudera.com>
Tested-by: jenkins
This patch fixes a slightly pathological state that occurs when the
statestore is under heavy load. The result of the bug is that
subscribers cannot successfully re-register because the statestore never
marks them as failed.
The exact sequence of events is as follows:
1. Subscriber registers with state-store.
2. Statestore does not send heartbeats in timely fashion to
subscriber. Subscriber times-out.
3. Subscriber is restarted quickly. Statestore does not detect
restart.
4. Subscriber's RegisterSubscriber() call fails, because statestore
detects duplicate registration.
5. Subscriber restarts again. Since state-store is slow to send
heartbeats, the state-store has not detected the restart and the
subscriber receives a heartbeat message from the statestore and
does not reject it.
6. Statestore continues to believe subscriber is alive, since the
heartbeats are not being rejected.
To fix this, we add a registration ID to each successfully registered
subscriber that is known to both subscriber and statestore. If the
subscriber should restart and re-register, it receives a new
registration ID. Whenever a heartbeat arrives, it compares its
registration ID to that sent by the statestore with the heartbeat, and
rejects the heartbeat if they do not match.
We also allow re-registration of existing subscribers (getting rid of
the dreaded "Duplicate subscription" message). A new registration
overwrites an old one.
Change-Id: Ie32df3a586ccb375375ebfbcbec1aaeb930b6bfe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/778
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Adds support for "show create table", a DDL statement that outputs a DDL statement that
creates the specified table.
In general, the output DDL works in Impala, so a user can copy the output and execute it
to create the same table. However, there are a few special cases that output Hive DDL
because we do not support creating some tables in Impala: HBase tables and tables with
LZO compressed text. When we do support creating these tables in Impala, users should
be able to execute the DDL in Impala as well.
Change-Id: I8c130297a657810dea5b994bf99d72b0e61b847b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/842
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Fixed the following stats-related bugs:
- Per-partition row count was not distributed properly via CatalogService
- HBase column stats were not loaded and distributed properly
Enhancements to test framework:
- Allow regex specification of expected row or column values
- Fixed expected results of some tests because the test framework
did not catch that they were incorrect
Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This updates the tests to run more test cases in parallel and also removes some
unneeded "invalidate metadata" calls. This cut down the 'serial' execution time
for me by 10+ minutes.
Change-Id: I04b4d6db508a26a1a2e4b972bcf74f4d8b9dde5a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/757
Tested-by: jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
This patch goes some way to improving recovery after an INSERT
fails. Inserts now write intermediate results to
<table_dir>/.impala_insert_staging. After execution completes, either
successfully or not, the query-specific directory under that directory
is deleted.
This doesn't complete the job for better cleanup (although this goes as
far as IMPALA-449 suggests). Two things to do in the future:
* Have each backend delete its own staging files on error. The
difficulty getting there now is that backends don't know if they are
cancelled in error or because a LIMIT was reached.
* If the operation to move files to their final destinations should
fail during FinalizeQuery(), the coordinator should perform
compensation actions and delete the files that made it.
Note: We also considered a query-wide and impalad-wide option to change
the staging dir. There are advantages to this (all intermediate results
go to a known location which is easy to clean up on failure), but also
security and other operational concerns. Worth revisiting in the future.
Change-Id: Ia54cf36db6a382e359877f87d7d40aad7fdb77be
Reviewed-on: http://gerrit.ent.cloudera.com:8080/670
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
With this change we now detect if a table is read-only and disable INSERT/LOAD operations
on these tables. A table is read-only if Impala does not have write permission on the HDFS
base directory of the table or any one of the partition directories (if
the table is partitioned).
Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/713
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
ALTER TABLE ADD PARTITION had performance problems as it scaled to a
large number of partitions. This was because Impala was always
The reason Impala is hit this is because after executing the
ALTER DDL, Impala "refreshes" the table metadata. As part of the
refresh() Impala tries to reuse any metadata it can that already is
cached. In this case the lastDdlTime has changed so all partitions are
reloaded from the metastore using the listPartitions() RPC. This call
does not scale well as the number of partitions grows, for a table with
2000 partitions ADD PARTITION can take over 20 seconds.
This change significantly improves the performance of ALTER TABLE
ADD PARTITION by slightly changing how incremental refreshes work. We
now check the list of partition names in the metastore and load only the
delta of what is new (and remove any partitions that have been dropped)
by checking a new "isDirty()" flag on HdfsPartition. Additionally,
the lastDdlTime is now updated on the local (cached) copy after each
ALTER operation so we can detect against external modification to the
table.
With these changes we are able to add partitions at a pretty much
constant time (~1s / partition), even for tables that have a large
number of partitions.
Change-Id: Idc48618d4061ea3c56d9b6dae2c431a7ac49d5d9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/495
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).
This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.
Some other minor cleanup in:
- JNI from FE to BE
- UDFs/UDAs that are loaded as test data
Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
This patch redoes how the aggregation node is implemented. The functionality is
now split between aggregation-node, agg-expr and aggregate-functions. This is a working
progress (there's still a lot of debug stuff I added that needs to be cleaned up) but
it does pass the tests.
Aggregation-node is now very simple and now only deals with the grouping part.
Aggregate-expr serves as the glue between the agg node and the aggregate functions.
The aggregation functions are implemented with the UDA interface. I've reimplemented
our existing aggregate functions with this setup. For true UDAs, the binaries would be
loaded in aggregate-expr.
This also includes some preliminary changes in the FE. We now need to annotate each
AggNode as executing the update vs. merge phase (root aggs execute update, others
execute merge) and if it needs a finalize step (only the root does). This is more
general than our builtins which are too simple to need this structure.
There is a big TODO here to allow the intermediate types between agg nodes to change.
For example, in distinct estimate, the input type is the column type and the output type
is a bigint. We'd like the intermediate type to be CHAR(256). This is different since
currently, the intermediate type and output type have always been the same. We've hacked
around this by having both the intermediate and output type be TYPE_STRING. I've left
this for another patch (changing the BE to support this is trivial).
For aggregates that result in strings, we used to store some additional stuff past the
end of the tuple. The layout was:
<tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc
The rationale for this is that we want to reuse the buffer for min/max and grow the buffer
more quickly for group_concat. This breaks down the abstraction between agg-expr and
agg-node and is not something UDAs can use in general. Rather than try to hack around
this, I think the proper solution is to the intermediate type not be StringValue and
to contain the buffer length itself.
This patch also resurrects the distinct estimate code. The distinct estimate functions
exercise all of the code paths.
Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346
Reviewed-on: http://gerrit.ent.cloudera.com:8080/564
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.
Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.
What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
contains the change. An impalad will wait for the statestore heartbeat that contains this
version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing
Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
same JAR.
Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
execute_using_jdbc used to expect a query string. Its interface was recently changed to
accept a query object. Additionally, change the interface of the Query() class to enable
it to accept raw (qualified) query strings.
Change-Id: I44693cd2cccf1041cab32a9821fb76b12d148375
Reviewed-on: http://gerrit.ent.cloudera.com:8080/577
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
This patch also adds a number of improvements to NativeUdfExpr. Highlights include:
* Correctly handling the lowering of AnyVal struct types (required for ABI compatibility)
* A rudimentary library cache for reusing handles produced by dlopen
* More complicated test cases
Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/540
Tested-by: jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
Instead of failing when Hive ColumnStatsData isn't compatible with the
column type, Impala will now reset that Column's stats to "unknown".
This allows the table metadata to be loaded, for what could be a common
scenario - someone computes stats on a column and then then changes the
column type using an ALTER TABLE command.
Change-Id: I7f681b8e6b6d35268a84794014912122d1fefab6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/515
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
OVERWRITE
INSERT OVERWRITE into an unpartitioned table is supposed to remove all
data files from the root. This should not include hidden files or
directories. This patch excludes hidden files from deletion, and adds a
test case.
Partition directories are still removed in their entirety: the cost of
statting a large number of files and directories rather than issuing a
single "rm -rf" outweighs the benefits of preserving hidden files for
now.
Hive does not preserve hidden files in either configuration.
Change-Id: Ia73e55e011c26c88f14745075210cf359764e3c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/418
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
This change adds Impala DDL support for creation of AVRO tables.
Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES
which are used when creating Avro backed tables. This syntax is not
exactly the same as the Hive support since it introduces a new
fileformat (AVROFILE) that implies the needed Serialization library,
input format, and output format.
Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/464
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
This change adds support for updating TBLPROPERTIES via CREATE and ALTER DDL
statements. The TBLPROPERTIES are additional custom key-value metadata that is persisted
along with the table definition in the Hive Metastore.
Change-Id: Idbde4326353fffa3723375ff5e8469712220e1b3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/425
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Implements a group_concat() function which concatenates all the values in a group together.
The format is group_concat(str_col, [separator]). The default separator is ', '. NULLs
are ignored.
Change-Id: If152df6f528401117dba81d66ef691bfb548cc7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/117
Reviewed-by: Aaron Davidson <aaron.davidson@cloudera.com>
Tested-by: Aaron Davidson <aaron.davidson@cloudera.com>
I tried to investigate the jenkins issue where we weren't returning any rows.
I setup the cluster on that box manually and noticed there weren't any results
because the store_sales table was empty. Refresh did not fix. This looks like
a data loading issue. Adding this test would make discovering this like this
much easier.
Change-Id: I8ccddd43892b279d506371b9de717629815c6a08
Reviewed-on: http://gerrit.ent.cloudera.com:8080/260
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Split out the encoder/type for parquet reader/writer. I think this puts us
in a better place to support future encodings.
On the tpch lineitem table, the results are:
Before:
BytesWritten: 236.45 MB
Per Column Sizes:
l_comment: 75.71 MB
l_commitdate: 8.64 MB
l_discount: 11.19 MB
l_extendedprice: 33.02 MB
l_linenumber: 4.56 MB
l_linestatus: 869.98 KB
l_orderkey: 8.99 MB
l_partkey: 27.02 MB
l_quantity: 11.58 MB
l_receiptdate: 8.65 MB
l_returnflag: 1.40 MB
l_shipdate: 8.65 MB
l_shipinstruct: 1.45 MB
l_shipmode: 2.17 MB
l_suppkey: 21.91 MB
l_tax: 10.68 MB
After:
BytesWritten: 198.63 MB (84%)
Per Column Sizes:
l_comment: 75.71 MB (100%)
l_commitdate: 8.64 MB (100%)
l_discount: 2.89 MB (25.8%)
l_extendedprice: 33.13 MB (100.33%)
l_linenumber: 1.50 MB (32.89%)
l_linestatus: 870.26 KB (100.032%)
l_orderkey: 9.18 MB (102.11%)
l_partkey: 27.10 MB (100.29%)
l_quantity: 4.32 MB (37.31%)
l_receiptdate: 8.65 MB (100%)
l_returnflag: 1.40 MB (100%)
l_shipdate: 8.65 MB (100%)
l_shipinstruct: 1.45 MB (100%)
l_shipmode: 2.17 MB (100%)
l_suppkey: 10.11 MB (46.14%)
l_tax: 2.89 MB (27.06%)
The table is overall 84% as big (i.e. 16% smaller). A few columns got marginally
bigger. If the file filled the 1 GB, I'd expect the overhead to decrease even
more.
The restructuring to use a virtual call doesn't seem to change things much and
will go away when we codegen the scanner.
Here's what they look like with this patch (note this is on the before data files,
so only string cols are dictionary encoded).
Before query times:
Insert Time: 8.5 sec
select *: 2.3 sec
select avg(l_orderkey): .33 sec
After query times:
Insert Time: 9.5 sec <-- Longer due to doing dictionary encoding
select *: 2.4 sec <-- kind of noisy, possibly a slight slow down
select avg(l_orderkey): .33 sec
Change-Id: I213fdca1bb972cc200dc0cd9fb14b77a8d36d9e6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/238
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>