Fixes a crash that occurs in some cases when io buffers are still used and
child nodes are closed early. We close child nodes early when all rows have
been consumed and resources are transfered, but in some cases io buffers are
still in use when a scan node is closed. We avoid this problem by only
closing reader contexts when the entire fragment is closed.
Change-Id: Ie62cdecdcd530bdc61dd4e83cd9ecfc7d2c93ef6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1806
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 66f14a47b953b7b7153c73f4e018d03461dcd5ef)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1859
This is because in HdfsTable we call call "expr.castTo(colType)", but BooleanLiteral
(incorrectly) didn't implement "uncheckedCastTo()". This meant that instead of a
BooleanLiteral being returned we got back a CastExpr, which cannot be cast to LiteralExpr.
As part of this change it turns out Boolean partition columns are also broken in Hive. I
filed HIVE-6590 for these issues and we decided to disable INSERT into a boolean partition
column for Impala due to this bug.
Change-Id: I3e295bb96aadc08d64faf551f6393a7128a7ef27
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1755
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
The previous implementation did not properly handle replacing the is_null
return argument from expr calls.
Change-Id: I96cd0dfca8876b4f914b0cbc4eb459ea3dcdf230
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1795
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
The bug: Multi-slot predicates bound to a single
outer-joined tuple were not marked as materialized.
In addition, such predicates were not picked up by nodes
under the join via getBoundPredicates() even if it
would be correct to do so.
The fix: Always mark slots of predicates that must be
evaluated by a join in SelectStmt.materializeRequiredSlots(),
regardless of whether the predicate can also be safely
evaluated below the join.
This patch also generalizes getBoundPredicates() to handle
multi-slot predicates and fixes some issues with redundant
predicate assignment. Still, the new approach has several
limitations which are documented in the predicate propatation
planner test to ease future improvements.
Change-Id: If5da0354a83c00a9766fc63b7780ed4d5a9c46e5
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1717
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1819
The bug was that the number of materialized agg-tuple slots did not correspond to the number
of materialized agg functions, due to binding predicates against an AggNode causing slot
materialization after SelectStmt.materializeRequiredSlots().
This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple)
into consideration in SelectStmt.materializeRequiredSlots().
I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot
materialization that is also fixed in this patch. The ordering exprs must be marked before
the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain
agg exprs that are only referenced inside the ORDER BY clause.
Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Having predicates need to be transferred to the 2nd phase merge agg
for distinct + non-distinct aggregates without group by.
For distinct + non-distinct aggregates with group by, it is correct
to evaluate the predicates at the 2nd phase (non-merge) agg.
Change-Id: I71d73c4ef92becbb81e142bc0cb5f54e790b1fb5
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1743
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1817
This patch introduces the ability to specify a prepare and close
function for a UDF, as well as FunctionContext methods for maintaining
state across UDF invocations within a query. Many of the changes are
related to adding an Expr::Open() function which calls the UDF's
prepare function, if specified (it has to be called in Open() since
the LLVM module must be compiled first).
Change-Id: I581d90d03dff71f7ff5d4a6bef839ba6bc46b443
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1693
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8e2ed7fb9051d98f89327715fdebd6f5ed22d6ee)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1757
Our new build machines (e.g., beefy) have more cores than our other machines,
so scan nodes may have a different memory estimate causing the explain tests
to fail. This patch fixes the num_scanner_threads to 1 for explain tests
to ensure consisteny estimates.
Change-Id: Ie6194f3c3b17d04aa141d04fcddb7ac948e92fcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1735
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
We run wat-for-hbase-master.py after starting hbase to account for a race between
the master and region server. This script has not been working for some time. It caused
no ill effects sinc the said race was absent. However, the race has manifested itself
again, so the script needs to be fixed. Setting the correct classpath does so.
Change-Id: I783a7473cfd24a9cb66711f5428f7052ceb96282
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1756
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
With a recent upstream change, a core-site.xml was introduced in a YARN test jar pulled in
by thirdparty. This causes MiniLlama to ignore options set in
fe/src/test/resources/core-site.xml. The problem manifests itself with the MiniDfsCluster
starting on an arbitary port, but it would have also caused a lot of tests to fail as none
of the compression codecs are pulled in. This change prepends the classpath used by
minillama with the path to the internal core-site.
Change-Id: Iee267fe12e02301baec059a1f7469288c038d6fa
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1739
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This updates how Impala fetches partition metadata from the Hive Metastore to fetch
partitions in batches, rather than all at once. This helps reduce the load on the
HMS and also lets Impala scale to above 32K partitions. The downside is that it
may require additional RPCs to get all the partitions.
This is done by first querying the metastore to get all the partition names that
exist, then splitting the list of names into seperate batches to get the actual
partition metadata.
Impala uses a default size of 1000 partitions per batch, but it can be configured
by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter
in the hive-site.xml config file.
Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698
The purpose of this patch is to avoid CDH-17414 which causes data files loaded
with Hive to incorrectly have a replication factor of 1. When using beeline
this problem only appears to occur immediately after creating the first HBase table
since starting HiveServer2, i.e., subsequent loads seem to function correctly.
This patch add a new script that creates an external HBase table in Hive to
'warm up' HiveServer2 immediately after it is started.
Subsequent loads should assign a correct replication factor.
Change-Id: Ic54c9401b67b748a8848d19f82b8e7df9535e845
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1640
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This patch includes several changes to predicate assignment and propagation.
First, we now only register as outer joined those tuples of TableRefs
directly participating in an outer join. In particular,
materialized tuples referenced inside an outer-joined InlineView are not
registered as outer joined - only the InlineView's tuple is registered.
The other major change is that we detect when it is correct to propagate
predicates to scan nodes participating (directly or indirectly) in an outer
join by testing whether a predicate can become true if a tuple
is NULL. If that is the case, then it is generally not safe to propagate
a predicate because it would change the final result of the outer join.
Change-Id: Ia135ab15ec8c6ef756a908f797f96812d28c84c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1567
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1606
Our code for eliminating redundant join predicates based on equivalence
classes is not quite right. I've commented out the relevant code
to ensure we don't incorrectly remove predicates. Left a TODO
to fix and re-enable this feature.
Change-Id: Ie76b365903dff6df271a378cbb4fd327ffa0631f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1572
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.
The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).
This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.
This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.
Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
The overall goal of this change allow for table metadata to be loaded in the background
but also to allow prioritization of loading on an as-needed basis. As part of analysis,
any tables that are not loaded are tracked and if analysis fails the Impalad will make
an RPC to the CatalogServer to requiest the metadata loading of these tables be
prioritized and analysis will be restarted.
To support this, the CatalogServer now has a deque of the tables to load. For
background loading, tables to load are added to the tail of the deque. However, a new
CatalogServer RPC was added that can prioritize the loading of one or more tables in
which case they will get added to the head of the deque. The next table to load is
always taken from the head. This helps prioritize loading but is admittedly not the most
fair approach.
The support the prioritized loading, some changes had to made on the Impalad side during
analysis:
- During analysis, any tables that are missing metadata are tracked.
- Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1
table/view was missing metadata, these tables missing metadata are requested to be
loaded by calling the CatalogServer.
- The impalad will wait until the required tables are received (by getting notified each
time there is a call to updateCatalog()), and waiting to run analysis until all tables
are available. Once the tables are available, analysis will restart.
This change also introduces two new flags:
--load_catalog_in_background (bool). When this is true (the default) the catalog server
will run a period background thread to queue all unloaded tables for loading. This is
generally the desired behavior, but there may be some cases (very large metastores) where
this may need to be disabled.
--num_metadata_loading_threads (int32). The number of threads to use when loading catalog
metadata (degree of parallelism). The default is 16, but it can be increased to improve
performance at the cost of stressing the Hive metastore/HDFS.
Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538
We were previously only clearing the cache in the catalog service
update loop so the impalad the drop was issued to was not doing the
right thing.
Change-Id: I6bee228e8c0d565cea4ea61cbf64240d83a45a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1511
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
The select exprs of an inline view may not always be materialised, yet
the output tuple itself may be. This patch fixes a crash in this
situation in the backend aggregation node which assumed its output tuple
would always have at least one materialised slot.
The cause was a couple of too-conservative DCHECKs that failed if the
tuple was NULL. In fact, the code was robust to this possibility without
the checks, so this bug didn't affect release builds of Impala.
Change-Id: If0b90809d30fcd196f55197953392452d1ac9c4f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1431
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8c1c21b66c43e900760ace54d090305f32a85a1f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1471
Tested-by: Henry Robinson <henry@cloudera.com>
We weren't initializing the udf mem pool causing UDFs to return strings to crash if used as part
of a constant expression.
Change-Id: Ic3a0e556aec8ce03a9e59f3ccf6980c682046b50
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1447
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
With Hive .12, the default RC file format can be configured to be ColumnarSerDe or
LazyBinaryColumnarSerDe. Impala does not yet support the LazyBinaryColumnarSerDe. This change
verifies it is properly disabled.
Change-Id: Ia84495868237ce2c89a9706ad75e0f7eb8499057
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1416
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1423
There was race when the catalog was invalidated at the same time a table
was being loaded. This is because an uninitialized Table was being returned
unexpectedly to the impalad due to the concurrent invalidate.
This fixes the problem by updating the CatalogObjectCache to load when
a catalog object is uninitialized, rather than load when null. New items can
now be added in a initialized or uninitialized state; uninitialized objects
are loaded on access.
Also adds a stress test for invalidate metadata/invalidate metadata <table>/refresh
In addition, it cleans up the locking in the Catalog to make it more
straight forward. The top-level catalogLock_ is now only in CatalogServiceCatalog
and this lock is used to protect the catalogVersion_. Operations that need to
perform an atomic bulk catalog operation can use this lock (such as when the
CatalogServer needs to take a snapshot of the catalog to calculate what delta to send
to the statestore). Otherwise, the lock is not needed and objects are protected by the
synchronization at each level in the object heirarchy (Db->[Function/Table]). That is,
Dbs are synchronized by the Db cache, each Db has a Table Cache which is synchronized
independently.
Change-Id: I9e542cd39cdbef26ddf05499470c0d96bb888765
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1355
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1418
This fixes the flaky ALTER/CREATE tests by removing a verification step that
didn't add value and was non-deterministic. The verficiation step that was
removed verified that CREATE/ALTER set the appropriate file format by
changing the format to something that didn't match the underlying data files,
then attempting to read the data. This is already covered by the positive
test case where the file format is changed to match the underlying data.
Change-Id: I66f485405234f472f3b83f3e776bf7f2c10de874
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1379
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1382
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
This change adds support for lazy loading of table metadata to the
CatalogService/Impalad. The way this works is that the CatalogService initially
sends out an update with only the databases and table names (wrapped as
IncompleteTables). When an Impalad encounters one of these tables, it will contact
the catalog service to get the metadata, possibly triggering a metadata load if the
catalog server has not yet loaded this table.
With these changes the catalog server starts up in just seconds, even for large
metastores since it only needs to call into the metastore to get the list of tables
and databases. The performance of "invalidate metadata" also improves for the same reason.
I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and
remove the need for using a LoadingCache for databases.
This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run
against catalog object recieved from the catalog server. This actually turned up some bugs
in our previous test configuration where we were not running with the correct column stats
(we were always running with avgSerializedSize = slotSize). This changed some plans so the
planner tests needed to be updated.
Still TODO:
This does not include the changes to perform background metadata loading. I will send
that out as a separate patch on top of this.
Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e
Saving changes
Change-Id: I48c34408826b7396004177f5fc61a9523e664acc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Changes include:
- version changes in impala-config
- version changes in various loading scripts
- hbase jars are no longer in hive/lib
- mini-llama script changes
- updates due to sentry api changes
- JDBC tests disabled
- unsupported types tests disabled.
Change-Id: If8cf1b7ad8e22aa4d23094b9a4b1047f7e9d93ee
Fixed codepath with rm disabled. Set enable_rm to false by default.
Change-Id: I3bf2d0525d91243ec3c0ea048b0c03680befcda2
Conflicts:
be/src/runtime/runtime-state.cc
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.
Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1