Calcite treats the intersect set operator with higher precedence
when compared with the except and union set operators. Impala treats
all the precedences equally (favoring left operators over right).
The following query was failing
select 100 union select 101 intersect select 101
Calcite was returning 2 rows here, performing the intersect before
the union. Impala does the union first and returned one row.
To fix this, new custom operators were created for the set operators
where all set operators have equal precedence.
Change-Id: Ic52661a30cc90534ea1a20868799edf9ceed13b6
Reviewed-on: http://gerrit.cloudera.org:8080/22052
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit allows support for milliseconds, microseconds, and
nanoseconds for Calcite.
The ImpalaSqlIntervalQualifier class extends the Calcite
SqlIntervalQualifier class which handles most datetime parts. Some
code was copied from the base class since some of the methods were private,
but in general, the code handling the parts is exactly the same.
Change-Id: I392c3900c70e7754a35ef25fc720ba4a2f2e5dd6
Reviewed-on: http://gerrit.cloudera.org:8080/22029
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The decode operator is similar to the case operator. The tricky part
for supporting the decode is that some of the operands are search
parameters and some are return parameters. The search parameters all
need to be compatible with each other. And the return parameters need
to be compatible with each other as well. Much of the code deals
with casting these parameters to compatible types.
Change-Id: Ia3b68fda7cfa14799a41428e35d5bbc5984a801a
Reviewed-on: http://gerrit.cloudera.org:8080/22031
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allow administrators to configure per user limits on queries that can
run in the Impala system.
In order to do this, there are two parts. Firstly we must track the
total counts of queries in the system on a per-user basis. Secondly
there must be a user model that allows rules that control per-user
limits on the number of queries that can be run.
In a Kerberos environment the user names that are used for both the user
model and at runtime are short user names, e.g. testuser when the
Kerberos principal is testuser/scm@EXAMPLE.COM
TPoolStats (the data that is shared between Admission Control instances)
is extended to include a map from user name to a count of queries
running. This (along with some derived data structures) is updated when
queries are queued and when they are released from Admission Control.
This lifecycle is slightly different from other TPoolStats data which
usually tracks data about queries that are running. Queries can be
rejected because of user quotas at submission time. This is done for
two reasons: (1) queries can only be admitted from the front of the
queue and we do not want to block other queries due to quotas, and
(2) it is easy for users to understand what is going on when queries
are rejected at submission time.
Note that when running in configurations without an Admission Daemon
then Admission Control does not have perfect information about the
system and over-admission is possible for User-Level Admission Quotas
in the same way that it is for other Admission Control controls.
The User Model is implemented by extending the format of the
fair-scheduler.xml file. The rules controlling the per-user limits are
specified in terms of user or group names.
Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to
the fair-scheduler.xml file. These elements can be placed on the root
configuration, which applies to all pools, or the pool configuration.
The ‘userQueryLimit’ element has 2 child elements: "user"
and "totalCount". The 'user' element contains the short names of users,
and can be repeated, or have the value "*" for a wildcard name which
matches all users. The ‘groupQueryLimit’ element has 2 child
elements: "group" and "totalCount". The 'group' element contains group
names.
The root level rules and pool level rules must both be passed for a new
query to be queued. The rules dictate a maximum number of queries that
can run by a user. When evaluating rules at either the root level, or
at the pool level, when a rule matches a user then there is no more
evaluation done.
To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the
RequestPoolService is enhanced.
If user quotas are enabled for a pool then a list of the users with
running or queued queries in that pool is visible on the coordinator
webui admission control page.
More comprehensive documentation of the user model will be provided in
IMPALA-12943
TESTING
New end-to-end tests are added to test_admission_controller.py, and
admission-controller-test is extended to provide unit tests for the
user model.
Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41
Reviewed-on: http://gerrit.cloudera.org:8080/21616
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes a regression caused by IMPALA-13516.
The validator finds the explicit_cast function when the cast to
timestamp function is present in the query. However, the validator
then tries to validate the explicit_cast function which was not
present in the operator table. Placing the operator in the table
fixes the issue.
Change-Id: Ib8577a06178435f5048d0a9721c16069ebe05743
Reviewed-on: http://gerrit.cloudera.org:8080/22057
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
There were some cast functions that were failing. There were
several reasons behind this. One reason was because Calcite
classifies all integers as an "int" even if they can be other
smaller types (e.g. tinyint). Normally this is handled by the
"CoerceNodes" portion, but it is impossible to tell the type if the
query had the phrase "select cast(1 as integer)" or "select 1"
since both would show up to CoerceNodes as "select 1:INT"
In order to handle this an "explicit_cast" operator now exists and
is used when the cast function is parsed within the commit. The
explicit_cast operator has to be different from the "cast" Calcite
operator in order to avoid being optimized out in various portions
of the compilation.
Change-Id: I1edabc942de1c4030331bc29612c41b392cd8a05
Reviewed-on: http://gerrit.cloudera.org:8080/22034
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
The following SQL query in analytics.test ...
select lag(coalesce(505, 1 + NULL), 1) over (order by int_col desc)
from alltypestiny
... had a couple of issues
1) The coalesce function needed a special operator. This function
derives its return type from a common type that works for all
parameters.
2) The function was not being saved when being reset. This is
needed for when resetAnalysisState is called.
3) createNullLiteral needed to be overriden for similar reasons.
The null literal type needs to be saved for when
resetAnalysisState is called.
Change-Id: Ic54d955a73cec4b5f421099a74df4172a1b7dd8b
Reviewed-on: http://gerrit.cloudera.org:8080/22024
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This makes several changes to the Calcite planner to improve the
generated exceptions when there are errors:
1. When the Calcite parser produces SqlParseException, this is converted
to Impala's regular ParseException.
2. When the Calcite validation fails, it produces a CalciteContextException,
which is a wrapper around the real cause. This converts these validation
errors into AnalysisExceptions.
3. This produces UnsupportedFeatureException for non-HDFS table types like
Kudu, HBase, Iceberg, and views. It also produces
UnsupportedFeatureException for HDFS tables with complex types (which
otherwise would hit ClassCastException).
4. This changes exception handling in CalciteJniFrontend.java so it does
not convert exceptions to InternalException. The JNI code will print
the stacktrace for exceptions, so this drops the existing call to
print the exception stack trace.
Testing:
- Ran some end-to-end tests with a mode that continues past failures
and examined the output.
Change-Id: I6702ceac1d1d67c3d82ec357d938f12a6cf1c828
Reviewed-on: http://gerrit.cloudera.org:8080/21989
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This adds support for explain statements for the Calcite planner.
This also fixes up the Parser.jj file so that statements not
processed by Calcite planner will fail, like "describe" and other
non-select statements. The parser will now only handle "select"
and "explain" as the first keyword.
If the parser fails, we need to do an additional check within Impala.
We run the statement through the Impala parser and check the statement
type. If the statement type is anything other than SelectStmt,
we run the query within the original Impala planner. If it is a SelectStmt,
we fail the query because we want all select statements to go through
the Calcite parser.
Change-Id: Iea6afaa1f1698a300ad047c8820691cf7e8eb44b
Reviewed-on: http://gerrit.cloudera.org:8080/21923
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Various small issues fixed including:
- There is a special operator dedicated to scalar functions not handled in
Calcite. The special agg operator equivalent was created (ImpalaAggOperator)
- Grouping_id function needs to be handled in a special way, calling
AggregateFunction.createRewrittenFunction
- A custom Avg operator was created to handle avg(TIMESTAMP) which isn't allowed
in Calcite.
- A custom Min/Max operator was created to handle min(NULL) and min(char types).
- The corr, covar_pop, and covar_samp functions use the default Impala function
resolver rather than the Calcite resolver.
Change-Id: I038127d6a2f228ae8d263e983b1906e99ae05f77
Reviewed-on: http://gerrit.cloudera.org:8080/21961
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Among the rules that were added:
The "Minus" RelNode is not handled directly by the physical node
translator and is changed into other nodes that are handled. This
was added in the ImpalaMinusToDistinctRule
The ExtractLiteralAgg rule compensates for the fact that a literal
value cannot be used directly in an agg.
The CalciteRelNodeConverter handles breaking down a SubQuery RelNode
into simpler RelNodes that can be optimized.
The pom.xml file was also changed. There is a java bug in java 8
that causes incremental compiles to fail. So we do a full compile
for the Calcite planner now.
Change-Id: I03a38aaa5c413b9b4d2f4c179de07935b672a031
Reviewed-on: http://gerrit.cloudera.org:8080/21941
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Added support for datetime interval operators. A dummy IntervalExpr
was added to support logical to physical representation. The
IntervalExpr is removed once the actual "+" operation for datetime
is created, but the translation framework is simplified by working
on the Calcite interval type in a separate temporary Expr.
Some logic was inserted into CoerceOperandShuttle. This object is
used to add casts when there is limitation in Calcite for number types.
When we have an interval, we always want to represent it as a special
type so we do not want to introduce a cast.
Also, in ImpalaOperatorTable, the ability to use Impala functions over
Calcite functions are needed because Calcite translates the year function
into "extract" which causes some issues. Just using the Impala signature
works for us here.
Change-Id: I2b4afc3ab1d17ba1f168904a6ded052e1d62b3fe
Reviewed-on: http://gerrit.cloudera.org:8080/21946
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds support for reading NDV statistics from Puffin files
when they are available for the current snapshot. Puffin files or blobs
that were written for other snapshots than the current one are ignored.
Because this behaviour is different from what we have for HMS stats and
may therefore be unintuitive for users, reading Puffin stats is disabled
by default; set the "--disable_reading_puffin_stats" startup flag to
false to enable it.
When Puffin stats reading is enabled, the NDV values read from Puffin
files take precedence over NDV values stored in the HMS. This is because
we only read Puffin stats for the current snapshot, so these values are
always up-to-date, while the values in the HMS may be stale.
Note that it is currently not possible to drop Puffin stats from Impala.
For this reason, this patch also introduces two ways of disabling the
reading of Puffin stats:
- globally, with the aforementioned "--disable_reading_puffin_stats"
startup flag: when it is set to true, Impala will never read Puffin
stats
- for specific tables, by setting the
"impala.iceberg_disable_reading_puffin_stats" table property to
true.
Note that this change is only about reading Puffin files, Impala does
not yet support writing them.
Testing:
- created the PuffinDataGenerator tool which can generate Puffin files
and metadata.json files for different scenarios (e.g. all stats are
in the same Puffin file; stats for different columns are in different
Puffin files; some Puffin files are corrupt etc.). The generated
files are under the "testdata/ice_puffin/generated" directory.
- The new custom cluster test class
'test_iceberg_with_puffin.py::TestIcebergTableWithPuffinStats' uses
the generated data to test various scenarios.
- Added custom cluster tests that test the
'disable_reading_puffin_stats' startup flag.
Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Reviewed-on: http://gerrit.cloudera.org:8080/21605
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The tpcds queries contain some functions that require some
modifications that the general function resolver cannot handle.
These include:
- Some functions don't have the same name within Calcite. An example
of this is "is_not_null" which is "is_not_null_pred" in Impala.
- The grouping function returns a tinyint in Impala which is different
from Calcite.
- The params for functions that adjust the scale (e.g. ROUND) need to
handle casting of parameters in the Impala way which is different
from Calcite.
Also handled in this commit is turning on the identifier expansion in
the Calcite validator. This is needed to fix some of the tpcds queries
as well.
Change-Id: Id451357f2fb92d35e09b100751f0f4a49760a51c
Reviewed-on: http://gerrit.cloudera.org:8080/21947
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Changed the parser to use quotes that are inline with how Impala
treats quotes, including allowing single quotes, double quotes,
and back ticks for aliases, and also allowing the backslash
to be used as an escape character. This is inline with what
BigQuery uses in Calcite.
A couple of unit tests were added, but these will be tested more
extensively by the ParserTest frontend unit test when that gets
committed.
Also, added the VALUE as a nonreserved keyword which is used in
the tpcds queries (along with the doublequotes)
Change-Id: I67ebb19912714c240b99a42d9f2f02f78c189350
Reviewed-on: http://gerrit.cloudera.org:8080/21942
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
WHen there are 2 references of the same table in a query, there needs
to be a unique alias name used within the TableRef object. Code has
been added to generate an alias.
IMPALA-13460 has been filed because we should use the user provided
alias name rather than a generated alias name. This is a little more
difficult to implement because Calcite has a limitation in that their
table object at validation time is equivalent to a FeTable in that there
is only one object for the multiple tables.
In order to fix IMPALA-13460, there is a Calcite bug that has to be
fixed. We'd have to generate our own TableScan object underneath their
LogicalTableScan that would hold an alias. This TableScan can be
generated through their RelBuilder Factory object. But the current
code creates the LogicalTableScan directly rather than go through
a factory, so that would need to be fixed first.
There are no unit tests attached to this Jira, but there are some
tpcds queries that will start working when this gets committed.
Change-Id: Ib9997bc642c320c2e26294d7d02a05bccbba6a0d
Reviewed-on: http://gerrit.cloudera.org:8080/21945
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
The optimizer now runs a rule to put expressions in conjunctive
normal form.
This commit will allow tpcds and tpch tests to run through without
hanging, specifically queries 13 and 48.
Change-Id: Iceca22f3b2d2b59ab21591f21c07650bbd8efb3c
Reviewed-on: http://gerrit.cloudera.org:8080/21938
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
The "withCreateValuesRel" false config parameter causes a "value" node
to be created for every literal in an "in" clause. This slows down
the compilation time and runtime massively. By removing this
parameter (using the 'true' default), all literal values are placed
within one Values RelNode.
Changing this parameter exposed a bug in tpcds q8. The CoerceNodes
module explicitly creates a Project node above a Values node when
the values node contains a string literal. Unfortunately, a Calcite
limitation prevents the string literal to be of type "string" but
instead is of type "char(x)".
Because of this limitation this Project hack was created. When
converting Calcite RelNodes to Impala RelNodes, we "notify" the
Values RelNode that it should ignore the row datatypes of the current
Values RelNode and instead use the parent row datatypes.
Change-Id: Ifc3d84c70af9cd4db44359c4ab7f0c9eb70738f5
Reviewed-on: http://gerrit.cloudera.org:8080/21911
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This updates the GBN to include the RANGER-4585 functionality
to support multi-column policy creation. IMPALA-12554 will use
this to create a single multi-column policy for a GRANT statement
rather than many single-column policies.
This fixes a few issues encountered during the upgrade:
1. This includes the fix for IMPALA-13433 to make test_sfs.py
resilient to HMS versions that do not properly create the
database directories.
2. This modified test_metadata_query_statements.py to use
unique directories for the databases to avoid HMS bugs.
3. The version of Avro changed, which changed the version of
Jackson and the package name of the JsonParseException.
This adds code to tolerate both the old and new package
name in the error message.
4. This includes the fix for IMPALA-13391 to exclude log4j-slf4j-impl
from hadoop-cloud-storage.
5. This excludes an unnecessary org.cloudera.logredactor
dependency.
Testing:
- Ran a core job
Change-Id: I32727020a69a66c3af4f4096fe15bc81600e2215
Reviewed-on: http://gerrit.cloudera.org:8080/21921
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Pinning javax.el was done when Impala still used Sentry. That was
removed in IMPALA-9708, and Hbase now explicitly depends on a
specific version. So this pin is no longer relevant.
Change-Id: I5be3eeeacf2f6fb04bc5106902e1d11b3886d844
Reviewed-on: http://gerrit.cloudera.org:8080/21827
Tested-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
The Calcite planner was crashing when there was an outer join
and there was a conjunct that compared two columns within the
same table. This conjunct needs to be place in "other" conjuncts
rather than "equi" conjuncts.
Change-Id: I4ae2d257fa58f3a58079b6aa551c32ffda7d28cf
Reviewed-on: http://gerrit.cloudera.org:8080/21908
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This commit contains fixes on top of the analytic expressions which fixes
some of the tests in analytic-fns.test.
The fixes include:
- The AnalyzedAnalyticExpr object now calls "standardize" on AnalyticExpr
which mutates AnalyticExpr into its final compiled form.
- Added handling for sum_init_zero which is produced by Calcite. Note: this
is only supported in Impala for BIGINT. An implementation is needed for
Decimal and double (IMPALA-13435)
- CastExpr needs to be analyzed. There is a quirk in the current Impala
implementation that the parameters for CastExpr are not re-analyzed.
So an explicit analyze is done when a CastExpr is encountered.
- AnalyticExprs allow "count" with zero parameters
- Certain analytic expressions use default window functions. The Calcite window
operations will be ignored for these functions.
Change-Id: I56529b13c545cdc9f96dd1c3bea9ef676e8c2755
Reviewed-on: http://gerrit.cloudera.org:8080/21897
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
An analytic expression is represented with a RexOver type RexNode
within Calcite. They will exist within the Project RelNode. If
there are any RexOvers existing within the Project, then the
ImpalaAnalyticRel RelNode gets created instead of the ImpalaProjectRel.
Only bare bones test cases are included. There are quite a number
of analytic expressions that will not work. The logic is included in
the AnalyticExpr.standardize() method. Another commit will be needed
to support all general analytic expressions and the tests within
Impala will be used for testing purposes.
Change-Id: Iba5060546a7568ba0cd315f546daa78d89b1c3c5
Reviewed-on: http://gerrit.cloudera.org:8080/21565
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Ideally, at validation time, we would like Calcite to coerce function
parameters and generate the proper return types that match Impala
functions. Currently, Calcite comes close to hitting these needs, and
any function that passes Calcite validation will be a valid function.
However, while some of the types are close, we want the output of the
Calcite optimizer to contain methods where all parameter types and
return types exactly match what is expected by Impala at runtime.
The bulk of this commit is to address this issue. There are a couple of
other parts to this commit added that rely on proper types as well. This
will be described as well further down in the commit message.
The 'fixing' of the parameters and return types is done after
optimization and before the Calcite RelNodes are translated into Impala
RelNodes. The code responsible for this is under the coercenodes directory.
The entry point for fixing these nodes is the method CoerceNodes.coerceNodes()
which takes a RelNode input and produces a RelNode output. This method
will journey through the RelNodes bottom up because a RelNode must be created
with its inputs, so it makes sense to fix the inputs first.
One problem within Calcite is how it generates RexLiterals. For the literal
number 2, Calcite will generate an INTEGER type. However, the Impala output
for this is a TINYINT. Another Calcite issue is for a string literal
such as 'hello'. Calcite will generate a CHAR(5) type whereas Impala
generates a string.
These inconsistencies cause Impala to crash in certain situations. For instance,
the query "select 'hello' union select 'goodbye'" generates a union with
2 input nodes. If we were to use the Calcite definitions, one would have
type of char(5) and the other would have type of char(7), which would cause
a crash.
Also, eventually when CTAS statements are implemented, we need the types to
match Impala's native type.
Most of the Calcite RelNode types need some sort of correction due to these
issues. Join, Filter and Project nodes may have expressions that need fixing.
The CoerceOperandShuttle class helps navigating through the RexNode function
structure for these. The Aggregate class may require an underlying Project
class where the explanation is detailed in the processAggNode method. The
Union node also will generate underlying Project nodes for casting. The
Values node creates a Project above itself since Calcite will not allow
generation of a RexLiteral of string type directly, so it needs to be cast.
In addition to these changes, other changes that relied on casting issues
have been added. The or/and/case operator is now supported. 'Or' and 'and'
worked before, but not if there were more than 2 or/and conditions grouped
together. Case requires all return types to match and required some
special logic. These functions required a little extra support since
they have variable arguments.
Change-Id: I13a349673f185463276ad7ddb83f0cfc2d73218c
Reviewed-on: http://gerrit.cloudera.org:8080/21335
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This commit adds the ability to handle joins in the Calcite planner.
Some items worth noting:
There is extra handling in the ImpalaJoinRel class to deal with outer
joins. The AnalyzedTupleIsNullExpr object is needed for processing
which derives from TupleIsNullExpr. Normally, expressions are created
in the CreateExprVisitor, but the join requires that the TupleIsNullExpr
object is wrapped around the expressions retrieved from the inputs.
The execution engine requires separation of the equijoin conditions and
the non-equijoin conditions. Furthermore, the equijoin conditions are
BinaryCompPredicates instead of normal FunctionCallExprs, so the
AnalyzedBinaryCompExpr class had to be created.
Special processing needed to be coded for runtime filter generators.
The conditions needed to be added to the value transfer graph in order
to enable the Impala planner logic to create and push these generators.
The join also required some rules to be added to the optimizer. If a join
is done through the "ON" clause, Calcite is able to place the join
condition directly in the Join RelNode. However, if it is in the "WHERE"
clause, Calcite creates a Filter RelNode and creates a cross join RelNode
object. Therefore, in order to handle "WHERE" joins, we need to implement
the rules in the optimizer.
Change-Id: I5db097577907d79877f52feff2922000af074ecd
Reviewed-on: http://gerrit.cloudera.org:8080/21239
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Basic aggregation functionality is now added to the Calcite planner.
The implementation of aggregation was a little tricky on the
conversion from the Aggregate RelNode to the Impala Agg PlanNode. The
compilation in Impala requires some AggregateInfo structures which
may set up multiple internal PlanNodes. Some parts of the Analyzer
are used by AggregateInfo.
This usage of Analyzer puts two design goals in conflict with each
other, which are:
1) Remove dependency on the Analyzer since Calcite does all the parsing
and validation
2) Avoid refactoring in the first major iteration of the Calcite planner.
To resolve this, a SimplifiedAnalyzer class has been created which is
injected into the AggregateInfo. Some methods of the Analyzer class are
overridden to avoid the non-Calcite planner analysis.
The SimplifiedAnalyzer overrides two aspects of the Analyzer:
1) "Having" filter conjuncts are going to be "unassigned conjuncts".
After Calcite validates and optimizes the plan, the only filter
conjuncts above the aggregation will be the "having" clause, so all
these conjuncts will be used in the aggregate (sidenote: optimization
rules have not been pushed yet to move filters underneath the aggregate,
but that will come in a push in the near future). Once the aggregate
has been changed to a PlanNode, we can clear out the unassigned conjuncts.
2) Because the Aggregte PlanNodes can have multiple layers, it may
be responsible for creating some TupleDescriptors and SlotDescriptors
for these PlanNodes. The SlotDescriptors need to be "materialized".
In the non-Calcite planner, this is done through its planning process.
In the Calcite planner, the materialization can happen immediately when
the PlanNode is created. So the "addSlotDescriptor" is overridden to
call the parent, but then to immediately materialize the SlotDescriptor.
The rest of the ImpalaAggRel is hopefully self-explanatory. The groups,
aggregates, and grouping sets are extracted from the RelNodes and used
in the PlanNodes. The logic to set up multiple PlanNodes and the creation
of MultiAggregateInfo and AggregateInfo objects are similar to what is
used in the non-Calcite planner.
Change-Id: Iacf0de8ba11f0d31d73d624f0c9a91db9997cfd5
Reviewed-on: http://gerrit.cloudera.org:8080/21238
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Sometimes there is a jackson-databind patch release without
a corresponding release of other jackson libraries. For example,
there is a jackson-databind 2.12.7.1, but jackson-core does not
have an artifact with that version. To handle these scenarios,
it is useful to have a separate version for jackson-databind
vs other jackson libraries.
This introduces IMPALA_JACKSON_VERSION (which currently matches
IMPALA_JACKSON_DATABIND_VERSION) and uses this for non-databind
jackson libraries.
Testing:
- Ran a local build
Change-Id: I3055cb47986581793d947eaedb6a24b4dd92e3a6
Reviewed-on: http://gerrit.cloudera.org:8080/21719
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
The Sort RelNode is now supported. This includes limit an offset features
as well.
A minor bit of code was copied from the original planner, but it just
involved decision making on which Sort PlanNode to call, so this probably
doesn't need to be refactored.
Change-Id: I747e107ed996862ef348f829deee47f0c0fc78d5
Reviewed-on: http://gerrit.cloudera.org:8080/21237
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upgrades io.airlift.aircompressor to 0.27 to address CVE-2024-36114.
Aircompressor is a dependency of Orc, however we tend to upgrade Orc
more deliberately and synchronize C++ and Java upgrades. Aircompressor
upgrades in Orc did not require any code changes, so manage this
dependency directly to address the CVE.
Change-Id: I6c56daa61d5ecbcb3a5f7fbd0665043bb49b469f
Reviewed-on: http://gerrit.cloudera.org:8080/21677
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit handles Union and Value RelNode operators
The Union RelNode is created within Calcite when there is a "union"
clause.
The Values RelNode is created when the lowest level does not come from
a table, but instead comes from constant values. For example, the query
"select 3" would create a Values RelNode with one literal value of 3.
The PlanNode creation simulates what already exists within the Impala planner.
There is no corresponding Values PlanNode. Instead, a Union is created with
the values expression serving as inputs expressions (hence the reason of
combining these 2 RelNodes in the same commit).
Other plan nodes used are the "SelectNode" and the "EmptySetNode". The
EmptySetNode is used where there are no rows coming from the value node.
While this cannot be simulated at this point, this will be needed when
we start introducing optimization rules, and will be tested when we turn
on the Impala test framework queries.
The SelectNode is used for functions that are applied on top of the UnionNode.
There is a major issue with this iteration of Union and Value nodes due
to a Calcite issue. Calcite currently treats all string literals as "CHAR"
type. This causes problems in the union operator if one tries to implement
the following query: "select 'a' union select 'ab'", since the 2 types in
the value clauses are CHAR(1) and CHAR(2) which do not match. This would
cause an exception on the server. A future commit will fix this issue.
Also of concern is that Calcite treats non-bigint constant as integers only.
That is, 3, 257, 65539 are all considered of type INT. This will also be
fixed in a later commit.
Change-Id: Ibd989dbb5cf0df0fcc88f72dd579ce4fd713f547
Reviewed-on: http://gerrit.cloudera.org:8080/21211
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The Filter RelNode is now handled in the Calcite planner.
The parsing and analysis is done by Calcite so there were no
changes added to that portion. The ImpalaFilterRel class was
created to handled the conversion of the Calcite LogicalFilter
to create a filter condition within the Impala plan nodes.
There is no explicit filter plan node in Impala. Instead, the
filter condition attaches itself to an existing plan node. The
filter condition gets passed into the children plan nodes through
the ParentPlanRelContext.
The ExprConjunctsConverter class is responsible for creating the
filter Expr list that is used. The list contains separate AND
conditions that are on the top level.
Change-Id: If104bf1cd801d5ee92dd7e43d398a21a18be5d97
Reviewed-on: http://gerrit.cloudera.org:8080/21498
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
This commit handles the first pass on getting functions to work
through the Calcite planner. Only basic functions will work with
this commit. Implicit conversions for parameters are not yet supported.
Custom UDFs are also not supported yet.
The ImpalaOperatorTable is used at validation time to check for
existence of the function name for Impala. At first, it will check
Calcite operators for the existence of the function name (A TODO,
IMPALA-13096, is that we need to remove non-supported names from the
parser file). It is preferable to use the Calcite Operator since
Calcite does some optimizations based on the Calcite Operator class.
If the name is not found within the Calcite Operators, a check is done
within the BuiltinsDb (TODO: IMPALA-13095 handle UDFs) for the function.
If found, and SqlOperator class is generated on the fly to handle this
function.
The validation process for Calcite includes a call into the operator
method "inferReturnType". This method will validate that there exists
a function that will handle the operands, and if so, return the "return
type" of the function. In this commit, we will assume that the Calcite
operators will match Impala functionality. In later commits, there
will be overrides where we will use Impala validation for operators
where Calcite's validation isn't good enough.
After validation is complete, the functions will be in a Calcite format.
After the rest of compilation (relnode conversion, optimization) is
complete, the function needs to be converted back into Impala form (the
Expr object) to eventually get it into its thrift request.
In this commit, all functions are converted into Expr starting in the
ImpalaProjectRel, since this is the RelNode where functions do their
thing. The RexCallConverter and RexLiteralConverter get called via the
CreateExprVisitor for this conversion.
Since Calcite is providing the analysis portion of the planning, there
is no need to go through Impala's Analyzer object. However, the Impala
planner requires Expr objects to be analyzed. To get around this, the
AnalyzedFunctionCallExpr and AnalyzedNullLiteral objects exist which
analyze the expression in the constructor. While this could potentially
be combined with the existing FunctionCallExpr and NullLiteral objects,
this fits in with the general plan to avoid changing "fe" Impala code
as much as we can until much later in the commit cycle. Also, there
will be other Analyzed*Expr classes created in the future, but this
commit is intended for basic function call expressions only.
One minor change to the parser is added with this commit. Calcite parser
does not have acknowledge the "string" datatype, so this has been
added here in Parser.jj and config.fmpp.
Change-Id: I2dd4e402d69ee10547abeeafe893164ffd789b88
Reviewed-on: http://gerrit.cloudera.org:8080/21357
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
from the metastore
Bump the GBN to 49623641 to leverage HIVE-27499, so that Impala can
directly fetch the latest events specific to the db/table from the
metastore, instead of fetching the events from metastore and then
filtering in the cache matching the DbName/TableName.
Implementation Details:
Currently when a DDL/DML is performed in Impala, we fetch all the
events from metastore based on current eventId and then filter them in
Impala which can be a bottleneck if the events count is huge. This can
be optimized by including db name and/or table name in the notification
event request object and then filter by event type in impala. This can
provide performance boost on tables that generate a lot of events.
Note:
Also included ShowUtils class in hive-minimal-exec jar as it is
required in the current build version
Testing:
1) Did some tests in local cluster
2) Added a test case in MetaStoreEventsProcessorTest
Change-Id: I6aecd5108b31c24e6e2c6f9fba6d4d44a3b00729
Reviewed-on: http://gerrit.cloudera.org:8080/20979
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adding the framework to create our own parsing syntax for Impala using
the base Calcite Parser.jj file.
The Parser.jj file here was grabbed from Calcite 1.36. So with this commit,
we are using the same parsing analysis as Calcite 1.36. Any changes made
on top of the Parser.jj file or the config.fmpp file in the future are Impala
specific changes, so a diff can be done from this commit to see all the Impala
parsing changes.
The config.fmpp file was grabbed from Calcite 1.36 default_config.fmpp. The
Calcite intention of the config.fmpp file is to allow markup of variables in
the Parser.jj file. So it is always preferable to modify the
default_config.fmpp file when possible. Our version is grabbed from
https://github.com/apache/calcite/blob/main/core/src/main/codegen/config.fmpp
and slightly modified with the class name to make it compile for Impala.
There's no unit test needed since there is no functional change. The Calcite
planner will eventually make changes in the ".jj" file to support the differences
between the Impala parser and the Calcite parser.
Change-Id: If756b5ea8beb85661a30fb5d029e74ebb6719767
Reviewed-on: http://gerrit.cloudera.org:8080/21194
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch upgrades bouncycastle to 1.78. As of bouncycastle:1.71, the
*-jdk15on artifact is no longer available, the artifact is changed to
*-jdk18on.
Tests:
- core tests ran
Change-Id: I8372916ab79b863e7a07d22e8333abd54492fa29
Reviewed-on: http://gerrit.cloudera.org:8080/21371
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, there is no dependency management for the log4j2
version. Impala itself doesn't use log4j2. However, recently
we encountered a case where one dependency brought in
log4-core 2.18.0 and another brought in log4j-api 2.17.1.
log4j-core 2.18.0 relies on the existence of the ServiceLoaderUtil
class from log4j-api 2.18.0. log4j-api 2.17.1 doesn't have this
class, which causes class not found exceptions.
This uses dependency management to set the log4j2 version to 2.18.0
for log4j-core and log4j-api to avoid any mismatch.
Testing:
- Ran a local build and verified that both log4j-core and log4j-api
are using 2.18.0.
Change-Id: Ib4f8485adadb90f66f354a5dedca29992c6d4e6f
Reviewed-on: http://gerrit.cloudera.org:8080/21379
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the first commit to use the Calcite library to parse,
analyze, and optimize queries.
The hook for the planner is through an override of the JniFrontend. The
CalciteJniFrontend class is the driver that walks through each of the
Calcite steps which are as follows:
CalciteQueryParser: Takes the string query and outputs an AST in the
form of Calcite's SqlNode object.
CalciteMetadataHandler: Iterate through the SqlNode from the previous step
and make sure all essential table metadata is retrieved from catalogd.
CalciteValidator: Validate the SqlNode tree, akin to the Impala Analyzer.
CalciteRelNodeConverter: Change the AST into a logical plan. In this first
commit, the only logical nodes used are LogicalTableScan and LogicalProject.
The LogicalTableScan will serve as the node that reads from an Hdfs Table and
the LogicalProject will only project out the used columns in the query. In
later versions, the LogicalProject will also handle function changes.
CalciteOptimizer: This step is to optimize the query. In this cut, it will be
a nop, but in later versions, it will perform logical optimizations via
Calcite's rule mechanism.
CalcitePhysPlanCreator: Converts the Calcite RelNode logical tree into
Impala's PlanNode physical tree
ExecRequestCreator: Implement the existing Impala steps that turn a Single
Node Plan into a Distributed Plan. It will also create the TExecRequest object
needed by the runtime server.
Only some very basic queries will work with this commit. These include:
select * from tbl <-- only needs the LogicalTableScan
select c1 from tbl <-- Also uses the LogicalProject
In the CalciteJniFrontend, there is some basic checks to make sure only
select statements will get processed. Any non-query statement will revert
back to the current Impala planner.
In this iteration, any queries besides the minimal ones listed above will
result in a caught exception which will then be run through the current
Impala planner. The tests that do work can be found in calcite.test and
run through the custom cluster test test_experimental_planner.py
This iteration should support all types with the exception of complex
types. Calcite does not have a STRING type, so the string type is
represented as VARCHAR(MAXINT) similar to how Hive represents their
STRING type.
The ImpalaTypeConverter file is used to convert the Impala Type object
to corresponding Calcite objects.
Authorization is not yet working with this current commit. A Jira has been
filed (IMPALA-13011) to deal with this.
Change-Id: I453fd75b7b705f4d7de1ed73c3e24cafad0b8c98
Reviewed-on: http://gerrit.cloudera.org:8080/21109
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch moves the source files of jdbc package to fe.
Data source location is optional. Data source could be created without
specifying HDFS location. Assume data source class is in the classpath
and instance of data source class could be created with current class
loader. Impala still try to load the jar file of the data source in
runtime if it's set in data source location.
Testing:
- Passed core test
- Passed dockerised-tests
Change-Id: I0daff8db6231f161ec27b45b51d78e21733d9b1f
Reviewed-on: http://gerrit.cloudera.org:8080/20971
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
for external data source table
This patch adds support for datatype date as predicates
for external data sources.
Testing:
- Added tests for date predicates with operators:
'=', '>', '<', '>=', '<=', '!=', 'BETWEEN'.
Change-Id: Ibf13cbefaad812a0f78755c5791d82b24a3395e4
Reviewed-on: http://gerrit.cloudera.org:8080/20915
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Also sets dependencyManagement to force using the same version
for jackson-databind, jackson-core and jackon-annotations. This is
needed because datagenerator depends on kitesdk, which would pull in a
very old jackson-core version (2.3.1) and lead to build failures
with the newer jackson.databind.
Change-Id: I8440426da1395045cf149aca0044286015861e5f
Reviewed-on: http://gerrit.cloudera.org:8080/20914
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch uses JDBC connection string to apply query options to the
Impala server by setting the properties in "jdbc.properties" when
creating JDBC external DataSource table.
jdbc.properties are specified as comma-delimited key=value string, like
"MEM_LIMIT=1000000000, ENABLED_RUNTIME_FILTER_TYPES=\"BLOOM,MIN_MAX\"".
Fixed Impala to allow value of ENABLED_RUNTIME_FILTER_TYPES to have
double quotes in the beginning and ending of string.
jdbc.properties can be used for other databases like Postgres and MySQL
to set additional properties. The test cases will be added in separate
patch.
Testing:
- Added end-to-end tests for setting query options on Impala JDBC
tables.
- Passed core tests.
Change-Id: I47687b7a93e90cea8ebd5f3fc280c9135bd97992
Reviewed-on: http://gerrit.cloudera.org:8080/20837
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
external data source
In the current implementation of external JDBC data source,
the user has to provide both the username and password in
plain text which is not a good practice.
This patch extends the functionality of existing implementation
to either provide:
a) username and password
b) username or key and keystore
If the user provides the password, then that password is used.
However, if no password is provided and the user provides only the
key/keystore, then it fetches the password from the secure jceks
keystore.
Testing:
- Added unit test TestExtDataSourcesWithKeyStore
Change-Id: Iec83a9b6e00456f0a1bbee747bd752b2cf9bf238
Reviewed-on: http://gerrit.cloudera.org:8080/20809
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support to read Impala tables in the Impala cluster
through JDBC external data source. It also adds a new counter
NumExternalDataSourceGetNext in profile for the total number of calls
to ExternalDataSource::GetNext().
Setting query options for Impala will be supported in a following patch.
Testing:
- Added an end-to-end unit test to read Impala tables from Impala
cluster through JDBC external data source.
Manually ran the unit-test with Impala tables in Impala cluster on a
remote host by setting $INTERNAL_LISTEN_HOST in jdbc.url as the ip
address of the remote host on which an Impala cluster is running.
- Added LDAP test for reading table through JDBC external data source
with LDAP authentication.
Manually ran the unit-test with Impala tables in a remote Impala
cluster.
- Passed core tests.
Change-Id: I79ad3273932b658cb85c9c17cc834fa1b5fbd64f
Reviewed-on: http://gerrit.cloudera.org:8080/20731
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>