The help output for buildall.sh notes running `buildall.sh -testdata` as
an option to incrementally load test data without formatting the
mini-cluster. However trying to do that with existing data loaded
results in an error when running `hadoop fs -mkdir /test-warehouse`. Add
`-p` so this step is idempotent, allowing the example to work as
documented.
Change-Id: Icc4ec4bb746abf53f6787fce4db493919806aaa9
Reviewed-on: http://gerrit.cloudera.org:8080/18522
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-10723 added support for treating materialized views as tables.
In certain test configurations, the rebuild of the materialized views
(which is done via Hive) was not populating the data in the MV. In
this patch, I have changed the source tables of materialized views
to be full-acid instead of insert-only transactional tables. This
enables the tests to succeed. Insert-only source tables are also
meant to work for the MV rebuild but that is a Hive issue that will
be investigated separately.
Change-Id: I349faa0ad36ec8ca6f574f7f92d9a32fb7d0d344
Reviewed-on: http://gerrit.cloudera.org:8080/18421
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Aman Sinha <amsinha@cloudera.com>
The existing behavior is that materialized views are treated
as views and therefore expanded similar to a view when one
queries the MV directly (SELECT * FROM materialized_view).
This is incorrect since an MV is a regular table with physical
properties such as partitioning, clustering etc. and should be
treated as such even though it has a view definition associated
with it.
This patch focuses on the use case where MVs are created as
HDFS tables and makes the MVs a derived class of HdfsTable,
therefore making it a Table object. It adds support for
collecting and displaying statistics on materialized views
and these statistics could be leveraged by an external frontend
that supports MV based query rewrites (note that such a rewrite
is not supported by Impala with or without this patch). Note
that we are not introducing new syntax for MVs since DDL, DML
operations on MVs are only supported through Hive.
Directly querying a MV is permitted but inserts into MVs is not
since MVs are supposed to be only modified through an external
refresh when the source tables have modifications.
If the source tables associated with a materialized view have
column masking or row-filtering Ranger policies, querying the
MV will throw an error. This behavior is consistent with that
of Hive.
Testing:
- Added transactional tables for alltypes, jointbl and used
them as source tables to create materialized view.
- Added tests for compute stats, drop stats, show stats and
simple select query on a materialized view.
- Added test for select on a materialized view when the
source table has a column mask.
- Modified analyzer tests related to alter, insert, drop of
materialized view.
Change-Id: If3108996124c6544a97fb0c34b6aff5e324a6cff
Reviewed-on: http://gerrit.cloudera.org:8080/17595
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
This patch changes the line that added to HADOOP_CLASSPATH all the jar
files in the folder ${RANGER_HOME}/ews/webapp/WEB-INF/lib to a line that
only includes those jar files with names starting with "ranger-" since
almost all other jar files do not seem to be necessary to run the E2E
test of test_hive_with_ranger_setup.
This way we also avoid adding too many paths to HADOOP_CLASSPATH, which
in turn could result in Hadoop not being able to return its version to
the script that starts HMS due to the error of "Argument list too long".
Testing:
- Verified after this patch, test_hive_with_ranger_setup still
succeeds.
- Verified in a local development environment that the length of
Hadoop's environment variable 'CLASSPATH' logged in
hive-metastore.out decreases from 100,876 characters to 62,634
characters when executing run-hive-server.sh with the flag
'-with_ranger' if $HADOOP_SHELL_SCRIPT_DEBUG is "true" and
$IMPALA_HOME is "/home/fangyurao/Impala_for_FE".
Change-Id: Ifd66fd99a346835b9f81f95b5f046273fcce2590
Reviewed-on: http://gerrit.cloudera.org:8080/18398
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We use 'mapred.output.compression.codec' to set the compression codec in
generating test files by Hive. However, it doesn't affect ORC files.
Instead, we need to set 'orc.compress' in tblproperties for each ORC
tables. The default value of 'orc.compress' is ZLIB which corresponds to
our 'def' codec. We only need to set it for non-def codecs.
This patch also fixes a bug in build_compression_codec_statement() that
would raise KeyError when loading lz4 non-avro tables.
Tests
- Loaded tpch data in orc/none/none, orc/def/block, orc/snap/block,
orc/lz4/block and verified there compression codecs.
Change-Id: I02bd5d9400864145133ff019a3d076a6cab36fcc
Reviewed-on: http://gerrit.cloudera.org:8080/18228
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Like IMPALA-8369, this patch adds a compatibility shim in fe so that
Impala can interoperate with Hive 3.1.2. we need adds a new
Metastoreshim class under compat-apache-hive-3 directory. These shim
classes implement method which are different in cdp-hive-3 vs
apache-hive-3 and are used by front end code. At the build time, based
on the environment variable IMPALA_HIVE_DIST_TYPE one of the two shims
is added to as source using the fe/pom.xml build plugin.
Some codes that directly use Hive 4 APIs need to be ignored in
compilation, eg. fe/src/main/java/org/apache/impala/catalog/metastore/.
Use Maven profile to ignore some codes, profile will automatically
activated based on the IMPALA_HIVE_DIST_TYPE.
Testing:
1. Code compiles and runs against both HMS-3 and ASF-HMS-3
2. Ran full-suite of tests against HMS-3
3. Running full-tests against ASF-HMS-3 will need more work
supporting Tez in the mini-cluster (for dataloading) and HMS
transaction support. This will be on-going effort and test failures
on ASF-Hive-3 will be fixed in additional sub-tasks.
Notes:
1. Patch uses a custom build of Apache Hive to be deployed in
mini-cluster. This build has the fixes for HIVE-21569, HIVE-20038.
This hack will be added to the build script in additional sub-tasks.
Change-Id: I9f08db5f6da735ac431819063060941f0941f606
Reviewed-on: http://gerrit.cloudera.org:8080/17774
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
load-functional-query-exhaustive-hbase-generated.create failed with
newer HBase shell from version 2.4.6. HBase shell throws the following
error at the end of script execution:
ERROR NoMethodError: private method `exit' called for nil:NilClass
Since we run the script in non-interactive way, it is safe for us to
remove this last "exit" command from script.
Testing:
- Complete data loading without error.
- Pass core tests.
Change-Id: I9185704b02c51c7a9cb9aa7fd2a7d1103c8b7cbb
Reviewed-on: http://gerrit.cloudera.org:8080/18079
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.
New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.
Follow-up:
- Support for caching COS file handles will be addressed in
IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
COS (IMPALA-10773).
Tests:
- Upload hdfs test data to a COS bucket. Modify all locations in HMS
DB to point to the COS bucket. Remove some hdfs caching params.
Run CORE tests.
Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Recent Hive releases seem to be enforcing that data for a managed table
is stored under the hive.metastore.warehouse.dir path property in a
folder path similar to databasename.db/tablename - see
https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables
Use this form /test-warehouse/managed/databasename.db in
generate-schema-statements.py when creating transactional tables.
Testing:
- A few small changes to tests that verify filesystem changes for acid
tables.
- Exhaustive tests pass.
Change-Id: Ib870ca802c9fa180e6be7a6f65bef35b227772db
Reviewed-on: http://gerrit.cloudera.org:8080/18046
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch bumps up the GBN to 15549253. This patch includes the fix by
Fang-Yu for using correct policy id to update the policy of "all - database"
due to the change on the Ranger side.
Testing:
* ran the create-load-data.sh
Change-Id: Ie7776e62dad0b9bec6c03fb9ee8f1b8728ff0e69
Reviewed-on: http://gerrit.cloudera.org:8080/17746
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We found that setup-ranger will continue the execution if an error
occurs when i) wget was executed to initialize the environment
variables of GROUP_ID_OWNER and GROUP_ID_NON_OWNER, and ii) curl was
executed to upload a revised Ranger policy even though the -e option was
set when create-load-data.sh was executed. This patch improves the error
handling by making setup-ranger exit as soon as an error occurs so that
no test would be run at all in case there is an error.
To exit if an error occurs during wget, we separate the assignment and
the export of the environment variables since the export of an
environment variable will run and succeed even though there is an error
before the export if the assignment and the export are combined. That
is, combining them hides the error.
On the other hand, to exit if an error occurs during curl, we add an
additional -f option so that an error will no longer be silently
ignored.
Testing:
- Verified that setup-ranger could be successfully executed after this
patch.
- Verified that setup-ranger would exit if a URL in setup-ranger is not
correctly set up or if the 'id' field in policy_4_revised.json does
not match the URL of the policy to be updated.
Change-Id: I45605d1a7441b734cf80249626638cde3adce28b
Reviewed-on: http://gerrit.cloudera.org:8080/17386
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch, when we started the Hive Metastore or the HiveServer2
via run-hive-server.sh, hive-metastore.out and hive-server2.out would be
generated, respectively. The output of executing the hive command was
redirected to the corresponding files mentioned above. However, since
">" was used to redirect the output, the previously generated files
would be overwritten every time during the restart and thus some
important error message may be lost.
To collect more information regarding the restart, this patch appends
the newly produced output to hive-metastore.out and hive-server2.out
during the restart, which makes it easier for developers to troubleshoot
issues related to the restart.
Change-Id: I2efdbcf886e2d32ccf8c7eef038360884e44f216
Reviewed-on: http://gerrit.cloudera.org:8080/17642
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the dataload scripts don't respect non-standard
compression codecs when loading Parquet data. It always
loads snappy, even when specifying something else like
--table_format=parquet/zstd.
This fixes the dataload scripts so that they specify the
compression_codec query option correctly and thus use the
right codec when loading Parquet.
For backwards compatibility, this preserves the behavior
that parquet/none corresponds to the default compression
codec (which is Snappy).
This should make it easier to do performance testing on
various Parquet codecs (like ZSTD).
Testing:
- Ran bin/load-data.py -w tpch --table_format=parquet/zstd
and checked the codec in the file with the parquet-reader
utility
Change-Id: I1a346de3e5c4e38328e5a8ce8162697b7dd6553a
Reviewed-on: http://gerrit.cloudera.org:8080/17259
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Streamline triaging a bit. When this fails, it does so in a specific
location, and until now you had to scan the build log to find the
problem. This JUnitXML symptom should make this failure mode obvious.
Tested by running an S3 build on private infrastructure with a knowingly
mismatched data snapshot.
Change-Id: I2fa193740a2764fdda799d6a9cc64f89cab64aba
Reviewed-on: http://gerrit.cloudera.org:8080/17242
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.
New flags for GCS:
- num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.
Follow-up:
- Support for spilling to GCS will be addressed in IMPALA-10561.
- Support for caching GCS file handles will be addressed in
IMPALA-10568.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
GCS (IMPALA-10562).
- Some tests are skipped due to issues introduced by /etc/hosts setting
on GCE instances (IMPALA-10563).
Tests:
- Compile and create hdfs test data on a GCE instance. Upload test data
to a GCS bucket. Modify all locations in HMS DB to point to the GCS
bucket. Remove some hdfs caching params. Run CORE tests.
- Compile and load snapshot data to a GCS bucket. Run CORE tests.
Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch enables the Impala frontend jar and dependent library
libfesupport.so to be used by an external Java frontend.
Calling FeSupport.setExternalFE() will cause external frontend
initialization mode to be used during FeSupport.loadLibrary(). This
mode builds upon logic that is used to initialize the frontend jar for
unit tests.
Initialization in external frontend mode differs as follows:
- Skip instantiating Frontend object and it's dependents
- Skip loading libhdfs
- Skip starting JVM Pause monitor
- Disable Minidumper
- Initialize TimezoneDatabase for external frontends
- Disable redirect of stderr/stdout to libfesupport.so glog
- Log messages from libfesupport.so to stderr
- Use libfesupport.so for JNI symbol look up
Null check were added in places where objects were assumed to be
instantiated but are now skipped during initialization.
Additional change:
1) Add libfesupport.lib path to JAVA_LIBRARY_PATH in test driver
Testing: - Initialized frontend jar from external frontend
- Verified that frontend Java objects can be used externally without
issues
- Verified that exceptions thrown from Impala Java or libfesupport
can be caught or propagated correctly by the external frontend
- Manual verification of minicluster logs
- Ran queries with external frontend
Co-authored-by: John Sherman <jfs@cloudera.com>
Co-authored-by: Aman Sinha <amsinha@cloudera.com>
Change-Id: I4e3a84721ba196ec00773ce2923b19610b90edd9
Reviewed-on: http://gerrit.cloudera.org:8080/17115
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
As a follow-up to IMPALA-10314, it is sometimes useful to consider
a simple limit as a way to sample from a table if a relevant hint
has been provided. Doing a sample instead of pure limit serves
dual purposes: (a) it still helps with reducing the planning time
since the scan ranges need be computed only for the sample files,
(b) it allows sufficient number of files/rows to be read from
the table such that after applying filter conditions or joins with
another table, the query may still produce the N rows needed for
limit.
This fuctionality is especially useful if the query is against a
view. Note that TABLESAMPLE clause cannot be applied to a view and
embedding a TABLESAMPLE explicitly on a table within a view will
not work because we don't want to sample if there's no limit.
In this patch, a new table level hint, 'convert_limit_to_sample(n)'
is added. If this hint is attached to a table either in the main
query block or within a view/subquery and simple limit optimization
conditions are satisfied (according to IMPALA-10314), the limit
is converted to a table sample. The parameter 'n' in parenthesis is
required and specifies the sample percentage. It must be an integer
between 1 and 100. For example:
set optimize_simple_limit = true;
CREATE VIEW v1 as SELECT * FROM T [convert_limit_to_sample(5)]
WHERE [always_true] <predicate>;
SELECT * FROM v1 LIMIT 10;
In this case, the limit 10 is applied on top of a 5 percent sample
of T which is applied after partition pruning.
Testing:
- Added a alltypes_date_partition_2 table where the date and
timestamp values match (this helps with setting the
'always_true' hint).
- Added views with 'convert_limit_to_sample' and 'always_true'
hints and added new tests against the views. Modified a few
existing tests to reference the new table variant.
- Added an end-to-end test.
Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
Reviewed-on: http://gerrit.cloudera.org:8080/16792
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This changes all existing Java code to be submodules under
a single root pom. The root pom is impala-parent/pom.xml
with minor changes to add submodules.
This avoids most of the weird CMake/maven interactions,
because there is now a single maven invocation for all
the Java code.
This moves all the Java projects other than fe into
a top level java directory. fe is left where it is
to avoid disruption (but still is compiled via the
java directory's root pom). Various pieces of code
that reference the old locations are updated.
Based on research, there are two options for dealing
with the shaded dependencies. The first is to have an
entirely separate Maven project with a separate Maven
invocation. In this case, the consumers of the shaded
jars will see the reduced set of transitive dependencies.
The second is to have the shaded dependencies as modules
with a single Maven invocation. The consumer would see
all of the original transitive dependencies and need to
exclude them all. See MSHADE-206/MNG-5899. This chooses
the second.
This only moves code around and does not focus on version
numbers or making "mvn versions:set" work.
Testing:
- Ran a core job
- Verified existing maven commands from fe/ directory still work
- Compared the *-classpath.txt files from fe and executor-deps
and verified they are the same except for paths
Change-Id: I08773f4f9d7cb269b0491080078d6e6f490d8d7a
Reviewed-on: http://gerrit.cloudera.org:8080/16500
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
HIVE-24145 still causes data load failures quite frequently. The failure
usually occurs during TPC-DS loading. I modified
generate-schema-statements.py to only load ORC tables as full ACID in
the 'functional-query' workload. Since this workload contains the
ACID-specific tests, we should still have enough coverage for ORC/ACID
testing.
Testing
* Ran exhaustive tests successfully
Change-Id: I0c81aedd3be314819dc4bc5bebec17bad3d03b10
Reviewed-on: http://gerrit.cloudera.org:8080/16511
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some tests (e.g. AnalyzeDDLTest.TestCreateTableLikeFileOrc) depend on
hard-coded file paths of managed tables, assuming that there is always a
file named 'base_0000001/bucket_00000_0' under the table dir. However,
the file name is in the form of bucket_${bucket-id}_${attempt-id}. The
last part of the file name is not guaranteed to be 0. If the first
attempt fails and the second attempt succeeds, the file name will be
bucket_00000_1.
This patch replaces these hard-coded file paths to corresponding files
that are uploaded to HDFS by commands. For tests that do need to use the
file paths of managed table files, we do a listing on the table dir to
get the file names, instead of hard-coding the file paths.
Updated chars-formats.orc to contain column names in the file so can be
used in more tests. The original one only has names like col0, col1,
col2.
Tests:
- Run CORE tests
Change-Id: Ie3136ee90e2444c4a12f0f2e1470fca1d5deaba0
Reviewed-on: http://gerrit.cloudera.org:8080/16441
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for constant propagation of range predicates
involving date and timestamp constants. Previously, only equality
predicates were considered for propagation. The new type of propagation
is shown by the following example:
Before constant propagation:
WHERE date_col = CAST(timestamp_col as DATE)
AND timestamp_col BETWEEN '2019-01-01' AND '2020-01-01'
After constant propagation:
WHERE date_col >= '2019-01-01' AND date_col <= '2020-01-01'
AND timestamp_col >= '2019-01-01' AND timestamp_col <= '2020-01-01'
AND date_col = CAST(timestamp_col as DATE)
As a consequence, since Impala supports table partitioning by date
columns but not timestamp columns, the above propagation enables
partition pruning based on timestamp ranges.
Existing code for equality based constant propagation was refactored
and consolidated into a new class which handles both equality and
range based constant propagation. Range based propagation is only
applied to date and timestamp columns.
Testing:
- Added new range constant propagation tests to PlannerTest.
- Added e2e test for range constant propagation based on a newly
added date partitioned table.
- Ran precommit tests.
Change-Id: I811a1f8d605c27c7704d7fc759a91510c6db3c2b
Reviewed-on: http://gerrit.cloudera.org:8080/16346
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This removes Impala-lzo from the Impala development environment.
Impala-lzo is not built as part of the Impala build. The LZO plugin
is no longer loaded. LZO tables are not loaded during dataload,
and LZO is no longer tested.
This removes some obsolete scan APIs that were only used by Impala-lzo.
With this commit, Impala-lzo would require code changes to build
against Impala.
The plugin infrastructure is not removed, and this leaves some
LZO support code in place. If someone were to decide to revive
Impala-lzo, they would still be able to load it as a plugin
and get the same functionality as before. This plugin support
may be removed later.
Testing:
- Dryrun of GVO
- Modified TestPartitionMetadataUncompressedTextOnly's
test_unsupported_text_compression() to add LZO case
Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e
Reviewed-on: http://gerrit.cloudera.org:8080/15814
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Minor compactions can compact several delta directories into a single
delta directory. The current directory filtering algorithm had to be
modified to handle minor compacted directories and prefer those over
plain delta directories. This happens in the Frontend, mostly in
AcidUtils.java.
Hive Streaming Ingestion writes similar delta directories, but they
might contain rows Impala cannot see based on its valid write id list.
E.g. we can have the following delta directory:
full_acid/delta_0000001_0000010/0000 # minWriteId: 1
# maxWriteId: 10
This delta dir contains rows with write ids between 1 and 10. But maybe
we are only allowed to see write ids less than 5. Therefore we need to
check the ACID write id column (named originalTransaction) to determine
which rows are valid.
Delta directories written by Hive Streaming don't have a visibility txn
id, so we can recognize them based on the directory name. If there's
a visibilityTxnId and it is committed => every row is valid:
full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId
# every row is valid
If there's no visibilityTxnId then it was created via Hive Streaming,
therefore we need to validate rows. Fortunately Hive Streaming writes
rows with different write ids into different ORC stripes, therefore we
don't need to validate the write id per row. If we had statistics,
we could validate per stripe, but since Hive Streaming doesn't write
statistics we validate the write id per ORC row batch (an alternative
could be to do a 2-pass read, first we'd read a single value from each
stripe's 'currentTransaction' field, then we'd read the stripe if the
write id is valid).
Testing
* the frontend logic is tested in AcidUtilsTest
* the backend row validation is tested in test_acid_row_validation
Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d
Reviewed-on: http://gerrit.cloudera.org:8080/15818
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.
Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
"ranger" requires extra parameters to be set. This changes the
default value of authorization_provider to "", which translates
internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
- authorization_policy_provider_class
- sentry_catalog_polling_frequency_s
- sentry_config
3. The authorization_factory_class may be obsolete now that
there is only one authorization policy, but this leaves it
in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
that is removed. There are still Maven dependencies coming
from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
is not removed and it is still called from testdata/bin/kill-all.sh.
Testing:
- Core job passes
Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 moved to using CDP versions for components, which involves
adopting Hive 3. This removes the old code supporting CDH components
and Hive 2. Specifically, it does the following:
1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true.
USE_CDP_HIVE now has no effect on the Impala environment. This also
means that bin/jenkins/build-all-flag-combinations.sh no longer
include USE_CDP_HIVE=false as a configuration.
2. Remove USE_CDH_KUDU and default to getting Impala from the
native toolchain.
3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including
the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml.
There is a fair amount of code that still references the Hive major
version. Upstream Hive is now working on Hive 4, so there is a high
likelihood that we'll need some code to deal with that transition.
This leaves some code (such as maven profiles) and test logic in
place.
Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608
Reviewed-on: http://gerrit.cloudera.org:8080/15869
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HIVE-22794 disallows ACID tables outside of the 'managed' warehouse
directory. This change updates data loading to make it conform to
the new rules.
The following tests had to be modified to use the new paths:
* AnalyzeDDLTest.TestCreateTableLikeFileOrc()
* create-table-like-file-orc.test
Change-Id: Id3b65f56bf7f225b1d29aa397f987fdd7eb7176c
Reviewed-on: http://gerrit.cloudera.org:8080/15708
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Automatically assume IMPALA_HOME is the source directory
in a couple of places.
Delete the cache_tables.py script and MINI_DFS_BASE_DATA_DIR
config var which had both bit-rotted and were unused.
Allow setting IMPALA_CLUSTER_NODES_DIR to put the minicluster
nodes, most important the data, in a different location, e.g.
on a different filesystem.
Testing:
I set up a dev environment using this code and was able to
load data and run some tests.
Change-Id: Ibd8b42a6d045d73e3ea29015aa6ccbbde278eec7
Reviewed-on: http://gerrit.cloudera.org:8080/15687
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Full ACID row format looks like this:
{
"operation": 0,
"originalTransaction": 1,
"bucket": 536870912,
"rowId": 0,
"currentTransaction": 1,
"row": {"i": 1}
}
User columns are nested under "row". In the frontend we need to create
slot descriptors that correspond to the file schema. In the catalog we
could mimic the file schema but that would introduce several
complexities and corner cases in column resolution. Also in query
results the heading of the above user column would be "row.i". Star
expansion should also be modified, etc.
Because of that in the Catalog I create the exact opposite of the above
schema:
{
"row__id":
{
"operation": 0,
"originalTransaction": 1,
"bucket": 536870912,
"rowId": 0,
"currentTransaction": 1
}
"i": 1
}
This way very little modification is needed in the frontend. And the
hidden columns can be easily retrieved via 'SELECT row__id.*' when we
need those for debugging/testing.
We only need to change Path.getAbsolutePath() to return a schema path
that corresponds to the file schema. Also in the backend we need some
extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the
table schema path from the file schema path.
Testing:
I changed data loading to load ORC files in full ACID format by default.
With this change we should be able to scan full ACID tables that are
not minor-compacted, don't have deleted rows, and don't have original
files.
Newly added Tests:
* specific queries about hidden columns (full-acid-rowid.test)
* SHOW CREATE TABLE (show-create-table-full-acid.test)
* DESCRIBE [FORMATTED] TABLE (describe-path.test)
* INSERT should be forbidden (acid-negative.test)
* added tests for column masking (
ranger_column_masking_complex_types.test)
Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb
Reviewed-on: http://gerrit.cloudera.org:8080/15395
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Due to the incompatibility between different versions of the Guava
libraries, after bumping up Guava in IMPALA-8870, our build script is
not supposed to start up the Sentry service when starting the
minicluster because Sentry has not had its Guava bumped up yet. However,
the patch for IMPALA-8870 did not take this into consideration when
$TARGET_FILESYSTEM is s3 and thus run-all.sh still attempts to start
up Sentry in this case. This patch fixes the bug.
Testing:
- Verified that this patch passes the core tests in the DEBUG build when
$TARGET_FILESYSTEM is s3.
Change-Id: If81846f4251fb2aa752ba8c33615cae0ab513a62
Reviewed-on: http://gerrit.cloudera.org:8080/15590
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds an environment variable DISABLE_SENTRY to allow Impala
to run tests without Sentry. Specifically, we start up Sentry only when
$DISABLE_SENTRY does not evaluate to true. The corresponding Sentry FE
and E2E tests will also be skipped if $DISABLE_SENTRY is true.
Moreover, in this patch we will set DISABLE_SENTRY to true if
$USE_CDP_HIVE evaluates to true, allowing one to only test Impala's
authorization with Ranger when support for Sentry is dropped after we
switch to the CDP Hive.
Note that in this patch we also change the way we generate
hive-site.xml when $DISABLE_SENTRY is true. To be more precise, when
generating hive-site.xml, we do not add the Sentry server as a metastore
event listener if $DISABLE_SENTRY is true. Recall that both CDH Hive and
CDP Hive would make an RPC to the registered listeners every time after
the method of create_database_core() in HiveMetaStore.java is called,
which happens when Hive instead of Impala is used to create a database,
e.g., when some databases in the TPC-DS data set are created during the
execution of create-load-data.sh. Thus the removal of Sentry as an event
listener is necessary when $DISABLE_SENTRY is true in that it prevents
the HiveMetaStore from keeping connecting to the Sentry server that is
not online, which could make create-load-data.sh time out.
Testing:
Except for two currently known issues of IMPALA-9513 AND IMPALA-9451,
verified this patch passes the exhaustive tests in the DEBUG build
- when $USE_CDP_HIVE is false, and
- when $USE_CDP_HIVE is true.
Change-Id: Ifa3f1840a77a7b32310a5c8b78a2c26300ccb41e
Reviewed-on: http://gerrit.cloudera.org:8080/15505
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
String values from external systems (HDFS, Hive, Kudu, etc.) are already
unescaped, the same as string values in Thrift objects deserialized in
coordinators. We should mark needsUnescaping_ as false in creating
StringLiterals for these values (in LiteralExpr#create()).
When comparing StringLiterals in partition pruning, we should also use
the unescaped values if needsUnescaping_ is true.
Tests:
- Add tests for partition pruning on unescaped strings.
- Add test coverage for all existing code paths using
LiteralExpr#create().
- Run core tests
Change-Id: Iea8070f16a74f9aeade294504f2834abb8b3b38f
Reviewed-on: http://gerrit.cloudera.org:8080/15278
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The schema file allows specifying a commandline command in
several of the sections (LOAD, DEPENDENT_LOAD, etc). These
are execute by testdata/bin/generate-schema-statements.py
when it is creating the SQL files that are later executed
for dataload. A fair number of tables use this flexibility
to execute hdfs mkdir and copy commands via the command line.
Unfortunately, this is very inefficient. HDFS command line
commands require spinning up a JVM and can take over one
second per command. These commands are executed during a
serial part of dataload, and they can be executed multiple
times. In short, these commands are a significant slowdown
for loading the functional tables.
This converts the hdfs command line statements to equivalent
Hive LOAD DATA LOCAL statements. These are doing the copy
from an already running JVM, so they do not need JVM startup.
They also run in the parallel part of dataload, speeding up
the SQL generation part.
This speeds up generate-schema-statements.py significantly.
On the functional dataset, it saves 7 minutes.
Before:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real 8m8.068s
user 10m11.218s
sys 0m44.932s
After:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real 0m35.800s
user 0m42.536s
sys 0m5.210s
This is currently a long-pole in dataload, so it translates directly to
an overall speedup of about 7 minutes.
Testing:
- Ran debug tests
Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c
Reviewed-on: http://gerrit.cloudera.org:8080/15228
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As Joe pointed out in IMPALA-9351, it would help debugging issues with
missing files if we had logged the created files when loading the data.
With this commit, running create-load-data.sh now logs the created
files into created-files.log.
Change-Id: I4f413810c6202a07c19ad1893088feedd9f7278f
Reviewed-on: http://gerrit.cloudera.org:8080/15234
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Add a new flag -with_ranger in testdata/bin/run-hive-server.sh to start
Hive with Ranger integration. The relative configuration files are
generated in bin/create-test-configuration.sh using a new varient
ranger_auth in hive-site.xml.py. Only Hive3 is supported.
Current limitation:
Can't use different username in Beeline by the -n option. "select
current_user()" keeps returning my username, while "select
logged_in_user()" can return the username given by -n option but it's
not used in authorization.
Tests:
- Ran bin/create-test-configuration.sh and verified the generated
hive-site_ranger_auth.xml contains Ranger configurations.
- Ran testdata/bin/run-hive-server.sh -with_ranger. Verified column
masking and row filtering policies took effect in Beeline.
- Added test in test_ranger.py for this mode.
Change-Id: I01e3a195b00a98388244a922a1a79e65146cec42
Reviewed-on: http://gerrit.cloudera.org:8080/15189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.
Tests
- Unit test for Hudi in FileMetadataLoader
- Create table tests in functional_schema_template.sql
- Query tests in hudi-parquet.test
Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Reviewed-on: http://gerrit.cloudera.org:8080/14711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The kerberized minicluster is enabled by setting
IMPALA_KERBERIZE=true in impala-config-*.sh.
After setting it you must run ./bin/create-test-configuration.sh
then restart minicluster.
This adds a script to partially automate setup of a local KDC,
in lieu of the unmaintained minikdc support (which has been ripped
out).
Testing:
I was able to run some queries against pre-created HDFS tables
with kerberos enabled.
Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee
Reviewed-on: http://gerrit.cloudera.org:8080/15159
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive 3 changed the typical storage model for tables to split them
between two directories:
- hive.metastore.warehouse.dir stores managed tables (which is now
defined to be only transactional tables)
- hive.metastore.warehouse.external.dir stores external tables
(everything that is not a transactional table)
In more recent commits of Hive, there is now validation that the
external tables cannot be stored in the managed directory. In order
to adopt these newer versions of Hive, we need to use separate
directories for external vs managed warehouses.
Most of our test tables are not transactional, so they would reside
in the external directory. To keep the test changes small, this uses
/test-warehouse for the external directory and /test-warehouse/managed
for the managed directory. Having the managed directory be a subdirectory
of /test-warehouse means that the data snapshot code should not need to
change.
The Hive 2 configuration doesn't change as it does not have this concept.
Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION
to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate
location for data as compared to USE_CDP_HIVE=false. That should reduce
conflicts between the two configurations.
Testing:
- Ran exhaustive tests with USE_CDP_HIVE=false
- Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version)
- Verified that dataload succeeds and tests are able to run with a newer
Hive version.
Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec
Reviewed-on: http://gerrit.cloudera.org:8080/15026
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In IMPALA-9047, we disabled some Ranger-related FE and BE tests due to
changes in Ranger's behavior after upgrading Ranger from 1.2 to 2.0.
This patch aims to re-enable those disabled FE tests in
AuthorizationStmtTest.java and RangerAuditLogTest.java to increase
Impala's test coverage of authorization via Ranger.
There are at least two major changes in Ranger's behavior in the newer
versions.
1. The first is that the owner of the requested resource no longer has
to be explicitly granted privileges in order to access the resource.
2. The second is that a user not explicitly granted the privilege of
creating a database is able to do so.
Due to these changes, some of previous Ranger authorization requests
that were expected to be rejected are now granted after the upgrade.
To re-enable the tests affected by the first change described above, we
modify AuthorizationTestBase.java to allow our FE Ranger authorization
tests to specify the requesting user in an authorization test. Those
tests failed after the upgrade because the default requesting user in
Impala's AuthorizationTestBase.java happens to be the owner of the
resources involved in our FE authorization tests. After this patch, a
requesting user can be either a non-owner user or an owner user in a
Ranger authorization test and the requesting user would correspond to a
non-owner user if it is not explicitly specified. Note that in a Sentry
authorization test, we do not use the non-owner user as the requesting
user by default as we do in the Ranger authorization tests. Instead, we
set the name of the requesting user to a name that is the same as the
owner user in Ranger authorization tests to avoid the need for providing
a customized group mapping service when instantiating a Sentry
ResourceAuthorizationProvider as we do in AuthorizationTest.java, our
FE tests specifically for testing authorization via Sentry.
On the other hand, to re-enable the tests affected by the second change,
we remove from the Ranger policy for all databases the allowed
condition that grants any user the privilege of creating a database,
which is not by default granted by Sentry. After the removal of the
allowed codition, those tests in AuthorizationStmtTest.java and
RangerAuditLogTest.java affected by the second change now result in the
same authorization errors before the upgrade of Ranger.
Testing:
- Passed AuthorizationStmtTest.java in a local dev environment
- Passed RangerAuditLogTest.java in a local dev environment
Change-Id: I228533aae34b9ac03bdbbcd51a380770ff17c7f2
Reviewed-on: http://gerrit.cloudera.org:8080/14798
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The fix for IMPALA-9150 changed kill-hbase.sh to use HBase's
stop-hbase.sh script. Around this time, the GVO timeout issues
started. GVO can reuse machines, so we don't know what state
they may be in. If something failed to kill HBase processes,
the next job would need to be able to kill them even without
access to the last run's files / logs.
This restores the original kill logic to kill-hbase.sh, after
trying a graceful shutdown using HBase's stop-hbase.sh script.
The original kill logic doesn't rely on anything from the
filesystem to know about the existence of processes, so it
would handle machine reuse.
This also changes our Jenkins test scripts to shut down the
minicluster at the end.
Testing:
- Started with a running minicluster, ran bin/clean.sh,
then ran testdata/bin/kill-all.sh and verified that the
java processes were gone
Change-Id: Ie2f0b342bcd1d8abea8ef923adbb54a14518a7a6
Reviewed-on: http://gerrit.cloudera.org:8080/14789
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some jobs have been hanging in the testdata/bin/create-hbase.sh
script. Logs during the hang show the HBase Master is stuck
unitialized:
Master startup cannot progress, in holding-pattern until region onlined.
...
ERROR master.HMaster: Master failed to complete initialization after 900000ms.
Anecdotally, the HBase Master doesn't have this problem if we remove
the /hbase/MasterProcWALs directory in kill-hbase.sh. This patch
does exactly that. It is a hack, and we should update this code
once we know what is going on.
Testing:
- test-with-docker.py fails without this patch and passes with it
- Hand testing on my minicluster shows that this allows HBase to
restart and be consistently usable
Change-Id: Icef3d30e6b539a175e03f63fcdbfb2d4608c08fa
Reviewed-on: http://gerrit.cloudera.org:8080/14757
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This converts the existing bin/run-all-tests-timeout-check.sh
to a more generic bin/script-timeout-check.sh. It uses this
new script for both bin/run-all-tests.sh and
testdata/bin/create-load-data.sh. The new script takes two
arguments:
-timeout : timeout in minutes
-script_name : name of the calling script
The script_name is used in debugging output / output filenames
to make it clear what timed out.
The run-all-tests.sh timeout remains the same.
testdata/bin/create-load-data.sh uses a 2.5 hour timeout.
This should help debug the issue in IMPALA-9165, because at
least the logs would be preserved on the Jenkins job.
Testing:
- Tested the timeout script by hand with a caller script that
sleeps longer than the timeout
- Ran a gerrit-verify-dryrun-external
Change-Id: I19d76bd8850c7d4b5affff4d21f32d8715a382c6
Reviewed-on: http://gerrit.cloudera.org:8080/14741
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
testdata/bin/kill-hbase.sh currently uses the generic
kill-java-service.sh script to kill the region servers,
then the master, and then the zookeeper. Recent versions
of HBase become unusable after performing this type of
shutdown. The master seems to get stuck trying to recover,
even after restarting the minicluster.
The root cause in HBase is unclear, but HBase provides the
stop-hbase.sh script, which does a more graceful shutdown.
This switches tesdata/bin/kill-hbase.sh to use this script,
which avoids the recovery problems.
Testing:
- Ran the test-with-docker.py tests (which does a minicluster
restart). Before the change, the HBase tests timed out due
to HBase getting stuck recovering. After the change, tests
ran normally.
- Added a minicluster restart after dataload so that this
is tested.
Change-Id: I67283f9098c73c849023af8bfa7af62308bf3ed3
Reviewed-on: http://gerrit.cloudera.org:8080/14697
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Having non-existent or incompatible jars on the classpath can cause
Ranger startup to fail. Update run-ranger-server.sh to clean the
classpath so that it works after sourcing set-classpath.sh.
Also remove a couple of legacy jars from 2013. Those jars
no longer exist in Hive.
Testing:
In my development environment.
$ . bin/set-classpath.sh
$ ./testdata/bin/run-ranger-server.sh
Change-Id: Ie7036f9a07e5c9b8d46bb7f459d0b9d1e7e9d0a7
Reviewed-on: http://gerrit.cloudera.org:8080/14152
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In HdfsCachingUtil we set the expiration of cache directives to never.
This works well until the cache pool has max TTL set. Once max TTL is
set Impala will get an exception when it tries to add caching for tables
or partitions.
I changed HdfsCachingUtil to not set the expiration. This way the cache
directive inherits the expiration from the cache pool.
Testing
Added e2e test that creates a table in a cache pool that has max TTL.
Change-Id: I475b92704b19e337b2e62f766e5b978585bf6583
Reviewed-on: http://gerrit.cloudera.org:8080/14485
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the avro scanner.
Similarly to parquet, avro uses DATE logical type for dates. DATE
logical type annotates an INT32 that stores the number of days since
the unix epoch, 1 January 1970.
This representation introduces an avro interoperability issue between
Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
up to 1582-10-05 and Gregorian calendar for dates starting with
1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
calendar backward to dates preceding its official introduction in
1582-10-15.
This means that pre-1582-10-15 dates written to an avro table by Hive
will be read back incorrectly by Impala.
Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.
Dependency changes:
- BE uses avro 1.7.4-p5 from native-toolchain.
Change-Id: I7a9d5b93a22cf3a00244037e187f8c145cacc959
Reviewed-on: http://gerrit.cloudera.org:8080/13944
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
* Don't add PYTHONPATH to environment in impala-config.sh,
it is done automatically by the impala-python script anyway.
I think this is legacy from when we ran some things with
the system python.
* Remove unnecessary set-pythonpath.sh invocations where all
calls go via impala-python anyway.
* Remove impala-shell eggs from python path. All these packages
are installed into the virtualenv.
* testdata path entry was not needed - it's imported via the root
Testing:
Ran core tests
Change-Id: Iff98eb261ab48c592e8d323aa409c6a65317b95a
Reviewed-on: http://gerrit.cloudera.org:8080/14238
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
The flakiness may be related to starting Hive queries in parallel which
triggers initializing Tez resources in parallel (only needed at the
first statement that uses Tez). Doing a non-parallel statement at first
may solve the issue.
Also includes a fix for a recent issue in 'build-and-copy-hive-udfs'
introduced by the version bump
in https://gerrit.cloudera.org/#/c/14043/
Change-Id: Id21d57483fe7a4f72f450fb71f8f53b3c1ef6327
Reviewed-on: http://gerrit.cloudera.org:8080/14081
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
The bug is that the serialized size wasn't populated
for VARCHAR in a case when it should have been.
It appears a condition was simply not updated when
VARCHAR was added.
Other code assumed that the serialized size was
populated when the other size field was populated,
which is a reasonable invariant. I documented the
invariant in the class and added validation that
the invariant held.
Defining and checking invariants led to discovering
various other minor issues where the sizes were
set incorrect for fixed-length types or not set for
variable-length types:
* CHAR was not consistently treated as a fixed-length type.
* avgSerializedSize_ was not always updated with avgSize_
Testing:
Added a regression test for this specific case. Adding
the assertions resulted in other cases showing up
related bugs.
Change-Id: Ie45e386cb09e31f4b7cdc82b7734dbecb4464534
Reviewed-on: http://gerrit.cloudera.org:8080/14062
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>