Commit Graph

17 Commits

Author SHA1 Message Date
Riza Suminto
182aa5066e IMPALA-13958: Revisit hs2_parquet_constraint and hs2_text_constraint
hs2_parquet_constraint and hs2_text_constraint is meant to extend test
vector dimension to also test non-default test protocol (other than
beeswax), but limit it to only run against 'parquet/none' or 'text/none'
format accordingly.

This patch modifies these constraints to
default_protocol_or_parquet_constraint and
default_protocol_or_text_constraint respectively such that the full file
format coverage happen for default_test_protocol configuration and
limited for the other protocols. Drop hs2_parquet_constraint entirely
from test_utf8_strings.py because that test is already constrained to
single 'parquet/none' file format.

Num modified rows validation in date-fileformat-support.test and
date-partitioning.test are changed to check the NumModifiedRows counter
from profile.

Fix TestQueriesJsonTables to always run with beeswax protocol because
its assertions relies on beeswax-specific return values.

Run impala-isort and fix few flake8 issues and in modified test files.

Testing:
Run and pass the affected test files using exhaustive exploration and
env var DEFAULT_TEST_PROTOCOL=hs2. Confirmed that full file format
coverage happen for hs2 protocol. Note that
DEFAULT_TEST_PROTOCOL=beeswax is still the default.

Change-Id: I8be0a628842e29a8fcc036180654cd159f6a23c8
Reviewed-on: http://gerrit.cloudera.org:8080/22775
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-17 22:50:58 +00:00
Riza Suminto
047cf9ff4d IMPALA-13954: Validate num inserted rows via NumModifiedRows counter
This patch changes the way test validate num inserted rows from checking
the beeswax-specific result to checking NumModifiedRows counter from
query profile.

Remove skiping over hs2 protocol in test_chars.py and refactor
test_date_queries.py a bit to reduce test skiping. Added HS2_TYPES in
tests that requires it and fix some flake8 issues.

Testing:
Run and pass all affected tests.

Change-Id: I96eae9967298f75b2c9e4d0662fcd4a62bf5fffc
Reviewed-on: http://gerrit.cloudera.org:8080/22770
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-04-11 20:31:50 +00:00
Csaba Ringhofer
f98b697c7b IMPALA-13929: Make 'functional-query' the default workload in tests
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.

All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.

The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.

get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.

Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-08 07:12:55 +00:00
Eyizoha
2f06a7b052 IMPALA-10798: Initial support for reading JSON files
Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-05 16:55:41 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Michael Smith
1eb0510eaa IMPALA-11456: Collapse filesystem Skip logic
Combines all SkipIf* classes for different filesystems into a single
SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the
rest as filesystem-specific special cases. The 'jira' option is removed
in favor of specific flags for each issue.

Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1
Reviewed-on: http://gerrit.cloudera.org:8080/18781
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-10 22:37:08 +00:00
Michael Smith
830625b104 IMPALA-9442: Add Ozone to minicluster
Adds Ozone as an alternative to hdfs in the minicluster. Select by
setting `export TARGET_FILESYSTEM=ozone`. With that flag,
run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot
because Ozone does not support HBase (HDDS-3589); snapshot loading
doesn't work yet primarily due to HDDS-5502.

Uses the o3fs interface because Ozone puts specific restrictions on
bucket names (no underscores, for instance), and it was a lot easier to
use an interface where everything is written to a single bucket than to
update all Impala's use of HDFS-style paths to make `test-warehouse` a
bucket inside a volume.

Specifies reduced Ozone client retries during shutdown where Ozone may
not be available.

Passes tests with FE_TEST=false BE_TEST=false.

Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be
Reviewed-on: http://gerrit.cloudera.org:8080/18738
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-08-03 16:58:20 +00:00
Michael Smith
5263d13112 IMPALA-11314: Test PyPI package with system python
Sets up a virtualenv with system python to install the impala-shell PyPI
package into. Using system python provides better coverage for Python
versions likely to be used by customers. Runs impala-shell tests using
the PyPI package to provide better coverage for the artifact customers
will use.

Includes a PyPI install in notests_independent_targets because these
seem to be used for Python testing despite -notests.

Change-Id: I384ea6a7dab51945828cca629860400a23fa0c05
Reviewed-on: http://gerrit.cloudera.org:8080/18586
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-06-13 17:13:42 +00:00
stiga-huang
331ff4647d IMPALA-11137: Enable proleptic Gregorian Calendar for Hive
Since HIVE-22589, Hive still uses Julian Calendar for writing dates
before 1582-10-15, whereas Impala uses proleptic Gregorian Calendar.
This affects the results Impala gets when querying tables written by
Hive. Currently, the Avro and ORC formats of date_tbl are suffering this
issue.

This patch enables proleptic Gregorian Calendar for Hive by default.
It also reverts the two commits of IMPALA-9555 which modifies the tests
to satisfy the inconsistent results.

Tests:
 - Ran CORE tests

Change-Id: I6be9c9720dd352d6821cdaa6c64d35ba20473bc0
Reviewed-on: http://gerrit.cloudera.org:8080/18262
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-23 20:09:10 +00:00
Fucun Chu
157086cb80 IMPALA-10771: Add Tencent COS support
This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.

New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.

Follow-up:
- Support for caching COS file handles will be addressed in
   IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   COS (IMPALA-10773).

Tests:
 - Upload hdfs test data to a COS bucket. Modify all locations in HMS
   DB to point to the COS bucket. Remove some hdfs caching params.
   Run CORE tests.

Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-12-08 16:32:02 +00:00
stiga-huang
2dfc68d852 IMPALA-7712: Support Google Cloud Storage
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.

New flags for GCS:
 - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.

Follow-up:
 - Support for spilling to GCS will be addressed in IMPALA-10561.
 - Support for caching GCS file handles will be addressed in
   IMPALA-10568.
 - test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   GCS (IMPALA-10562).
 - Some tests are skipped due to issues introduced by /etc/hosts setting
   on GCE instances (IMPALA-10563).

Tests:
 - Compile and create hdfs test data on a GCE instance. Upload test data
   to a GCS bucket. Modify all locations in HMS DB to point to the GCS
   bucket. Remove some hdfs caching params. Run CORE tests.
 - Compile and load snapshot data to a GCS bucket. Run CORE tests.

Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-13 11:20:08 +00:00
Attila Jeges
2cd7a2b77a IMPALA-9555: [Hive3] Fix test failure introduced by HIVE-22589
With HIVE-22589 Hive3 switched back to using Julian Calendar for
historical dates by default which caused an Impala test failure
around Avro DATE values.

Change-Id: I51dd933867ea7877235e7f6e1f2b56711dca107e
Reviewed-on: http://gerrit.cloudera.org:8080/15564
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-27 17:16:00 +00:00
Lars Volker
74c7b7e55f IMPALA-8863: Add support to run tests over HTTP/HS2
This change adds support to run backend tests over HTTP using a new
version of Impyla (0.16.1). It also adds a test that exercises
authentication over HTTP.

Change-Id: I7156558071781378fcb9c8941c0f4dd82eb0d018
Reviewed-on: http://gerrit.cloudera.org:8080/14059
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-26 22:46:40 +00:00
Attila Jeges
27fa27e808 IMPALA-8198: DATE: Read from avro.
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the avro scanner.

Similarly to parquet, avro uses DATE logical type for dates. DATE
logical type annotates an INT32 that stores the number of days since
the unix epoch, 1 January 1970.

This representation introduces an avro interoperability issue between
Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
  up to 1582-10-05 and Gregorian calendar for dates starting with
  1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
  calendar backward to dates preceding its official introduction in
  1582-10-15.
This means that pre-1582-10-15 dates written to an avro table by Hive
will be read back incorrectly by Impala.

Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.

Dependency changes:
- BE uses avro 1.7.4-p5 from native-toolchain.

Change-Id: I7a9d5b93a22cf3a00244037e187f8c145cacc959
Reviewed-on: http://gerrit.cloudera.org:8080/13944
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-09-27 17:18:35 +00:00
Attila Jeges
2bce974990 IMPALA-7370: DATE: Read/Write to parquet.
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer. CREATE TABLE LIKE PARQUET
statements associated with data files that contain dates are also
supported.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
  up to 1582-10-05 and Gregorian calendar for dates starting with
  1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
  calendar backward to dates preceding its official introduction in
  1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.

Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Reviewed-on: http://gerrit.cloudera.org:8080/13189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-07 00:36:56 +00:00
Tim Armstrong
0a9ea803d2 IMPALA-7290: part 1: clean up shell tests
This sets up the tests to be extensible to test shell
in both beeswax and HS2 modes.

Testing:
* Add test dimension containing only beeswax in preparation
  for HS2 dimension.
* Factor out hardcoded ports.
* Add tests for formatting of all types and NULL values.
* Merge date shell test into general type tests.
* Added testing for floating point output formatting, which does
  change as a result of switching to server-side vs client-side
  formatting.
* Use unique_database for tests that create tables.

Change-Id: Ibe5ab7f4817e690b7d3be08d71f8f14364b84412
Reviewed-on: http://gerrit.cloudera.org:8080/13083
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-30 11:30:45 +00:00
Attila Jeges
b5805de3e6 IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.

This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.

The changes are as follows:
- Support for DATE literal syntax.

- Explicit casting between DATE and other types (note that invalid
  casts will fail with an error just like invalid DECIMAL_V2 casts,
  while failed casts to other types do no lead to warning or error):
    - from STRING to DATE. The string value must be formatted as
      yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
      the time component is optional. If the time component is
      present, it will be truncated silently.
    - from DATE to STRING. The resulting string value is formatted as
      yyyy-MM-dd.
    - from TIMESTAMP to DATE. The source timestamp's time of day
      component is ignored.
    - from DATE to TIMESTAMP. The target timestamp's time of day
      component is set to 00:00:00.

- Implicit casting between DATE and other types:
    - from STRING to DATE if the source string value is used in a
      context where a DATE value is expected.
    - from DATE to TIMESTAMP if the source date value is used in a
      context where a TIMESTAMP value is expected.

- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
  implicit conversions are now all possible, the existing function
  overload resolution logic is not adequate anymore.
  For example, it resolves the
  if(false, '2011-01-01', DATE '1499-02-02') function call to the
  if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
  function, instead of the if(BOOLEAN, DATE, DATE) version.

  This is clearly wrong, so the function overload resolution logic had
  to be changed to resolve function calls to the best-fit overloaded
  function definition if there are multiple applicable candidates.

  An overloaded function definition is an applicable candidate for a
  function call if each actual parameter in the function call either
  matches the corresponding formal parameter's type (without casting)
  or is implicitly castable to that type.

  When looking for the best-fit applicable candidate, a parameter
  match score (i.e. the number of actual parameters in the function
  call that match their corresponding formal parameter's type without
  casting) is calculated and the applicable candidate with the highest
  parameter match score is chosen.

  There's one more issue that the new resolution logic has to address:
  if two applicable candidates have the same parameter match score and
  the only difference between the two is that the first one requires a
  STRING -> TIMESTAMP implicit cast for some of its parameters while
  the second one requires a STRING -> DATE implicit cast for the same
  parameters then the first candidate has to be chosen not to break
  backward compatibility.
  E.g: year('2019-02-15') function call must resolve to
  year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
  implemented yet, so this is not an issue at the moment but it will
  be in the future.
  When the resolution algorithm considers overloaded function
  definitions, first it orders them lexicographically by the types in
  their parameter lists. To ensure the backward compatible behavior
  Primitivetype.DATE enum value has to come after
  PrimitiveType.TIMESTAMP.

- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
  math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.

These items are tightly coupled and it makes sense to implement them
in one change-set.

Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
  corresponding HBASE table 'functional_hbase.date_tbl') was
  introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
    - since DATE type is supported for TEXT and HBASE fileformats
      only, most DATE tests were implemented separately in
      tests/query_test/test_date_queries.py.

Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.

Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-23 13:33:57 +00:00