29 Commits

Author SHA1 Message Date
Csaba Ringhofer
843de44788 IMPALA-13125: Fix pairwise test vector generation
Replaced allpairspy with a homemade pair finder that
seems to find a somewhat less optimal (larger) covering
vector set but works reliably with filters. For details
see tests/common/test_vector.py

Also fixes a few test issues uncovered. Some fixes are
copied from https://gerrit.cloudera.org/#/c/23319/

Added the possibility of shuffling vectors to get a
different test set (env var IMPALA_TEST_VECTOR_SEED).
By default the algorithm is deterministic so the test
set won't change between runs (similarly to allpairspy).

Added a new constraint to test only a single compression
per file format in some tests to reduce the number of
new vectors.

EE + custom_cluster test count in exhaustive runs:
before patch:                   ~11000
after patch:                    ~16000
without compression constraint: ~17000

Change-Id: I419c24659a08d8d6592fadbbd5b764ff73cbba3e
Reviewed-on: http://gerrit.cloudera.org:8080/23342
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-28 15:27:02 +00:00
Riza Suminto
9cb9bae84e IMPALA-13758: Use context manager in ImpalaTestSuite.change_database
ImpalaTestSuite.change_database is responsible to point impala client to
database under test. However, it left client pointing to that database
after the test without reverting them back to default database. This
patch does the reversal by changing ImpalaTestSuite.change_database to
use context manager.

This patch change the behavior of execute_query_using_client() and
execute_query_async_using_client(). They used to change database
according to the given vector parameter, but not anymore after this
patch. In practice, this behavior change does not affect many tests
because most queries going through these functions already use fully
qualified table name. Going forward, querying through function other
than run_test_case() should try to use fully qualified table name as
much as possible.

Retain behavior of ImpalaTestSuite._get_table_location() since there are
considerable number of tests relies on it (changing database when
called).

Removed unused test fixtures and fixed several flake8 issues in modified
test files.

Testing:
- Moved nested-types-subplan-single-node.test. This allows the test
  framework to point to the right tpch_nested* database.
- Pass exhaustive test except IMPALA-13752 and IMPALA-13761. They will
  be fixed in separate patch.

Change-Id: I75bec7403cc302728a630efe3f95e852a84594e2
Reviewed-on: http://gerrit.cloudera.org:8080/22487
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-19 23:50:34 +00:00
Riza Suminto
c08aff420d IMPALA-13672: Migrate query_test/test_kudu.py to use hs2 protocol
This patch migrate query_test/test_kudu.py to use hs2 client protocol.
Here are the steps taken:

- Override default_test_protocol() to return 'hs2'.
  See documentation in ImpalaTestSuite about what this method does.
- Remove usage of deprecated cursor and unique_cursor fixture.
- Replace all direct ImpalaTestSuite.client usage with helper
  function call such as execute_query() or execute_query_using_vector().
- Remove all "SET" query invocation and replace it with passing
  exec_option dictionary to helper method.
- Replace veryfing kudu modified / inserted rows from reading query
  output to reading runtime profile counters.
- Add HS2_TYPES section at test cases where only TYPES exist.
- Remove all drop_impala_table_after_context() calls and replace it with
  proper use of unique_database fixture.

KuduTestSuite is fixed with hs2 protocol dimension. Meanwhile,
CustomKuduTest is fixed to use beeswax protocol dimension until proper
migration can be done.

Added following convenience methods:
- ImpalaTestSuite.default_test_protocol() to allow individual test
  class to override its default test procol.
- ImpylaHS2ResultSet.tuples() to access the raw HS2 result set that is
  a list of tuples.

This patch also added several literal constants around test vector
dimension to help with traceability.

Fixed a bug where "SHOW PARTITIONS" via hs2 over kudu table will shows
NULL number of #Replicas because TResultRowBuilder does not have
overload for int type value. Adjust numFiles variable inside
HdfsTable.getTableStats() from int to long to match Type.BIGINT of
column '#Files'.

Fixed py.test classes that does not inherit BaseTestSuite. Fixed flake8
issues in test_statestore.py.

Testing:
- Run and pass all tests extended from KuduTestSuite in exhaustive mode.

Change-Id: I5f38baf5a0bbde1a1ad0bb4666c300f4f3cabd33
Reviewed-on: http://gerrit.cloudera.org:8080/22358
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-07 11:57:59 +00:00
Riza Suminto
6fbde72969 IMPALA-13694: Add ImpalaTestSuite.__reset_impala_clients method
This patch adds __reset_impala_clients() method in ImpalaConnection.
__reset_impala_clients() then simply clear configuration. It is called
on each setup_method() to ensure that each EE test uses clean test
client. All subclasses of ImpalaTestSuite that declare setup() method
are refactored to declare setup_method() instead, to match newer py.test
convention. Also implement teardown_method() to complement
setup_method(). See "Method and function level setup/teardown" at
https://docs.pytest.org/en/stable/how-to/xunit_setup.html.

CustomClusterTestSuite fully overrides setup_method() and
teardown_method() because it subclasses can be destructive. The custom
cluster test method often restart the whole Impala cluster, rendering
default impala clients initialized at setup_class() unusable. Each
subclass of CustomClusterTestSuite is responsible to ensure that impala
client they are using is in a good state.

This patch improve BeeswaxConnection and ImpylaHS2Connection to only
consider non-REMOVED options as its default options. They lookup for
valid (not REMOVED) query options with their own appropriate way,
memorized the option names as lowercase string and the values as string.
List values are wrapped with double quote. Log in
ImpalaConnection.set_configuration_option() is differentiated from how
SET query looks.

Note that ImpalaTestSuite.run_test_case() modify and restore query
option written at .test file by issuing SET query, not by calling
ImpalaConnection.set_configuration_option(). It is remain unchanged.

Consistently lower case query option everywhere in Impala test code
infrastructure. Fixed several tests that has been unknowingly override
'exec_option' vector dimension due to case sensitive mismatch. Also
fixed some flake8 issues.

Added convenience method execute_query_using_vector() and
create_impala_client_from_vector() in ImpalaTestSuite.

Testing:
- Pass core tests.

Change-Id: Ieb47fec9f384cb58b19fdbd10ff7aa0850ad6277
Reviewed-on: http://gerrit.cloudera.org:8080/22404
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-06 04:03:33 +00:00
Riza Suminto
134de01a59 IMPALA-13642: Fix unused test vector in test_scanners.py
Several test vectors were ignored in test_scanners.py. This cause
repetition of the same test without actually varying the test
exec_option nor debug_action.

This patch fix it by:
- Use execute_query() instead of client.execute()
- Passing vector.get_value('exec_option') when executing test query.

Repurpose ImpalaTestMatrix.embed_independent_exec_options to deepcopy
'exec_option' dimension during vector generation. Therefore, each test
execution will have unique copy of 'exec_option' for them self.

This patch also adds flake8-unused-arguments plugin into
critique-gerrit-review.py and py3-requirements.txt so we can catch this
issue during code review. impala-flake8 is also updated to use
impala-python3-common.sh. Adds flake8==3.9.2 in py3-requirements.txt,
which is the highest version that has compatible dependencies with
pylint==2.10.2.

Drop unused 'dryrun' parameter in get_catalog_compatibility_comments
method of critique-gerrit-review.py.

Testing:
- Run impala-flake8 against test_scanners.py and confirm there is no
  more unused variable.
- Run and pass test_scanners.py in core exploration.

Change-Id: I3b78736327c71323d10bcd432e162400b7ed1d9d
Reviewed-on: http://gerrit.cloudera.org:8080/22301
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-09 06:17:51 +00:00
Riza Suminto
b9b4a6d122 IMPALA-13330: Fix orc_schema_resolution in test_nested_types.py
test_nested_types.py declare 'orc_schema_resolution' dimension, but does
not actually exercise it. None of the test actively inserting
'orc_schema_resolution' dimension value into
vector.get_value('exec_dimension').

This patch fix that issue by declaring 'orc_schema_resolution' option
using helper function add_exec_option_dimension() to automatically
insert it into 'exec_option' dimension. Test classes also reorganized to
reduce test skipping and deepcopy-ing.

Following are notable changes:
- Use 'unique_database' in test_struct_in_select_list to avoid collision
  during view creation.
- Drop unused 'unique_database' fixture in
  TestNestedCollectionsInSelectList.
- test_map_null_keys does not have 'mt_dop' dimension anymore since it
  only test how NULL map key are displayed.
- Created common base class TestParquetArrayEncodingsBase for
  TestParquetArrayEncodings and TestParquetArrayEncodingsAmbiguous. The
  latter does not run with 'parquet_array_resolution' anymore since that
  query option is set directly within parquet-ambiguous-list-modern.test
  and parquet-ambiguous-list-legacy.test files.
- Make ImpalaTestMatrix.add_dimensions() call
  ImpalaTestMatrix.clear_dimension() if given dimension.name is
  'exec_option' and independent_exec_option_names is not empty.

The reduction of test count are follows:
Before patch:
168 core tests, 571 exhaustive tests
After patch:
161 core tests, 529 exhaustive tests

Testing:
- Ran and pass test_nested_types.py in exhaustive exploration.
- Verified that no WARNING log printed by
  ImpalaTestSuite.validate_exec_option_dimension()

Change-Id: Ib958cd34a56c949190b4f22e5da5dad2c0de25ff
Reviewed-on: http://gerrit.cloudera.org:8080/21726
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-31 06:11:10 +00:00
Riza Suminto
be6f896d10 IMPALA-13319: Avoid duplicate exec option declaration in py.test
Before this patch, add_mandatory_exec_option() replace existing query
option values in 'exec_option' dimension and may cause unintended test
vector duplication. For example, the following declaration will create
two duplicate test vector, both with "disable_codegen=False":

    cls.ImpalaTestMatrix.add_dimension(create_exec_option_dimension(
      disable_codegen_options=[False, True]))
    add_mandatory_exec_option(cls, "disable_codegen", False)

add_exec_option_dimension() will create new test dimension for a 'key',
but does not insert it into 'exec_option' dimension until vector
generation later. It also does not validate if 'key' already exist in
'exec_option' dimension. This can confuse test writer when they need to
write constraint, because they might look for the value at
vector.get_value('exec_option')['key'] instead of
vector.get_value('key'), and vice versa.

This patch add assertion to check that no duplicate query option name is
declared through any helper function. It also assert that all query
option names are declared in lowercase.

Testing:
- Manually verify test vector generation in test files containing the
  helper functions by running:
  impala-py.test --exploration=exhaustive --collect-only <test_file>
- Adjust query option declaration that breaks after this change.

Change-Id: I8143e47f19090e20707cfb0a05c779f4d289f33c
Reviewed-on: http://gerrit.cloudera.org:8080/21707
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-23 02:05:36 +00:00
Riza Suminto
172925bcb7 IMPALA-3825: Delegate runtime filter aggregation to some executors
IMPALA-4400 improve the runtime filter by aggregating runtime filters
locally before sending filter update to the coordinator and sharing a
single RuntimeFilterBank for all fragment instances in a query. However,
local filter aggregation is still insufficient if the number of nodes in
an impala cluster is large. For example, in a cluster of around 700
impalad backends, aggregation of 1 MB bloom filter updates in the
coordinator can exceed more than 1 second.

This patch aims to reduce coordinator load and speed up runtime filter
aggregation by doing intermediate aggregation in a few designated impala
backends before doing final aggregation and publishing in the
coordinator. Query option MAX_NUM_FILTERS_AGGREGATED_PER_HOST is added
to control this feature. Given N as the number of backend executors
excluding the coordinator, the selected number of intermediate
aggregators M = ceil(N / MAX_NUM_FILTERS_AGGREGATED_PER_HOST). Setting
MAX_NUM_FILTERS_AGGREGATED_PER_HOST <= 1 will disable the intermediate
aggregator feature. In the backend scheduler, M impalad will be selected
randomly as the intermediate aggregator for that runtime filter.
Information of this M selected impalad then passed from the scheduler to
coordinator as a RuntimeFilterAggregatorInfoPB. The coordinator then
converts the RuntimeFilterAggregatorInfoPB into a filter routing
information TRuntimeFilterAggDesc that is piggy-backed in
TRuntimeFilterSource.

A new RPC endpoint named UpdateFilterFromRemote is added in
data_stream_service.proto to handle filter updates from fellow impalad
executor to the designated aggregator impalad. This RPC will merge
filter updates into 'pending_remote_filter'. The intermediate aggregator
will then combine 'pending_remote_filter' with
'pending_merge_filter' (from local aggregation) into 'result_filter'
which is then sent to the coordinator. RuntimeFilterBank of the
intermediate aggregator will wait for all remote filter updates for at
least RUNTIME_FILTER_WAIT_TIME_MS. If RuntimeFilterBank is closing and
RUNTIME_FILTER_WAIT_TIME_MS has passed, any incomplete filter will be
marked as ALWAYS_TRUE and sent to the coordinator.

This patch currently targets the bloom filter produced by partitioned
join build only. Another kind of runtime filter is still efficient to
aggregate in coordinator only, while the bloom filter from broadcast
join only requires 1 valid filter update for publishing.

test_runtime_filters.py is modified to clarify the exec_options
dimension, test matrix constraints, and reduce pytest.skip() calls on
each test. runtime_filters.test is also changed to use counter
aggregation and assert on ExecSummary table so that they stay valid
irrespective of the number of fragment instances.

We benchmark the aggregation speed of 1 MB runtime filter aggregation on
20 executor nodes cluster with MT_DOP=36 that is instrumented to disable
local aggregation, simulating 720 runtime filter updates. The speed is
approximated as the duration between the earliest time a filter update
is made and the time that the coordinator publishes the complete filter.
The result is following:

+---------------------+------------------------+
| num aggregator node | Aggregation speed (ms) |
+---------------------+------------------------+
|                   0 |                   1296 |
|                   1 |                   1229 |
|                   2 |                    608 |
|                   4 |                    329 |
|                   8 |                    205 |
+---------------------+------------------------+

Testing:
- Exercise MAX_NUM_FILTERS_AGGREGATED_PER_HOST in
  test_runtime_filters.py and query-options-test.cc
- Add TestRuntimeFiltersLateRemoteUpdate.
- Add custom_cluster/test_runtime_filter_aggregation.py.
- Pass exhaustive tests.

Change-Id: I11d38ed0f223d6e5b32a19ebe725af7738ee4ab0
Reviewed-on: http://gerrit.cloudera.org:8080/20612
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-20 12:29:55 +00:00
Riza Suminto
feefcc6395 IMPALA-12518: Combine all exec_option dimension in test_vector.py
Before this patch, when writing pytest that exercise custom query option
values, we need to declare it by making new test dimension, followed by
deepcopying the original vector, and inserting the selected dimension
value into 'exec_option' dictionary in generated vector.

This patch simplify this steps by accounting dimensions that is intended
to be part of 'exec_option' and automatically combining them during
vector generation in test_vector.py. Such dimension should be registered
via the new ImpalaTestMatrix.add_exec_option_dimension() function.

function add_exec_option_dimension() in test_dimensions.py is renamed to
add_mandatory_exec_option() to make it consistent with the same
functionality in ImpalaTestMatrix and avoid confusion with the new
ImpalaTestMatrix.add_exec_option_dimension() function. Function name
add_exec_option_dimension() in test_dimensions.py is then repurposed as
a shorthand for ImpalaTestMatrix.add_exec_option_dimension().

The remaining changes for other pytest files will be done gradually.

Testing:
- Fix bug in TestIcebergV2Table and confirm that both True and False
  value for 'disable_optimized_iceberg_v2_read' options are exercised.
- Run and pass all modified tests in this patch.

Change-Id: I3adba260990fccf4d2f2e7c8c4e4fadc6fd43fe1
Reviewed-on: http://gerrit.cloudera.org:8080/20625
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2023-10-27 03:22:53 +00:00
Michael Smith
8b2598cd70 IMPALA-12485: Remove Python 2 has_key
Switch calls to dict#has_key (Python 2-only) for 'key in dict' syntax.

Change-Id: I08e9f6667011d70ceddbf919a61d1be7d6e07ee4
Reviewed-on: http://gerrit.cloudera.org:8080/20541
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-10-10 00:44:23 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
566df80891 IMPALA-11959: Add Python 3 virtualenv
This adds a Python 3 equivalent to the impala-python
virtualenv base on the toolchain Python 3.7.16.
This modifies bootstrap_virtualenv.py to support
the two different modes. This adds py2-requirements.txt
and py3-requirements.txt to allow some differences
between the Python 2 and Python 3 virtualenvs.

Here are some specific package changes:
 - allpairs is replaced with allpairspy, as allpairs did
   not support Python 3.
 - requests is upgraded slightly, because otherwise is has issues
   with idna==2.8.
 - pylint is limited to Python 3, because we are adding it
   and don't need it on both
 - flake8 is limited to Python 2, because it will take
   some work to switch to a version that works on Python 3
 - cm_api is limited to Python 2, because it doesn't support
   Python 3
 - pytest-random does not support Python 3 and it is unused,
   so it is removed
 - Bump the version of setuptool-scm to support Python 3

This adds impala-pylint, which can be used to do further
Python 3 checks via --py3k. This also adds a bin/check-pylint-py3k.sh
script to enforce specific py3k checks. The banned py3k warnings
are specified in the bin/banned_py3k_warnings.txt. This is currently
empty, but this can ratchet up the py3k strictness over time
to avoid regressions.

This pulls in a new toolchain with the fix for IMPALA-11956
to get Python 3.7.16.

Testing:
 - Hand tested that the allpairs libraries produce the
   same results
 - The python3 virtualenv has no influence on regular
   tests yet

Change-Id: Ica4853f440c9a46a79bd5fb8e0a66730b0b4efc0
Reviewed-on: http://gerrit.cloudera.org:8080/19567
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
c1794023bc IMPALA-11952 (part 3): Fix raise syntax
Python 3 does not support this old raise syntax:

raise Exception, "message"

Instead, it should be:

raise Exception("message")

This fixes all locations with the old raise syntax.

Testing:
 - check-python-syntax.sh shows no errors from raise syntax

Change-Id: I2722dcc2727fb65c7aedede12d73ca5b088326d7
Reviewed-on: http://gerrit.cloudera.org:8080/19553
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-28 17:11:50 +00:00
Steve Carlin
bb9fb663ce IMPALA-10778: Allow impala-shell to connect directly to HS2
Impala-shell already uses HS2 protocol to connect to Impalad.
This commit allows impala-shell to connect to any server (for
example, Hive) using the hs2 protocol. This will be done via
the "--strict_hs2_protocol" option.

When the "--strict_hs2_protocol" option is turned on, only features
supported by hs2 will work. For instance, "runtime-profile" is an
impalad specific feature and will be disabled.

The "--strict_hs2_protocol" will only work on servers that abide
by the strict definition of what is supported by HS2. So one will
be able to connect to Hive in this mode, but connections to Impala
will not work. Any feature supported by Hive (e.g. kerberos
authentication) should work as well.

Note: While authentication should work, the test framework is not
set up to create an HS2 server that does authentication at this point
so this feature should be used with caution.
Change-Id: I674a45640a4a7b3c9a577830dbc7b16a89865a9e
Reviewed-on: http://gerrit.cloudera.org:8080/17660
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-27 09:45:59 +00:00
Thomas Tauber-Marshall
a4ad8f35f7 IMPALA-8390: clean up test vectors in test_cancellation.py
Due to changes to TestCancellation made in IMPALA-7205 that were not
reflected in TestCancellationSerial and TestCancellationFullSort,
test_cancel_insert has not been running at all and test_cancel_sort
has been running with unintended parameters.

This patch re-enables test_cancel_insert, while including a number of
constraints on its parameters to keep test execution time reasonable.
It also fixes an incorrect constraint on test_cancel_sort.

The patch also makes some related improvements:
- Removes an xfail on test_cancel_insert related to a bug that is
  fixed now.
- When ImpalaTestVector.get_value() is called with a value name that
  does not actually exist in the vector, the result is a StopIteration
  exception. Due to python's questionable habit of using exceptions
  for flow control, StopIteration is frequently treated not as an
  error but as the normal end of iteration, which can result in
  unexpected behavior, eg. when pytest_generate_tests raises a
  StopIteration pytest just silently ignores it and drops the test
  case. This patch modifies get_value() to instead raise a ValueError
  in this situation.
- When a test has no vectors generated for it, the name of the test is
  now included in the logged warning.

Testing:
- Ran full core and exhaustive runs and verified that the expected
  test cases are run for test_cancellation.py now

Change-Id: I9673fe82bda5314aff6a51d1961759ff286fbf6f
Reviewed-on: http://gerrit.cloudera.org:8080/12960
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-13 00:27:19 +00:00
stiga-huang
818cd8fa27 IMPALA-5717: Support for reading ORC data files
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 05:13:02 +00:00
Taras Bobrovytsky
35a3e186d6 IMPALA-5478: Run TPCDS queries with decimal_v2 enabled
We add new TPCDS .test files that are expected to be run with decimal_v2
enabled. The new expected results were generated using Impala and I
inspected them manually.

Change-Id: Ib867c51a521ec4a087bc127d99aee4b95ba97733
Reviewed-on: http://gerrit.cloudera.org:8080/8985
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-18 03:28:51 +00:00
David Knupp
f590bc0da6 IMPALA-4750: Rename test infra classes so they don't mimic test classes.
This patch addresses warning messages from pytest re: the imported
TestMatrix, TestVector, and TestDimension classes, which were being
collected as potential test classes. The fix was to simply prepend
the class names with Impala-

git grep -l 'TestDimension' | xargs \
    sed -i 's/TestDimension/ImpalaTestDimension/g'

git grep -l 'TestMatrix' | xargs \
    sed -i 's/TestMatrix/ImpalaTestMatrix/g'

git grep -l 'TestVector' | xargs \
    sed -i 's/TestVector/ImpalaTestVector/g'

The tests all passed in an exhaustive run on the upstream jenkins
server:

http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/

Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1
Reviewed-on: http://gerrit.cloudera.org:8080/5794
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-26 23:40:22 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Taras Bobrovytsky
609b80410e Clean up Python test import statements
Many of our test scripts have import statements that look like
"from xxx import *". It is a good practice to explicitly name what
needs to be imported. This commit implements this practice. Also,
unused import statements are removed.

Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8
Reviewed-on: http://gerrit.cloudera.org:8080/3444
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-07-15 23:26:18 +00:00
Casey Ching
074e5b4349 Remove hashbang from non-script python files
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.

Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-04 05:26:07 +00:00
Alex Behm
37ca6b81ae IMPALA-1567: Ignore 'hidden' files with special suffixes.
Currently, we only consider files hidden if they have the special
prefixes "." or "_". However, some tools use special suffixes
to indicate a file is being operated on, and should be considered
invisible.

This patch adds the following hidden suffixes:
'.tmp' - Flume's default for temp files
'.copying' - hdfs put may produce these

Change-Id: I151eafd0286fa91e062407e12dd71cfddd442430
Reviewed-on: http://gerrit.cloudera.org:8080/80
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-02-24 10:55:22 +00:00
Lenni Kuff
ebd750acc6 Minor cleanup of test_spilling custom cluster test suite
Change-Id: If853893db082eae79a6ec22180e9ad5572c58f05
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4455
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-09-21 19:43:50 -07:00
Ippokratis Pandis
fe0646f76b IMPALA-1022: Handle cases where in Parquet the expected number of rows in metadata is wrong
There are cases of Parquet files where the metadata indicate wrong number of rows for
these files. The parquet-scanner until now was not reporting any problem in this case.
Instead it was reading as long as there where values for the read columns.
But with IMPALA-1016 we are now reading at most as many rows as the rows per metadata.
With this patch, the parquet-scanner, right before it finishes scannings, checks whether
it read the expected number of rows (taken from metadata). In cases where the actual
number of rows read is less than or greater than the expected number, it either aborts
or logs an error.

Change-Id: Ie6a66a38e8912730bf04762e6526ec1cadb2bcdc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2755
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2944
2014-06-10 17:27:54 -07:00
Lenni Kuff
8d1674f638 Run only subset of tests with small batch_sizes + a few small fixes 2014-01-08 10:48:58 -08:00
Lenni Kuff
45c1cbe1fd Use Python 2.6 style dictionary comprehension for building test dimensions 2014-01-08 10:47:05 -08:00
Lenni Kuff
ef9a5c2d0e Add test suite for DEFAULT_ORDER_BY_LIMIT query option 2014-01-08 10:47:05 -08:00
Nong Li
b575b08357 Fix planner to reject compressed text formats. 2014-01-08 10:47:01 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00