Commit Graph

1104 Commits

Author SHA1 Message Date
Michael Brown
fe2be25245 IMPALA-4775: minor adjustments to python test infra logging
- Set up log handler to append, not truncate. This was the cause of
  IMPALA-4775.

Other improvements:
- Log a thread name, not thread ID. Thread names are more useful.
- Use ISO 8601-like timestamps

I tested that running disrepancy_searcher.py doesn't overwrite its logs
anymore. One such command that could reproduce it is:

tests/comparison/discrepancy_searcher.py \
  --use-postgresql \
  --query-count 1 \
  --db-name tpch_kudu

I also ensured the stress test (concurrent_select.py) still logged to
its file.

Change-Id: I2b7af5b2be20f3c6f38d25612f6888433c62d693
Reviewed-on: http://gerrit.cloudera.org:8080/5746
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-20 02:24:47 +00:00
Tim Armstrong
69859bddfb IMPALA-4549: consistently treat 9999 as upper bound for timestamp year
Previously Impala was inconsistent about whether the year 10000 was
supported, as a result of inconsistency in boost, which reported the
maximum year as 9999 but sometimes allowed 10000. This meant that
Impala sometimes accepted the year 10000 and sometimes not.

Use the patched boost version and update tests accordingly.

Testing:
Ran an exhaustive build.

Change-Id: Iaf23b40833017789d879e5da7bb10384129e2d10
Reviewed-on: http://gerrit.cloudera.org:8080/5665
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-19 00:04:27 +00:00
Michael Brown
db7facdee0 IMPALA-4351,IMPALA-4353: [qgen] randomly generate INSERT statements
- Generate INSERT statements that are either INSERT ... VALUES or INSERT
  ... SELECT

- On both types of INSERTs, we either insert into all columns, or into
  some column list. If the column list exists, all primary keys will be
  present, and 0 or more additional columns will also be in the list.
  The ordering of the column list is random.

- For INSERT ... SELECT, occasionally generate a WITH clause

- For INSERT ... VALUES, generate non-null constants for the primary
  keys, but for the non-primary keys, randomly generate a value
  expression.

The type system in the random statement/query generator isn't
sophisticated enough to the implicit type of a SELECT item or a value
expression. It knows it will be some INT-based type, but not if it's
going to be a SMALLINT or a BIGINT. To get around this, the easiest
thing seems to be to explicitly cast the SELECT items or value
expressions to the columns' so-called exact_type attribute.

Much of the testing here involved running discrepancy_searcher.py
--explain-only on both tpch_kudu and a random HDFS table, using both the
default profile and DML-only profile. This was done to quickly find bugs
in the statement generation, as they tend to bubble up as analysis
errors. I expect to make other changes as follow on patches and more
random statements find small test issues.

For actual use against Kudu data, you need to migrate data from Kudu
into PostgreSQL 5 (instructions tests/comparison/POSTGRES.txt) and run
something like:

tests/comparison/discrepancy_searcher.py \
  --use-postgresql \
  --postgresql-port 5433 \
  --profile dmlonly \
  --timeout 300 \
  --db-name tpch_kudu \
  --query-count 10

Change-Id: I842b41f0eed07ab30ec76d8fc3cdd5affb525af6
Reviewed-on: http://gerrit.cloudera.org:8080/5486
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-13 01:31:47 +00:00
Michael Brown
54665120cb IMPALA-4355: random query generator: modify statement execution flow to support DML
- Rework the discrepancy searcher to run DML statements. We do this by
  using the query profile to choose a table, copy that table, and
  generate a statement that will INSERT into that copy. We chose a slow
  copy over other methods because INSERTing into a copy is a more
  reliable test that prevents table sizes from getting out of hand or
  time-consuming replay to reproduce a particular statement.

- Introduce a statement generator stub. The real generator work is
  tracked in IMPALA-4351 and IMPALA-4353. Here we simply generate a
  basic INSERT INTO ... VALUES statement to make sure our general query
  execution flow is working.

- Add query profile stub for DML statements (INSERT-only at this time).
  Since we'll want INSERT INTO ... SELECT very soon, this inherits from
  DefaultProfile. Also add building blocks for choosing random
  statements in the DefaultProfile.

- Improve the concept of an "execution mode" and add new modes. Before,
  we had "RAW", "CREATE_TABLE_AS", and "CREATE_VIEW_AS". The idea here
  is that some random SELECT queries could be generated as "CREATE
  TABLE|VIEW AS" at execution time, based on weights in the query
  profile. First, we remove the use of raw string literals for this,
  since raw string literals can be error-prone, and introduce a
  StatementExecutionMode class to contain a namespace for the enumerated
  statement execution modes. Second, we introduce a couple new execution
  modes. The first is DML_SETUP: this is a DML statement that needs to
  be run in both the test and reference databases concurrently. For our
  purposes, it's the INSERT ... SELECT that copies data from the chosen
  random table into the table copy. The second is DML_TEST: this is a
  randomly-generated DML statement.

- Switch to using absolute imports in many places. There was a mix of
  absolute and relative imports happening here, and they were causing
  problems, especially when comparing data types. In Python,
  <class 'db_types.Int'> != <class 'tests.comparison.db_types.Int'>.
  Using
    from __future__ import absolute_import
  didn't seem to catch the relative import usage anyway, so I haven't
  employed that.

- Rename some, but not nearly all, names from "query" to "statement".
  Doing this is a rather large undertaking leading to much larger diffs
  and testing (IMPALA-4602).

- Fix a handful of flake8 warnings. There are a bunch that went unfixed
  for over- and under-indentation.

- Testing
  o ./discrepancy_searcher.py runs with and without --explain-only, and
  with --profile default and --profile dmlonly. For tpch_kudu data, it
  seems sufficient to use a --timeout of about 300.
  o Leopard run to make sure standard SELECT-only generation still works
  o Generated random stress queries locally
  o Generated random data locally

Change-Id: Ia4c63a2223185d0e056cc5713796772e5d1b8414
Reviewed-on: http://gerrit.cloudera.org:8080/5387
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-12 21:40:39 +00:00
Lars Volker
8b7f876649 IMPALA-4722: Disable log caching in test_scratch_disk
test_scratch_disk fails sporadically when trying to assert the presence
of log messages. This is probably caused by log caching, since after
such failures the log files do contains the lines in question.

I manually tested this by running the tests repeatedly for 2 days (10k
runs).

To make future diagnosis of similar problems easier, this change also
adds more output to assert_impalad_log_contains().

Change-Id: I9f21284338ee7b4374aca249b6556282b0148389
Reviewed-on: http://gerrit.cloudera.org:8080/5669
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-12 18:58:48 +00:00
Tim Armstrong
75027c913b IMPALA-4745: fix TestScratchLimit failure on S3
The commit "IMPALA-3202,IMPALA-2079: rework scratch file I/O"
improved efficiency of scratch file use in some scenarios.
TestScratchLimit::test_with_low_scratch_limit started failing
on S3, because it expects to use more than 50MB of scratch space.

Testing:
Ran the test in a loop locally for 50+ iterations - didn't see
any failures.

Change-Id: I607b4c6ad10eba0e6c7bc8d6e640d42da26ee6c8
Reviewed-on: http://gerrit.cloudera.org:8080/5654
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-11 03:47:29 +00:00
Lars Volker
ac59489df9 IMPALA-4751: Remove blank line from raw_text template
The additional blank line can break tooling which uses the
/query_profile_encoded endpoint and has been erroneously introduced in
the fix for IMPALA-3918.

Change-Id: I9b688aa9e2423b0271c8891a983e5b22707d8dbc
Reviewed-on: http://gerrit.cloudera.org:8080/5664
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-10 21:16:57 +00:00
Jim Apple
0ee6d19d59 IMPALA-4742: Change "{}".format() to "{0}".format() for Py 2.6
From the Python docs:

"Changed in version 2.7: The positional argument specifiers can be
omitted, so '{} {}' is equivalent to '{0} {1}'."

http://gerrit.cloudera.org:8080/5401 used the newer form,
"{}".format(). This change uses the older backwards-compatible
compatible form.

Change-Id: If78b9b4061ca191932ac5b0b14e0ee8951a9d4e8
Reviewed-on: http://gerrit.cloudera.org:8080/5641
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-08 23:52:01 +00:00
Jim Apple
9fa2ff7138 IMPALA-2605: Omit the sort and mini stress tests
These stress tests were sometimes causing the end-to-end tests to hang
indefinitey, including in the pre-merge testing (sometimes called
"GVO" or "GVM").

This patch also prints to stdout some connections metrics that may
prove useful for debugging stress test hangs in the future. The
metrics are printed before and after stress tests are run when
run-tests.py is used.

Change-Id: Ibd30abf8215415e0f2830b725e43b005daa2bb2d
Reviewed-on: http://gerrit.cloudera.org:8080/5401
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-06 21:32:15 +00:00
Tim Armstrong
95ed4434f2 IMPALA-3202,IMPALA-2079: rework scratch file I/O
Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into
TmpFileMgr, in anticipation of it being shared with BufferPool.
TmpFileMgr now handles:
* Scratch space allocation and recycling
* Read and write I/O

The interface is also greatly changed so that it is built around Write()
and Read() calls, abstracting away the details of temporary file
allocation from clients. This means the TmpFileMgr::File class can
be hidden from clients.

Write error recovery:
Also implement write error recovery in TmpFileMgr.

If an error occurs while writing to scratch and we have multiple
scratch directories, we will try one of the other directories
before cancelling the query. File-level blacklisting is used to
prevent excessive repeated attempts to resize a scratch file during
a single query. Device-level blacklisting is not implemented because
it is problematic to permanently take a scratch directory out of use.

To reduce the number of error paths, all I/O errors are now handled
asynchronously. Previously errors creating or extending the file were
returned synchronously from WriteUnpinnedBlock(). This required
modifying DiskIoMgr to create the file if not present when opened.

Also set the default max_errors value in the thrift definition file,
so that it is in effect for backend tests.

Future Work:
* Support for recycling variable-length scratch file ranges. I omitted
  this to avoid making the patch even large.

Testing:
Updated BufferedBlockMgr unit test to reflect changes in behaviour:
* Scratch space is no longer permanently associated with a block, and
  is remapped every time a new block is written to disk .
* Files are now blacklisted - updated existing tests and enable the
  disable blacklisting test.

Added some basic testing of recycling of scratch file ranges in
the TmpFileMgr unit test.

I also manually tested the code in two ways. First by removing permissions
for /tmp/impala-scratch and ensuring that a spilling query fails cleanly.
Second, by creating a tiny ramdisk (16M) and running with two scratch
directories: one on /tmp and one on the tiny ramdisk. When spilling, an
out of space error is encountered for the tiny ramdisk and impala spills
the remaining data (72M) to /tmp.

Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17
Reviewed-on: http://gerrit.cloudera.org:8080/5141
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-05 02:26:24 +00:00
Lars Volker
25ebf586e0 IMPALA-4689: Fix computation of last active time
The last active time in impala-server.cc#L1806 is in milliseconds, but
the TimestampValue c'tor expects seconds. This change also renames some
variables to make their meaning more explicit, aiming to prevent similar
bugs in the future.

This change also fixes a bug that occurred when during startup of the
local minicluster the operating system PIDs would wrap around. This way
the first impalad would not be the one with the smallest PID and
ImpalaCluster.get_first_impalad() would return the wrong one.

I ran git-clang-format on the change.

Change-Id: I283564c8d8e145d44d9493f4201555d3a1087edf
Reviewed-on: http://gerrit.cloudera.org:8080/5546
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2017-01-04 12:12:04 +00:00
Taras Bobrovytsky
2159beee89 IMPALA-4467: Add support for DML statements in stress test
- Add support for insert, upsert, update and and delete statements.
- Add support for compute stats with mt_dop query options.
- Update impyla version in order to be able to have access to query
  error text for DML queries.
- Made flake8 fixes. flake8 on this file is clean.

For every Kudu table in the databases, we make a copy and add a
'_original' suffix to the table name. The DML queries will only make
modifications to the non original table, the original table will never
be modified. The orignal tables could be used to bring the non-original
table to the inital state. Two flags were added for doing this:
--reset-databases-before-binary-search and
--reset-databases-after-binary-search.

The DML queries are generated based on the mod values passed in with the
following flag: --dml-mod-values 11 13 17. For each mod value 4 DML
queries are generated. The DML operations will touch table rows where
primary_key % mod_value = 0. So, the larger the mod value, the more rows
would be affected. The DML queries are generated in such a way that the
data for the insert, upsert, and update queries is taken from the table
with the _original suffix. The stress test generates DML queries for
only kudu databases. For example, --tpch-kudu-db=tpch_100_kudu
--tpch-db=tpch_100 --generate-dml-queries would only generate queries
for the tpch_100_kudu database.

Here's an example of a full call with the new options that runs the
stress test on the local mini cluster:
./concurrent_select.py \
    --tpch-kudu-db=tpch_kudu \
    --generate-dml-queries \
    --dml-mod-values 11 13 17 \
    --generate-compute-stats-queries \
    --select-probability=0.5 \
    --mem-limit-padding-pct=25 \
    --mem-limit-padding-abs=50 \
    --reset-databases-before-binary-search \
    --reset-databases-after-binary-search

Change-Id: Ia2aafdc6851cc0e1677a3c668d3350e47c4bfe40
Reviewed-on: http://gerrit.cloudera.org:8080/5093
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-20 01:33:01 +00:00
David Knupp
6c5f8e3f5e IMPALA-4639: Add pytest option and xfail markers for tests that only run locally.
As we're beginning to run Impala end-to-end tests on remote clusters, we're
finding some tests that do not pass for infrastructure-related reasons (as
opposed to product issues.) It would be useful to be able to xfail any tests
that we know to be problematic within a given module, yet still run the
others. This way, we can get passing test runs as we're ironing out those
infrastructure issues.

Change-Id: Id4d6e46dc1e64ad20c727ccb19af7a9f3daf917f
Reviewed-on: http://gerrit.cloudera.org:8080/5446
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-12-15 02:45:50 +00:00
Tim Armstrong
246acba0b3 IMPALA-4659: fuzz test fixes
* Apply a 512m mem_limit to all fuzz tests. This limits aggregate memory
  consumption to ~5GB per daemon(assuming 10 concurrent tests).
* Refactor the exec option handling to use the exec_option dimension.
  This avoids executing the test multiple times redundantly
* Remove the xfails to reduce noise, since there is no immediate plan to
  fix the product bugs. Instead just pass the tests.

Testing:
Ran in a loop for ~1h to flush out flakiness.

Change-Id: Ie1942ceef252ec3e6171a0a54722b66a7d9abbd7
Reviewed-on: http://gerrit.cloudera.org:8080/5502
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-15 01:31:28 +00:00
Matthew Jacobs
652e7d56d9 IMPALA-4654: KuduScanner must return when ReachedLimit()
Fixes a bug in the KuduScanner where the scan node's limit
was not respected and thus the scanner thread would
continue executing until the scan range was fully consumed.
This could result in completed queries leaving fragments
running and those threads could be using significant CPU and
memory.

For example, the query 'select * from tpch_kudu.lineitem
limit 90' when running in the minicluster and lineitem is
partitioned into 3 hash partitions would end up leaving a
scanner thread running for ~60 seconds. In real world
scenarios this can cause unexpected resource consumption.
This could build up over time leading to query failures if
these queries are submitted frequently.

The fix is to ensure KuduScanner::GetNext() returns with
eos=true when it finds ReachedLimit=true. An unnecessary and
somewhat confusing flag 'batch_done' was being returned by a
helper function DecodeRowsIntoRowBatch, which isn't
necessary and was removed in order to make it more clear how
the code in GetNext() should behave.

Change-Id: Iaddd51111a1b2647995d68e6d37d0500b3a322de
Reviewed-on: http://gerrit.cloudera.org:8080/5493
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-12-14 23:24:47 +00:00
Matthew Jacobs
73e41cea19 IMPALA-4642: Fix TestFragmentLifecycle failures; kudu test must wait
Fixes test failures in TestFragmentLifecycle when it runs
after TestKuduMemLimits which takes some time for all
fragments to finish closing, even though the query is
finished. TestFragmentLifecycle checks that there are no
fragments in flight. For now, this fixes the tests by
forcing TestKuduMemLimits to wait for all 'in flight'
fragments to complete before continuing. We still need to
understand why the KuduScanNode/KuduScanner is taking so
long to Close() (see IMPALA-4654).

Change-Id: Ia655a37ff06e92cc55ba05f01d5e94fe39447c65
Reviewed-on: http://gerrit.cloudera.org:8080/5481
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-12-13 03:12:42 +00:00
Lars Volker
1e683d4ee6 IMPALA-4403: Implement SHOW RANGE PARTITIONS for Kudu tables
Change-Id: Idf5b2fdd02938a42fa59ec98884e4ac915dd1f65
Reviewed-on: http://gerrit.cloudera.org:8080/5390
Reviewed-by: Lars Volker <lv@cloudera.com>
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-10 00:05:50 +00:00
Lars Volker
e1a6db7609 Bump Kudu server version to latest master (a70c905006)
This also re-enabled kudu_alter.test, which was disabled in IMPALA-4628.

Change-Id: Ie5acdeffea7ed9a68ce0f48d1f68c6c922044704
Reviewed-on: http://gerrit.cloudera.org:8080/5427
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-09 19:24:50 +00:00
Lars Volker
02b5cce846 IMPALA-4628: Disable broken kudu test to unblock GVOs
Change-Id: I30d45acb26eb3e709a1994a89e8444ca9530d8cc
Reviewed-on: http://gerrit.cloudera.org:8080/5428
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-12-08 21:58:46 +00:00
Tim Armstrong
88448d1d4a IMPALA-4586: don't constant fold in backend
This patch ensures that setting the query option
enable_expr_rewrites=false will disable both constant folding in the
frontend (which it did already) and constant caching in the backend
(which is enabled in this patch). This gives a way for users to revert
to the old behaviour of non-deterministic UDFs before these
optimisations were added in Impala 2.8.

Before this patch, the backend would cache values based on IsConstant().
This meant that there was no way to override caching of values of
non-deterministic UDFs, e.g. with enable_expr_rewrites.

After this patch, we only cache literal values in the backend. This
offers the same performance as before in the common case where the
frontend will constant fold the expressions anyway.

Also rename some functions to more cleanly separate the backend concepts
of "constant" expressions and expressions that can be evaluated without
a TupleRow. In a future change (IMPALA-4617) we should remove the
IsConstant() analysis logic from the backend entirely and pass the
information from the frontend. We should also fix isConstant() in the
frontend so that it only returns true when it is safe to constant-fold
the expression (IMPALA-4606). Once that is done, we could revert back
to using IsConstant() instead of IsLiteral().

Testing:
Added targeted test to test constant folding of UDFs: we expect
different results depending on whether constant folding is enabled.

Also run TestUdfs with expr rewrites enabled and disabled, since this
can exercise different code paths. Refactored test_udfs somewhat to
avoid running uninteresting combinations of query options for
targeted tests and removed some 'drop * if not exists' statements
that aren't necessary when using unique_database.

This change revealed flakiness in test_mem_limit, which seems
to have only worked by coincidence. Updated TrackAllocation() to
actually set the query status when a memory limit is exceeded.
Looped this test for a while to make sure it isn't flaky any
more.

Also fix other test bugs where the vector argument is modified
in-place, which can leak out to other tests.

Change-Id: I0c76e3c8a8d92749256c312080ecd7aac5d99ce7
Reviewed-on: http://gerrit.cloudera.org:8080/5391
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-08 04:53:53 +00:00
Taras Bobrovytsky
1083639ff2 IMPALA-4585: Allow the $DATABASE template in the CATCH section
In a recent change (IMPALA-4363) we introduced a change where all file
paths in .test files should be replaced with '__HDFS_FILENAME__'. This
caused problems for tests on non-HDFS file systems and we also lost some
test coverage. This patch fixes the problem by allowing the $DATABASE
template in the catch section of the .test file.

Change-Id: If0f6ae8dea7ac4cdaf0c61ebd8f0c589c353a96e
Reviewed-on: http://gerrit.cloudera.org:8080/5372
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-08 02:20:50 +00:00
Michael Ho
9337518137 IMPALA-4595: Ignore discarded functions after linking
For LLVM IR UDF, Impalad will link an external LLVM module
in which the IR UDF is defined with the main module. If it
happens that a symbol is defined in both modules, LLVM may
choose to discard the one defined in the external module.
The discarded function and its callee will not be present
in the linked module.

In IMPALA-4595, udf-sample.cc was compiled without any
optimization. Duplicated definition such as StringVal::null()
may have different inlining level between the external module
and the main module. When the duplicated definition in
the external module is discarded, some of its callee
functions (which are not inlined) may not be defined in the
main module so they can no longer be located in the linked
module. This trips up some code in the LlvmCodegen::LinkModule().
In particular, when parsing for functions in external module
which are materialized during linking, certain functions may
not be present due to the reason above. Impalad will hit
a DCHECK in debug build or crash due to null pointer access
in release build.

This change fixes the problem above by taking into account
that certain functions may not be defined anymore after linking.
This change also fixes two incorrect status propagation in
fe-support.cc.

Change-Id: Iaa056a0c888bfcc95b412e1bc1063bb607b58ab7
Reviewed-on: http://gerrit.cloudera.org:8080/5384
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-07 22:52:35 +00:00
Dan Burkert
f83652c1da Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.

Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-07 07:31:16 +00:00
Michael Ho
9b80224f9f IMPALA-2925: Mark test_alloc_update as xfail.
test_alloc_update.py is flaky and the expected failure sometimes
doesn't occur. Mark this test as xfail for now to unblock the build.

Change-Id: If4e86e7b9c064bc78b672814cd3569453ecc268d
Reviewed-on: http://gerrit.cloudera.org:8080/5366
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-12-06 12:37:17 +00:00
Dimitris Tsirogiannis
cba93f1ac3 IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE
Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2
Reviewed-on: http://gerrit.cloudera.org:8080/5317
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-06 10:41:53 +00:00
Alex Behm
7efa08316e IMPALA-4572: Run COMPUTE STATS on Parquet tables with MT_DOP=4.
COMPUTE STATS on Parquet tables is run with MT_DOP=4 by default.
COMPUTE STATS on non-Parquet tables will run without MT_DOP.

Users can always override the behavior by setting MT_DOP manually.
Setting MT_DOP to 0 means a statement will be run in the
conventional execution mode (without intra-node paralellism based
on multiple fragment instances). Users can set a higher MT_DOP
even for Parquet tables.

Testing: Added a new test that checks the effective MT_DOP.
Locally ran test_mt_dop.py, test_scanners.py, test_nested_types.py,
test_compute_stats.py, and test_cancellation.py.

Change-Id: I2be3c7c9f3004e9a759224a2e5756eb6e4efa359
Reviewed-on: http://gerrit.cloudera.org:8080/5315
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-12-03 22:28:53 +00:00
Henry Robinson
b9034ea0d5 IMPALA-4580: Fix crash with FETCH_FIRST when #rows < result cache size
The following sequence can lead to a crash:

  1. Client sets result cache size to N
  2. Client issues query with #results < N
  3. Client fetches all results, triggering eos and tearing down
     Coordinator::root_sink_.
  4. Client restarts query with FETCH_FIRST.
  5. Client reads all results again. After cache is exhausted,
     Coordinator::GetNext() is called to detect eos condition again.
  6. GetNext() hits DCHECK(root_sink_ != nullptr).

This patch makes GetNext() a no-op if called after it sets *eos,
avoiding the crash..

Testing:
  Regression test that triggered the bug before this fix.

Change-Id: I454cd8a6cf438bdd0c49fd27c2725d8f6c43bb1d
Reviewed-on: http://gerrit.cloudera.org:8080/5335
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-12-03 11:07:04 +00:00
Taras Bobrovytsky
858f5c2197 IMPALA-4363: Add Parquet timestamp validation
Before this patch, we would simply read the INT96 Parquet timestamp
representation and assume that it's valid. However, not all bit
permutations represent a valid timestamp. One of the boost functions
raised an exception (that we didn't catch) when passed an invalid
boost date object, which resulted in a crash. This patch fixes
problem by validating that the date falls into 1400..9999 year
range as we are scanning Parquet.

Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac
Reviewed-on: http://gerrit.cloudera.org:8080/5343
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-12-03 06:41:07 +00:00
Thomas Tauber-Marshall
7bcb51b152 IMPALA-4357: Fix DROP TABLE to pass analysis if the table fails to load
If a table fails to load, eg. because it was deleted externally from
Kudu, we should still allow 'DROP TABLE' to pass analysis. Otherwise,
you may be unable to drop tables that are in a bad state.

Testing:
- Updates existing Kudu tests to reflect the new behavior, and fixes
a couple of problems with those tests that were causing them to pass
spuriously (as well as fixing the same problem with another test in
the file while I'm here).

Change-Id: I6b41fc3c0e95508ab67f1d420b033b02ec75a5da
Reviewed-on: http://gerrit.cloudera.org:8080/5144
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-12-02 21:58:03 +00:00
Michael Brown
8d4f8d8d93 IMPALA-4343,IMPALA-4354: qgen: model INSERTs; write INSERTs from query model
This patch adds support to the random query generator infrastructure to
model and write SQL INSERTs. It does not actually randomly generate
INSERTs at this time (tracked in IMPALA-4353 and umbrella task
IMPALA-3740) but does provide necessary building blocks to do so.

First, it's necessary to model the INSERTs as part of our data model.
This was done by taking the current notion of a Query and making it a
SelectQuery. We also then create an abstract Query containing some of
the more common methods and attributes. We then model an INSERT query,
INSERT clause, and VALUES clause (IMPALA-4343).

Second, it's necessary to test the basics of this data model. It made
sense to go ahead and implement the necessary SqlWriter methods to write
the SQL for these clauses (IMPALA-4354).

I could then use this writer with some existing and new tests that take
a query written into our data model and write the SQL, verifying they're
correct.

For INSERT into Kudu tables, the equivalent PostgreSQL queries need to
use "ON CONFLICT DO NOTHING", so all existing and new query tests verify
they can be written as PostgreSQL as well.

Testing:
- all the query generator tests pass
- I can run Leopard front_end.py and load older query generator reports,
  browse them, and re-run failed queries
- I can run Leopard controller.py to actually do a query generator
  run
- discrepancy_searcher.py --explain-only ran for hundreds of queries.
  There were no problems writing the SELECT queries

Change-Id: I38e24da78c49e908449b35f0a6276ebe4236ddba
Reviewed-on: http://gerrit.cloudera.org:8080/5162
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
2016-12-02 20:49:43 +00:00
Matthew Jacobs
48983b3893 IMPALA-4567: Fix test_kudu_alter_table exhaustive failures
The issue is that we set the Kudu table name explicitly via
tblproperty so it doesn't have the unique db name in the
underlying Kudu name. Meanwhile, the tests are run
concurrently in exhaustive so this test may end up running
the multiple times (w/ different parameters, e.g.
disable_codegen) concurrently.  This test needs to be run
serially.

Change-Id: Ibca64d5567c24240606e454b052d130fcd0c3968
Reviewed-on: http://gerrit.cloudera.org:8080/5312
Reviewed-by: David Knupp <dknupp@cloudera.com>
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-02 04:01:19 +00:00
Tim Armstrong
b374061206 IMPALA-4564,IMPALA-4565: mt_dop fixes for old aggs and joins
Fix a test bug where we need to skip nested types tests for the old aggs
and joins.

Fix a product bug where *eos is not initialised by the MT scan node.
This causes incorrect results when the calling ExecNode does not
initialise the eos variable, e.g. the sort node and the old agg and join
nodes.

Testing:
Added a test that reproduces the incorrect results with the sort node
when run under ASAN

Tested the mt_dop tests locally with old aggs and joins to ensure they
pass.

Change-Id: I48c50c8aa0c23710eb099fba252bc3c0cb74b313
Reviewed-on: http://gerrit.cloudera.org:8080/5302
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-02 01:46:55 +00:00
Michael Ho
a41918d443 Fix E2E test infrastructure to handle missing exceptions correctly
This change fixes a bug in the E2E infrastructure that handles
the case when an expected exception wasn't thrown. The code was
expecting that test_section['CATCH'] to be a string but in
reality it's a list of strings. It also clarifies the error
message about the missing exception. This change also enforces
that the CATCH subsection in tests cannot be empty.

Change-Id: I7d83c5db59e8a239e4e70694a1e625af6f21419c
Reviewed-on: http://gerrit.cloudera.org:8080/5260
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-12-01 23:43:03 +00:00
Jim Apple
5a158dbcd1 IMPALA-4543: Properly escape ignored tests subdirectories.
In the shell, double-quoted strings are not very close to "raw"
strings; double quotes end the string, but parameter expansion is also
performed forstrings like "${FOO}". To pass strings from Python to the
shell, I have replaced double quotes with single quotes and escaped
the single quote characters in the strings.

While I am here, add better logging in TestExecutor.run_tests to make
errors like this easier to diagnose.

Change-Id: I006eb559ec5f5b5b0379997fab945116dfc7e8d7
Reviewed-on: http://gerrit.cloudera.org:8080/5242
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2016-11-30 22:30:39 +00:00
Michael Brown
585ed5aaae IMPALA-4450: qgen: use string concatenation operator for postgres queries
The random query generator writes a logical query Python object into
Impala or PostgreSQL dialects. When the CONCAT() function is chosen,
Impala's and PostgreSQL's CONCAT() implementations behave differently.
However, PostgreSQL has a || operator that functions like Impala's
CONCAT().

The method added here overrides the default behavior for the
PostgresqlSqlWriter. It prevents CONCAT(arg1, arg2, ..., argN) from
being written and instead causes the SQL to be written as
'arg1 || arg2 || ... || argN'.

Testing:

I made sure that we generate syntactically valid queries still on the
PostgreSQL side. This includes queries that made use of string
concatenation. I also re-ran some failed queries that previously
produced different results. They now produce the same results. This is a
very straightforward change, so unit or functional tests for this seem
overkill.

The full effects of using || instead of CONCAT() are hard to test. It's
not clear if in my manual testing of || vs. CONCAT() that I missed some
edge behavior, especially in some complicated query, nested expressions,
GROUPing BY, and so on.

Change-Id: I149b695889addfd7df4ca5f40dc991456da51687
Reviewed-on: http://gerrit.cloudera.org:8080/5034
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Sailesh Mukil <sailesh@cloudera.com>
2016-11-30 16:22:56 +00:00
Dimitris Tsirogiannis
9f497ba02f IMPALA-2890: Support ALTER TABLE statements for Kudu tables
With this commit, we add support for additional ALTER TABLE statements
against Kudu tables. The new supported ALTER TABLE operations for Kudu are:
- ADD/DROP range partitions. Syntax:
    ALTER TABLE <tbl_name> ADD [IF NOT EXISTS] RANGE <kudu_partition_spec>
    ALTER TABLE <tbl_name> DROP [IF EXISTS] RANGE <kudu_partition_spec>
- ADD/DROP/RENAME column. Syntax:
    ALTER TABLE <tbl_name> ADD COLUMNS (col_spec, [col_spec, ...])
    ALTER TABLE <tbl_name> DROP COLUMN <col_name>
    ALTER TABLE <tbl_name> CHANGE COLUMN <old> <new_name> <type>
- Rename Kudu table using the 'kudu.table_name' table property. Example:
  ALTER TABLE <tbl_name> SET TBLPROPERTY ('kudu.tbl_name'='<new_name>'),
  will change the underlying Kudu table name to <new_name>.
- Renaming the HMS/Catalog table entry of a Kudu table is supported using the
  existing ALTER TABLE <tbl_name> RENAME TO <new_tbl_name> syntax.

Not supported:
- ALTER TABLE <tbl_name> REPLACE COLUMNS

Change-Id: I04bc87e04e05da5cc03edec79d13cedfd2012896
Reviewed-on: http://gerrit.cloudera.org:8080/5136
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-30 04:55:03 +00:00
Tim Armstrong
e2cde13a2b IMPALA-4519: increase timeout in TestFragmentLifecycle
Increase the timeout to over 120s to match datastream_sender_timeout_ms.
This should avoid spurious test failures if we are unlucky and a sender
gets stuck waiting for a receiver fragment that will never start.

Testing:
Ran the test in a loop for a while to flush out any flakiness.

Change-Id: I9fe6e6c74538d0747e3eeb578cf0518494ff10c8
Reviewed-on: http://gerrit.cloudera.org:8080/5244
Tested-by: Impala Public Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-11-29 01:46:05 +00:00
Tim Armstrong
fe8d994f0f IMPALA-4541: fix test dimensions for test_codegen_mem_limit
The test should only be run with codegen enabled.

Change-Id: Iac460d2a1b69de638c557d7c8aa318a73ad0507b
Reviewed-on: http://gerrit.cloudera.org:8080/5221
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-25 23:32:24 +00:00
Tim Armstrong
16552f6eda IMPALA-4525: fix crash when codegen mem limit exceeded
The error path in OptimizeLlvmModule() has not worked correctly for a
long time because various places in the code assume that codegen'd
function pointers will be filled in (e.g. ScalarFnCall) . Since the
recent change "IMPALA-4397,IMPALA-3259: reduce codegen time and memory"
it is more likely to go down this path.

The cases when errors occur on this path: memory limit exceeded, internal
codegen bugs, and corrupt IR UDFs, are all cases when it is not correct
or safe to continue executing the query, so we should just fail the
query.

Testing:
Add a test where codegen reliably fails with memory limit exceeded.

Change-Id: Ib38d0a44b54c47617cad1b971244f477d344d505
Reviewed-on: http://gerrit.cloudera.org:8080/5211
Reviewed-by: Michael Ho <kwho@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-24 08:03:39 +00:00
Henry Robinson
e4fc5bd5c5 IMPALA-4488: HS2 GetOperationStatus() should keep session alive
GetOperationStatus() is used by *DBC drivers to check if a query is
ready for its results to be fetched. However, it did not keep the
associated session alive, so queries would time out if they took longer
than the timeout to materialize their first rows to be fetched.

* Add withSession() to GetOperationStatus()
* Add a test that failed before this patch, and succeeds after.

Change-Id: Ibb3f66188209563b4b74b2ca96480f16ace0f190
Reviewed-on: http://gerrit.cloudera.org:8080/5213
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-24 05:39:45 +00:00
Alex Behm
bbf5255d0e IMPALA-1788: Fold constant expressions.
Adds a new ExprRewriteRule for replacing constant expressions
with their literal equivalent via BE evaluation. Applies the
new rule together with the existing ones on the parse tree,
after analysis.

Limitations
- Constant folding is applied on the unresolved expressions.
  As a result, it only works for expressions that are constant
  within a single query block, as opposed to expressions that
  may become constant after fully substituting inline-view exprs.
- Exprs are not normalized, so some opportunities for constant
  folding are missed for certain expr-tree shapes.

This patch includes the following interesting changes:
- Introduces a timestamp literal that can only be produced
  by constant folding (not expressible directly via SQL).
- To make sure that rewrites have no user-visible effect,
  the original result types and column labels of the top-level
  statement are restored after the rewrites are performed.
- Does not fold exprs if their evaluation resulted in a
  warning or error, or if the resulting value is not
  representable by corresponding FE LiteralExpr.
- Fixes an existing issue with converting strings between
  the FE/BE. String produced in the BE that have characters
  with a value > 127 are not correctly deserialized into a
  Java String via thrift. We detect this case during constant
  folding and abandon folding of such exprs.
- Fixes several issues with detecting/reporting errors in
  NativeEvalConstExprs().
- Cleans up ExprContext::GetValue() into
  ExprContext::GetConstantValue() which clarifies its only use
  of evaluating exprs from the FE.

Testing:
- Modifies expr-test.cc to run all tests through the constant
  folding path.
- Adds basic planner and rewrite rule tests.
- Exhaustive test run passed

Change-Id: If672b703db1ba0bfc26e5b9130161798b40a69e9
Reviewed-on: http://gerrit.cloudera.org:8080/5109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-23 21:11:30 +00:00
Michael Ho
1e306211d0 IMPALA-3838, IMPALA-4495: Codegen EvalRuntimeFilters() and fixes filter stats updates
This change codegens HdfsParquetScanner::EvalRuntimeFilters()
by unrolling its loop, codegen'ing the expression evaluation
of the runtime filter and replacing some type information with
constants in the hashing function of runtime filter to avoid
branching at runtime.

This change also fixes IMPALA-4495 by not counting a row as
'considered' in the filter stats before the filter arrives.
This avoids unnecessarily marking a runtime filter as
ineffective before it's even used.

With this change, TPCDS-Q88 improves by 13-14%.
primitive_broadcast_join_1 improves by 24%.

Change-Id: I27114869840e268d17e91d6e587ef811628e3837
Reviewed-on: http://gerrit.cloudera.org:8080/4833
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-11-23 12:48:47 +00:00
Tim Armstrong
4db330e69a IMPALA-4397,IMPALA-3259: reduce codegen time and memory
A handful of fixes to codegen memory usage:
* Delete the IR module when we're done with it (it can be fairly large)
* Track the compiled code size (typically not that large, but it can add
  up if there are many fragments).
* Estimate optimisation memory requirements and track it in the memory
  tracker. This is very crude but much better than not tracking it.

A handful of fixes to improve codegen time/cost, particularly targeted
at compute stats workloads:
* Avoid over-inlining when there are many aggregate functions,
  conjuncts, etc by adding "NoInline" attributes.
* Don't codegen non-grouping merge aggregations. They will only process
  one row per Impala daemon, so codegen is not worth it.
* Make the Hll algorithm more efficient by specialising the hash function
  based on decimal width.

Limitations:
* This doesn't tackle over-inlining of large expr trees, but a similar
  approach will be used there in a follow-on patch.

Perf:
Compute stats on functional_parquet.widetable_1000_cols goes from 1min+
of codegen to ~ 5s codegen on my machine. Local perf runs of tpc-h
and targeted perf showed no regressions and some moderate improvements
(1-2%).

Also did an experiment to understand the perf consequences of disabling
inlining. I manually set CODEGEN_INLINE_EXPRS_THRESHOLD to 0, and ran:

  drop stats tpch_20_parquet.lineitem
  compute stats tpch_20_parquet.lineitem;

There was no difference in time spent in the agg node: 30.7s with
inlining, 30.5s without.

Change-Id: Id10015b49da182cb181a653ac8464b4a18b71091
Reviewed-on: http://gerrit.cloudera.org:8080/4956
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-11-23 08:18:17 +00:00
David Knupp
696fb68e58 IMPALA-4510: Selectively filter args for metric verification tests
run-tests.py is a wrapper around impala-py.test. It abstracts away
the need to invoke separate runs for serial tests, parallel tests,
and metric verification tests.

Because it's possible for a user to specify certain test suites,
or even specific tests, on the command line when calling
run-tests.py, it had been necessary to override the command line
args when it came time to run the metric verification tests --
otherwise those other tests/suites would be rerun. Before this
patch, we had simply been stripping away all command line args.

However, that blanket approach causes problems when running tests
against a remote cluster, because we need to retain those command
line args that pertain to the remote cluster.

This patch selectively prunes unwanted command line args for the
last metric verification test stage, keeping the ones that we
need, and also adds extensive documentation for explaining why we
have to go through this fairly odd and elaborate step.

This patch was tested by running a sample test suite locally,
and against a remote cluster. Previously, the metric verification
stage had been failing for remote cluster tests (since they were
defaulting to localhost for services that were only available
remotely.) With the patch, the remote verfification tests were
passing.

Also, while I'm here, add a small change that exits immediately
if the user calls for --help. Before this, we actually still ran
the tests.

Change-Id: I069172f44c1307d55f85779cdb01fecc0ba1799e
Reviewed-on: http://gerrit.cloudera.org:8080/5135
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2016-11-23 05:46:18 +00:00
Alex Behm
8f2bb2f72f IMPALA-3809: Show Kudu-specific column metadata in DESCRIBE.
TODO:
- Corresponding changes to DESCRIBE EXTENDED/FORMATTED.

Testing:
A private core/hdfs run passed.

Change-Id: I83c91b540bc6d27cb4f21535fe12f3f8658c233e
Reviewed-on: http://gerrit.cloudera.org:8080/5125
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-22 23:06:05 +00:00
Michael Ho
b7eeb8bf85 IMPALA-4432: Handle internal codegen disabling properly
There are some conditions in which codegen is disabled internally
even if it's enabled in the query option. For instance, the single
node optimization or the expression evaluation requests sent from
the FE to the BE. These internal disabling of codegen are advisory
as their purposes are to reduce the latency for tables with no or
very few rows. The internal disabling of codegen doesn't interact
well with UDFs which cannot be interpreted (e.g. IR UDF) as it
conflates the 'disable_codegen' query option set by the user.
As a result, it's hard to differentiate between when codegen is
disabled explicitly by users and when it is disabled internally.

This change fixes the problem above by adding an explicit flag in
TQueryCtx to indicate that codegen is disabled internally. This flag
is only advisory. For cases in which codegen is needed to function,
this internal flag is ignored and if codegen is disabled via query
option, an error is thrown. For this new flag to work with ScalarFnCall,
codegen needs to happen after ScalarFnCall::Prepare() because it's
hard to tell if a fragment contains any UDF that cannot be interpreted
until after ScalarFnCall::Prepare() is called. However, Prepare() needs
the codegen object to codegen so it needs to be created before Prepare().
We can either always create the codegen module or defer codegen to a point
after ScalarFnCall::Prepare(). The former has the downside of introducing
unnecessary latency for say single-node optimization so the latter is
implemented. It is needed as part of IMPALA-4192 any way.

After this change, ScalarFnCall expressions which need to be codegen'd
are inserted into a vector in RuntimeState in ScalarFnCall::Prepare().
Later in the codegen phase, these expressions' GetCodegendComputeFn()
will be called after codegen for operators is done. If any of these
expressions are already codegen'd indirectly by the operators,
GetCodegendComputeFn() will be a no-op. This preserves the behavior
that ScalarFnCall will always be codegen'd even if the fragment
doesn't contain any codegen enabled operators.

Change-Id: I0b6a9ed723c64ba21b861608583cc9b6607d3397
Reviewed-on: http://gerrit.cloudera.org:8080/5105
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-11-22 14:56:03 +00:00
Lars Volker
8ea21d099f IMPALA-2523: Make HdfsTableSink aware of clustered input
IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Reviewed-on: http://gerrit.cloudera.org:8080/4863
Reviewed-by: Lars Volker <lv@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-22 02:51:20 +00:00
Sailesh Mukil
178fd59142 IMPALA-4502: test_partition_ddl_predicates breaks on non-HDFS filesystems
This is because that test uses 'set cached' and 'set uncached' which
are not supported on non-HDFS filesystems. This patch creates a
separate test file for non-HDFS filesystems with only supported
queries and invokes the right file based on the filesystem.

Change-Id: I8606aa427cb6e50be3395cdde246abb53db5172c
Reviewed-on: http://gerrit.cloudera.org:8080/5164
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-11-22 00:42:57 +00:00
Dan Hecht
035b775a6d IMPALA-4440: lineage timestamps can go backwards across daylight savings transitions
Using TimestampValue (or equivalent string representation) for
timestamps that require a point in time doesn't work because the same
time can represent multiple point in times.  For example, the timestamp:
'2016-11-13 01:01 AM' occurred twice last weekend.

Instead, we should use unix time directly rather than trying to derive
unix time from a (timezone-less) timestamp.

Note that there are other questionable uses of TimestampValue for
internal Impala service stuff, but I want to fix them separately as they
are not as important and fixing does add some risk.

While I'm here, remove a template TimestampValue constructor that was
unused and is confusing.

We don't have any end-to-end tests that exercise column lineage, so add
a simple custom cluster test that enables lineage and verifes the start
and end unix times are within appropriate bounds.  The other column
lineage graph fields are at least tested via planner tests.

Automated regression testing for the specifc daylight savings issue is
difficult as we'd have to cross the daylight savings boundary at just
the right time during query execution in order to reproduce
reliably. But open to ideas.

Testing:
- loop the new test overnight without any failures.
- exhaustive run.

Change-Id: I34e435fc3511e65bc62906205cb558f2c116a8a9
Reviewed-on: http://gerrit.cloudera.org:8080/5129
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-21 22:18:37 +00:00
Dimitris Tsirogiannis
3db5ced4ce IMPALA-3726: Add support for Kudu-specific column options
This commit adds support for Kudu-specific column options in CREATE
TABLE statements. The syntax is:
CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....])
where option is:
| NULL
| NOT NULL
| ENCODING encoding_val
| COMPRESSION compression_algorithm
| DEFAULT expr
| BLOCK_SIZE num

The output of the SHOW CREATE TABLE statement was altered to include all the specified
column options for Kudu tables.

Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02
Reviewed-on: http://gerrit.cloudera.org:8080/5026
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-18 11:41:01 +00:00