Commit Graph

692 Commits

Author SHA1 Message Date
Skye Wanderman-Milne
68fef6a5bf IMPALA-2213: make Parquet scanner fail query if the file size metadata is stale
This patch changes the Parquet scanner to check if it can't read the
full footer scan range, indicating that file has been overwritten by a
shorter file without refreshing the table metadata. Before it would
DCHECK. This patch adds a test for this case, as well as the case
where the new file is longer than the metadata states (which fails
with an existing error).

Change-Id: Ie2031ac2dc90e4f2573bd3ca8a3709db60424f07
Reviewed-on: http://gerrit.cloudera.org:8080/1084
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2015-10-01 13:58:39 -07:00
Juan Yu
6bac14a283 IMPALA-2005: Cleanup the newly created table if CTAS fails.
If CTAS query fails during the DML part Impala
should drop the newly created table.

Change-Id: I39e04a6923a36afa48f3252addd50ddda83d1706
(cherry picked from commit e03ce43585f68590a95038341e74db458f34bf32)
Reviewed-on: http://gerrit.cloudera.org:8080/870
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-10-01 13:58:38 -07:00
Skye Wanderman-Milne
0c5e6a804f IMPALA-2443: add support for more Parquet array encodings
This patch adds full support for the various Parquet array encodings,
as well as tests that use files from
https://github.com/apache/hive/tree/master/data/files. This should
allow us to read any existing array data.

Change-Id: I3d22ae237b1dc82ee75a83c1d4890d76316fadee
Reviewed-on: http://gerrit.cloudera.org:8080/826
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2015-10-01 13:58:37 -07:00
Dimitris Tsirogiannis
6c9b93973a IMPALA-2441: CREATE DATABASE IF NOT EXISTS may cause NPE
This commit fixes an issue where a CREATE DATABASE IF NOT EXISTS
statement will cause an NPE if the database already exists in the
Hive MetaStore but not in the Impala catalog. With this fix no error is
thrown if the database exists in HMS and the database is added to the
catalog.

Change-Id: If1d15bb50869ce8084e0443f119a596b365004c7
Reviewed-on: http://gerrit.cloudera.org:8080/1091
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-10-01 13:58:36 -07:00
Tim Armstrong
96d93083fe Disable nested TPCH tests for old aggs and joins
We do not support nested types combined with the old aggs and joins.

Change-Id: I81401dd4d482d46e678091989ac9d178ac771d01
Reviewed-on: http://gerrit.cloudera.org:8080/1078
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-30 17:17:48 -07:00
Matthew Jacobs
851056489d IMPALA-2440: Fix old HJ full outer join with no rows
When a full outer join on the old (non-partitioned)
HashJoinNode, if any join fragment has 0 build rows and 0
probe rows an extra null row will be produced.

Change-Id: I75373edc4f6b3b0c23afba3c1fa363c613f23507
Reviewed-on: http://gerrit.cloudera.org:8080/1068
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-09-30 17:17:47 -07:00
Juan Yu
7c498627f6 IMPALA-2249: Avoid allocating StringBuffer > 1GB in ScannerContext::Stream::GetBytesInternal()
Due to IMPALA-1619, allocating StringBuffer larger than 1GB could
cause Impala crash. Check the requested buffer size in advance and
fail the request if it is larger than 1GB. Once IMPALA-1619 is
fixed, we should revert this change.

Change-Id: Iffd1e701614b520ce58922ada2400386661eedb1
(cherry picked from commit 74ba16770eeade36ab77c86ed99d9248c60b0131)
Reviewed-on: http://gerrit.cloudera.org:8080/869
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-09-30 17:17:46 -07:00
Tim Armstrong
fbec3f65a0 Skip nested types tests with old aggs and joins
We don't support nested types combined with the old aggs and joins. This
patch disables the nested type query tests when the old aggs or joins
are enabled with the TEST_START_CLUSTER_ARGS environment variable.

Change-Id: I6579a0a245359d4d2ff955c399d1296580c9676e
Reviewed-on: http://gerrit.cloudera.org:8080/1046
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-30 17:17:39 -07:00
Matthew Jacobs
2478f05cb3 IMPALA-2375: Unblock old hj/agg test runs
Move a very expensive semi-join test case to run only on
exhaustive so that it is not run as part of the old hj/agg
jenkins runs where it fails.

Change-Id: I4a0f915e894ceac91d86b366876e47e9cc87255a
Reviewed-on: http://gerrit.cloudera.org:8080/930
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-27 15:13:32 -07:00
Martin Grund
579be1c542 IMPALA-2284: Disallow long (1<<30) strings in group_concat()
This is the first step to fix issues with large memory allocations. In
this patch, the built-in `group_concat` is no longer allowed to allocate
arbitraryly large strings and crash impala, but is limited to the upper
bound of possible allocations in Impala.

This patch does not perform any functional change, but rather avoids
unnecessary crashes. However, it changes the parameter type of
FindChunk() in MemPool to be a signed 64bit integer. This change allows
the mempool to allocate internally memory of more than one 1GB, but the
public interface of Allocate() is not changed, so the general limitation
remains. The reason for this change is as follows:

  1) In a UDF FunctionContext::Reallocate() would allocate slightly more
  than 512MB from the FreePool.
  2) The free pool tries to double this size to alloocate 1GB from the
  MemPool.
  3) The MemPool doubles the size again and overflows the signed 32bit
  integer in the FindChunk() method. This will then only allocate 1GB
  instead of the expected 2GB.

What happens is that one of the callers expected a larger allocation
than actually happened, which will in turn lead to memory corruption as
soon as the memory is accessed.

Change-Id: I068835dfa0ac8f7538253d9fa5cfc3fb9d352f6a
Reviewed-on: http://gerrit.cloudera.org:8080/858
Tested-by: Internal Jenkins
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
2015-09-23 15:15:55 -07:00
Ippokratis Pandis
48699de6e3 IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ
PAGG and PHJ were using an all-or-nothing approach wrt spilling. In
particular, they were trying to switch to IO-sized buffers for both
streams (aggregated and unaggregated in PAGG; build and probe in PHJ)
of every partition (currently 16 partitions for a total of 32
streams), even if some of the streams had very few rows, they were
empty or simply they would not spill so there was no need to allocate
IO-buffers for them. That was increasing the min mem needed by those
operators in many queries.

This patch decouples the decision to switch to IO-buffers for each
stream of each partition. Streams will switch to IO-sized buffers
whenever the rows they contain do not fit in the first two small
buffers (64KB and 512KB respectively). When we decide to spill a
partition, we switch to IO buffers both streams.

With these change many streams of PAGG and PHJ nodes do not need to
use IO-sized buffers, reducing the min mem requirement. For example,
below is the min mem needed (in MBs) for some of the TPC-H queries.
Some need half or less mem from the mem they needed before:

  TPC-H Q3: 645 -> 240
  TPC-H Q5: 375 -> 245
  TPC-H Q7: 685 -> 265
  TPC-H Q8: 740 -> 250
  TPC-H Q9: 650 -> 400
  TPC-H Q18: 1100 -> 425
  TPC-H Q20: 420 -> 250
  TPC-H Q21: 975 -> 620

To make this small buffer optimization to work, we had to fix
IMPALA-2352. That is, the AllocateRow() call of
PAGG::ConstructIntermediateTuple() could return unsuccessfully just
because the small buffers of the stream were exhausted. In that case,
previously we would treat it as an indication that there is no memory
left, start spilling a partition and switching all stream to
IO-buffes. Now we make a best effort, trying to first
SwitchToIoffers() and if that is successful, we re-attempt the
AllocateRow() call. See IMPALA-2352 for more details.

Another change is that now SwitchToIoBuffers() will reset the flag
using_small_buffers_ back to false, in case we are in a very low
memory situation and it fails to get a buffer. That allows us to
retry calling SwitchToIoBuffers() once we free up some space. See
IMPALA-2330 for more details.

With the above fixes we should also have fixed IMPALA-2241 and
IMPALA-2271 that are essentially stream::using_small_buffers_-related
DCHECKs.

This patch adds all 22 TPC-H queries in test_mem_usage_scaling test
and updates the per-query min mem limits in it. Additionally, it adds
a new aggregation test that uses the TPC-H dataset for larger
aggregations (TestTPCHAggregationQueries). It also removes some
dead test code.

Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38
Reviewed-on: http://gerrit.cloudera.org:8080/818
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins

Conflicts:

	tests/query_test/test_aggregation.py
2015-09-23 11:07:42 -07:00
Ippokratis Pandis
4d5ee2b3a2 IMPALA-2364: Wrong DCHECK in PHJ::ProcessProbeBatch
There was a dcheck in PHJ::ProcessProbeBatch() that was expecting that
the state of PHJ was PROCESSING_PROBE. It looks like we can hit the
same dcheck when we are in REPARTITIONING phase.
This patch fixes this dcheck. It also adds tpc-ds q53 in the
test_mem_usage_scaling test (along with the needed refactoring in this
test) because tpc-ds q53 hit this dcheck in an endurance test.

Change-Id: I37f06e1bfe07c45e4a6eac543934b4d83a205d28
Reviewed-on: http://gerrit.cloudera.org:8080/893
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-09-23 10:38:58 -07:00
aacalfa
57dd4d1502 IMPALA-1309: Add support for distinct in group_concat function.
Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111
Reviewed-on: http://gerrit.cloudera.org:8080/892
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-09-23 09:42:17 +00:00
Szehon Ho
0c574394d6 Data_generator: add a new mode "populate_existing" to populate existing tables.
Example use-case, can generate data given any DDL to reproduce issues.

The general strategy is to read the tables metadata, create a temp table with
that schema and insert text data, and then insert into target table, similar to
the code path of parquet.

Change-Id: I4d512a80bc0accf4c243587f6246d9f63fda9149
Reviewed-on: http://gerrit.cloudera.org:8080/877
Reviewed-by: Szehon Ho <szehon@cloudera.com>
Tested-by: Internal Jenkins
2015-09-22 10:58:33 -07:00
Tim Armstrong
db7519df24 IMPALA-2207: memory corruption on build side of NLJ
The NLJ node did not follow the expected protocol when need_to_return
is set on a row batch, which means that memory referenced by a rowbatch
can be freed or reused the next time GetNext() is called on the child.

This patch changes the NLJ node to follow the protocol by deep copying
all build side row batches when the need_to_return_ flag is set on the
row batches. This prevents the row batches from referencing memory that
may be freed or reused.

Reenable test that was disabled because of IMPALA-2332 since this was
the root cause.

Change-Id: Idcbb8df12c292b9e2b243e1cef5bdfc1366898d1
Reviewed-on: http://gerrit.cloudera.org:8080/810
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-22 10:58:32 -07:00
ishaan
e408560c56 Perf Framework: Move exec functions to a separate file and deprecate Hive execution.
This patch does the following:
  - Removes code that deals with executing queries through Hive.
  - Gives the user the option to specify only the hostname for the Impalads.
  - Moves the execution functions to their own .py file.
  - Removes some duplicate code (exec_shell_cmd -> exec_process)

Change-Id: If49951c7bb5423ef9343d4d211f6da13d397325a
Reviewed-on: http://gerrit.cloudera.org:8080/862
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-09-22 10:58:32 -07:00
Alex Behm
41ef3a216d Nested Types: Add functional tests.
This patch adds basic end-to-end functional tests for nested types:
1. For exercising the Reset() of exec nodes when inside a subplan.
2. For asserting correct behavior when row batches with collection-typed
   slots flow through exec nodes.

Most cases are covered, but there are a few known issues that prevent
full coverage. The remaining tests will be added as part of the fixes
for those existing JIRAs.

Change-Id: I0140c1a32cb5edd189f283c68a24de8484b3f434
Reviewed-on: http://gerrit.cloudera.org:8080/823
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-14 13:43:01 -07:00
Alex Behm
361da01152 Fail queries that require a SubplanNode when using legacy joins and aggs.
We will not provide full nested types support if any of these options
are set:

--enable_partitioned_aggregation=false
--enable_partitioned_hash_join=false

Change-Id: I0f8607914faf9691d5f7b1a4327609fefba22e56
Reviewed-on: http://gerrit.cloudera.org:8080/792
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-09-10 04:50:31 +00:00
Tim Armstrong
235a8d08da IMPALA-2295: deep copy arrays in BufferedTupleStream
This was unimplemented and is used on some code paths. Arrays were not
properly copied into the BufferedTupleStream, potentially leaving stray
pointers to invalid or reused memory. Arrays are now correctly deep
copied.  Includes a unit test that copys rows containing arrays in and
out of a BufferedTupleStream.

Also implement matching optimisation for deep copy in RowBatch.

Change-Id: I75d91a6b439450c5b47646b73bc528cfb8f7109b
Reviewed-on: http://gerrit.cloudera.org:8080/751
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-09 02:39:14 +00:00
Alex Behm
dbb40c7922 Nested Types: Add end-to-end tests running nested TPCH on Parquet.
Change-Id: I2a3c46ea50e53479f2f91c175c45e2da3c1c7025
Reviewed-on: http://gerrit.cloudera.org:8080/740
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-06 20:49:42 +00:00
Henry Robinson
270f12b09a Fix race condition in test_statestore changes
KillableThreadedServer.port is required to be set for
KillableThreadedServer.wait_until_up() to execute. However, it was being
set during serve() which was run concurrently with wait_until_up() (see
StatestoreSubscriber.__init_server()), so sometimes it was not set. The
fix is to set KillableThreadedServer.port during construction (it is set
by the underlying socket as soon as it is constructed itself).

Change-Id: Ib9ca9e237bca96635f5ee5b5bbfb7fd678929ce4
Reviewed-on: http://gerrit.cloudera.org:8080/759
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-05 09:44:55 +00:00
Henry Robinson
956c9d74fa Ensure that test_statestore always uses free ports
test_statestore.py needs a lot of ports to run subscriber servers
on. before this patch, we'd find a free port by binding to port 0,
finding what port the OS actually used, and closing the original socket
before passing that port to the actual server to start up. However, this
was obviously racy if some other process is also looking for a free
port at the same time.

This patch moves the bind logic into the Thrift server socket itself, so
that there's no close->open race window between binding to the port and
actually wanting to use it.

I wasn't able to reproduce the issue on my local machine, but this
diagnosis fits the problems we've seen.

Change-Id: Idfbbe71f596ff5a7c3f4ff33b5edd565648d8e59
Reviewed-on: http://gerrit.cloudera.org:8080/754
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-09-04 22:44:20 +00:00
Alex Behm
d48ec4b8b3 IMPALA-2289: Properly handle AtCapacity() in SubplaNode.
After this patch we get correct results for nested TPCH Q13.

The bug: Since we were not properly handling AtCapacity() of the output
batch in SubplanNode, we sometimes passed a row batch that was already
at capacity into GetNext() on the second child of the SubplanNode.
In this particular case, that batch was passed into the NestedLoopJoinNode
which may return incomplete results if the output batch is already
at capacity (e.g., ProcessUnmatchedBuildRows() was not called).

The fix is to return from SuplanNode::GetNext() if the output batch
is at capacity due to resources being tranferred to it from the input
batch used to fetch from the first child.

Change-Id: Ib97821e8457867dc0d00fd37149a3f0a75872297
Reviewed-on: http://gerrit.cloudera.org:8080/742
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-04 20:26:52 +00:00
Tim Armstrong
4ac7e5d15d Disable nested types tests affected by IMPALA-2295
IMPALA-2295 causes tests combining paggs/phjs with collection types to
intermittently fail because of memory corruption. This affects
non-scanner nested types tests.

Change-Id: I63893fbde87189485455cf95a7f63eb7e8aa95f3
Reviewed-on: http://gerrit.cloudera.org:8080/747
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-04 05:38:17 +00:00
Szehon Ho
5056431f29 Standardize all comparison test HS2 connection to use Impyla.
With Impyla PR#108, Impyla now supports Hive's default PLAIN auth-mode.
This change gets rid of pyHS2 and standardizes all the connection to use Impyla.

Change-Id: Ifd3bd212595753ed5e0591105802ec094a41d8af
Reviewed-on: http://gerrit.cloudera.org:8080/739
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-09-03 23:35:26 +00:00
Casey Ching
8586ec5280 Python: Switch query generator back to psycopg2 driver
The pg8000 driver currently in use doesn't work well with autocommit
enabled. When enabled, the drive complains that only 100 rows can be
fetched because of a buffer/cache limitation. With autocommit disabled,
a bunch of code changes would be needed. For now the driver the
previous psycopg2 driver will be used again. (The psycopg2 driver was
originally replaced because dev postgres libraries are required to
build it, so building the virtualenv would fail. This patch doesn't
try to build psycopg2 and instead assumes it was already installed.)

Change-Id: I6901cb1fa109d6da907b1415601116d833d66656
Reviewed-on: http://gerrit.cloudera.org:8080/737
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
2015-09-03 19:23:39 +00:00
Skye Wanderman-Milne
bcc73a36da Nested types: read and materialize nested types in Parquet scanner
This patch modifies the Parquet scanner to resolve nested schemas, and
read and materialize collection types. The high-level modification is
to create a CollectionColumnReader that recursively materializes map-
and array-type slots.

This patch also adds many tests, most of which query a new table
called complextypestbl. This table contains hand-generated data that
is meant to expose edge cases in the scanner. The tests mostly test
the scanner, with a few tests of other functionality (e.g. array
serialization).

I ran a local benchmark comparing this scanner code to the original
scanner code on an expanded version of tpch_parquet.lineitem with
48009720 rows. My benchmark involved selecting different numbers of
columns with a single scanner thread, and I looked at the HDFS scan
node time in the query profiles. This code introduces a 10%-20%
regression in single-threaded scan time.

Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a
Reviewed-on: http://gerrit.cloudera.org:8080/576
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-02 19:23:54 +00:00
Martin Grund
f927a285c6 IMPALA-1136, IMPALA-2161: Skip \u0000 characters when dealing Avro schemas
The limitation of the Avro JSON library not to handle \u0000 characters
is to avoid problems with builtin functions like strlen() that would
report wrong length when encountering such a character. Now, in the
case if Impala, for now, we don't support any Unicode characters.
This allows us to actually skip the \u0000 character instead of
interpreting it.

It is important to say that even the most recent versions of Avro do
not support parsing \u0000 characters.

Change-Id: I56dfa7f0f12979fe9705c51c751513aebce4beca
Reviewed-on: http://gerrit.cloudera.org:8080/712
Tested-by: Internal Jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2015-09-02 00:37:28 +00:00
Martin Grund
fa5eca09c8 Disable HDFS file handle caching by default
This patch modifies the Impala command line flag:

     --max_cached_file_handles=VAL

to disable caching of HDFS file handles if VAL is 0.

In addition, it moves the existing functional tests to a custom cluster
test and keeps a sanity check for no caching in the original
place. Furthermore, it will check that no file handles are leaked.

Change-Id: Ic36168bba52346674f57639e1ac216fd531b0fad
Reviewed-on: http://gerrit.cloudera.org:8080/691
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-08-27 23:34:30 +00:00
Martin Grund
60c5140ea7 IMPALA-1983: Warn if table stats are potentially corrupt.
When the `numRows` parameter stored in the table properties is
errornously set to 0 and a number of non-empty files are present
the table statistics are considered to be corrupt.

To hint that there might be a problem, the explain statement will emit
an additional warning if it detects potentially corrupt table stats like
in the following example:

  Estimated Per-Host Requirements: Memory=42.00MB VCores=1
  WARNING: The following tables have potentially corrupt table and/or
  column statistics.
  compute_stats_db.corrupted

  03:AGGREGATE [FINALIZE]
  |  output: count:merge(*)
  |
  02:EXCHANGE [UNPARTITIONED]
  |
  01:AGGREGATE
  |  output: count(*)
  |
  00:SCAN HDFS [compute_stats_db.corrupted]
     partitions=1/2 files=1 size=24B

In addition, the small query optimization is disabled for such queries.

Change-Id: I0fa911f5132aa62195b854248663a94dcd8b14de
Reviewed-on: http://gerrit.cloudera.org:8080/689
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-08-26 22:19:33 +00:00
Szehon Ho
dcad11c3af Support Parquet fileformat for data-generator when running against Hive
Also add DDL logging to facilitate debugging.

Change-Id: If7600677ed8c491b68468cae4ddf5394499576ca
Reviewed-on: http://gerrit.cloudera.org:8080/688
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-08-26 10:10:32 +00:00
Taras Bobrovytsky
b8b7930377 Add nested types support to Create Table Like File
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.

Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 01:46:26 +00:00
Szehon Ho
5787dc2cf3 First commit to run the random query generator on Hive.
With this change, random query generator can run continuously on Hive
and approximately half of its generated queries are able to run.

1. Connect timeout from Impyla to HS2 was too small,
increasing it to match Impala's.
2. Query timeout to wait for Hive queries was too short,
making it configurable so we can play with different values.
3. Hive does not support 'with' clause in subquery,
but interestingly supports it at the top-level.
Added a profile flag "use_nested_with" to disable nested with's.
4. Hive does not support 'having' without 'group by'.
Added a profile flag "use_having_without_groupby" to always
generate a group by with having.
5. Hive does not support "interval" keyword for timestamp.
Added a profile 'restrict' list to restrict certain functions,
and added 'dateAdd' to this list for Hive.
6. Hive 'greatest' and 'least' UDF's do not do implicit type casting
like other databases.  Modified the query-generator to only choose args of
the same type for these, and for HiveSqlWriter to add a cast as there
were still some lingering issues like udf's on int returning bigint.
7. Hive always orders the Nulls first in ORDER BY ASC,
opposite to other databases,
and does not have any 'NULLS FIRST' or 'NULLS LAST' option.
Thus the only workaround is to add a "nulls_order_asc" flag
to the profile, and pass it in to the ref database's SqlWriter
to generate the 'NULLS FIRST' or 'NULLS LAST' statement on that end.
8. Hive strangely does not support multiple sort keys in a window
without frame specification.  The workaround is for HiveSqlWriter
to add 'rows unbounded preceding' to specify the default frame if
there are no existing frames.

Change-Id: I2a5b07e37378f695de1b50af49845283468b4f0f
Reviewed-on: http://gerrit.cloudera.org:8080/619
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-08-21 08:19:04 +00:00
Vlad Berindei
e4c42fa8bf IMPALA-595: Add CASCADE to DROP DATABASE and use it in cleanup_db
Change-Id: Idfa5b6943bc797e10d542487c31b8f1b527d8c97
Reviewed-on: http://gerrit.cloudera.org:8080/635
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-08-20 03:34:31 +00:00
Skye Wanderman-Milne
7906ed44ac IMPALA-2015: Add support for nested loop join
Implement nested-loop join in Impala with support for multiple join
modes, including inner, outer, semi and anti joins. Null-aware left
anti-join is not currently supported.

Summary of changes:
Introduced the NestedLoopJoinNode class in the FE that represents the nested
loop join. Common functionality between NestedLoopJoinNode and HashJoinNode
(e.g. cardinality estimation) was moved to the JoinNode class.
In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop
join execution strategy.

Change-Id: I238ec7dc0080f661847e5e1b84e30d61c3b0bb5c
Reviewed-on: http://gerrit.cloudera.org:8080/652
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-08-19 08:40:14 +00:00
Tim Armstrong
5350d49f8c IMPALA-1829: UDAs with different intermediate type
Previously the frontend rejected UDAs with different intermediate and
result type. The backend supports these, so this change enables support
in the frontend and adds tests.

This patch adds a test UDA function with different intermediate type and
a simple end-to-end test that exercises it. It modifies an existing
unused test UDA that used a currently unsupported intermediate type -
BufferVal.

Change-Id: I5675ec7f275ea698c24ea8e92de7f469a950df83
Reviewed-on: http://gerrit.cloudera.org:8080/655
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2015-08-19 04:37:39 +00:00
Henry Robinson
4eb2754924 IMPALA0-2007: Fix race in test_statestore.test_topic_persistence
test_topic_persistence simulates a failing subscriber by sending updates
for one persistent and one transient topic to the statestore, and then
having the subscriber kill itself by closing its connections to the
statestore. Another subscriber then registers and checks that the
persistent topic entries are there, but the transient ones are not.

This patch fixes a race in that test where the first subscriber may
forcibly close its connections before it has sent the topic updates,
leading to a failure when the second subscriber checks the topic
contents. This happens because the 'kill' thread is notified at the
point that the RPC thread is leaving the RPC implementation, but before
any network response has been sent, so the kill thread can race to close
the TCP connection before the response actually makes it to the
statestore.

The easy fix is to force the subscriber to wait for 2 updates from the
statestore rather than 1 before terminating - this ensures that the
original response completes before the connections are closed.

Before this patch, the test would fail within ten minutes. After, it has
yet to fail in an hour of continuous testing.

Change-Id: I5d464d5781ed0e27220f3e826609493893a052aa
Reviewed-on: http://gerrit.cloudera.org:8080/649
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-08-18 01:29:45 +00:00
Casey Ching
669290c513 Python: Switch postgres driver
The old driver (psycopg2) requires some development packages to be
installed. The new driver (pg8000) is pure python so it's much
easier to setup. Potentially the new driver is slower but we don't
need much performance from the postgres driver.

Change-Id: Iea743b53b20e9bdf405be595ab1cac35763f120b
Reviewed-on: http://gerrit.cloudera.org:8080/653
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
2015-08-17 20:45:35 +00:00
Casey Ching
facedb2aa5 Add stress test for TPC queries running against a cluster
This will run concurrent TPC-DS/H queries against a CM managed cluster.

Stress test outline (and notes):
 1) Get a set of queries. TPCH and/or TPCDS queries will be used.
    TODO: Add randomly generated queries.
 2) For each query, run it individually to find:
     a) Minimum mem limit to avoid spilling
     b) Minimum mem limit to successfully run the query (spilling
     allowed)
     c) Runtime when no mem was spilled
     d) Runtime when mem was spilled
     e) A row order independent hash of the result set.
    This is a slow process so the results will be written to disk for
    reuse.
 3) Find the memory available to Impalad. This will be done by finding
 the minimum
    memory available across all impalads (-mem_limit startup option).
    Ideally, for
    maximum stress, all impalads will have the same memory
    configuration but this is
    not required.
 4) Optionally, set an amount of memory that can be overcommitted.
 5) Start submitting queries. There are two modes for throttling the
 number of
    concurrent queries:
     a) Submit queries until all available memory (as determined by
     items 3 and 4) is
        used. Before running the query a query mem limit is set
        between 2a and 2b.
        (There is a runtime option to increase the likelihood that a
        query will be
        given the full 2a limit to avoid spilling.)
     b) TODO: Use admission control.
 6) Randomly cancel queries to test cancellation. There is a runtime
 option to control
    the likelihood that a query will be randomly canceled.
 7) Cancel long running queries. Queries that run longer than some
 expected time,
    determined by the number of queries currently running, will be
    canceled.
    TODO: Collect stacks of timed out queries and add reporting.
 8) If a query errored, verify that memory was overcommitted during
 execution and the
    error is a mem limit exceeded error. There is no other reason a
    query should error
    and any such error will cause the stress test to stop.
    TODO: Handle crashes -- collect core dumps and restart Impala
    TODO: Handle client connectivity timeouts -- retry a few times
 9) Verify the result set hash of successful queries.

Change-Id: I4bd7f8a7cc65d5ae910a33afba59135040a99061
Reviewed-on: http://gerrit.cloudera.org:8080/474
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
2015-08-15 23:10:25 +00:00
Casey Ching
a4fe24c1b2 Python: Add more logging and CM options to common CLI parser
Example output of --help:

Options:
  --debug-log-file=DEBUG_LOG_FILE
                        Path to debug log file. [default:
                        /tmp/concurrent_select.py.log]
  --cm-host=host name   The host name of the CM server.
  --cm-port=port number
                        The port of the CM server. [default: 7180]
  --cm-user=user name   The name of the CM user. [default: admin]
  --cm-password=password
                        The password for the CM user. [default: admin]
  --cm-cluster-name=name
                        If CM manages multiple clusters, use this to
                        specify which cluster to use.

Change-Id: I614383f4a65e700348572204e3d8fd5670f5bcf7
Reviewed-on: http://gerrit.cloudera.org:8080/472
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
2015-08-15 23:10:10 +00:00
Alex Behm
dd88b3b465 IMPALA-2201: Unconditionally update the partition stats and row count.
Before this patch, we used to only send alterPartition() requests to
the Hive Metastore for partitions whose stats have changed during
COMPUTE [INCREMENTAL] STATS. However, there is other state associated
with the stats like the STATS_GENERATED_VIA_STATS_TASK that was not
properly handled. Not updating this additional partition metadata
was the root cause of IMPALA-2201.

This patch changes COMPUTE [INCREMENTAL] STATS to unconditionally update the
partition stats and row counts in the Hive Metastore, even if the partition
already has identical stats. This behavior results in possibly redundant work,
but it is predictable and easy to reason about because it does not depend on
the existing state of the metadata.

Note that in versions starting from CDH 5.4 it is not possible to reproduce
IMPALA-2201 because of a behavioral change in the Hive Metastore in the
alterPartition() code path.

Change-Id: I10105d8d6306d9ad9988b03abc23752d7bc98252
Reviewed-on: http://gerrit.cloudera.org:8080/640
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-14 23:33:20 +00:00
Tim Armstrong
1d2afcfec2 IMPALA-2079: Part 1: report non-writable scratch dirs at startup
Previously Impala could erroneously decide to use non-writable scratch
directories, e.g. if /tmp/impala-scratch already exists and is not
writable by the current user.

With this change, if we cannot remove and recreate a fresh scratch directory,
it is not used.  If we have no valid scratch directories, we log an
error and continue startup.

Add unit test for CreateDirectory to test behavior for success and
failure cases.

Add system tests to check logging and query execution in various
scenarios where we do not have scratch available.

Modify FilesystemUtil to use non-exception-throwing Boost functions to
avoid unhandled exceptions escaping into the rest of the Impala
codebase, which does not expect the use of exceptions.

Change-Id: Icaa8429051942424e1d811c54bde10102ac7f7b3
Reviewed-on: http://gerrit.cloudera.org:8080/565
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-08-14 00:38:22 +00:00
Dimitris Tsirogiannis
47c5ae405a Revert "IMPALA-2015: Add support for nested loop join"
This reverts commit 6837cdec7f6a7e1c7e8157e323f3ab68277689aa.

Change-Id: I2fd6424c553a701fcbfd425b4486af7280820b23
Reviewed-on: http://gerrit.cloudera.org:8080/636
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-13 02:20:07 +00:00
Skye Wanderman-Milne
f8134ff133 IMPALA-2187: Run py.test through impala python env.
The symptom of this bug was that we were seeing "ValueError: bad marshal data"
when trying to import from tests.hs2.test_hs2 during customer cluster tests.

The problem was that we were not running the custom cluster tests through the
new Impala Python virtualenv.

Some tests (properly running with the virtualenv) that run before the customer
cluster tests had caused the generation of pyc files for tests.hs2.test_hs2.
Those pyc files then appeared corrupted when executing the custom cluster
tests because the default python env is running a different version than the
virtualenv those pyc files were generated from in earlier tests.

Change-Id: Ie9d8f90c65921247dd885804165f9b7271ea807b
Reviewed-on: http://gerrit.cloudera.org:8080/618
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-08-09 06:17:48 +00:00
Skye Wanderman-Milne
f000758ca8 IMPALA-2015: Add support for nested loop join
Implement nested-loop join in Impala with support for multiple join
modes, including inner, outer, semi and anti joins. Null-aware left
anti-join is not currently supported.

Summary of changes:
Introduced the NestedLoopJoinNode class in the FE that represents the nested
loop join. Common functionality between NestedLoopJoinNode and HashJoinNode
(e.g. cardinality estimation) was moved to the JoinNode class.
In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop
join execution strategy.

Change-Id: Id65a1aae84335bba53f06339bdfa64a1b0be079e
Reviewed-on: http://gerrit.cloudera.org:8080/457
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-08-07 02:47:32 +00:00
Casey Ching
d202d6a967 Use "impala-python" (virtualenv) instead of system python
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.

Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-06 02:09:09 +00:00
Ippokratis Pandis
adac8b79bc IMPALA-1933: Fixing an error check in parquet scanner
The HdfsParquetScanner would exit with the wrong error that it read
fewer rows than what it was stated in the metadata of the file, when
the ReadValue() call would fail with memory limit exceeded error.
One of the effects of this wrong error reporting it was that tests
like test_mem_usage_scaling would some times fail, especially under
ASAN.

With this patch the parquet scanner checks whether memory limit was
exceeded before checking the difference between the number of rows
read and the number of expected rows according to metadata.
This patch also adds another value in test_mem_usage_scaling test,
that value (20MB) would normally trigger this false negative error.

Change-Id: Iad008d7af1993b88ac4dc055f595cfdbc62a6b79
Reviewed-on: http://gerrit.cloudera.org:8080/557
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-08-05 12:33:52 +00:00
Casey Ching
074e5b4349 Remove hashbang from non-script python files
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.

Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-04 05:26:07 +00:00
Matthew Jacobs
891cdf1830 Close pytest beeswax queries with exceptions
The beeswax interface in the test infrastructure was not
closing queries that encountered exceptions. This was
problematic because failed queries would remain open, and
due to IMPALA-2060, resources wouldn't be released. If
admission control or RM is enabled, the test run may
eventually fail if resources continue to be held.
Regardless, failed queries should be closed.

Change-Id: I5077023b1d95d1ce45a92009666448fdc6e83542
Reviewed-on: http://gerrit.cloudera.org:8080/530
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2015-08-03 20:45:05 +00:00
Henry Robinson
621205ebbc IMPALA-2143: Avoid sending auth credentials over insecure connections
This patch changes the behaviour of the Impala shell to refuse to
attempt an LDAP-authenticated connection to Impala unless SSL/TLS is
configured.

A new flag --auth_creds_in_clear_ok is added to suppress this
behaviour. This is similar to Impala's --ldap_passwords_in_clear_ok
flag. The shell will also now print a warning if an insecure
configuration is used.

Change-Id: Ide25d8dd881a61b9f08900112466c430da64a038
Reviewed-on: http://gerrit.cloudera.org:8080/546
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-07-30 07:15:29 +00:00