Commit Graph

194 Commits

Author SHA1 Message Date
Matthew Jacobs
65c1a6f21e Remove SOURCE keyword by parsing as an identifier and checking the value
Reverts "IMPALA-1033: Remove SOURCE keyword; very common identifier"

Change-Id: I3fcf6d02786e00287b564cff0a823d0c19504e7a
2014-06-30 16:47:47 -07:00
Dimitris Tsirogiannis
6a795915d6 Fix loading data from snapshopt for alltypesagg table.
The alltypesagg table was not loaded correctly from a snapshot file due
to a missing ALTER TABLE statement, thereby causing some tests to fail.

Change-Id: I74066a99529f24fc268bb5779d3fb64fbd4f66b9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3248
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3270
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
2014-06-25 21:52:11 -07:00
Victor Bittorf
2d7f2e19b2 IMPALA 938: Infer schema from Parquet file
Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'".
Supports all options that CREATE TABLE does. Currently only PARQUET is supported.
Run testdata/bin/create-load-data.sh after pulling this patch.

Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158
2014-06-20 17:38:01 -07:00
ishaan
99602fb8c2 Force load data if the current HEAD has a schema change.
This patch checks the test-warehouse's stored githash (if it exists) to determine if the
current patch has changed the schema if a table. If a change is detected, we force load
all the data.

Change-Id: I314f9f3364d3e6b2d66de38a9e6d9f57c4e279a7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3049
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-06-19 02:25:50 -07:00
Matthew Jacobs
f5da019555 IMPALA-1025: Use converse of data source predicate operators if expr has val before slot
Change-Id: I31790c037e2fa9af7b80c01014f7507ba5053e63
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2925
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
2014-06-09 23:54:09 -07:00
Matthew Jacobs
89ec6b3d7a IMPALA-1033: Remove SOURCE keyword; very common identifier
The SOURCE keyword was introduced for DATA SOURCE ddl commands, but
it is also a very common identifier. This removes the SOURCE and
SOURCES keywords and instead uses DATASOURCE and DATASOURCES.

Change-Id: Ic6c2897d1e23efa169aa8787752fe4aa2bb125d5
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2895
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 267c13f9b46d249bfd1b8711fd3fadf6853dc1ef)
2014-06-09 17:17:14 -07:00
Nong Li
5d80942d42 [CDH5] IMPALA-1019: Fix cancellation path in io mgr for cached reads.
Change-Id: I11efd65d1efa900f79afe88b781262a44ac5006a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2703
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-30 19:14:39 -07:00
Lenni Kuff
c45e9a70d9 [CDH5] Add DDL support for HDFS caching
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED

When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.

When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.

It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).

Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.

Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-27 16:47:15 -07:00
Skye Wanderman-Milne
1dff1686aa Add option to build UDF test libs in copy-udfs-udas.sh
The option is off by default, but useful for running this script
without building the world.

Change-Id: I82d8251cf9bb2763ce69094da1995a4d6ceff167
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2647
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
(cherry picked from commit a7f77643820dcbfbab231a9260c94450564bd2df)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2659
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-05-22 18:01:55 -07:00
Matthew Jacobs
6ccd56bc1f Enforce slot equivalences at data source scan nodes
Change-Id: I2ed606ba398990ab05afa3301b6356c6a636e2bb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2521
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 55061f6953956f45d433fe227ded539a648e3f9c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2536
2014-05-19 14:37:44 -07:00
Dimitris Tsirogiannis
a7a9cde86f CDH-18969: Incorrect query result in Impala
This commit fixes issue CDH-18969 where Impala returns wrong results
when querying an HBase table. This issue is triggered when a column family
sorts lexicographically before ":key", which is the column family of the
row key, thereby causing the wrong column to be used as a row key by the
backend.

The following changes are included:
1. Modified the load function in HBaseTable.java to make sure the
catalog object of an HBase table always stores the row key column first.

Change-Id: Icd7ebc973d81672c04d5c7c8bbabd813338d5eac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2513
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2602
2014-05-18 16:29:11 -07:00
Skye Wanderman-Milne
edbbe6035e Decimal: read from Avro
Allows reading decimal columns with or without codegen. Includes tests
based on a data file posted on HIVE-5823.

Change-Id: Ie541c6b98bd24543691850cb45a434af60b5a5a6
(cherry picked from commit 6983dcefdf70cce14724e17d03bc061ffb8f671c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2596
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-05-16 22:26:11 -07:00
Lenni Kuff
61cbdd4f49 [CDH5] Add Sentry Service to local test environment
Adds the ability to start/stop the Sentry Service to our local test environment and
load the sentry-site.xml configs. Since the existing Sentry startup scripts don't work
I wrote a simple wrapper to handle service startup.

Change-Id: I1b77a2e50e51e6e6eae58cfed4d5d7c403dbc0b4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2540
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-05-14 12:02:02 -07:00
Matthew Jacobs
ebc6c5894e External Data Source: Frontend and catalog changes
Initial frontend and catalog changes for external data sources.

Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485
2014-05-08 14:56:19 -07:00
ishaan
50caed17d7 [CDH5] Fix the format option in run-all
Previously, the -format option was a no-op. Moreoever, run-all would not work without the
option. This patch fixes both problems.

Change-Id: I4726c03452409322fd0cd864cdb6dd395c4e651a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2449
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-05-07 22:56:38 -07:00
Lenni Kuff
13c794db91 [CDH5] Update dependency versions to CDH5.1.0
This just updates the versions, it doesn't touch anything in /thirdparty.
Change parquet version to append SNAPSHOT
Added hadoop-hbase-compat jar in AUX_CLASSPATH and mapreduce/*.jar to HDFS

Change-Id: I4471ef4476997371cf49a9d54cfa63f2fda126e4
2014-05-07 15:10:40 -07:00
Nong Li
03e5665e56 Decimal: Read/Write to parquet.
This adds support for the FIXED_LENGTH_BYTE_ARRAY parquet type and
encoding for decimals.

Change-Id: I9d5780feb4530989b568ec8d168cbdc32b7039bd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1727
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2432
2014-05-02 16:38:35 -07:00
Skye Wanderman-Milne
60db4d4d82 CDH-18416: Don't inline ReadWriteUtil::ReadZLong()
For wide Avro tables, ReadZLong() would get inlined many times into a
single function body, causing LLVM to crash. Not inlining doesn't seem
to have a performance impact on narrow tables, and helps with wide
tables.

This change also adds tests over wide (i.e. many-column) tables. The
test tables are produced by specifying shell commands to generate test
tables in functional_schema_template.sql, which are executed in
generate-schema-statements.py. In the SQL templates, sections starting
with a ` are treated as shell commands. The output of the shell
command is then used as the section text. This is only a starting
point; it isn't currently implemented for all sections, and may have
to be tweaked if we use this mechanism for all tables.

Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353
Tested-by: jenkins
2014-04-28 15:58:15 -07:00
casey
2351266d0e Replace single process mini-dfs with multiple processes
This should allow individual service components, such as a single nodemanager,
to be shutdown for failure testing. The mini-cluster bundled with hadoop is a
single process that does not expose the ability to control individual roles.
Now each role can be controlled and configured independently of the others.

Change-Id: Ic1d42e024226c6867e79916464d184fce886d783
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1432
Tested-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-04-23 18:24:05 -07:00
Nong Li
87295a4e06 Decimal implementation.
This patch implements decimal support for text based formats.

Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238
Tested-by: jenkins
2014-04-14 21:07:32 -07:00
Lenni Kuff
aa0b7a35f5 IMPALA-880: COMPUTE STATS should update partitions in batches
When updating partition metadata as part of COMPUTE STATS we would previously
attempt to update all partitions at once. This could lead to HMS socket timeouts
and also could run into issues if there were > 32K partitions.

In this change we now update the partitions in batches, with a max size of 500
partitions per batch. We also compare whether the row count has changed and only
update partitions that have been modified.

Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918
2014-03-14 19:20:12 -07:00
ishaan
9e043e862c Fix run-hbase.sh to correctly pick up the classpath.
We run wat-for-hbase-master.py after starting hbase to account for a race between
the master and region server. This script has not been working for some time. It caused
no ill effects sinc the said race was absent. However, the race has manifested itself
again, so the script needs to be fixed. Setting the correct classpath does so.

Change-Id: I783a7473cfd24a9cb66711f5428f7052ceb96282
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1756
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-03-05 01:04:56 -08:00
ishaan
00724a47da Prefix the path to the local core-site to the classpath used by minillama
With a recent upstream change, a core-site.xml was introduced in a YARN test jar pulled in
by thirdparty. This causes MiniLlama to ignore options set in
fe/src/test/resources/core-site.xml. The problem manifests itself with the MiniDfsCluster
starting on an arbitary port, but it would have also caused a lot of tests to fail as none
of the compression codecs are pulled in. This change prepends the classpath used by
minillama with the path to the internal core-site.

Change-Id: Iee267fe12e02301baec059a1f7469288c038d6fa
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1739
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-03-04 09:59:50 -08:00
Lenni Kuff
bf16b5cd0d IMPALA-749: Fetch partitions in batches, rather than all at once.
This updates how Impala fetches partition metadata from the Hive Metastore to fetch
partitions in batches, rather than all at once. This helps reduce the load on the
HMS and also lets Impala scale to above 32K partitions. The downside is that it
may require additional RPCs to get all the partitions.

This is done by first querying the metastore to get all the partition names that
exist, then splitting the list of names into seperate batches to get the actual
partition metadata.

Impala uses a default size of 1000 partitions per batch, but it can be configured
by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter
in the hive-site.xml config file.

Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698
2014-02-28 22:30:45 -08:00
Alex Behm
9cabee4a71 Wait for the Metastore to come up before starting HiveServer2.
Change-Id: Ic8e29efe63f6745e1ff44248657cbd7882bb16d9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1626
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1670
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-02-25 21:05:33 -08:00
Alex Behm
8223e1e44b Avoid Hive replication bug (CDH-17414) by 'warming up' HiveServer2 after it starts.
The purpose of this patch is to avoid CDH-17414 which causes data files loaded
with Hive to incorrectly have a replication factor of 1. When using beeline
this problem only appears to occur immediately after creating the first HBase table
since starting HiveServer2, i.e., subsequent loads seem to function correctly.
This patch add a new script that creates an external HBase table in Hive to
'warm up' HiveServer2 immediately after it is started.
Subsequent loads should assign a correct replication factor.

Change-Id: Ic54c9401b67b748a8848d19f82b8e7df9535e845
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1640
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-02-25 17:33:53 -08:00
Lenni Kuff
b4f5c1edcf Enable lazy loading of table metadata for the CatalogService/Impalad
This change adds support for lazy loading of table metadata to the
CatalogService/Impalad. The way this works is that the CatalogService initially
sends out an update with only the databases and table names (wrapped as
IncompleteTables). When an Impalad encounters one of these tables, it will contact
the catalog service to get the metadata, possibly triggering a metadata load if the
catalog server has not yet loaded this table.

With these changes the catalog server starts up in just seconds, even for large
metastores since it only needs to call into the metastore to get the list of tables
and databases. The performance of "invalidate metadata" also improves for the same reason.

I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and
remove the need for using a LoadingCache for databases.

This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run
against catalog object recieved from the catalog server. This actually turned up some bugs
in our previous test configuration where we were not running with the correct column stats
(we were always running with avgSerializedSize = slotSize).  This changed some plans so the
planner tests needed to be updated.

Still TODO:
This does not include the changes to perform background metadata loading. I will send
that out as a separate patch on top of this.

Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e

Saving changes

Change-Id: I48c34408826b7396004177f5fc61a9523e664acc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-21 21:43:29 -08:00
Nong Li
04b501d3a1 [CDH5] Collect metadata for cached blocks.
Change-Id: I81026de2f9a08553dc15e07090b8297120aa7462
(cherry picked from commit 69414f67b20016e49b739a46d6e2b4b57e1d1a3c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1252
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-15 15:12:20 -08:00
Nong Li
53d7bbb97a [CDH5] Impala changes for updated thirdparty components.
Changes include:
  - version changes in impala-config
  - version changes in various loading scripts
  - hbase jars are no longer in hive/lib
  - mini-llama script changes
  - updates due to sentry api changes
  - JDBC tests disabled
  - unsupported types tests disabled.

Change-Id: If8cf1b7ad8e22aa4d23094b9a4b1047f7e9d93ee
2014-01-15 15:12:13 -08:00
Alex Behm
c70905628b Using MiniLlama's --write-hdfs-conf to dump the MiniDfs conf for our test setup.
Change-Id: I238f375bda4ef95fa3d5ae9a29bd1dfc2aa3e401
2014-01-15 15:12:06 -08:00
Alex Behm
760750af27 Enforcing reserved memory resources via mem limits.
Fixed codepath with rm disabled. Set enable_rm to false by default.

Change-Id: I3bf2d0525d91243ec3c0ea048b0c03680befcda2

Conflicts:
	be/src/runtime/runtime-state.cc
2014-01-15 15:12:05 -08:00
Alex Behm
dc7b398bd3 Impala reserves resources from YARN via LLama.
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.

Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
2014-01-15 15:12:04 -08:00
Alex Behm
fc6ecd39e5 [CDH5] Fixed issue with data loading using JDK7 and Hive (HIVE-5068). Fixed missing dependency in testdata for HBase region splitting.
Change-Id: Iab002f652bc1b1c2f8ce60b7505f592eedcb9cc0
2014-01-15 15:11:32 -08:00
Alex Behm
60003ad211 [CDH5] Changes to make Impala work on CDH5. Mostly fixing up dependency versions. Minor code changes to address HBase API changes.
Change-Id: Icbbeb13eefa29e38286328d45600117a383cd106
2014-01-15 15:11:23 -08:00
Skye Wanderman-Milne
561da008c7 IMPALA-729: fix resource management in Parquet scanner for multiple row groups
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().

Also adds test cases for IMPALA-720.

Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb
2014-01-08 10:56:26 -08:00
Lenni Kuff
fbe79fc47b Use separate log files for each of our mini-cluster services
Also adds a bit more logging on which individual services are starting.

Change-Id: I53f12e1825fbf738e2fb8325874c3126e55f3f44
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1147
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:54:37 -08:00
Alex Behm
c6397ca1e3 Revert "Revert to FROM-clause order if any table is lacking stats."
This reverts commit 7e84cbe3bab9bf30a57ac58d9ef525ebc10a7b7a.

Change-Id: I89d55ca2bcb8eb6eddc244d3e7b005074d04c26a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1104
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:29 -08:00
Alex Behm
df0b28d163 Revert to FROM-clause order if any table is lacking stats.
Change-Id: I7d09c0f393e2bfeefa386845fc6bbba4ab6c8812
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1095
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:28 -08:00
Skye Wanderman-Milne
9e17042185 Allow zero bit width dict/RLE decoders.
This allows us to read single-value dictionary-encoded columns
generated by parquet-mr.

Change-Id: I80903d910d0cc3a3e4ebf02e34212d868e94feb4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1098
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
de531e15bd IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).

Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Lenni Kuff
e63cc59a94 Add partitioned tpcds planner tests (SQL-92 style joins)
Adds the TPCDS queries as planner tests and fixes a few small issues
with the Planner test file parser. This adds the TPC-DS queries using
SQL-92 style joins that have a hand optimized (although
not perfect) join order.

Change-Id: I2d81e66af740b2d826b8ebd0c5ba8553b5faf0a2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1019
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:26 -08:00
Skye Wanderman-Milne
acdc792355 IMPALA-695: Use the local path of Hive UDF jars in the FE.
The FE was creating class loaders with the HDFS locations of Hive UDF
libs, rather than the local locations created by the BE. Our tests
still passed since we only used UDFs already on the classpath
(e.g. Hive builtins).

Change-Id: Idbe9c98ad6adb84b70cb44efbf9ad0afc53366ca
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1081
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:25 -08:00
Skye Wanderman-Milne
b54d16dabd IMPALA-679: Append hash of HDFS path to filename in CopyHdfsFile() to avoid collisions.
Change-Id: Ia84fa81fe043a9604248d66ed963ef3f91b0601e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1018
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:22 -08:00
Lenni Kuff
0bae3978c9 Update compute-stats.py to execute using Impala
Updates our compute stats script to execute using Impala. This allows us
to easily compute stats on all tables in a database or all tables in the
metastore.
The updated stats caused one of the TPCH plans to change so this also
updates the TPCH planner test results.

Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:18 -08:00
Lenni Kuff
76fa3b2ded Update DDL to support 'STORED AS PARQUET' and 'STORED AS AVRO' syntax
This change updates our DDL syntax support to allow for using 'STORED AS PARQUET'
as well as 'STORED AS PARQUETFILE'. Moving forward we should prefer the new syntax,
but continue to support the old.  I made the same change for 'AVROFILE', but since
we have not yet documented the 'AVROFILE' syntax I left out support for the old syntax.

Change-Id: I10c73a71a94ee488c9ae205485777b58ab8957c9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1053
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:18 -08:00
Nong Li
c1a64d6863 Add kill-mini-llama to CDH4 branch.
This makes it easier to switch between our branches and a no-op if
for those of us staying on CDH4.

Change-Id: Ic07eb8a7ba7e48db118c06c221aabe5e124f3bfb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1033
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:54:17 -08:00
ishaan
fcdcf1a9d8 Parallelize data loaded through Impala to speed up data loading.
Currently, we execute all the queries involved in data loading serially. This change
creates a separate .sql file for each file format, compression codec and compression
scheme combination, and executes all the files in parallel. Additionally, we now store all the
.sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note
that only data loaded through Impala is parallelized, data loaded through hive and hbase
remains serial.

On our build machines, the time taken to load all the data from snapshot was on the order
of 15 minutes.

Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/804
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:53 -08:00
Lenni Kuff
498c2529d4 Test CR: Change spacing in run-all.sh
Change-Id: I2362799213a7faca3892e38fb874bfbbd0c1718f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/803
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:50 -08:00
Lenni Kuff
8b2acf5c22 IMPALA-425: Detect read-only tables and disable INSERT/LOAD operations on these tables
With this change we now detect if a table is read-only and disable INSERT/LOAD operations
on these tables. A table is read-only if Impala does not have write permission on the HDFS
base directory of the table or any one of the partition directories (if
the table is partitioned).

Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/713
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:37 -08:00
Alex Behm
51e914e911 Use hive-exec instead of hive-builtin because hive-builtin does not exist in CDH5 Hive.
Change-Id: I11993c7eebc9f5f07f112810d7e81d07ce157193
Reviewed-on: http://gerrit.ent.cloudera.com:8080/715
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:53:33 -08:00