Commit Graph

52 Commits

Author SHA1 Message Date
Nong Li
87295a4e06 Decimal implementation.
This patch implements decimal support for text based formats.

Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238
Tested-by: jenkins
2014-04-14 21:07:32 -07:00
Skye Wanderman-Milne
e60bf29a96 IMPALA-13: Use SSE string functions that take an explicit length
This patch modifies DelimitedTextParser and StringValue to work with
data containing null characters by using SSE instructions that take a
length, rather than expecting null-terminated strings. It also adds
some other minor changes to correctly handle data with nulls and to
faciliate testing. I checked the execution time of a count(*) and a
select(*) limit 1 query locally, and saw no difference for either text
or sequence files.

Change-Id: Ia920b35bea7048aa286f39ec83e313c2a39251d1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2110
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2181
2014-04-11 11:16:24 -07:00
Nong Li
c27bd34075 Revert "Disable decimal in analysis."
This reverts commit 695017410adf6d4f8426c4117798c93f823a4b4b.

Change-Id: I919d965e8e711d588e6c56dcdbd3c8e0d9ec7a05
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2104
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-03-27 12:45:55 -07:00
Lenni Kuff
cc1c0c61fd IMP-1291: Support "extended" ASCII characters as delimiters in text files
This fixes how we validate delimiters to be in line with Hive. A delimiter must
fit in a single byte and can be specified in the following formats, as far as I can
tell (there isn't documentation):
- A single ASCII or unicode character (ex. '|')
- An escape character in octal format (ex. \001. Stored in the metastore as a
  unicode character: \u0001).
- A signed decimal integer in the range [-128:127]. Used to support delimiters
  for ASCII character values between 128-255 (-2 maps to ASCII 254).

Previously, we were not handling the "signed integer" case so there was no way
to specify a delimiter in the "extended" ASCII range of 128-255.

To support result validation, the test infrastructure had to be updated to support
reading/writing different character encodings.

Change-Id: Ie3c4d444dc9c6e60192093ed0c0f6f151eab16bc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1848
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1888
2014-03-13 13:00:15 -07:00
Nong Li
b1aeea3f0b Disable decimal in analysis.
Change-Id: I4cff1fc74ef0afeba15bee5b9eb6851abbfddbdf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1874
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-03-12 18:55:06 -07:00
Nong Li
f0a67153d3 Decimal analysis changes.
Change-Id: Ib7d6a6a7650cc9058ff1486fc7546ab66c698d46
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1734
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-03-03 21:15:00 -08:00
Alex Behm
3d764619f7 Run Hive data loading through beeline instead of the Hive shell.
Fixes our log configuration to put the Hive logs in cluster_logs/hive.

Change-Id: I5d98581e35325f2173e4b3170e36bec42d33f8f3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1497
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1615
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-02-20 15:43:31 -08:00
Lenni Kuff
c4879340e8 IMPALA-781: Block Impala from reading RC file tables created using the LazyBinaryColumnarSerDe
With Hive .12, the default RC file format can be configured to be ColumnarSerDe or
LazyBinaryColumnarSerDe. Impala does not yet support the LazyBinaryColumnarSerDe. This change
verifies it is properly disabled.

Change-Id: Ia84495868237ce2c89a9706ad75e0f7eb8499057
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1416
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1423
2014-02-01 10:05:19 -08:00
Skye Wanderman-Milne
de531e15bd IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).

Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
9147cd7518 IMPALA-525: Adjust IO buffer size based on read length and other memory fixes
We were previously wasting memory by always reading into 8MB IO
buffers, even when the data read was much less than 8MB. With this
patch, the IO manager picks a buffer size closer to the actual amount
being read (we don't use the exact size so we can continue to recycle
buffers). The minimum IO buffer size is determined via the
--min_buffer_size flag, and the max IO buffer size via the --read_size
flag.

This technique also helps with IMPALA-652, since short columns will
not use as much memory as before (we will not use considerably more
memory than the size of the table).

This patch also changes StringBuffer to use a doubling strategy so it
doesn't end up allocating many large unused buffers, and has the
scanner context use the requested length as the sync read size if it's
larger than the size produced by read_past_size_cb(). These changes
help prevent the boundary buffer in the scanner context from
allocating excess memory.

Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50
Reviewed-on: http://gerrit.ent.cloudera.com:8080/938
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:01 -08:00
Lenni Kuff
92829b8400 IMPALA-587: Support implicit hbase column mapping keys
The Hive HBase spec specifies that the key column mapping can either be
defined explicitly (using the :key syntax) or left out completely in
which case a mapping to the first table column is implied. This change
updates Impala to support implicit key mappings and also adds some
checks in our ALTER TABLE DDL to unsure we cannot get into this state by
dropping a column from an Hbase table (a similar restriction that Hive
puts in place)

Change-Id: I920d642261659ee3e881da2553ffe83300923af8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/554
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:14 -08:00
Lenni Kuff
f5c9e4d075 Fix comment location in schema generation script
Our data generation doesn't appear to fully support comments in the
base table name section. This fixes the data generation, but we should
follow on by improving our comment support in the framework.

Change-Id: I68274bd98d5b0d54868d6c80b7137a59e7329229
Reviewed-on: http://gerrit.ent.cloudera.com:8080/465
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:47 -08:00
Lenni Kuff
8e2a313673 IMPALA-590: Impala should more gracefully fail when loading HBase tables with complex types
Change-Id: Ifc3338ee1339ff0544ed14066824f1aa2d9d7c25
Reviewed-on: http://gerrit.ent.cloudera.com:8080/457
Tested-by: jenkins
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:47 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Alex Behm
9a201645cd IMPALA-496: Fix escaping of field delimiter and escape character in inserts
Change-Id: I49c36ae9823b35dcb9e92d1a13bef270657e36f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/163
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:09 -08:00
Alex Behm
f0e2d539fc IMPALA-495: Views Sometimes Not Utilizing Partition Pruning.
Change-Id: I65daebbe8c4b72b956a409fe28edd3773fda7cb7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/128
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:04 -08:00
Alex Behm
8ad15fabcf IMPALA-372: Added CREATE/DROP/ALTER VIEW. 2014-01-08 10:51:35 -08:00
Alan Choi
254ee6ef89 IMPALA-434 Support binary hbase encoding 2014-01-08 10:51:18 -08:00
Alex Behm
9ff09cd3f4 IMPALA-70: Respect tbl properties to allow empty strings to be treated as NULL 2014-01-08 10:50:28 -08:00
Alex Behm
558590140c IMPALA-238: Problems inserting into tables with TIMESTAMP partition columns ... 2014-01-08 10:50:05 -08:00
Henry Robinson
018028f8e3 IMPALA-269: Throw exception if serde unrecognised 2014-01-08 10:50:01 -08:00
Lenni Kuff
c74b7e41dd Enable insert tests to run against parquet 2014-01-08 10:49:47 -08:00
Lenni Kuff
cba9cd00dd Fix full data load build break due to constructing incorrect HDFS paths 2014-01-08 10:49:34 -08:00
Lenni Kuff
558d5ce755 Data loading: Exec DDL statements via Impala and don't recreate metadata if it exists 2014-01-08 10:49:28 -08:00
Elliott Clark
0e0c02b6bd Add the ability to Select into HBase table.
* Changed frontend analysis for HBase tables
* Changed Thrift messages to allow HBase as a sink type.
* JNI Wrapper around htable
* Create hbase-table-sink
* Create hbase-table-writer
* Static init lots of JNI related code for HBase.
* Cleaned up some cpplint issues.
* Changed junit analysis tests
* Create a new HBase test table.
* Added functional tests for HBase inserts.
2014-01-08 10:49:06 -08:00
Lenni Kuff
831ee529be Fixed data loading bugs, moved most tables out of load-dependent-tables 2014-01-08 10:48:56 -08:00
Elliott Clark
ade4453a17 Allow impala hbase serdproperties to have newlines 2014-01-08 10:48:54 -08:00
Alex Behm
be03e6c21c IMPALA-138: Error messages for unknown column types are particularly bad. 2014-01-08 10:48:53 -08:00
Skye Wanderman-Milne
461a48df2b Refactor testing framework to generate Avro tables. 2014-01-08 10:48:45 -08:00
Lenni Kuff
d57440e87d Allow column comments for CREATE TABLE and DESCRIBE <table> statements 2014-01-08 10:48:37 -08:00
ishaan
f1d37646ed Fix creation of the insert_string_partitioned_table 2014-01-08 10:48:20 -08:00
Henry Robinson
222d15c6ca IMPALA-72: String partition keys should be URL encoded 2014-01-08 10:48:20 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
bed633c1ae Extract config/metastore creation from buildall + script for loading warehouse snapshot 2014-01-08 10:46:53 -08:00
Lenni Kuff
1e25c98fb4 Test data loading framework improvements
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
2014-01-08 10:46:49 -08:00
Nong Li
adf36b81f9 Fix data errors test. 2014-01-08 10:46:45 -08:00
Nong Li
34879a4ddc Fix IMP-297 2014-01-08 10:46:44 -08:00
Michael Ubell
7536510b69 IMP-258 Test writing nulls. 2014-01-08 10:46:31 -08:00
Michael Ubell
37aaf06f79 IMP-390 Get rid of test dependencies on InProcessQE and Runquery 2014-01-08 10:46:18 -08:00
Michael Ubell
0c4f025a5e Fix loading of nulltable data, remove loading functional-planner data 2014-01-08 10:45:58 -08:00
Michael Ubell
bf57ae27a5 IMP-291 Read sequence file to next sync mark when; ragged columns 2014-01-08 10:45:57 -08:00
Michael Ubell
5f951ffc4a Handle missing columns at the end of a row 2014-01-08 10:45:11 -08:00
Henry Robinson
e7348a209b IMP-232: Parallel INSERT OVERWRITE 2014-01-08 10:45:04 -08:00
Nong Li
7c411da86c Fixed schema template. 2014-01-08 10:44:41 -08:00
Nong Li
5b2621a401 Fix null table creation to workaround hive issue. 2014-01-08 10:44:41 -08:00
Nong Li
4c9c82910a Text parser fix for columns off end. 2014-01-08 10:44:40 -08:00
Nong Li
4d0319d32b Fix null string parsing. 2014-01-08 10:44:40 -08:00
Lenni Kuff
6e07e0b8d8 Added support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements during data loading
Add support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements to the data loading
workflow. This allows for capturing simple table stats such as number of rows, number of
partitions, and table size in bytes. These are stored into a new mysql database with the same
name as the metastore except with a '_Stats' suffix. If using Derby a new database results are
stored in a new derby database.
2014-01-08 10:44:34 -08:00
Alan Choi
cbadb4eac4 When a scan range begins at the starting point fo the tuple, we'll missed that tuple. This patch fixes
this problem.

review: 162
2014-01-08 10:44:24 -08:00
Michael Ubell
02d63d8dc3 Trevni file support 2014-01-08 10:44:19 -08:00