Commit Graph

133 Commits

Author SHA1 Message Date
Nong Li
1f6481382e Fix parquet test setup. 2014-01-08 10:49:41 -08:00
Henry Robinson
14d29aa579 Add plan and number of fragment instances to profile 2014-01-08 10:49:38 -08:00
Alex Behm
1b2e8280d4 Fix NULL issues. 2014-01-08 10:49:32 -08:00
Lenni Kuff
e218721386 IMPALA-198: Support setting file format, table comment in CREATE TABLE LIKE statements 2014-01-08 10:49:31 -08:00
Alan Choi
5f9e26b4a8 Average Scanner Thread Concurrency is a new metrics in the profile that reports
the average number of active scanner thread (i.e. those that are not blocked by
IO).

In the hdfs-scan-node, whenever a thread is started, it will increment the
active_scanner_thread_counter_. When a scanner thread enter the
scan-range-context's GetRawBytes or GetBytes, the counter will be decremented.

A new sampling thread is created to sample the value of
active_scanner_thread_counter_ and compute the average.

A bucket couting of HdfsReadThreadConcurrent is also added.

The output of the hdfs-scan-node profile is also updated. Here's the new output
for hdfs-scan-node after running count(*) from tpch.lineitem.

      HDFS_SCAN_NODE (id=0):(10s254ms 99.75%)
        File Formats: TEXT/NONE:12
        Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:6/351.21M
(351208888) 1:6/402.65M (402653184)
         - AverageHdfsReadThreadConcurrency: 1.95
           - HdfsReadThreadConcurrencyCountPercentage=0: 0.00
           - HdfsReadThreadConcurrencyCountPercentage=1: 5.00
           - HdfsReadThreadConcurrencyCountPercentage=2: 95.00
           - HdfsReadThreadConcurrencyCountPercentage=3: 0.00
         - AverageScannerThreadConcurrency: 0.15
         - BytesRead: 718.94 MB
         - MemoryUsed: 0.00
         - NumDisksAccessed: 2
         - PerReadThreadRawHdfsThroughput: 36.75 MB/sec
         - RowsReturned: 6.00M (6001215)
         - RowsReturnedRate: 585.25 K/sec
         - ScanRangesComplete: 12
         - ScannerThreadsInvoluntaryContextSwitches: 168
         - ScannerThreadsTotalWallClockTime: 1m40s
           - DelimiterParseTime: 2s128ms
           - MaterializeTupleTime: 723.0us
           - ScannerThreadsSysTime: 10.0ms
           - ScannerThreadsUserTime: 2s090ms
         - ScannerThreadsVoluntaryContextSwitches: 99
         - TotalRawHdfsReadTime: 19s561ms
         - TotalReadThroughput: 68.69 MB/sec
2014-01-08 10:49:30 -08:00
Marcel Kornacker
d7e22f44bb Partitioned hash joins
- added PlanNode.numNodes, PlanNode.avgRowSize and PlanNode.computeStats()
- fixing up some cardinality estimates
- Planner now tries to do a cost-based decision between broadcast join and join with full repartitioning (both inputs)
- ExchangeNode now distinguishes between its input and output row descriptor: the output potentially contains more tuples
- fixed problem related to cancellation and concurrent hash table builds.

Not included:
- partitioned joins that take advantage of existing partitions of the inputs; those will have to wait for a follow-on change
2014-01-08 10:49:29 -08:00
Henry Robinson
2ae20cbbb7 Statestore-2.0: New state-store implementation
* API simplified to deal only with 'topics', not services and objects
* Scalability improved: heartbeat loop is now multi-threaded
* State-store can store arbitrary objects
* State-store may send either deltas or complete topic state (delta computation to come)
2014-01-08 10:49:23 -08:00
Lenni Kuff
15f0313283 Add analysis checks for length of RowFormat strings, fix escaping of row format values 2014-01-08 10:49:21 -08:00
Alan Choi
afc6f83ba6 IMP-819 Pass file length to backend 2014-01-08 10:49:18 -08:00
Lenni Kuff
5a0b1270c4 Add support for ALTER ... PARTITION (partitionSpec) SET FILEFORMAT/LOCATION
Adds support for:
* ALTER TABLE <table> PARTITION (partitionSpec) SET FILEFORMAT
* ALTER TABLE <table> PARTITION (partitionSpec) SET LOCATION

This enables setting the location and fileformat of specific partitions.
2014-01-08 10:49:17 -08:00
Lenni Kuff
1fb72fbc73 IMPALA-156: Support core 'ALTER TABLE' DDL command
This patch adds support for
- ALTER TABLE ADD|REPLACE COLUMNS
- ALTER TABLE DROP COLUMN
- ALTER TABLE ADD/DROP PARTITION
- ALTER TABLE SET FILEFORMAT
- ALTER TABLE SET LOCATION
- ALTER TABLE RENAME
2014-01-08 10:49:14 -08:00
Skye Wanderman-Milne
c5afb11558 Compress serialized RowBatchs 2014-01-08 10:49:13 -08:00
Alan Choi
051c56073a IMPALA-158: query options should be optional 2014-01-08 10:49:06 -08:00
Elliott Clark
0e0c02b6bd Add the ability to Select into HBase table.
* Changed frontend analysis for HBase tables
* Changed Thrift messages to allow HBase as a sink type.
* JNI Wrapper around htable
* Create hbase-table-sink
* Create hbase-table-writer
* Static init lots of JNI related code for HBase.
* Cleaned up some cpplint issues.
* Changed junit analysis tests
* Create a new HBase test table.
* Added functional tests for HBase inserts.
2014-01-08 10:49:06 -08:00
Alan Choi
991db9001b IMPALA-113 Raise error when default order by limit is exceeded 2014-01-08 10:49:03 -08:00
Marcel Kornacker
0c36c7f327 Partitioned merge aggregation. 2014-01-08 10:48:59 -08:00
Lenni Kuff
ca0d23a844 IMPALA-157: Support CREATE TABLE LIKE DDL 2014-01-08 10:48:55 -08:00
Alex Behm
be03e6c21c IMPALA-138: Error messages for unknown column types are particularly bad. 2014-01-08 10:48:53 -08:00
Nong Li
6e293090e6 Parquet writer.
Change-Id: I7117b545e3d3a7803a219234ad992040a6c7c4ec
2014-01-08 10:48:44 -08:00
Lenni Kuff
0bcb54fcf8 Add GetRuntimeProfile RPC and enable printing runtime profile from impala-shell 2014-01-08 10:48:44 -08:00
Marcel Kornacker
d7bfe6c68d IMPALA-144: partition pruning for arbitrary predicates that are fully bound by partition columns
This makes partition pruning more effective by extending it to predicates that are fully bound by the partition column,
e.g., '<col> IN (1, 2, 3)' will also be used to prune partitions, in addition to equality and binary comparisons.
2014-01-08 10:48:41 -08:00
Lenni Kuff
d57440e87d Allow column comments for CREATE TABLE and DESCRIBE <table> statements 2014-01-08 10:48:37 -08:00
Skye Wanderman-Milne
57c3072188 Add support for reading Avro files compressed using the deflate codec. 2014-01-08 10:48:36 -08:00
Lenni Kuff
9f71374875 IMPALA-102: Add support for CREATE TABLE ... PARTITIONED BY (col1, col2) 2014-01-08 10:48:35 -08:00
Henry Robinson
71e6d81d1b IMP-261: Clean up network address handling 2014-01-08 10:48:33 -08:00
Marcel Kornacker
77f4fc8cf9 Adding memory limits
- new class MemLimit
- new query flag MEM_LIMIT
- implementation of impalad flag mem_limit

Still missing:
- parsing a mem limit spec that contains "M/G", as in: 1.25G
2014-01-08 10:48:33 -08:00
Alan Choi
4b6ce8ecb3 This patch changes the clock to CLOCK_MONOTONIC.
Rdtsc is not accurate, due to changes in cpu frequency. Very often, the time
reported in the profile is even longer than the time reported by the shell.

This patch replaces Rdtcs with CLOCK_MONOTONIC. It is as fast as Rdtsc and
accurate. It is not affected by cpu frequency changes and it is not affected by
user setting the system clock.

Note that the new profile report will always report time, rather than in clock
cycle.  Here's the new profile:

  Averaged Fragment 1:(68.241ms 0.00%)
    completion times: min:69ms  max:69ms  mean: 69ms  stddev:0
    execution rates: min:91.60 KB/sec  max:91.60 KB/sec  mean:91.60 KB/sec
stddev:0.00 /sec
    split sizes:  min: 6.32 KB, max: 6.32 KB, avg: 6.32 KB, stddev: 0.00
     - RowsProduced: 1
    CodeGen:
       - CodegenTime: 566.104us    <--* reporting in microsec instead of
clock cycle
       - CompileTime: 33.202ms
       - LoadTime: 2.671ms
       - ModuleFileSize: 44.61 KB
    DataStreamSender:
       - BytesSent: 16.00 B
       - DataSinkTime: 50.719us
       - SerializeBatchTime: 18.365us
       - ThriftTransmitTime: 145.945us
    AGGREGATION_NODE (id=1):(68.384ms 15.50%)
       - BuildBuckets: 1.02K
       - BuildTime: 13.734us
       - GetResultsTime: 6.650us
       - MemoryUsed: 32.01 KB
       - RowsReturned: 1
       - RowsReturnedRate: 14.00 /sec
    HDFS_SCAN_NODE (id=0):(57.808ms 84.71%)
       - BytesRead: 6.32 KB
       - DelimiterParseTime: 62.370us
       - MaterializeTupleTime: 767ns
       - MemoryUsed: 0.00
       - PerDiskReadThroughput: 9.32 MB/sec
       - RowsReturned: 100
       - RowsReturnedRate: 1.73 K/sec
       - ScanRangesComplete: 4
       - ScannerThreadsInvoluntaryContextSwitches: 0
       - ScannerThreadsReadTime: 662.431us
       - ScannerThreadsSysTime: 0
       - ScannerThreadsTotalWallClockTime: 25ms
       - ScannerThreadsUserTime: 0
       - ScannerThreadsVoluntaryContextSwitches: 4
       - TotalReadThroughput: 0.00 /sec
2014-01-08 10:48:32 -08:00
Lenni Kuff
87d8f79efe Add support for CREATE TABLE ... STORED AS PARQUETFILE 2014-01-08 10:48:32 -08:00
Lenni Kuff
1cd847c856 IMPALA-81: Add support for CREATE/DROP DATABASE/TABLE
This adds Impala support for CREATE/DROP DATABASE/TABLE. With this change, Impala
supports creating tables in the metastore stored as text, sequence, and rc file format.
It currently only supports creating unpartitioned tables and tables stored in HDFS.
2014-01-08 10:48:30 -08:00
Marcel Kornacker
c02d25baa8 IMPALA-20: Limit clause in inline view not handled correctly by planner
- this adds a SelectNode that evaluates conjuncts and enforces the limit
- all limits are now distributed: enforced both by the child plan fragment and
  by the merging ExchangeNode
- all limits w/ Order By are now distributed: enforced both by the child plan fragment and
  by the merging TopN node
2014-01-08 10:48:29 -08:00
Alan Choi
9c11c0ce2d HiveServer2 clean up
This patch adds

1. use boost uuid
2. add unit test for HiveServer2 metadata operation
3. add JDBC metadata unit test
4. implement all remaining HiveServer2: GetFunctions and GetTableTypes
5. remove in-process impala server from fe-support
2014-01-08 10:48:06 -08:00
Skye Wanderman-Milne
8b87099998 IMPALA-2: Support for Avro data files
Adds HdfsAvroScanner, as well as modifies the sequence scanners to be more general.
2014-01-08 10:48:05 -08:00
Nong Li
868a99135a Add network benchmark 2014-01-08 10:47:56 -08:00
Alan Choi
073de3e02e Generate HiveServer2 files from thrift 2014-01-08 10:47:52 -08:00
Marcel Kornacker
63e3cd0279 Adding query option DEBUG_ACTION 2014-01-08 10:47:37 -08:00
Alan Choi
be98df19c8 HiveServer2
This patch implements the HiveServer2  API.

We have tested it with Lenni's patch against the tpch workload. It has also
been tested manually against Hive's beeline with queries and metadata operations.

All of the HiveServer2 code is implemented in impala-hs2-server.cc. Beeswax
code is refactored to impala-beeswax-server.cc.

HiveServer2 has a few more metadata operations. These operations go through
impala-hs2-server to ddl-executor and then to FE. The logics are implemented in
fe/src/main/java/com/cloudera/impala/service/MetadataOp.java.

Because of the Thrift union issue, I have to modify the generated c++ file.
Therefore, all the HiveServer2 thrift generated c++ code are checked into
be/src/service/hiveserver2/. Once the thrift issue is resolved, I'll remove
these files.

Change-Id: I9a8fe5a09bf250ddc43584249bdc87b6da5a5881
2014-01-08 10:47:24 -08:00
Henry Robinson
7ba437a52e Code changes to build against thrift 0.9.0 in thirdparty/ 2014-01-08 10:47:22 -08:00
Alan Choi
ff704ce586 IMP-690: impala-shell calls PingImpalaService thrift API to verify
the connected server is an impalad.
2014-01-08 10:47:13 -08:00
Henry Robinson
b7f937577d Add missing thrift file 2014-01-08 10:47:12 -08:00
Henry Robinson
986f3cddf6 Move sparrow/ to statestore/ and remove sparrow namespace 2014-01-08 10:47:12 -08:00
Skye Wanderman-Milne
982747c856 IMP-653: add CURRENT_TIMESTAMP() function as synonym for now() 2014-01-08 10:47:09 -08:00
Marcel Kornacker
bf56c21c1b IMP-618
Adding DEFAULT_ORDER_BY_LIMIT query option.
Also removing deprecated PARTITION_AGG query option.
2014-01-08 10:47:04 -08:00
Michael Ubell
8a5297a526 Add HdfsLzoTextScanner 2014-01-08 10:46:35 -08:00
Henry Robinson
2f339f2ed8 Add ASL license to all public files 2014-01-08 10:46:32 -08:00
ishaan
ccb020c4a0 Adding copyrights to remaining files. 2014-01-08 10:46:30 -08:00
ishaan
05c65789bb Change Copyrights from 2011 ti 2012 2014-01-08 10:46:29 -08:00
Henry Robinson
dd0e9f1180 IMP-265: State-store subscriber recovery mode 2014-01-08 10:46:25 -08:00
Michael Ubell
c1852e2dcf Add from_unixtime and unix_timestamp(string, string) 2014-01-08 10:46:22 -08:00
Marcel Kornacker
ea050a43ad Switching over backend runtime structures to new planner.
Added container-util.h
2014-01-08 10:46:20 -08:00
Nong Li
08968c1d07 Performance improvements for aggregation and hash join nodes with codegen. 2014-01-08 10:46:19 -08:00