Commit Graph

43 Commits

Author SHA1 Message Date
Nong Li
5d903efca3 ExecSummary
The runtime profile as we present it is not very useful and I think the structure of
it makes it hard to consume. This patch adds a new client facing schemed set of
counters that are collected from the runtime profiles. For example, with this structure
it would be easy to have the shell get the stats of a running query and print a useful
progress report or to check the most relevant metrics for diagnosing issues.

Here's an example of the output for one of the tpch queries:
Operator              #Hosts   Avg Time   Max Time    #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail
------------------------------------------------------------------------------------------------------------------------
09:MERGING-EXCHANGE        1   79.738us   79.738us        5           5         0        -1.00 B  UNPARTITIONED
05:TOP-N                   3   84.693us   88.810us        5           5  12.00 KB       120.00 B
04:AGGREGATE               3    5.263ms    6.432ms        5           5  44.00 KB       10.00 MB  MERGE FINALIZE
08:AGGREGATE               3   16.659ms   27.444ms   52.52K     600.12K   3.20 MB       15.11 MB  MERGE
07:EXCHANGE                3    2.644ms      5.1ms   52.52K     600.12K         0              0  HASH(o_orderpriority)
03:AGGREGATE               3  342.913ms  966.291ms   52.52K     600.12K  10.80 MB       15.11 MB
02:HASH JOIN               3    2s165ms    2s171ms  144.87K     600.12K  13.63 MB      941.01 KB  INNER JOIN, BROADCAST
|--06:EXCHANGE             3    8.296ms    8.692ms   57.22K      15.00K         0              0  BROADCAST
|  01:SCAN HDFS            2    1s412ms    1s978ms   57.22K      15.00K  24.21 MB      176.00 MB  tpch.orders o
00:SCAN HDFS               3    8s032ms    8s558ms    3.79M     600.12K  32.29 MB      264.00 MB  tpch.lineitem l

Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-06-11 03:10:11 -07:00
Matthew Jacobs
25c0ebf58c External Data Source: Public API
Adds the thrift structures for the public external data source API
and a new maven project containing the Java ExternalDataSource
interface and the generated Java thrift classes.

The ExternalDataSource.thrift structures can evolve in a backward
compatible way. The ExternalDataSource Java interface will always
contain a version number in the namespace (e.g.
com.cloudera.impala.extdatasource.v1 for V1) so we can potentially
make breaking changes to the interface in the future but still
support older versions.

A trivial implementation of the ExternalDataSource API is also
added for testing purposes.
TODO: Make the sample data source implementation realistic.

Change-Id: I827d6420a87ed7a2bce34e050362ca98ddc5dbcc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2241
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f29814e9ede9d4c889f2648606fcf511feeb47ae)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2313
2014-04-22 18:34:48 -07:00
Nong Li
0d2919fe7f Refactor scalar and aggregate function analysis and execution.
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.

The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).

This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.

This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.

Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
2014-02-18 18:40:08 -08:00
Alex Behm
dc7b398bd3 Impala reserves resources from YARN via LLama.
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.

Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
2014-01-15 15:12:04 -08:00
Henry Robinson
51e58e1f3c Statestore aesthetic cleanup
* Statestore is now one word, without camelcase, eveywhere. Previous
    names included StateStore, state-store and state_store,
    variously. The only exception is a couple of flags that have
    'state_store', and can't be changed for compatibility reasons.
  * File names are also changed to reflect the standard naming.
  * Most comments are now 90 chars wide (from 80 before)

Change-Id: I83b666c87991537f9b1b80c2f0ea70c2e0c07dcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1225
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
2014-01-09 09:56:04 -08:00
Lenni Kuff
9d5b94baa5 CatalogServer follow-on code review changes
Changes to address follow-on code review comments. This change consists mainly of:
* Comment cleanup / clarification
* Thrift struct consolidation
* Minor naming changes
* Small code fixes/changes, etc

Change-Id: Idd03cc8adeb9c0d99744688a02f81a08135966de
Reviewed-on: http://gerrit.ent.cloudera.com:8080/667
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:42 -08:00
Lenni Kuff
bf139d1eba Update catalogd to forward log4j log messages to glog
Change-Id: I4620b77ba731e134a3e48883e8ae7ee3820ed584
Reviewed-on: http://gerrit.ent.cloudera.com:8080/612
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:12 -08:00
Lenni Kuff
a2cbd2820e Add Catalog Service and support for automatic metadata refresh
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.

Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.

What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
  contains the change. An impalad will wait for the statestore heartbeat that contains this
  version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing

Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
  same JAR.

Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:11 -08:00
Nong Li
af90c8a133 Fix memory usage tracking.
Changes MemLimit to MemTracker:
- the limit is optional
- it also records a label and an optional parent
- Consume() and Release() also update the ancestors and there's also a new
  AnyLimitExceeded(), which also checks the ancestors
- the consumption counter is a HighwaterMarkCounter and can optionally be created
  as part of a profile

Each fragment instance now has a MemTracker that is part of a 3-level
hierarchy: process, query, fragment instance.

Change-Id: I5f580f4956fdf07d70bd9a6531032439aaf0fd07
Reviewed-on: http://gerrit.ent.cloudera.com:8080/339
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:36 -08:00
Alex Behm
816141b9f5 Allow BIGINT type in functions that currently only accept INT. 2014-01-08 10:50:46 -08:00
Henry Robinson
90ed9f0ab8 Remove planservice 2014-01-08 10:50:20 -08:00
Henry Robinson
2ae20cbbb7 Statestore-2.0: New state-store implementation
* API simplified to deal only with 'topics', not services and objects
* Scalability improved: heartbeat loop is now multi-threaded
* State-store can store arbitrary objects
* State-store may send either deltas or complete topic state (delta computation to come)
2014-01-08 10:49:23 -08:00
Nong Li
6e293090e6 Parquet writer.
Change-Id: I7117b545e3d3a7803a219234ad992040a6c7c4ec
2014-01-08 10:48:44 -08:00
Nong Li
868a99135a Add network benchmark 2014-01-08 10:47:56 -08:00
Alan Choi
be98df19c8 HiveServer2
This patch implements the HiveServer2  API.

We have tested it with Lenni's patch against the tpch workload. It has also
been tested manually against Hive's beeline with queries and metadata operations.

All of the HiveServer2 code is implemented in impala-hs2-server.cc. Beeswax
code is refactored to impala-beeswax-server.cc.

HiveServer2 has a few more metadata operations. These operations go through
impala-hs2-server to ddl-executor and then to FE. The logics are implemented in
fe/src/main/java/com/cloudera/impala/service/MetadataOp.java.

Because of the Thrift union issue, I have to modify the generated c++ file.
Therefore, all the HiveServer2 thrift generated c++ code are checked into
be/src/service/hiveserver2/. Once the thrift issue is resolved, I'll remove
these files.

Change-Id: I9a8fe5a09bf250ddc43584249bdc87b6da5a5881
2014-01-08 10:47:24 -08:00
Henry Robinson
7ba437a52e Code changes to build against thrift 0.9.0 in thirdparty/ 2014-01-08 10:47:22 -08:00
Henry Robinson
986f3cddf6 Move sparrow/ to statestore/ and remove sparrow namespace 2014-01-08 10:47:12 -08:00
Nong Li
2289906a5a Fix linker dependencies. 2014-01-08 10:46:56 -08:00
Henry Robinson
2f339f2ed8 Add ASL license to all public files 2014-01-08 10:46:32 -08:00
ishaan
05c65789bb Change Copyrights from 2011 ti 2012 2014-01-08 10:46:29 -08:00
Michael Ubell
ad46b98366 Add Kerberos authentication. 2014-01-08 10:45:10 -08:00
Marcel Kornacker
c004cdaa1c Thrift structures for the new planner interface. 2014-01-08 10:44:47 -08:00
Marcel Kornacker
fb32d40b03 Switching to an asynchronous plan fragment exec interface; this entails:
- making the coordinator asynchronous
- renamed ImpalaBackendService to ImpalaInternalService;
- new class ImpalaServer implements ImpalaService and ImpalaInternalService
- renaming ImpalaInternalService fields to conform to c++ style
- merged impala-service.{cc,h} and backend-service.{cc,h} into impala-server.{cc,h}
- added TStatusCode field to Status.ErrorDetail
- removed ImpalaInternalService.CloseChannel

Also removed JdbcDriverTest.java
2014-01-08 10:44:15 -08:00
Kay Ousterhout
073e38d6c2 Added the StateStore, a centralized repository for soft state.
The commit also adds the StateStoreSubscriber, a component that
runs alongside each impalad and handles communication with
the state store.
2012-07-13 09:26:16 -07:00
Alan Choi
f52286f72c This completes the Beeswax implementation for ODBC. All the ODBC tests
(CDH/hive-odbc-test) passes (except those with "create table" and "show table".

We should have nightly regression of the odbc test to run against impalad.

There're still a few issues:
1. running with num_node > 0 crashes the coordinator;
2. work around for a few ODBC jiras
3. no test for bool/timestamp because ODBC doesn't support them.

review: issue 110
2012-06-18 14:46:46 -07:00
Alan Choi
ef10afa439 This changes the Thrift from 0.6.1 to 0.7.0. Please uninstall the old thrift and download/install Thrift 0.7.0.
Beeswax service now depends on Hive metastore;
fix buildall.sh to clean generated-source in FE;
fix .gitignore to clean generated-source in BE;
2012-06-14 18:21:08 -07:00
Alan Choi
7af87c7dea Beeswax Service for Impala (partiial implementation)
review id: 82
2012-06-06 10:08:06 -07:00
Henry Robinson
3ff3559805 Add support for per-partition file formats to front end and backend.
At the same time, this patch removes the partitionKeyRegex in favour
of explicitly sending a list of literal expressions for each file path
from the front end.
2012-06-05 12:00:09 -07:00
Marcel Kornacker
4a4a07fde7 A number of changes for the Jenkins build:
- added option to run with derby metastore, based on whether env var METASTORE_IS_DERBY is set
- emoved hardwired file locations from planner tests
- switching to linking statically against libthrift.a

Also added script rebuild.sh, which contains the build steps of buildall.sh (against impala sources).
2012-03-08 16:19:47 -08:00
Nong Li
b410b62716 Add distributed profile counter for the BE. 2012-03-01 13:59:17 -08:00
Nong Li
88237350f0 Change the build to allow debug and release builds to coexist. 2012-02-17 18:14:04 -08:00
Nong Li
94db70c9fd Fix build. Dependencies don't propagate right on first build. 2011-12-30 21:28:18 -08:00
Nong Li
c84fec38d3 - Move thrift out of FE src and into impala/common
- Thrift files now build using cmake instead of mvn
- Added cmake build to impala/ which drives the build process
2011-12-30 19:35:20 -08:00
Marcel Kornacker
c056445612 Added m:n data streams:
- DataStreamSender: sender side (1:n) for a single stream
- DataStreamMgr: receiver side; singleton class for all incoming streams active at a node

Changed ExecNode::GetNext() to return eos indicator explicitly; this allows us to pass incoming TRowBatches (which may not be full) up w/o copying the data.

Added data-stream-test.
2012-01-10 18:00:20 -08:00
Alexander Behm
c7f7382c31 Added planner changes and data sinks for INSERT statements. 2011-12-12 15:14:49 -08:00
Nong Li
ea9d4b94f4 Adding opcode cmakelists.txt 2011-11-20 14:27:27 -08:00
Nong Li
b1833d4de8 Implmented opcode registry. Added substr() and pi() functions. Added backend testing to buildall.sh 2011-11-20 13:44:41 -08:00
Marcel Kornacker
0914fedea9 Defining Impala backend service, which is exported by backend processes to service plan
fragment execution requests.
Changing thrift plan-related structs to pull out runtime parameters in preparation for parallel execution.
2011-11-02 14:47:32 -07:00
Marcel Kornacker
c534062c20 fixing build failure introduced by 7d9b7e2 2011-08-03 15:37:58 -07:00
Marcel Kornacker
cc141953de Adding plan service for be test driver
Adding mock implementation of libhdfs (only what's needed for text-scan-node)
in order to avoid having to make any jni calls.
Some bug fixes.
2011-07-22 12:09:55 -07:00
marcel
08e8a5db4c fixed jni and linker problems
some bug fixes and missing functions

some cleanup
2011-07-15 13:17:20 -07:00
Marcel Kornacker
c23616a30c deserializing plan request in c++
Coordinator.main(): util function to execute single query against test schema
removed dead code from TestSchemaUtils
2011-07-13 13:48:54 -07:00
marcel
3286190599 Initial version of backend. 2011-07-07 15:49:46 -07:00