impala

mirror of https://github.com/apache/impala.git synced 2026-02-01 12:00:22 -05:00

Author	SHA1	Message	Date
Michael Ho	27b3b4d470	IMPALA-1112: Remove some unncessary code from cross-compilation This change stops including some boost library header files which pulls in other unnecessary boost library header files. This reduces the amount of cross-compiled code which needs to be materialized during codegen. This change also removes some UDF's Prepare() and Close() functions and UDF functions fromUtc(), toUtc() and uuid() from cross-compilation as they won't benefit from it. With this change, the bitcode module reduces from 2.12 MB to 1.86MB. Change-Id: I543809c69da0b4085a0e299b91cd550b274c46af Reviewed-on: http://gerrit.cloudera.org:8080/3793 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-08-10 10:07:16 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Alex Behm	6ee15faded	IMPALA-3845: Split up hdfs-parquet-scanner.cc into more files/components. This patch refactors hdfs-parquet-scanner.cc into several files. The new responsibilities of each file/component are roughly as follows: hdfs-parquet-scanner.h/cc - Creates column readers and uses them to materializes row batches. - Evaluates runtime filters and conjuncts, populates row batch queue. parquet-metadata-utils.h/cc - Contains utilities for validating Parquet file metadata. - Parses the schema of a Parquet file into our internal schema representation. - Resolves SchemaPaths (e.g. from a table descriptor) against the internal representation of the Parquet file schema. parquet-column-readers.h/cc - Contains the per-column data reading, parsing and value materialization logic. Testing: A private core/hdfs run passed. Change-Id: I4c5fd46f9c1a0ff2a4c30ea5a712fbae17c68f92 Reviewed-on: http://gerrit.cloudera.org:8080/3596 Tested-by: Internal Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2016-07-15 18:27:05 +00:00
Henry Robinson	55b2a639e8	IMPALA-3442: Replace '> >' with '>>' in template decls Change-Id: I9b27dfbda791ec6730c329c91bed015887a5e27c Reviewed-on: http://gerrit.cloudera.org:8080/3562 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2016-07-14 19:04:44 +00:00
Sailesh Mukil	5d9e13a6af	IMPALA-3253: Modify gen_build_version.sh to always output the right version gen_build_version.sh previously had a --noclean option which did not overwrite the version information if it was already populated. Since --noclean was the default option, it always never updated the version information. This patch modifies gen_build_version.py to generate a common/version.cc instead of a common/version.h. Now, common/version.h will be a part of the git repo and will not need to be modified on every build. It declares the functions that will return the build information. These functions will be defined in common/version.cc and the build information will change on every new build. Since only the .cc file changes on every build, we will not incur a highly noticable change in build times. Also changed the function names from GetImpalaBuild...() to GetImpaladBuild...() so as to avoid naming confusion between the Impala-lzo and the Impala functions. There is an accompanying change in the Impala-lzo library too. Change-Id: Ie461110b6f8ca545f04ea33b7b502aea550b8551 Reviewed-on: http://gerrit.cloudera.org:8080/2651 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Sailesh Mukil <sailesh@cloudera.com>	2016-07-05 13:37:25 -07:00
Henry Robinson	df1412c962	IMPALA-3480: Add query options for min/max filter sizes This patch adds two query options for runtime filters: RUNTIME_FILTER_MAX_SIZE RUNTIME_FILTER_MIN_SIZE These options define the minimum and maximum filter sizes for a filter, no matter what the estimates produced by the planner are. Filter sizes are rounded up to the nearest power of two. Change-Id: I5c13c200a0f1855f38a5da50ca34a737e741868b Reviewed-on: http://gerrit.cloudera.org:8080/2966 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-05-12 23:06:35 -07:00
Henry Robinson	5454086c74	IMPALA-3443: Replace BOOST_FOREACH with ranged for() This patch doesn't use 'auto' for the loop index type, as it's not clear yet where the savings in typing outweigh the cost of eliding the type. Change-Id: Iae1ca36313e3562311b6418478bf54b6d9b0bf7d Reviewed-on: http://gerrit.cloudera.org:8080/2890 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-05-12 14:17:56 -07:00
Tim Armstrong	8d98eadc1c	Reduce dependencies on inline header functions This patch helps reduce compile times when modifying function implementations in decimal-value.h, e.g. when tuning the implementations of decimal operators. decimal-value.h was included in many places that only need to know about the layout of DecimalValue, not the implementation of decimal operations. It was included indirectly in many files, e.g. via runtime-state.h. The patch moves those functions to decimal-value.inline.h and is able to avoid including decimal-value.inline.h in most headers. We also need to do the same thing for raw-value.h and runtime-filter.h, because some of the inline functions in raw-value.h referenced inline functions in decimal-value.h, and functions in runtime-filter.h referenced inline functions in runtime-filter.h. It also moves timestamp parsing logic from .h to .cc file. This slightly reduces the size of the llvm bitcode module and will slightly reduce compile times. The functions are too large to benefit from inlining in generated code. Change-Id: Ic7a2f388cd14a4427c43af2724340a2ffe8fae3d Reviewed-on: http://gerrit.cloudera.org:8080/2485 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 04:24:27 +00:00
Skye Wanderman-Milne	5a4daeb787	Don't send unmaterialized slots to the BE This patch has the FE include only materialized slots in the tuple descriptors shipped to the BE. This simplifies BE code which had to skip over unmaterialized slots (which aren't used anywhere outside the FE). Change-Id: I2f69078a391e38d30fa129fba12185208375b7c9 Reviewed-on: http://gerrit.cloudera.org:8080/1764 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-01-28 05:29:49 +00:00
Skye Wanderman-Milne	afeb7fc10e	Rename ArrayValue to CollectionValue. Also renames ArrayVal to CollectionVal and related variable names. CollectionValues represent both arrays and maps, so ArrayValue was a misleading name. Change-Id: I5b482e4dafcffda7c6e8f3e71f7b9fa34125f5c4 Reviewed-on: http://gerrit.cloudera.org:8080/1266 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2015-12-17 03:42:12 +00:00
Skye Wanderman-Milne	ae1238e739	Nested types: add better PrintPath() function for error reporting This patch also adds the field names of structs to ColumnType Change-Id: Ib4b9e6b90828dd8369de19883307a152b622f331 Reviewed-on: http://gerrit.cloudera.org:8080/512 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2015-07-27 22:08:29 -07:00
Skye Wanderman-Milne	e24878f680	Nested types: introduce ArrayValue and ArrayValueBuilder This patch adds ArrayValue, the in-memory representation of arrays and maps (which are treated as an array of key/value structs). It also adds ArrayValueBuilder, which is a helper class for creating ArrayValues. Change-Id: Iba0348d1a25876bbed452c93d2c4ed90a701e9d3 Reviewed-on: http://gerrit.cloudera.org:8080/487 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-07-21 00:52:55 +00:00
Matthew Jacobs	4315109a12	Improve error messages for Llama request timeouts/rejections Change-Id: Ib38bd8ab5a94abe58ee909bbc564a66cd9cf4c07 Reviewed-on: http://gerrit.cloudera.org:8080/464 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-06-18 21:18:40 +00:00
Martin Grund	2eb12e9593	Deprecating namespace directive declarations (std, boost) This patch removes all occurrences of "using namespace std" and "using namespace boost(.*)" from the codebase. However, there are still cases where namespace directives are used (e.g. for rapidjson, thrift, gutil). These have to be tackled in subsequent patches. To reduce the patch size, this patch introduces a new header file called "names.h" that will include many of our most frequently used symbols iff the corresponding include was already added. This means, that this header file will pull in for example map / string / vector etc, only iff vector was already included. This requires "common/names.h" to be the last include. After including `names.h` a new block contains a sorted list of using definitions (this patch does not fix namespace directive declarations for other than std / boost namespaces.) Change-Id: Iebe4c054670d655bc355347e381dae90999cfddf Reviewed-on: http://gerrit.cloudera.org:8080/338 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-04-18 01:26:47 +00:00
Skye Wanderman-Milne	d2dcda5421	Nested types: BE changes for Parquet struct support These are the backend changes necessary for reading structs in Parquet files. I wrote this against Alex's preliminary frontend work, and ad-hoc tables containing structs work. We won't be able to add automated tested until the FE changes are in as well, but I'd like to get these changes in so we can at least get converage of our existing workloads. The bulk of the changes are in the Parquet scanner. The rest is around changing the column index of a slot descriptor to a column path, in order to support nested columns. Change-Id: Ifbd865b52c2b4679d81643184b1f36bf539ffcfd Reviewed-on: http://gerrit.cloudera.org:8080/62 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-02-26 00:19:25 +00:00
Henry Robinson	9c3946c57c	Rename TCounterType to more general TUnit Change-Id: I5a43b5d843d5d7ee625d265fc249df77a69395ed Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5755 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2015-01-14 17:01:49 -08:00
Henry Robinson	0c2a71576f	Metric changes This patch reworks a lot of the metrics subsystem, laying much of the groundwork for unifying runtime profiles and metrics in the future, as well as enabling better rendering of metric data in our webpages, and richer integration with thirdparty monitoring tools like CM. There are lots of changes. The most significant are below. TODO (incomplete list): * Add descriptions for all metrics * Settle on a standard hierarchy for process-wide metric groups * Add path-based resolution for searching for metrics (i.e. resolve "group1.group2.metric_name") * Add a histogram metric type Improvements for all metrics: New 'description' field, which allows a human-readable description to be provided for each metric. Metrics must serialise themselves to JSON via the RapidJson library (all by-hand JSON serialisation has been removed). Metrics are contained in MetricGroups (replacing the old 'Metrics' class), which are hierarchically arranged to make grouping metrics into smaller subsystems more natural. Metrics are rendered via the new webserver templating engine, replacing the old /metrics endpoint. The old /jsonmetrics endpoint is retained for backwards compatibility. Improvements for 'simple' metrics: SimpleMetric replaces the old PrimitiveMetric class (using much of the same code), and are metrics whose value does not itself have relevant structure (as opposed to sets, lists, etc). SimpleMetrics have 'kinds' (counter, gauge, property etc) ** ... and units (from TCounterType), to make pretty-printing easier. Change-Id: Ida1d125172d8572dfe9541b4271604eff95cfea6 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5722 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2015-01-11 21:05:00 -08:00
Victor Bittorf	4339133887	Adding SEQUENCEFILE compressed record format Currently we do not support per record compression for SEQUENCEFILE; we do support no compression and block compression. Per record compression is typically very slow (since the compressor is invoked per record in the table) and not widely used. We chose to add support for per record compression as part of our effort to use Impala for all of our testdata loading infrastructure. We have per record compressed tables in testdata, so even though there is no customer demand for per record compression, we need it to migrate our data loading off of Hive. Change-Id: I6ea98ae0d31cceff8236b4b006c3a9fc00f64131 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5302 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit f62a76f8d00b8dbc2846deb36ee5f65031ad846e) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5322	2014-11-19 17:21:36 -08:00
Henry Robinson	2448848987	Fix a few compilation warnings when building in release mode Change-Id: I4741bf7088b3723eee9d0fb0d4cdeb1dc34ef7f0 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4228 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2014-09-12 18:17:28 -07:00
Victor Bittorf	820e1c070b	Support writing to Avro files Introduces support for writing tables stored as Avro files. This supports writing all data types except TIMESTAMP. Supports the following COMPRESSION_CODECs: NONE, DEFLATE, SNAPPY. Change-Id: Ica62063a4f172533c30dd1e8b0a11856da452467 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3863 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit 15c6066d05d5077bee0d5123d26777b0715eb9c6) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4056	2014-08-27 13:41:42 -07:00
Nong Li	5d903efca3	ExecSummary The runtime profile as we present it is not very useful and I think the structure of it makes it hard to consume. This patch adds a new client facing schemed set of counters that are collected from the runtime profiles. For example, with this structure it would be easy to have the shell get the stats of a running query and print a useful progress report or to check the most relevant metrics for diagnosing issues. Here's an example of the output for one of the tpch queries: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------ 09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED 05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B 04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE 08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE 07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority) 03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB 02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST \|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST \| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o 00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-06-11 03:10:11 -07:00
Nong Li	ac230c7021	Fix active time reporting in runtime profiles. - A few places didn't have total timer at the beginning. - Async build thread for blocking join nodes really messed things up (sum of children was more than the time in the join node). Change-Id: I9176ce37cf22f2bcebea21b117e45cce066dbc1d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2276 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-04-18 02:24:28 -07:00
Nong Li	0d2919fe7f	Refactor scalar and aggregate function analysis and execution. This patch cleans up analysis and execution of scalar and aggregate functions so that there is no difference between how builtins and user functions are handled. The only difference is that the catalog is populated with the builtins all the time. The BE always gets a TFunction object and just executes it (builtins will have an empty hdfs file location). This removes the opcode registry and all of the functionality is subsumed by the catalog, most of which was already duplicated there anyway. This also introduces the concept of a system database; databases that the user cannot modify and is populated automatically on startup. Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577	2014-02-18 18:40:08 -08:00
Henry Robinson	04604d2a6f	Add Google-util library to the build This patch incorporates the gutil library in thirdparty/gutil into the build, and uses strings::Substitute in one place as a proof-of-concept. Some other cleanups for $IMPALA_HOME/CMakeLists.txt are included. Change-Id: I851bf6f130c2f4039f3df3c6d60f842a5661e5da Reviewed-on: http://gerrit.ent.cloudera.com:8080/1026 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:32 -08:00
Henry Robinson	89c96fa37c	Properly support legacy query-id format from CDH4-series CM This patch fixes a deficiency in a previous attempt to keep legacy compatibility with CM4 when it comes to query IDs sent to the debug page for cancellation. Those query IDs are sent as <decimal-int> <decimal-int>, whereas going forward we want to accept <hex-int>:<hex-int>. Change-Id: I4a3611d1e0c613198861b2c8052aa48ef7bc8714 Reviewed-on: http://gerrit.ent.cloudera.com:8080/950 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:02 -08:00
Henry Robinson	8bdf8d4a17	Accept spaces in query IDs for backwards compatibility with CM4.x Change-Id: I809f46b1a12519da6dcd0fe45a9d6d9c21eb538e Reviewed-on: http://gerrit.ent.cloudera.com:8080/799 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2014-01-08 10:53:47 -08:00
Henry Robinson	89a0beb56a	IMPALA-449: Better cleanup after an INSERT fails This patch goes some way to improving recovery after an INSERT fails. Inserts now write intermediate results to <table_dir>/.impala_insert_staging. After execution completes, either successfully or not, the query-specific directory under that directory is deleted. This doesn't complete the job for better cleanup (although this goes as far as IMPALA-449 suggests). Two things to do in the future: * Have each backend delete its own staging files on error. The difficulty getting there now is that backends don't know if they are cancelled in error or because a LIMIT was reached. * If the operation to move files to their final destinations should fail during FinalizeQuery(), the coordinator should perform compensation actions and delete the files that made it. Note: We also considered a query-wide and impalad-wide option to change the staging dir. There are advantages to this (all intermediate results go to a known location which is easy to clean up on failure), but also security and other operational concerns. Worth revisiting in the future. Change-Id: Ia54cf36db6a382e359877f87d7d40aad7fdb77be Reviewed-on: http://gerrit.ent.cloudera.com:8080/670 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:37 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Skye Wanderman-Milne	b7f83bcd73	Add support for LLVM IR UDFs. This patch also adds a number of improvements to NativeUdfExpr. Highlights include: * Correctly handling the lowering of AnyVal struct types (required for ABI compatibility) * A rudimentary library cache for reusing handles produced by dlopen * More complicated test cases Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195 Reviewed-on: http://gerrit.ent.cloudera.com:8080/540 Tested-by: jenkins Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:03 -08:00
Lenni Kuff	e0876b951b	Add support for audit event logging This change adds support audit event logging in Impala. This feature is disabled by default and is enabled by setting the -audit_event_log_dir flag. When auditing is enabled, details on each query that Impala executes will be saved to the audit log along with the current session state. This includes information such as the statement type, catalog objects accessed by the query, and whether there the operation passed authorization. Change-Id: I39b78664c971124ec79c5fcee998065dd53fd32e Reviewed-on: http://gerrit.ent.cloudera.com:8080/142 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:52:08 -08:00
Lenni Kuff	8386577720	IMPALA-481: Add session information to the runtime profile Change-Id: Ia3b7b9dafcbdd4bea5ca308773da9c55ec1ecd3e Reviewed-on: http://gerrit.ent.cloudera.com:8080/96 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:52:00 -08:00
Nong Li	547c75e3d5	Add gzip compression to parquet.	2014-01-08 10:50:24 -08:00
Nong Li	68e4c14527	Fix parquet incompatibilities.	2014-01-08 10:50:22 -08:00
Skye Wanderman-Milne	0c343913fa	IMPALA-266: Round() does not output the right precision	2014-01-08 10:50:02 -08:00
Nong Li	ce92b2d906	Fix query id parsing.	2014-01-08 10:49:58 -08:00
Skye Wanderman-Milne	0ac303ad82	IMPALA-234: Add some library version validation logic to impalad when loading impala-lzo shared library Change-Id: I51cdbbe5ed7af2f34b7079faf80d45ab9d4c3c35	2014-01-08 10:49:52 -08:00
Henry Robinson	829c0dc948	Remove line break from version metric	2014-01-08 10:49:45 -08:00
Nong Li	925223d437	Change query id for debug page urls to the same as all our other query id formats.	2014-01-08 10:49:42 -08:00
Alan Choi	5f9e26b4a8	Average Scanner Thread Concurrency is a new metrics in the profile that reports the average number of active scanner thread (i.e. those that are not blocked by IO). In the hdfs-scan-node, whenever a thread is started, it will increment the active_scanner_thread_counter_. When a scanner thread enter the scan-range-context's GetRawBytes or GetBytes, the counter will be decremented. A new sampling thread is created to sample the value of active_scanner_thread_counter_ and compute the average. A bucket couting of HdfsReadThreadConcurrent is also added. The output of the hdfs-scan-node profile is also updated. Here's the new output for hdfs-scan-node after running count(*) from tpch.lineitem. HDFS_SCAN_NODE (id=0):(10s254ms 99.75%) File Formats: TEXT/NONE:12 Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:6/351.21M (351208888) 1:6/402.65M (402653184) - AverageHdfsReadThreadConcurrency: 1.95 - HdfsReadThreadConcurrencyCountPercentage=0: 0.00 - HdfsReadThreadConcurrencyCountPercentage=1: 5.00 - HdfsReadThreadConcurrencyCountPercentage=2: 95.00 - HdfsReadThreadConcurrencyCountPercentage=3: 0.00 - AverageScannerThreadConcurrency: 0.15 - BytesRead: 718.94 MB - MemoryUsed: 0.00 - NumDisksAccessed: 2 - PerReadThreadRawHdfsThroughput: 36.75 MB/sec - RowsReturned: 6.00M (6001215) - RowsReturnedRate: 585.25 K/sec - ScanRangesComplete: 12 - ScannerThreadsInvoluntaryContextSwitches: 168 - ScannerThreadsTotalWallClockTime: 1m40s - DelimiterParseTime: 2s128ms - MaterializeTupleTime: 723.0us - ScannerThreadsSysTime: 10.0ms - ScannerThreadsUserTime: 2s090ms - ScannerThreadsVoluntaryContextSwitches: 99 - TotalRawHdfsReadTime: 19s561ms - TotalReadThroughput: 68.69 MB/sec	2014-01-08 10:49:30 -08:00
Nong Li	f60f2d3e50	Implement support for grouped scan ranges in io mgr and integration with parquet.	2014-01-08 10:49:18 -08:00
Nong Li	2cae8688c5	Parquet scanner. Also includes - Cleanup of HdfsScanNode/IoMgr interaction - Rename of ScanRangeContext to scanner context - Removed files that were no longer being used	2014-01-08 10:48:59 -08:00
Nong Li	79066a2cdf	Make runtime profile more self contained.	2014-01-08 10:48:55 -08:00
Alan Choi	4b6ce8ecb3	This patch changes the clock to CLOCK_MONOTONIC. Rdtsc is not accurate, due to changes in cpu frequency. Very often, the time reported in the profile is even longer than the time reported by the shell. This patch replaces Rdtcs with CLOCK_MONOTONIC. It is as fast as Rdtsc and accurate. It is not affected by cpu frequency changes and it is not affected by user setting the system clock. Note that the new profile report will always report time, rather than in clock cycle. Here's the new profile: Averaged Fragment 1:(68.241ms 0.00%) completion times: min:69ms max:69ms mean: 69ms stddev:0 execution rates: min:91.60 KB/sec max:91.60 KB/sec mean:91.60 KB/sec stddev:0.00 /sec split sizes: min: 6.32 KB, max: 6.32 KB, avg: 6.32 KB, stddev: 0.00 - RowsProduced: 1 CodeGen: - CodegenTime: 566.104us <--* reporting in microsec instead of clock cycle - CompileTime: 33.202ms - LoadTime: 2.671ms - ModuleFileSize: 44.61 KB DataStreamSender: - BytesSent: 16.00 B - DataSinkTime: 50.719us - SerializeBatchTime: 18.365us - ThriftTransmitTime: 145.945us AGGREGATION_NODE (id=1):(68.384ms 15.50%) - BuildBuckets: 1.02K - BuildTime: 13.734us - GetResultsTime: 6.650us - MemoryUsed: 32.01 KB - RowsReturned: 1 - RowsReturnedRate: 14.00 /sec HDFS_SCAN_NODE (id=0):(57.808ms 84.71%) - BytesRead: 6.32 KB - DelimiterParseTime: 62.370us - MaterializeTupleTime: 767ns - MemoryUsed: 0.00 - PerDiskReadThroughput: 9.32 MB/sec - RowsReturned: 100 - RowsReturnedRate: 1.73 K/sec - ScanRangesComplete: 4 - ScannerThreadsInvoluntaryContextSwitches: 0 - ScannerThreadsReadTime: 662.431us - ScannerThreadsSysTime: 0 - ScannerThreadsTotalWallClockTime: 25ms - ScannerThreadsUserTime: 0 - ScannerThreadsVoluntaryContextSwitches: 4 - TotalReadThroughput: 0.00 /sec	2014-01-08 10:48:32 -08:00
Nong Li	ddf6fab056	Added counter for number of scan ranges per file format/compression type.	2014-01-08 10:47:03 -08:00
Nong Li	d5687d2e9c	Changed debug/release to upper case.	2014-01-08 10:47:02 -08:00
Nong Li	b26f394624	Added debug/release to version string.	2014-01-08 10:47:02 -08:00
Henry Robinson	41dc55b666	IMP-530: Daemon binaries should properly handle -version	2014-01-08 10:46:43 -08:00
Henry Robinson	2f339f2ed8	Add ASL license to all public files	2014-01-08 10:46:32 -08:00
ishaan	05c65789bb	Change Copyrights from 2011 ti 2012	2014-01-08 10:46:29 -08:00
Nong Li	8d0ee9aebe	Update progress reporting and aggregate scan ranges complete in coordinator.	2014-01-08 10:46:01 -08:00

1 2

71 Commits