impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Michael Smith	73a9ef9c4c	IMPALA-13005: Create Query Live table in HMS Creates the 'sys.impala_query_live' table in HMS using a similar 'CREATE TABLE' command to 'sys.impala_query_log'. Updates frontend to identify a System Table based on the '__IMPALA_SYSTEM_TABLE' property. Tables improperly marked with '__IMPALA_SYSTEM_TABLE' will error when attempting to scan them because no relevant scanner will be available. Creating the table in HMS simplifies supporting 'SHOW CREATE TABLE' and 'DESCRIBE EXTENDED', so allows them for parity with Query Log. Explicitly disables 'COMPUTE STATS' on system tables as it doesn't work correctly. Makes System Tables work with local catalog mode, fixing LocalCatalogException: Unknown table type for table sys.impala_query_live Updates workload management implementation to rely more on SystemTables.thrift definition, and adds DCHECKs to verify completeness and ordering. Testing: - adds additional test cases for changes to introspection commands - passes existing test_query_live and test_query_log suites Change-Id: Idf302ee54a819fdee2db0ae582a5eeddffe4a5b4 Reviewed-on: http://gerrit.cloudera.org:8080/21302 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-26 23:21:56 +00:00
wzhou-code	c9c5fb89b5	IMPALA-12156: Support High Availability for Statestore To support statestore HA, we allow two statestored instances in an Active-Passive HA pair to be added to an Impala cluster. We add the preemptive behavior for statestored. When HA is enabled, the preemptive behavior allows the statestored with the higher priority to become active and the paired statestored becomes standby. The active statestored acts as the owner of Impala cluster and provides statestore service for the cluster members. To enable catalog HA for a cluster, two statestoreds in the HA pair and all subscribers must be started with starting flag "enable_statestored_ha" as true. This patch makes following changes: - Defined new service for Statestore HA. - Statestored negotiates the role for HA with its peer statestore instance on startup. - Create HA monitor thread: Active statestored sends heartbeat to standby statestored. Standby statestored monitors peer's connection states with their subscribers. - Standby statestored sends heartbeat to subscribers with request for connection state between active statestore and subscribers. Standby statestored saves the connection state as failure detecter. - When standby statestored lost connection with active statestore, it checks the connection states for active statestore, and takes over active role if majority of subscribers lost connections with active statestore. - New active statestored sends RPC notification to all subscribers for new active statestored and active catalogd elected by the new active statestored. - New active statestored starts sending heartbeat to its peer when it receives handshake from its peer. - Active statestored enters recovery mode if it lost connections with its peer statestored and all subscribers. It keeps sending HA handshake to its peer until receiving response. - All subscribers (impalad/catalogd/admissiond) register to two statestoreds. - Subscribers report connection state for the requests from standby statestore. - Subscribers switch to new active statestore when receiving RPC notifications from new active statestored. - Only active statestored sends topic update messages to subscribers. - Add a new option "enable_statestored_ha" in script bin/start-impala-cluster.py for starting Impala mini-cluster with statestored HA enabled. - Add a new Thrift API in statestore service to disable network for statestored. It's only used for unit-test to simulate network failure. For safety, it's only working when the debug action is set in starting flags. Testings: - Added end-to-end unit tests for statestored HA. - Passed core tests Change-Id: Ibd2c814bbad5c04c1d50c2edaa5b910c82a6fd87 Reviewed-on: http://gerrit.cloudera.org:8080/20372 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-10-24 22:05:36 +00:00
Riza Suminto	1bcb35ec79	IMPALA-11384: Upgrade CPP thrift components to thrift-0.16.0 This patch upgrades IMPALA_THRIFT_CPP_VERSION=0.16.0-p3 to mitigate CVE-2020-13949 and hang issue with newer JDBC client. IMPALA_TOOLCHAIN_BUILD_ID is upgraded to 179-9977806f06, which contains the required thrift-0.16.0 compiler. The refactoring itself includes: - Removing non-generated (empty) *_constants.cpp and adjustment for THRIFT-4730. - Adjusts error handling in client-request-state.cc due to changing thrift error message after the upgrade. - Adding custom build target thrift-cpp-manual-edit to fix clang-diagnostic-inconsistent-missing-override warning in generated ImpalaHiveServer2Service.h and ImpalaService.h Testing: - Build and pass core tests. Change-Id: Ic278ac5c973ff5c3e829a6139b9c16e9d2c62a59 Reviewed-on: http://gerrit.cloudera.org:8080/18661 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-06-24 23:01:22 +00:00
Kurt Deschler	311938b4f5	IMPALA-10535: Add interface to ImpalaServer for execution of externally compiled statements The ExecutePlannedStatement interface allows an externally supplied TExecRequest to be executed by impalad. The TExecRequest must be fully populated and will be sent directly to the backend for execution. The following fields in the TExecRequest are updated by the coordinator: - Hostname - KRPC address - Local Timezone In order to add the interface to ImpalaInternalService.thrift, several of the thrift classes were moved to Query.thrift to avoid a circular dependency with Frontend.thrift. Added functionality to format and dump TExecRequest structures to path specified in debug flag dump_exec_request_path. A start timestamp field has been added to TExecRequest to represent the interval in the query profile between when the request was sent by the external frontend and handled by the backend. A local timestamp field has been added to the Ping result struct to return the current backend timestamp. This is used by the external to frontend to populate the start timestamp. Also included is a change to avoid generating silent AnalysisExceptions during table resolution. Tested with TExecRequest structures populated by external frontend. Local timezone change tested withe INT64 TIMESTAMP datatype Reviewed-by: John Sherman <jfs@cloudera.com> Change-Id: Iace716dd67290f08441857dc02d2428b0e335eaa Reviewed-on: http://gerrit.cloudera.org:8080/17104 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2021-03-12 17:49:08 +00:00
wzhou-code	6bb3b88d05	IMPALA-9180 (part 1): Remove legacy ImpalaInternalService The legacy Thrift based Impala internal service has been deprecated and can be removed now. This patch removes ImpalaInternalService. All infrastructures around it are cleaned up, except one place for flag be_port. StatestoreSubscriber::subscriber_id consists be_port, but we cannot change format of subscriber_id now. This remaining be_port issue will be fixed in a succeeding patch (part 4). TQueryCtx.coord_address is changed to TQueryCtx.coord_hostname since the port in TQueryCtx.coord_address is set as be_port and is unused now. Also Rename TQueryCtx.coord_krpc_address as TQueryCtx.coord_ip_address. Testing: - Passed the exhaustive test. - Passed Quasar-L0 test. Change-Id: I5fa83c8009590124dded4783f77ef70fa30119e6 Reviewed-on: http://gerrit.cloudera.org:8080/16291 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-30 22:41:00 +00:00
Anurag Mantripragada	d75562a7e3	IMPALA-9256: Refactor constraint information into a separate class. This change refactors the primary keys and foreign keys into a SqlConstraints class since they are almost always used together. This work also helps extend the constraints class to include other constraints we may support in the future. (Ex: Unique constraints.) This patch also: - Fixes a bug in the MetadataOp.getPrimaryKeys() and getForeignKeys() which returned incorrect results. The tests did not catch this before beacuse we did not have tests to verify individual resultset rows. The patch modifies these tests. - Fixes a bug in foreign key constraint name generation that was causing foreign keys corresponding to a composite primary key get different foreign key constraint names instead of the same name. - Introduces a canonical representation for foreign keys to prevent bugs like IMPALA-9372 which can occur due to HMS returning results in inconsistent ways. Testing: - Fixed the tests to work with the new behavior. - Ran all the PK/FK tests. Change-Id: I3f1c441c24df84d2d0791ffe94dff60d039a3341 Reviewed-on: http://gerrit.cloudera.org:8080/15213 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-15 12:50:43 +00:00
Sahil Takiar	d037ac8304	IMPALA-8818: Replace deque with spillable queue in BufferedPRS Replaces DequeRowBatchQueue with SpillableRowBatchQueue in BufferedPlanRootSink. A few changes to BufferedPlanRootSink were necessary for it to work with the spillable queue, however, all the synchronization logic is the same. SpillableRowBatchQueue is a wrapper around a BufferedTupleStream and a ReservationManager. It takes in a TBackendResourceProfile that specifies the max / min memory reservation the BufferedTupleStream can use to buffer rows. The 'max_unpinned_bytes' parameter limits the max number of bytes that can be unpinned in the BufferedTupleStream. The limit is a 'soft' limit because calls to AddBatch may push the amount of unpinned memory over the limit. The queue is non-blocking and not thread safe. It provides AddBatch and GetBatch methods. Calls to AddBatch spill if the BufferedTupleStream does not have enough reservation to fit the entire RowBatch. Adds two new query options: 'MAX_PINNED_RESULT_SPOOLING_MEMORY' and 'MAX_UNPINNED_RESULT_SPOOLING_MEMORY', which bound the amount of pinned and unpinned memory that a query can use for spooling, respectively. MAX_PINNED_RESULT_SPOOLING_MEMORY must be <= MAX_UNPINNED_RESULT_SPOOLING_MEMORY in order to allow all the pinned data in the BufferedTupleStream to be unpinned. This is enforced in a new method in QueryOptions called 'ValidateQueryOptions'. Planner Changes: PlanRootSink.java now computes a full ResourceProfile if result spooling is enabled. The min mem reservation is bounded by the size of the read and write pages used by the BufferedTupleStream. The max mem reservation is bounded by 'MAX_PINNED_RESULT_SPOOLING_MEMORY'. The mem estimate is computed by estimating the size of the result set using stats. BufferedTupleStream Re-Factoring: For the most part, using a BufferedTupleStream outside an ExecNode works properly. However, some changes were necessary: * The message for the MAX_ROW_SIZE error is ExecNode specific. In order to fix this, this patch introduces the concept of an ExecNode 'label' which is a more generic version of an ExecNode 'id'. * The definition of TBackendResourceProfile lived in PlanNodes.thrift, it was moved to its own file so it can be used by DataSinks.thrift. * Modified BufferedTupleStream so it internally tracks how many bytes are unpinned (necessary for 'MAX_UNPINNED_RESULT_SPOOLING_MEMORY'). Metrics: * Added a few of the metrics mentioned in IMPALA-8825 to BufferedPlanRootSink. Specifically, added timers to track how much time is spent waiting in the BufferedPlanRootSink 'Send' and 'GetNext' methods. * The BufferedTupleStream in the SpillableRowBatchQueue exposes several BufferPool metrics such as number of reserved and unpinned bytes. Bug Fixes: * Fixed a bug in BufferedPlanRootSink where the MemPool used by the expression evaluators was not being cleared incrementally. * Fixed a bug where the inactive timer was not being properly updated in BufferedPlanRootSink. * Fixed a bug where RowBatch memory was not freed if BufferedPlanRootSink::GetNext terminated early because it could not handle requests where num_results < BATCH_SIZE. Testing: * Added new tests to test_result_spooling.py. * Updated errors thrown in spilling-large-rows.test. * Ran exhaustive tests. Change-Id: I10f9e72374cdf9501c0e5e2c5b39c13688ae65a9 Reviewed-on: http://gerrit.cloudera.org:8080/14039 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-24 00:38:50 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Bharath Vissapragada	3f2f008ac4	IMPALA-3552: Make incremental stats max serialized size configurable The fix "IMPALA-2648/IMPALA-2664" introduced a conservative limitation on the maximum serialized size of incremental stats. As a side effect, some users with very large tables are experiencing regressions especially when they upgrade impala and the serialized size goes beyond 200MB. To mitigate the issue, the change introduces a new gflag, 'inc_stats_size_limit_bytes' to make the max serialized size configurable, which allows impala users to specify their own maximum serialized size. Default value for inc_stats_size_limit_bytes is 200MB. The change introduces a TBackendGflags class to pass the gflags from backend to the Frontend and the Catalog via thrift. This also revamps existing query options to use the TBackendConfig. Change-Id: I33684725a61eabc67237503e61178305d37d3cb5 Reviewed-on: http://gerrit.cloudera.org:8080/4867 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 03:22:11 +00:00
Henry Robinson	19de09ab7d	IMPALA-4160: Remove Llama support. Alas, poor Llama! I knew him, Impala: a system of infinite jest, of most excellent fancy: we hath borne him on our back a thousand times; and now, how abhorred in my imagination it is! Done: * Removed QueryResourceMgr, ResourceBroker, CGroupsMgr * Removed untested 'offline' mode and NM failure detection from ImpalaServer * Removed all Llama-related Thrift files * Removed RM-related arguments to MemTracker constructors * Deprecated all RM-related flags, printing a warning if enable_rm is set * Removed expansion logic from MemTracker * Removed VCore logic from QuerySchedule * Removed all reservation-related logic from Scheduler * Removed RM metric descriptions * Various misc. small class changes Not done: * Remove RM flags (--enable_rm etc.) * Remove RM query options * Changes to RequestPoolService (see IMPALA-4159) * Remove estimates of VCores / memory from plan Change-Id: Icfb14209e31f6608bb7b8a33789e00411a6447ef Reviewed-on: http://gerrit.cloudera.org:8080/4445 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-09-20 23:50:43 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Bharath Vissapragada	084b9b1692	IMPALA-2432: Add query endtime to impalad's lineage This commit adds query endtime to impalad's lineage log entries consumed by navigator. The lineage graph is constructed in the frontend and is then passed to the backend as a serialized thrift object. When the query terminates (includes cancellations and aborts), the backend appends the query endtime ("endTime") to the lineage graph and generates the lineage log entry in JSON format. Change-Id: I2236e98895ae9a159ad6e78b0e18e3622fdc3306 Reviewed-on: http://gerrit.cloudera.org:8080/934 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2015-11-04 08:39:12 +00:00
Martin Grund	384ae3ab08	Fixes for Toolchain Issues If a static version of zlib and bzip2 is picked up we assumed that it would be compiled with -fPIC. However, this is not always the case. Thus in the non-toolchain case we specifically dynamic link with zlib and bzip2 for the dynamic targets. In addition, this patch removes static linking of libgcc in the toolchain case as LLVM is not able to find the exception handling symbols even if they are present in the binary. Static linking of libgcc is postponed. Next, if Impala is build with -notests the external data source thrift files would not be generated. This patch make sure the dependencies are expressed correctly. Finally, if a user would have google perftools installed on the system we would accidentally pick up the system libraries and the thirdparty headers which will end in linker errors. This patch fixes the path issues. Change-Id: Ic000101c33da26d75a0cd733f7ef02f1bd694937 Reviewed-on: http://gerrit.cloudera.org:8080/460 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-06-15 23:14:32 +00:00
Matthew Jacobs	fe87bb1563	Add MetricDefs, static definitions of metric metadata generated from json Adds a static definition of the metric metadata used by Impala. The metric names, descriptions, and other properties are defined in common/thrift/metrics.json file, and the generate_metrics.py script creates a thrift representation. The metric definitions are then available in a constant map which is used at runtime to instantiate metrics, looking them up in the map by the metric key. New metrics should be defined by adding an entry to the list of metrics in metrics.json with the following properties: key: The unique string identifying the metric. If the metric can be templated, e.g. rpc call duration, it may be a format string (in the format used by strings::Substitute()). description: A text description of the metric. May also be a format string. label: A brief title for the metric, not currently used by Impala but provided for external tools. units: The unit of the metric. Must be a valid value of TUnit. kind: The kind of metric, e.g. GAUGE or COUNTER. Must be a valid value of TMetricKind. contexts: The context in which this metric may be instantiated. Usually "IMPALAD", "STATESTORED", "CATALOGD", but may be a different kind of 'entity'. Not currently used by Impala but provided for modeling purposes for external tools. For example, adding the counter for the total number of queries run over the lifetime of the impalad process might look like: { "key": "impala-server.num-queries", "description": "The total number of queries processed.", "label": "Queries", "units": "UNIT", "kind": "COUNTER", "contexts": [ "IMPALAD" ] } TODO: Incorporate 'label' into the metrics debug page. TODO: Verify the context at runtime, e.g. verify 'contexts' contains, e.g. a DCHECK. After the metric definition is added, the generate_metrics.py script will generate the TMetricDefs.thrift that contains a TMetricDef for the metric definition. At runtime, the metric can be instantiated using the key defined in metrics.json. Gauges, Counters, and Properties are instantiated using static methods on MetricGroup. Other metric types are instantiated using static CreateAndRegister methods on their associated classes. TODO: Generate a thrift enum used to lookup metric defs. TODO: Consolidate the instantiation of metrics that are created outside of metrics.h (i.e. collection metrics, memory metrics). TODO: Need a better way to verify if metric definitions are missing. Change-Id: Iba7f94144d0c34f273c502ce6b9a2130ea8fedaa Reviewed-on: http://gerrit.cloudera.org:8080/330 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-05-14 21:27:28 +00:00
Martin Grund	b582cdc22b	IMPALA-1598: Adding Error Codes to Log Messages This patch introduces the concept of error codes for errors that are recorded in Impala and are going to be presented to the client. These error codes are used to aggregate and group incoming error / warning messages to reduce the spill on the shell and increase the usefulness of the messages. By splitting the message string from the implementation, it becomes possible to edit the string independently of the code and pave the way for internationalization. Error messages are defined as a combination of an enum value and a string. Both are defined in the Error.thrift file that is automatically generated using the script in common/thrift/generate_error_codes.py. The goal of the script is to have a central understandable repository of error messages. Adding new messages to this file will require rebuilding the thrift part. The proxy class ErrorMessage is responsible to represent an error and capture the parameters that are used to format the error message string. When error messages are recorded they are recorded based on the following algorithm: - If an error message is of type GENERAL, do not aggregate this message and simply add it to the total number of messages - If an error messages is of specific type, record the first error message as a sample and for all other occurrences increment the count. - The coordinator will merge all error messages except the ones of type GENERAL and display a count. For example, in the case of the parquet file spanning multiple blocks the output will look like: Parquet files should not be split into multiple hdfs-blocks. file=hdfs://localhost:20500/fid.parq (1 of 321 similar) All messages are always logged to VLOG. In the coordinator error messages are merged across all backends to retain readability in the case of large clusters. The current version of this patch adds these new error codes to some of the most important error messages as a reference implementation. Change-Id: I1f1811631836d2dd6048035ad33f7194fb71d6b8 Reviewed-on: http://gerrit.cloudera.org:8080/39 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-03-01 03:37:32 +00:00
Henry Robinson	6bc411c890	Add support for HS2 protocol V6 This patch adds support for V6 of the HS2 protocol, which notably includes columnar organisation of result sets. Clients that set their protocol version to < V6 will receive result sets in the traditional row orientation. The performance of fetches over HS2 goes up significantly as a result, since the V1 protocol had some pathologies in its deserialisation performance. Beeswax Row materialisation: 455ms, client processing time: 523ms HS2 V6: Row materialisation: 444ms, client processing time: 1.8s HS2 V1: Row materialisation: 585ms, client processing time: 15.9s (!) TODO: Add support for the CHAR datatype The following patch is also included: Fix wait-for-hiveserver2.py when Impala moves to HS2 V6 Due to HIVE-6050, older versions of Hive are not compatible with newer clients (even those that try to use old protocol versions). wait-for-hiveserver2.py uses HS2 to talk to the HiveServer2 service, but picks up the newer version from V6, and fails. This patch temporarily re-adds cli_service.thrift (renaming the Thrift service as LegacyTCLIService) only for wait-for-hiveserver2.py to use. As soon as Impala's thirdparty Hive moves to HS2 V6, we can get rid of this change. Change-Id: I2cbe884345ae7e772620b80a29b6574bd6532940 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4402 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-09-18 20:17:18 -07:00
Henry Robinson	8a33b1861b	Optionally dynamically link Impala executables This patch adds two new flags to make_impala.sh: -build_shared_libs: Impala libraries (excluding thirdparty ones) will be built as shared objects (.so), and linked dynamically. -build_static_libs: Impala libraries will be built as archive files (.a), and linked statically. This was the behaviour before this patch, and is still the default for make_impala.sh. The speedup from dynamic linking for a clean build is significant: make_impala.sh -clean -build_static_libs: 11m48.676s make_impala.sh -clean -build_shared_libs: 5m46.943s make_debug.sh now passes -build_shared_libs by default. make_[asan\|release].sh still builds with statically linked libraries. All automated builds will be statically linked for now; we can move them to dynamic linking on a case-by-case basis. Change-Id: Icfd8101bf8e85cadd61d8995ae8864f8730297ea Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3828 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-08-17 12:44:05 -07:00
Matthew Jacobs	64f55f32fe	Refactor thrift for ext-data-source to generate only necessary structs ext-data-source only needs a small subset of the thrift structures, so this separates the dependencies between files so that just the necessary structs are generated for ext-data-source. Afterwards, we can remove extra maven dependencies which were using environment variables to get versions. While the environment variables work when building the pom, they are not propagated to dependencies so building fe/pom.xml ended up producing lots of warnings which are now gone. Change-Id: I267fe7bc7a54c3c21aad8c1ffce07cf1a1e07c5e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3748 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 1f738962ccb7a34834decfe6cb27307ed4548870) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3767	2014-08-05 11:33:46 -07:00
Nong Li	5d903efca3	ExecSummary The runtime profile as we present it is not very useful and I think the structure of it makes it hard to consume. This patch adds a new client facing schemed set of counters that are collected from the runtime profiles. For example, with this structure it would be easy to have the shell get the stats of a running query and print a useful progress report or to check the most relevant metrics for diagnosing issues. Here's an example of the output for one of the tpch queries: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------ 09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED 05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B 04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE 08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE 07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority) 03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB 02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST \|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST \| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o 00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-06-11 03:10:11 -07:00
Matthew Jacobs	25c0ebf58c	External Data Source: Public API Adds the thrift structures for the public external data source API and a new maven project containing the Java ExternalDataSource interface and the generated Java thrift classes. The ExternalDataSource.thrift structures can evolve in a backward compatible way. The ExternalDataSource Java interface will always contain a version number in the namespace (e.g. com.cloudera.impala.extdatasource.v1 for V1) so we can potentially make breaking changes to the interface in the future but still support older versions. A trivial implementation of the ExternalDataSource API is also added for testing purposes. TODO: Make the sample data source implementation realistic. Change-Id: I827d6420a87ed7a2bce34e050362ca98ddc5dbcc Reviewed-on: http://gerrit.ent.cloudera.com:8080/2241 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit f29814e9ede9d4c889f2648606fcf511feeb47ae) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2313	2014-04-22 18:34:48 -07:00
Nong Li	0d2919fe7f	Refactor scalar and aggregate function analysis and execution. This patch cleans up analysis and execution of scalar and aggregate functions so that there is no difference between how builtins and user functions are handled. The only difference is that the catalog is populated with the builtins all the time. The BE always gets a TFunction object and just executes it (builtins will have an empty hdfs file location). This removes the opcode registry and all of the functionality is subsumed by the catalog, most of which was already duplicated there anyway. This also introduces the concept of a system database; databases that the user cannot modify and is populated automatically on startup. Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577	2014-02-18 18:40:08 -08:00
Alex Behm	dc7b398bd3	Impala reserves resources from YARN via LLama. Impala reserves resources from YARN via Llama and handles resources preemptions by cancelling affected queries. Adds the Impala Resource Broker for interacting with Llama. Refactors scheduler and coordinator to move fragment-to-host assignment logic into scheduler. Local test setup uses MiniLLama. Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1	2014-01-15 15:12:04 -08:00
Henry Robinson	51e58e1f3c	Statestore aesthetic cleanup * Statestore is now one word, without camelcase, eveywhere. Previous names included StateStore, state-store and state_store, variously. The only exception is a couple of flags that have 'state_store', and can't be changed for compatibility reasons. * File names are also changed to reflect the standard naming. * Most comments are now 90 chars wide (from 80 before) Change-Id: I83b666c87991537f9b1b80c2f0ea70c2e0c07dcf Reviewed-on: http://gerrit.ent.cloudera.com:8080/1225 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-01-09 09:56:04 -08:00
Lenni Kuff	9d5b94baa5	CatalogServer follow-on code review changes Changes to address follow-on code review comments. This change consists mainly of: * Comment cleanup / clarification * Thrift struct consolidation * Minor naming changes * Small code fixes/changes, etc Change-Id: Idd03cc8adeb9c0d99744688a02f81a08135966de Reviewed-on: http://gerrit.ent.cloudera.com:8080/667 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:42 -08:00
Lenni Kuff	bf139d1eba	Update catalogd to forward log4j log messages to glog Change-Id: I4620b77ba731e134a3e48883e8ae7ee3820ed584 Reviewed-on: http://gerrit.ent.cloudera.com:8080/612 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:12 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Nong Li	af90c8a133	Fix memory usage tracking. Changes MemLimit to MemTracker: - the limit is optional - it also records a label and an optional parent - Consume() and Release() also update the ancestors and there's also a new AnyLimitExceeded(), which also checks the ancestors - the consumption counter is a HighwaterMarkCounter and can optionally be created as part of a profile Each fragment instance now has a MemTracker that is part of a 3-level hierarchy: process, query, fragment instance. Change-Id: I5f580f4956fdf07d70bd9a6531032439aaf0fd07 Reviewed-on: http://gerrit.ent.cloudera.com:8080/339 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:52:36 -08:00
Henry Robinson	90ed9f0ab8	Remove planservice	2014-01-08 10:50:20 -08:00
Henry Robinson	2ae20cbbb7	Statestore-2.0: New state-store implementation * API simplified to deal only with 'topics', not services and objects * Scalability improved: heartbeat loop is now multi-threaded * State-store can store arbitrary objects * State-store may send either deltas or complete topic state (delta computation to come)	2014-01-08 10:49:23 -08:00
Nong Li	6e293090e6	Parquet writer. Change-Id: I7117b545e3d3a7803a219234ad992040a6c7c4ec	2014-01-08 10:48:44 -08:00
Nong Li	868a99135a	Add network benchmark	2014-01-08 10:47:56 -08:00
Alan Choi	be98df19c8	HiveServer2 This patch implements the HiveServer2 API. We have tested it with Lenni's patch against the tpch workload. It has also been tested manually against Hive's beeline with queries and metadata operations. All of the HiveServer2 code is implemented in impala-hs2-server.cc. Beeswax code is refactored to impala-beeswax-server.cc. HiveServer2 has a few more metadata operations. These operations go through impala-hs2-server to ddl-executor and then to FE. The logics are implemented in fe/src/main/java/com/cloudera/impala/service/MetadataOp.java. Because of the Thrift union issue, I have to modify the generated c++ file. Therefore, all the HiveServer2 thrift generated c++ code are checked into be/src/service/hiveserver2/. Once the thrift issue is resolved, I'll remove these files. Change-Id: I9a8fe5a09bf250ddc43584249bdc87b6da5a5881	2014-01-08 10:47:24 -08:00
Henry Robinson	7ba437a52e	Code changes to build against thrift 0.9.0 in thirdparty/	2014-01-08 10:47:22 -08:00
Henry Robinson	986f3cddf6	Move sparrow/ to statestore/ and remove sparrow namespace	2014-01-08 10:47:12 -08:00
Nong Li	2289906a5a	Fix linker dependencies.	2014-01-08 10:46:56 -08:00
Henry Robinson	2f339f2ed8	Add ASL license to all public files	2014-01-08 10:46:32 -08:00
ishaan	05c65789bb	Change Copyrights from 2011 ti 2012	2014-01-08 10:46:29 -08:00
Michael Ubell	ad46b98366	Add Kerberos authentication.	2014-01-08 10:45:10 -08:00
Marcel Kornacker	c004cdaa1c	Thrift structures for the new planner interface.	2014-01-08 10:44:47 -08:00
Marcel Kornacker	fb32d40b03	Switching to an asynchronous plan fragment exec interface; this entails: - making the coordinator asynchronous - renamed ImpalaBackendService to ImpalaInternalService; - new class ImpalaServer implements ImpalaService and ImpalaInternalService - renaming ImpalaInternalService fields to conform to c++ style - merged impala-service.{cc,h} and backend-service.{cc,h} into impala-server.{cc,h} - added TStatusCode field to Status.ErrorDetail - removed ImpalaInternalService.CloseChannel Also removed JdbcDriverTest.java	2014-01-08 10:44:15 -08:00
Kay Ousterhout	073e38d6c2	Added the StateStore, a centralized repository for soft state. The commit also adds the StateStoreSubscriber, a component that runs alongside each impalad and handles communication with the state store.	2012-07-13 09:26:16 -07:00
Alan Choi	f52286f72c	This completes the Beeswax implementation for ODBC. All the ODBC tests (CDH/hive-odbc-test) passes (except those with "create table" and "show table". We should have nightly regression of the odbc test to run against impalad. There're still a few issues: 1. running with num_node > 0 crashes the coordinator; 2. work around for a few ODBC jiras 3. no test for bool/timestamp because ODBC doesn't support them. review: issue 110	2012-06-18 14:46:46 -07:00
Alan Choi	ef10afa439	This changes the Thrift from 0.6.1 to 0.7.0. Please uninstall the old thrift and download/install Thrift 0.7.0. Beeswax service now depends on Hive metastore; fix buildall.sh to clean generated-source in FE; fix .gitignore to clean generated-source in BE;	2012-06-14 18:21:08 -07:00
Alan Choi	7af87c7dea	Beeswax Service for Impala (partiial implementation) review id: 82	2012-06-06 10:08:06 -07:00
Henry Robinson	3ff3559805	Add support for per-partition file formats to front end and backend. At the same time, this patch removes the partitionKeyRegex in favour of explicitly sending a list of literal expressions for each file path from the front end.	2012-06-05 12:00:09 -07:00
Marcel Kornacker	4a4a07fde7	A number of changes for the Jenkins build: - added option to run with derby metastore, based on whether env var METASTORE_IS_DERBY is set - emoved hardwired file locations from planner tests - switching to linking statically against libthrift.a Also added script rebuild.sh, which contains the build steps of buildall.sh (against impala sources).	2012-03-08 16:19:47 -08:00
Nong Li	b410b62716	Add distributed profile counter for the BE.	2012-03-01 13:59:17 -08:00
Nong Li	88237350f0	Change the build to allow debug and release builds to coexist.	2012-02-17 18:14:04 -08:00
Nong Li	94db70c9fd	Fix build. Dependencies don't propagate right on first build.	2011-12-30 21:28:18 -08:00
Nong Li	c84fec38d3	- Move thrift out of FE src and into impala/common - Thrift files now build using cmake instead of mvn - Added cmake build to impala/ which drives the build process	2011-12-30 19:35:20 -08:00

1 2

59 Commits