![]() ![]() There are three datasets with the following schemas:įor Impala, Hive, Tez, and Shark, this benchmark uses the m2.4xlarge EC2 instance type. It was generated using Intel's Hadoop benchmark tools and data sampled from the Common Crawl document corpus. The input data set consists of a set of unstructured HTML documents and two SQL tables which contain summary information. Our dataset and queries are inspired by the benchmark contained in a comparison of approaches to large scale analytics. However, results obtained with this software are not directly comparable with results in the Pavlo et al paper, because we use different data sets, a different data generator, and have modified one of the queries (query 4 below) In particular, it uses the schema and queries from that benchmark. This work builds on the benchmark developed by Pavlo et al. We have changed the underlying filesystem from Ext3 to Ext4 for Hive, Tez, Impala, and Shark benchmarking.This set of queries does not test the improved optimizer. Hive has improved its query optimization, which is also inherited by Shark.It is important to note that Tez is currently in a preview state. We have added Tez as a supported platform.It is difficult to account for changes resulting from modifications to Hive as opposed to changes in the underlying Hadoop distribution. ![]() As a result, direct comparisons between the current and previous Hive results should not be made. We changed the Hive configuration from Hive 0.10 on CDH4 to Hive 0.12 on HDP 2.0.6.The workload here is simply one set of queries that most of these systems these can complete. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Keep in mind that these systems have very different sets of capabilities. This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes. For now, we've targeted a simple comparison between these systems with the goal that the results are understandable and reproducible. The choice of a simple storage format, compressed SequenceFile, omits optimizations included in columnar formats such as ORCFile and Parquet. We are aware that by choosing default configurations we have excluded many optimizations. This benchmark is not intended to provide a comprehensive overview of the tested platforms. This remains a work in progress and will evolve to include additional frameworks and new capabilities. Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0). ![]() Impala - a Hive-compatible * SQL engine with its own MPP-like execution engine. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |