04-05-2011, 10:40 AM
Abstract—
In recent years ad-hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in theirproduct portfolio, making it easy for customers to access these services and to deploy their programs. However, the processingframeworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular natureof a cloud. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarilyincrease processing time and cost. In this paper we discuss the opportunities and challenges for efficient parallel data processingin clouds and present our research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamicresource allocation offered by today’s IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can beassigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based onthis new framework, we perform extended evaluations of MapReduce-inspired processing jobs on an IaaS cloud system and comparethe results to the popular data processing framework Hadoop.
1 INTRODUCTION
Today a growing number of companies have to processhuge amounts of data in a cost-efficient manner. Classicrepresentatives for these companies are operators ofInternet search engines, like Google, Yahoo, or Microsoft.The vast amount of data they have to deal with everyday has made traditional database solutions prohibitivelyexpensive [5]. Instead, these companies havepopularized an architectural paradigm based on a largenumber of commodity servers. Problems like processingcrawled documents or regenerating a web index are splitinto several independent subtasks, distributed amongthe available nodes, and computed in parallel.In order to simplify the development of distributedapplications on top of such architectures, many of thesecompanies have also built customized data processingframeworks. Examples are Google’s MapReduce [9], Microsoft’sDryad [14], or Yahoo!’s Map-Reduce-Merge [6].They can be classified by terms like high throughputcomputing (HTC) or many-task computing (MTC), dependingon the amount of data and the number oftasks involved in the computation [20]. Although thesesystems differ in design, their programming modelsshare similar objectives, namely hiding the hassle ofparallel programming, fault tolerance, and execution optimizationsfrom the developer. Developers can typicallycontinue to write sequential programs. The processingframework then takes care of distributing the programamong the available nodes and executes each instanceof the program on the appropriate fragment of data. For companies that only have to process large amountsof data occasionally running their own data center isobviously not an option. Instead, Cloud computing hasemerged as a promising approach to rent a large IT infrastructureon a short-term pay-per-usage basis. Operatorsof so-called Infrastructure-as-a-Service (IaaS) clouds,like Amazon EC2 [1], let their customers allocate, access,and control a set of virtual machines (VMs) which runinside their data centers and only charge them for theperiod of time the machines are allocated. The VMs aretypically offered in different types, each type with itsown characteristics (number of CPU cores, amount ofmain memory, etc.) and cost.Since the VM abstraction of IaaS clouds fits the architecturalparadigm assumed by the data processingframeworks described above, projects like Hadoop [25],a popular open source implementation of Google’sMapReduce framework, already have begun to promoteusing their frameworks in the cloud [29]. Only recently,Amazon has integrated Hadoop as one of its core infrastructureservices [2]. However, instead of embracingits dynamic resource allocation, current data processingframeworks rather expect the cloud to imitate the staticnature of the cluster environments they were originallydesigned for. E.g., at the moment the types and numberof VMs allocated at the beginning of a compute jobcannot be changed in the course of processing, althoughthe tasks the job consists of might have completelydifferent demands on the environment. As a result,rented resources may be inadequate for big parts of theprocessing job, which may lower the overall processingperformance and increase the cost.
Download full report
http://dsl.cs.uchicago.edu/TPDS_MTC/pape...1-0012.pdf