Cloud Computing Jun 28

   EMBED

Share

Preview only show first 6 pages with water mark for full document please download

Transcript

Cloud Computing: Concepts, Technologies and Business Implications B. Ramamurthy & K. Madurai [email protected] & [email protected] This talks is partially supported by National Science Foundation grants DUE: #0920335, OCI: #1041280 Wipro Chennai 2011 6/23/2010 1 Outline of the talk • Introduction to cloud context o Technology context: multi-core, virtualization, 64-bit processors, parallel computing models, big-data storages… o Cloud models: IaaS (Amazon AWS), PaaS (Microsoft Azure), SaaS (Google App Engine) • Demonstration of cloud capabilities o Cloud models o Data and Computing models: MapReduce o Graph processing using amazon elastic mapreduce • A case-study of real business application of the cloud • Questions and Answers Wipro Chennai 2011 6/23/2010 2 Speakers’ Background in cloud computing • Bina: o Has two current NSF (National Science Foundation of USA) awards related to cloud computing: o 2009-2012: Data-Intensive computing education: CCLI Phase 2: $250K o 2010-2012: Cloud-enabled Evolutionary Genetics Testbed: OCI-CI-TEAM: $250K o Faculty at the CSE department at University at Buffalo. • Kumar: o Principal Consultant at CTG o Currently heading a large semantic technology business initiative that leverages cloud computing o Adjunct Professor at School of Management, University at Buffalo. Wipro Chennai 2011 6/23/2010 3 Introduction: A Golden Era in Computing Powerful multi-core processors Explosion of domain applications General purpose graphic processors Superior software methodologies Proliferation of devices Wider bandwidth for communication 6/2/2011 Virtualization leveraging the powerful hardware Cloud Futures 2011, Redmond, WA 4 Cloud Concepts, Enablingtechnologies, and Models: The Cloud Context Wipro Chennai 2011 6/23/2010 5 Publish scale Inform Interact Integrate Transact Wipro Chennai 2011 web Semantic discovery Discover (intelligence) Automate (discovery) Data-intensive HPC, cloud Social media and networking Data marketplace and analytics deep web time Evolution of Internet Computing 6/23/2010 6 Top Ten Largest Databases Top ten largest databases (2007) 7000 6000 5000 4000 Terabytes 3000 2000 1000 0 LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Ref: http://www.focus.com/fyi/operations/10-largest-databases-in-the-world/ Wipro Chennai 2011 6/23/2010 7 Challenges • Alignment with the needs of the business / user / noncomputer specialists / community and society • Need to address the scalability issue: large scale data, high performance computing, automation, response time, rapid prototyping, and rapid time to production • Need to effectively address (i) ever shortening cycle of obsolescence, (ii) heterogeneity and (iii) rapid changes in requirements • Transform data from diverse sources into intelligence and deliver intelligence to right people/user/systems • What about providing all this in a cost-effective manner? Wipro Chennai 2011 6/23/2010 8 Enter the cloud • Cloud computing is Internet-based computing, whereby shared resources, software and information are provided to computers and other devices on-demand, like the electricity grid. • The cloud computing is a culmination of numerous attempts at large scale computing with seamless access to virtually limitless resources. o on-demand computing, utility computing, ubiquitous computing, autonomic computing, platform computing, edge computing, elastic computing, grid computing, … Wipro Chennai 2011 6/23/2010 9 “Grid Technology: A slide from my presentation to Industry (2005) • Emerging enabling technology. • Natural evolution of distributed systems and the Internet. • Middleware supporting network of systems to facilitate sharing, standardization and openness. • Infrastructure and application model dealing with sharing of compute cycles, data, storage and other resources. • Publicized by prominent industries as on-demand computing, utility computing, etc. • Move towards delivering “computing” to masses similar to other utilities (electricity and voice communication).” • Now, Hmmm…sounds like the definition for cloud computing!!!!! Wipro Chennai 2011 6/23/2010 10 It is a changed world now… • • • • • • Explosive growth in applications: biomedical informatics, space exploration, business analytics, web 2.0 social networking: YouTube, Facebook Extreme scale content generation: e-science and e-business data deluge Extraordinary rate of digital content consumption: digital gluttony: Apple iPhone, iPad, Amazon Kindle Exponential growth in compute capabilities: multi-core, storage, bandwidth, virtual machines (virtualization) Very short cycle of obsolescence in technologies: Windows Vista Windows 7; Java versions; CC#; Phython Newer architectures: web services, persistence models, distributed file systems/repositories (Google, Hadoop), multi-core, wireless and mobile Diverse knowledge and skill levels of the workforce You simply cannot manage this complex situation with your traditional IT infrastructure: • • Wipro Chennai 2011 6/23/2010 11 Answer: The Cloud Computing? • Typical requirements and models: o o o o platform (PaaS), software (SaaS), infrastructure (IaaS), Services-based application programming interface (API) • A cloud computing environment can provide one or more of these requirements for a cost • Pay as you go model of business • When using a public cloud the model is similar to renting a property than owning one. • An organization could also maintain a private cloud and/or use both. Wipro Chennai 2011 6/23/2010 12 Enabling Technologies Cloud applications: data-intensive, compute-intensive, storage-intensive Bandwidth WS Services interface Web-services, SOA, WS standards VM0 VM1 VMn Storage Models: S3, BigTable, BlobStore, ... Virtualization: bare metal, hypervisor. … Multi-core architectures 64-bit processor Wipro Chennai 2011 6/23/2010 13 Common Features of Cloud Providers Development Environment: IDE, SDK, Plugins Production Environment Simple storage Table Store Drives Accessible through Web services Management Console and Monitoring tools & multi-level security Wipro Chennai 2011 6/23/2010 14 Windows Azure • Enterprise-level on-demand capacity builder • Fabric of cycles and storage available on-request for a cost • You have to use Azure API to work with the infrastructure offered by Microsoft • Significant features: web role, worker role , blob storage, table and drive-storage Wipro Chennai 2011 6/23/2010 15 Amazon EC2 • Amazon EC2 is one large complex web service. • EC2 provided an API for instantiating computing instances with any of the operating systems supported. • It can facilitate computations through Amazon Machine Images (AMIs) for various other models. • Signature features: S3, Cloud Management Console, MapReduce Cloud, Amazon Machine Image (AMI) • Excellent distribution, load balancing, cloud monitoring tools Wipro Chennai 2011 6/23/2010 16 Google App Engine • This is more a web interface for a development environment that offers a one stop facility for design, development and deployment Java and Python-based applications in Java, Go and Python. • Google offers the same reliability, availability and scalability at par with Google’s own applications • Interface is software programming based • Comprehensive programming platform irrespective of the size (small or large) • Signature features: templates and appspot, excellent monitoring and management console Wipro Chennai 2011 6/23/2010 17 Demos • Amazon AWS: EC2 & S3 (among the many infrastructure services) o Linux machine o Windows machine o A three-tier enterprise application • Google app Engine o Eclipse plug-in for GAE o Development and deployment of an application • Windows Azure o Storage: blob store/container o MS Visual Studio Azure development and production environment Wipro Chennai 2011 6/23/2010 18 Cloud Programming Models Wipro Chennai 2011 6/23/2010 19 The Context: Big-data • Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance. • We are in a knowledge economy. o Data is an important asset to any organization o Discovery of knowledge; Enabling discovery; annotation of data o Complex computational models o No single environment is good enough: need elastic, ondemand capacities • We are looking at newer o Programming models, and o Supporting algorithms and data structures. Wipro Chennai 2011 6/23/2010 20 Google File System • Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” • But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” ; • • • Privacy protected healthcare and patient information; Historical financial data; Other historical data • Google exploited this characteristics in its Google file system (GFS) Wipro Chennai 2011 6/23/2010 21 What is Hadoop?  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. Wipro Chennai 2011 6/23/2010 22 Fault tolerance • Failure is the norm rather than exception • A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data. • Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional. • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Wipro Chennai 2011 6/23/2010 23 HDFS Architecture Metadata ops Client Block ops Read Datanodes replication Namenode Metadata(Name, replicas..) (/home/foo/data,6. .. Datanodes B Blocks Rack1 Write Client Rack2 Wipro Chennai 2011 6/23/2010 24 Hadoop Distributed File System HDFS Server Master node HDFS Client Application Local file system Block size: 2K Name Nodes Block size: 128M Replicated Wipro Chennai 2011 6/23/2010 25 What is MapReduce?  MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day)  A map function extracts some intelligence from raw data.  A reduce function aggregates according to some guides the data output by the map.  Users specify the computation in terms of a map and a reduce function,  Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and  Underlying system also handles machine failures, efficient communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107113. Wipro Chennai 2011 6/23/2010 26 Classes of problems “mapreducable”  Benchmark for comparing: Jim Gray’s challenge on dataintensive computing. Ex: “Sort”  Google uses it for wordcount, adwords, pagerank, indexing data.  Simple algorithms such as grep, text-indexing, reverse indexing  Bayesian classification: data mining domain  Facebook uses it for various operations: demographics  Financial services use it for analytics  Astronomy: Gaussian analysis for locating extra-terrestrial objects.  Expected to play a critical role in semantic web and in web 3.0 Wipro Chennai 2011 6/23/2010 27 Large scale data splits Map pair Parse-hash Reducers (say, Count) Count P-0000 , count1 Parse-hash Count P-0001 , count2 Parse-hash Count Parse-hash Wipro Chennai 2011 P-0002 ,count3 6/23/2010 28 MapReduce Engine • MapReduce requires a distributed file system and an engine that can distribute, coordinate, monitor and gather the results. • Hadoop provides that engine through (the file system we discussed earlier) and the JobTracker + TaskTracker system. • JobTracker is simply a scheduler. • TaskTracker is assigned a Map or Reduce (or other operations); Map or Reduce run on node and so is the TaskTracker; each task is run on its own JVM on a node. Wipro Chennai 2011 6/23/2010 29 Demos • Word count application: a simple foundation for text-mining; with a small text corpus of inaugural speeches by US presidents • Graph analytics is the core of analytics involving linked structures (about 110 nodes): shortest path Wipro Chennai 2011 6/23/2010 30 A Case-study in Business: Cloud Strategies Wipro Chennai 2011 6/23/2010 31 Predictive Quality Project Overview Problem / Motivation: • Identify special causes that relate to bad outcomes for the quality- related parameters of the products and visually inspected defects • Complex upstream process conditions and dependencies making the problem difficult to solve using traditional statistical / analytical methods • Determine the optimal process settings that can increase the yield and reduce defects through predictive quality assurance • Potential savings huge as the cost of rework and rejects are very high Solution: • Use ontology to model the complex manufacturing processes and utilize semantic technologies to provide key insights into how outcomes and causes are related • Develop a rich internet application that allows the user to evaluate process outcomes and conditions at a high level and drill down to specific areas of interest to address performance issues Wipro Chennai 2011 6/23/2010 32 Why Cloud Computing for this Project • Well-suited for incubation of new technologies o Semantic technologies still evolving o Use of Prototyping and Extreme Programming o Server and Storage requirements not completely known • Technologies used (TopBraid, Tomcat) not part of emerging or core technologies supported by corporate IT • Scalability on demand • Development and implementation on a private cloud Wipro Chennai 2011 6/23/2010 33 Public Cloud vs. Private Cloud Rationale for Private Cloud: • Security and privacy of business data was a big concern • Potential for vendor lock-in • SLA’s required for real-time performance and reliability • Cost savings of the shared model achieved because of the multiple projects involving semantic technologies that the company is actively developing Wipro Chennai 2011 6/23/2010 34 Cloud Computing for the Enterprise What should IT Do • Revise cost model to utility-based computing: CPU/hour, GB/day etc. • Include hidden costs for management, training • Different cloud models for different applications evaluate • Use for prototyping applications and learn • Link it to current strategic plans for ServicesOriented Architecture, Disaster Recovery, etc. Wipro Chennai 2011 6/23/2010 35 References & useful links • Amazon AWS: http://aws.amazon.com/free/ • AWS Cost Calculator: http://calculator.s3.amazonaws.com/calc5.html • Windows Azure: http://www.azurepilot.com/ • Google App Engine (GAE): http://code.google.com/appengine/docs/whatisg oogleappengine.html • Graph Analytics: http://www.umiacs.umd.edu/~jimmylin/Cloud9/do cs/content/Lin_Schatz_MLG2010.pdf • For miscellaneous information: http://www.cse.buffalo.edu/~bina Wipro Chennai 2011 6/23/2010 36 • We illustrated cloud concepts and demonstrated the cloud capabilities through simple applications • We discussed the features of the Hadoop File System, and mapreduce to handle big-data sets. • We also explored some real business issues in adoption of cloud. • Cloud is indeed an impactful technology that is sure to transform computing in business. Summary Wipro Chennai 2011 6/23/2010 37