IT World has published an interview with Google's Vice President of Research Alfred Spector that provides details about Google IT infrastructure and strategy:
Google uses what is now termed "cloud computing" We have numerous clusters, each containing large numbers of computers. The clusters run a distributed computing infrastructure that uses Linux on each computer. All the computers are then tied together with high-performance networking and distributed computing software. For example, we have built and deployed a global file system called the Google File System that provides scalable, fault-tolerant storage; a record-oriented data storage system for tabular data called BigTable; and a computational programming model called MapReduce that allows our batch jobs to use the inherent parallelism in our clusters.
[As for] the exact number of machines, locations and clusterswe have, suffice it to say that we have so many individual elements in our fabric that an enormous amount of attention is paid to fault tolerance, because with so many elements operating, there are exceedingly frequent component failures. Could other companies emulate that kind of architecture? First, there really are economies of scale in running systems that can support many services on a common fabric. Second, relating to the services model we espouse, there are great simplifications to releasing software as a Web-based service, because services don't have to be tested and deployed on a large number of different customer environments. Instead, software can be released to a small number of machines in a more controlled cloud and then accessed by browsers.
A third benefit is that since a software service is a logically centralized notion, the history of interactions of very many users can be aggregated and thus be the basis for various types of self-learning systems. Google uses this concept to learn to correct spelling mistakes, but businesses can use similar notions to better meet the needs of employees or customers by learning, for example, of common errors, unfulfilled product searches, etc.
Jurriaan Persyn has provided an excellent summary of the presentation he gave at FOSDEM 2009 in Brussels, Belgium. FOSDEM is one of Europe's largest annual conference on open source software with about 5,000 attendees. He also provides the actual presentation slides used during his presentation.
Even when a traditional database engine is involved, there can be database-like code sitting in the application to extend the capabilities of the underlying database engine. Database sharding is a good example of this. In this approach, data is federated over a collection of cheap servers to increase scalability and performance. Typically the applications that use sharding have the code that distributes the data over the shards and combines the results from the shards within their application code. I've used similar techniques myself before most of the commercial database engines starting supporting partitioning and clustering natively. (Something that MySQL - which most of the sharding practitioners seem to use - has only just started to support.)
One interesting trend that I've noticed in many of the organizations that I've been into is that increasingly databases are being built to serve single applications. The early visions of databases shared amongst multiple applications is no longer the first choice. To a certain extent this has always been the case for certain operational systems, but now the reach of single application databases has grown. You'll even find data replicated across multiple multi-terabyte data warehouses to support different business intelligence solutions.
db geek attributes this trend to factors such as commodity hardware and the benefits of reducing the complexities created by multi-application databases.
Dan Pritchett has written an excellent blog detailing his experience with database sharding while working as an architect at eBay, Inc. and Rearden Commerce:
Lesson 1: Right Size Your Shards
Lesson 2: Use Math on Shard Counts
Lesson 3: Carefully Consider the Spread
Lesson 4: Plan for Exceeding Your Shards
Lesson 5: Shard Early and Often
You can read the details of each lesson in Dan's blog.
Presentations from the second Google Seattle Conference on Scalability have started to appear on YouTube.
For example, the technical lead on Google Maps for mobile dicussues the adapt- ing the Google Maps to work in a low bandwidth, high latency environment with a wide variety of networks and devices:
The project was very international and involved collaboration with teams in London, New York, Seattle, Tokyo, Beijing, and Cupertino.
Peter Wayner, an InfoWorld Test Center contributing editor, has conducted an interesting hands-on comparison of some of the leaders in cloud computing.
One interesting point that Wayner makes almost immediately in his report is that is that cloud computing, at least in the form of immediately availability of unlimited server capacity, does not deliver scalable applications:
After a few hours, the fog of hype starts to lift and it becomes apparent that the clouds are pretty much shared servers just as the Greek gods are filled with the same flaws as earthbound humans. Yes, these services let you pull more CPU cycles from thin air whenever demand appears, but they can't solve the deepest problems that make it hard for applications to scale gracefully. Many of the real challenges lie at the architectural level, and simply pouring more server cycles on the fire won't solve fundamental mistakes in design.
There are many benefits of cloud computing, especially the fact that hardware is now a commodity that is available on demand. However, there has been a general perception that using cloud computing providers such as Google or Amazon somehow magically delivers the same scalability as the applications provided by the these cloud computing providers. It should be remembered that providers such as Google and Amazon are also famous for their scalability because they shard their applications.
IBM ran a television campaign a few years ago that showed a development team monitoring a newly launched Web-based application that were initially happy with the growing number of users until they realized their application would not scale. It seems that UK cellular company O2’s staff missed the IBM campaign. 3G iPhone pre-orders have crashed O2’s Web site. The scalability problem could have been anticipated. O2 had taken 200,000 email addresses of people interested in purchasing the 3G iPhone, so they should have anticipated the potential load on the order processing application when they notified so many people that they were ready to take orders. O2’s official response was that: "Though O2 had invested several million pounds to increase the order capacity of the site, at times the site still couldn't process the sheer weight of demand." . The result is frustrated customers and negative press coverage in many UK newspapers. Would you trust your business communications to a company that can not scale a Web application? Can O2 be trusted to scale mail servers?
It would be interesting to find out what really happened.....
Poornima, a database engineer at mint.com with experience in performance and scalability monitoring and testing, wrote a very interesting blog yesterday about addressing Web application scalability before it is too late. Poornima's starts by making a point that is often overlooked by engineers - the starting point for scalability is 'a business perspective' - because scalability failures are business failures and because scalability requirements are business requirements.
Poornima provides a simple explanation of why Database Sharding works so well for database-driven Web applications:
I’m sure we’ve all learned from our intro computer architecture class that CPU bound processes are the fastest and can be parallelized, whereas I/O processes are the bottleneck. In the case of a website, accessing the DB is the slowest I/O process. However, you can speed up access to data by sharding the database. Sharding breaks up a large database into smaller pieces that contains redundant information or a parent db can map data to separate dbs.
Poornima concludes that scalability problems are often good - because they indicate a growing business - but you should plan for scalability from the start.