IT World has published an interview with Google's Vice President of Research Alfred Spector that provides details about Google IT infrastructure and strategy:
Google uses what is now termed "cloud computing" We have numerous clusters, each containing large numbers of computers. The clusters run a distributed computing infrastructure that uses Linux on each computer. All the computers are then tied together with high-performance networking and distributed computing software. For example, we have built and deployed a global file system called the Google File System that provides scalable, fault-tolerant storage; a record-oriented data storage system for tabular data called BigTable; and a computational programming model called MapReduce that allows our batch jobs to use the inherent parallelism in our clusters.
[As for] the exact number of machines, locations and clusterswe have, suffice it to say that we have so many individual elements in our fabric that an enormous amount of attention is paid to fault tolerance, because with so many elements operating, there are exceedingly frequent component failures. Could other companies emulate that kind of architecture? First, there really are economies of scale in running systems that can support many services on a common fabric. Second, relating to the services model we espouse, there are great simplifications to releasing software as a Web-based service, because services don't have to be tested and deployed on a large number of different customer environments. Instead, software can be released to a small number of machines in a more controlled cloud and then accessed by browsers.
A third benefit is that since a software service is a logically centralized notion, the history of interactions of very many users can be aggregated and thus be the basis for various types of self-learning systems. Google uses this concept to learn to correct spelling mistakes, but businesses can use similar notions to better meet the needs of employees or customers by learning, for example, of common errors, unfulfilled product searches, etc.
Jurriaan Persyn has provided an excellent summary of the presentation he gave at FOSDEM 2009 in Brussels, Belgium. FOSDEM is one of Europe's largest annual conference on open source software with about 5,000 attendees. He also provides the actual presentation slides used during his presentation.
Even when a traditional database engine is involved, there can be database-like code sitting in the application to extend the capabilities of the underlying database engine. Database sharding is a good example of this. In this approach, data is federated over a collection of cheap servers to increase scalability and performance. Typically the applications that use sharding have the code that distributes the data over the shards and combines the results from the shards within their application code. I've used similar techniques myself before most of the commercial database engines starting supporting partitioning and clustering natively. (Something that MySQL - which most of the sharding practitioners seem to use - has only just started to support.)
One interesting trend that I've noticed in many of the organizations that I've been into is that increasingly databases are being built to serve single applications. The early visions of databases shared amongst multiple applications is no longer the first choice. To a certain extent this has always been the case for certain operational systems, but now the reach of single application databases has grown. You'll even find data replicated across multiple multi-terabyte data warehouses to support different business intelligence solutions.
db geek attributes this trend to factors such as commodity hardware and the benefits of reducing the complexities created by multi-application databases.
Dan Pritchett has written an excellent blog detailing his experience with database sharding while working as an architect at eBay, Inc. and Rearden Commerce:
Lesson 1: Right Size Your Shards
Lesson 2: Use Math on Shard Counts
Lesson 3: Carefully Consider the Spread
Lesson 4: Plan for Exceeding Your Shards
Lesson 5: Shard Early and Often
You can read the details of each lesson in Dan's blog.
Sharon Perl of Google and Margo Seltzer of Harvard University and Oracle Corporation presented a paper one year ago today on Data Management for Internet-Scale Single-Sign-On at the 3rd USENIX Workshop on Real, Large Distributed Systems in Seattle. The paper explains in detail the Database Sharding strategy used for the single-sign-on service:
The SSO service maps usernames to user account data and services to service-specific data. These mappings are stored in the SSO database, which is partitioned into hundreds of pieces (called shards) for load balancing and data localization. Each shard is a replicated Berkeley DB database composed of between 5 and 15 replicas, depending on the shard's purpose. The SSO data in each replica is stored in a single Berkeley DB Btree database 2.
Smaller shards have five full replicas, any of which is capable of becoming a master. All updates must go to the master. Consistent reads must also go to the master. We sometimes allow ``stale reads'', which may be slightly out-of-date by an amount of time that we control, and which can be performed at non-master replicas. The larger replication groups typically have five replicas capable of becoming masters (``electable replicas'') plus additional read-only replicas. Read-only replicas receive updates from the master, but do not participate in elections or contribute to commit quorums for updates, so the number of read-only replicas and their distance from other replicas does not affect the latency or availability of operations. When the system is running well (the normal case) the state of read-only replicas will be fairly closely synchronized with the master. A shard can have a master as long as more than half its electable replicas are up and communicating.
We spread replicas across multiple, geographically distributed data centers for availability in the face of failures of machines, networks, or data centers. At the same time, we try to keep replicas within a shard fairly close to one another because the communication latency between replicas affects how long it takes to commit a write operation to a shard or to elect a new master. The set of shards is geographically dispersed for data locality. We try to assign new users to shards based on where their data is likely to be accessed. This becomes tricky when the user data is shared by a variety of services that also may be spread over geographically dispersed data centers. We could do more optimization of data placement than we currently do, however it has not turned out to be a high priority for system performance.
As illustrated in Figure 1, there are logically two different kinds of shards. The vast majority of shards are independent databases that map a set of userids to account data and service ids to user-independent service data. The remaining shards implement the ID-map, which maps usernames to userids and userids to shards.
The ID-map is used for login, e-mail delivery, and at other times when we need to find a user's account data given a username. The ID-map shards are chained together in a doubly-linked list to store an extensible map, for scalability. Each shard in the chain handles a sub-range of the key space. Adjacent shards store adjacent ranges. Client library code keeps hints for the names of the component shards of the ID-map and their corresponding key ranges, so that we do not have to traverse the list for each key access. If the keys get rebalanced among the shards (which they can using offline tools), clients of the storage system will notice the changes and adjust their cached locations.