IT World has published an interview with Google's Vice President of Research Alfred Spector that provides details about Google IT infrastructure and strategy:
Google uses what is now termed "cloud computing" We have numerous clusters, each containing large numbers of computers. The clusters run a distributed computing infrastructure that uses Linux on each computer. All the computers are then tied together with high-performance networking and distributed computing software. For example, we have built and deployed a global file system called the Google File System that provides scalable, fault-tolerant storage; a record-oriented data storage system for tabular data called BigTable; and a computational programming model called MapReduce that allows our batch jobs to use the inherent parallelism in our clusters.
[As for] the exact number of machines, locations and clusterswe have, suffice it to say that we have so many individual elements in our fabric that an enormous amount of attention is paid to fault tolerance, because with so many elements operating, there are exceedingly frequent component failures. Could other companies emulate that kind of architecture? First, there really are economies of scale in running systems that can support many services on a common fabric. Second, relating to the services model we espouse, there are great simplifications to releasing software as a Web-based service, because services don't have to be tested and deployed on a large number of different customer environments. Instead, software can be released to a small number of machines in a more controlled cloud and then accessed by browsers.
A third benefit is that since a software service is a logically centralized notion, the history of interactions of very many users can be aggregated and thus be the basis for various types of self-learning systems. Google uses this concept to learn to correct spelling mistakes, but businesses can use similar notions to better meet the needs of employees or customers by learning, for example, of common errors, unfulfilled product searches, etc.
Jurriaan Persyn has provided an excellent summary of the presentation he gave at FOSDEM 2009 in Brussels, Belgium. FOSDEM is one of Europe's largest annual conference on open source software with about 5,000 attendees. He also provides the actual presentation slides used during his presentation.
Even when a traditional database engine is involved, there can be database-like code sitting in the application to extend the capabilities of the underlying database engine. Database sharding is a good example of this. In this approach, data is federated over a collection of cheap servers to increase scalability and performance. Typically the applications that use sharding have the code that distributes the data over the shards and combines the results from the shards within their application code. I've used similar techniques myself before most of the commercial database engines starting supporting partitioning and clustering natively. (Something that MySQL - which most of the sharding practitioners seem to use - has only just started to support.)
One interesting trend that I've noticed in many of the organizations that I've been into is that increasingly databases are being built to serve single applications. The early visions of databases shared amongst multiple applications is no longer the first choice. To a certain extent this has always been the case for certain operational systems, but now the reach of single application databases has grown. You'll even find data replicated across multiple multi-terabyte data warehouses to support different business intelligence solutions.
db geek attributes this trend to factors such as commodity hardware and the benefits of reducing the complexities created by multi-application databases.
One interesting point that Wayner makes almost immediately in his report is that is that cloud computing, at least in the form of immediately availability of unlimited server capacity, does not deliver scalable applications:
After a few hours, the fog of hype starts to lift and it becomes apparent that the clouds are pretty much shared servers just as the Greek gods are filled with the same flaws as earthbound humans. Yes, these services let you pull more CPU cycles from thin air whenever demand appears, but they can't solve the deepest problems that make it hard for applications to scale gracefully. Many of the real challenges lie at the architectural level, and simply pouring more server cycles on the fire won't solve fundamental mistakes in design.
There are many benefits of cloud computing, especially the fact that hardware is now a commodity that is available on demand. However, there has been a general perception that using cloud computing providers such as Google or Amazon somehow magically delivers the same scalability as the applications provided by the these cloud computing providers. It should be remembered that providers such as Google and Amazon are also famous for their scalability because they shard their applications.
IBM ran a television campaign a few years ago that showed a development team monitoring a newly launched Web-based application that were initially happy with the growing number of users until they realized their application would not scale. It seems that UK cellular company O2’s staff missed the IBM campaign. 3G iPhone pre-orders have crashed O2’s Web site. The scalability problem could have been anticipated. O2 had taken 200,000 email addresses of people interested in purchasing the 3G iPhone, so they should have anticipated the potential load on the order processing application when they notified so many people that they were ready to take orders. O2’s official response was that: "Though O2 had invested several million pounds to increase the order capacity of the site, at times the site still couldn't process the sheer weight of demand." . The result is frustrated customers and negative press coverage in many UK newspapers. Would you trust your business communications to a company that can not scale a Web application? Can O2 be trusted to scale mail servers?
It would be interesting to find out what really happened.....
Poornima, a database engineer at mint.com with experience in performance and scalability monitoring and testing, wrote a very interesting blog yesterday about addressing Web application scalability before it is too late. Poornima's starts by making a point that is often overlooked by engineers - the starting point for scalability is 'a business perspective' - because scalability failures are business failures and because scalability requirements are business requirements.
Poornima provides a simple explanation of why Database Sharding works so well for database-driven Web applications:
I’m sure we’ve all learned from our intro computer architecture class that CPU bound processes are the fastest and can be parallelized, whereas I/O processes are the bottleneck. In the case of a website, accessing the DB is the slowest I/O process. However, you can speed up access to data by sharding the database. Sharding breaks up a large database into smaller pieces that contains redundant information or a parent db can map data to separate dbs.
Poornima concludes that scalability problems are often good - because they indicate a growing business - but you should plan for scalability from the start.
Sequin unkindly speculated that the major database vendors have ignored Database Sharding for commercial reasons.
There are a lot of expensive ways to scale your database – all of which are highly touted by the big three database vendors because, well, they want to sell you all types of really expensive stuff. Despite what an “engagement consultant” might tell you though, most of the high-traffic websites on the web (google, digg, facebook) rely on far cheaper and better strategies: the core of which is called sharding.
What’s really astounding is that sharding is database agnostic – yet only the MySQL crowd seem to really be leveraging it. The sales staff at Microsoft, IBM and Oracle are doing a good job selling us expensive solutions.
Domas Mituzas has presented Wikipedia's scalability strategy at Velocity 2008 this week (presentation is available here). Mituzas is a Wikipedia performance engineer and database administrator and member of Board of Trustees of the Wikimedia Foundation. Mituzas is also a MySQL (now Sun) employee and was not shy about reminding people that the entire site is driven from a MySQL database.
There was a big emphasis in the presentation on achiving results with minimal resources because the Wikimedia Foundation is a non-profit organization with a comparitively small budget.
The Wikipedia scalability statistics are impressive - 80,000 SQL queries per second, 18 million page objects in the English language version of the site, 220 million revisions, and 1.5 terabytes of compressed data.
Wikipedia uses Database Sharding to set up master-slave relationships between databases, which are logically based on use cases and languages. Mituzas points out that the Wikipedia team only found out that they database architecture was an example of Database Sharding after they implemented it. Mituzas said MySQL instances range from 200 to 300 gigabytes.
Bogdan Nicolau has published the third article in his 'Database Sharding Unraveled" series. He makes an interesting point about planning for database scalability from the start:
Before really diving into high scalability principles, I want to take a moment to talk about why database sharding has an important role even in small startups or medium sized web-sites (5 - 30k unique visitors/day).
It is equally important and benefic for a smaller web business to prepare itself from the beginning to tackle large amounts of users cheap. If it’s not obvious enough, think about what happens to a web-page that gets some plain old Digg attention. The server quickly collapses and the user experience immediately turns from positive to mega negative.
As I’ve explained before, the whole purpose of sharding is to be able to use an unlimited number of cheap machines topped by an open-source database. As experience taught me, the web server will rarely die. Instead, the DB server will choke easily when having to deal with many simultaneous connections.
The database doesn’t even have to be very big.
Bogdan's focus is building scalable database-driven Web sites - but his comments apply to general applications as well.
There is no single monolithic database at eBay. Instead there is a set of database hosts for user data, a set for item data, a set for purchase data, etc. - 1000 logical databases in all, on 400 physical hosts. Again, this approach allows us to scale the database infrastructure for each type of data independently of the others.
Best Practice #2: Split Horizontally
The more challenging problem arises at the database tier, since data is stateful by definition. Here we split (or "shard") the data horizontally along its primary access path. User data, for example, is currently divided over 20 hosts, with each host containing 1/20 of the users. As our numbers of users grow, and as the data we store for each user grows, we add more hosts, and subdivide the users further. Again, we use the same approach for items, for purchases, for accounts, etc.
Best Practice #3: Avoid Distributed Transactions
It turns out that you can't have everything. In particular, guaranteeing immediate consistency across multiple systems or partitions is typically neither required nor possible. The CAP theorem, postulated almost 10 years ago by Inktomi's Eric Brewer, states that of three highly desirable properties of distributed systems - consistency (C), availability (A), and partition-tolerance (P) - you can only choose two at any one time. For a high-traffic web site, we have to choose partition-tolerance, since it is fundamental to scaling. For a 24x7 web site, we typically choose availability. So immediate consistency has to give way.
Best Practice #4: Decouple Functions Asynchronously
The next key element to scaling is the aggressive use of asynchrony. If component A calls component B synchronously, A and B are tightly coupled, and that coupled system has a single scalability characteristic -- to scale A, you must also scale B. Equally problematic is its effect on availability. Going back to Logic 101, if A implies B, then not-B implies not-A. In other words, if B is down then A is down. By contrast, if A and B integrate asynchronously, whether through a queue, multicast messaging, a batch process, or some other means, each can be scaled independently of the other. Moreover, A and B now have independent availability characteristics - A can continue to move forward even if B is down or distressed.
Best Practice #5: Move Processing To Asynchronous Flows
Moving expensive processing to asynchronous flows, though, allows you to scale your infrastructure for the average load instead of the peak. Instead of needing to process all requests immediately, the queue spreads the processing over time, and thereby dampens the peaks. The more spiky or variable the load on your system, the greater this advantage becomes.
Best Practice #6: Virtualize At All Levels
At eBay, for example, we virtualize the database. Applications interact with a logical representation of a database, which is then mapped onto a particular physical machine and instance through configuration. Applications are similarly abstracted from the split routing logic, which assigns a particular record (say, that of user XYZ) to a particular partition.
Best Practice #7: Cache Appropriately
The most obvious opportunities for caching come with slow-changing, read-mostly data - metadata, configuration, and static data, for example. At eBay, we cache this type of data aggressively, and use a combination of pull and push approaches to keep the system reasonably in sync in the face of updates. Reducing repeated requests for the same data can and does make a substantial impact. More challenging is rapidly-changing, read-write data. For the most part, we intentionally sidestep these challenges at eBay. We have traditionally not done any caching of transient session data between requests. We similarly do not cache shared business objects, like item or user data, in the application layer. We are explicitly trading off the potential benefits of caching this data against availability and correctness.
It is a great public service for eBay to allow a senior engineer make its scalability strategy public.
It's a common misconception that Google’s key intellectual property is its seemingly unique ability to delivery accurate search results. In fact, the patent for the famous PageRank algorithm is owned by Stanford University. Google listed the scalability of its vast infrastructure as its key asset during its Initial Public Offering.
Tim O’Reilly has explained that "it's not accident that Google’s system administration, networking, and load balancing techniques are perhaps even more closely guarded secrets than their search algorithms."
The business value of this infrastructure is discussed in this month's Harvard Business Review (HBR) in a detailed article about Google innovation called “Reverse Engineering Google’s Innovation Machine”.
Google has spent billions of dollars creating its Internet-based operating platform and developing proprietary technology that allows the company to rapidly develop and roll out new services of its own or its partners’ devising.
However, Google not only has infrastructure that allows it to add new on-line services quicker and easier than its competitors, it also has a management strategy that encourages staff to innovate by including it in job descriptions:
Much of what the company does is rooted in its legendary IT infrastructure, but technology and strategy at Google are inseparable and mutually permeable – making it hard to say whether technology is the DNS of its strategy or the other way around.
What would it mean for your organization if there were sufficient infrastructure resources to allow everyone to experiment with new ideas?
Web-based applications are different than normal enterprise applications – there can be unexpected usage spikes. There are many examples of on-line service scalability failures:
Government failures in information technology are famous. This is somewhat unfair because government failures are probably no more common than in commercial enterprises, they are just higher profile. For example, the UK's Public Records Office published the 1901 census on-line, the database was not able to handle the workload, which was more queries in an hour than were expected in a day. The resulting problem took 10 months to fix because they had not implemented a database architecture that allowed them to increase the capacity for read volumes (this would be simple with database sharding, for example, because the data could simply be sharded into smaller databases on separate servers).
Even companies that are famous for the scalability of their infrastructure, such as Amazon, have had scalability failures. For example, in 2003, a sudden traffic increase at Amazon.co.uk due to incorrect pricing resulted in the entire site being taken down.
The online betting site sportingindex.com was not designed to cope with an increase in customer numbers and service was offline for an entire day just before one of the biggest global betting sports events – the 2002 England versus Brazil World Cup game. The result was not just loss of revenue, but also loss of customers to other betting sites.
"we had millions of Friendster members begging us to get the site working faster so they could log in and spend hours social networking with their friends. I remember coming in to the office for months reading thousands of customer service emails telling us that if we didn’t get our site working better soon, they’d be 'forced to join' a new social networking site that had just launched called MySpace…the rest is history."
So what can engineers do to avoid on-line scalability disasters?
The most common mistake is not designing on-line applications scalability and performance – because problems are rarely anticipated. It helps a lot of there is a clear understanding between technical and business sides of a project of the capacity requirements. For example, the business managers must understand the technical risks of a big marketing launch of a new on-line service. In all cases, project managers must allow sufficient time for testing.
Sharon Perl of Google and Margo Seltzer of Harvard University and Oracle Corporation presented a paper one year ago today on Data Management for Internet-Scale Single-Sign-On at the 3rd USENIX Workshop on Real, Large Distributed Systems in Seattle. The paper explains in detail the Database Sharding strategy used for the single-sign-on service:
The SSO service maps usernames to user account data and services to service-specific data. These mappings are stored in the SSO database, which is partitioned into hundreds of pieces (called shards) for load balancing and data localization. Each shard is a replicated Berkeley DB database composed of between 5 and 15 replicas, depending on the shard's purpose. The SSO data in each replica is stored in a single Berkeley DB Btree database 2.
Smaller shards have five full replicas, any of which is capable of becoming a master. All updates must go to the master. Consistent reads must also go to the master. We sometimes allow ``stale reads'', which may be slightly out-of-date by an amount of time that we control, and which can be performed at non-master replicas. The larger replication groups typically have five replicas capable of becoming masters (``electable replicas'') plus additional read-only replicas. Read-only replicas receive updates from the master, but do not participate in elections or contribute to commit quorums for updates, so the number of read-only replicas and their distance from other replicas does not affect the latency or availability of operations. When the system is running well (the normal case) the state of read-only replicas will be fairly closely synchronized with the master. A shard can have a master as long as more than half its electable replicas are up and communicating.
We spread replicas across multiple, geographically distributed data centers for availability in the face of failures of machines, networks, or data centers. At the same time, we try to keep replicas within a shard fairly close to one another because the communication latency between replicas affects how long it takes to commit a write operation to a shard or to elect a new master. The set of shards is geographically dispersed for data locality. We try to assign new users to shards based on where their data is likely to be accessed. This becomes tricky when the user data is shared by a variety of services that also may be spread over geographically dispersed data centers. We could do more optimization of data placement than we currently do, however it has not turned out to be a high priority for system performance.
As illustrated in Figure 1, there are logically two different kinds of shards. The vast majority of shards are independent databases that map a set of userids to account data and service ids to user-independent service data. The remaining shards implement the ID-map, which maps usernames to userids and userids to shards.
The ID-map is used for login, e-mail delivery, and at other times when we need to find a user's account data given a username. The ID-map shards are chained together in a doubly-linked list to store an extensible map, for scalability. Each shard in the chain handles a sub-range of the key space. Adjacent shards store adjacent ranges. Client library code keeps hints for the names of the component shards of the ID-map and their corresponding key ranges, so that we do not have to traverse the list for each key access. If the keys get rebalanced among the shards (which they can using offline tools), clients of the storage system will notice the changes and adjust their cached locations.
Jeff Dean, an engineer and Google Fellow, has presented Building a Computing System for the Worlds Information at the Seattle Conference on Scalability last June. After a general introduction to Google’s infrastructure, the presentation focuses on the famous elements of Google’s systems intrastructure: Google file system (GFS), MapReduce, and BigTable.
Computerworld has published a very popular article about how Digg has used a combination of caching and database sharding to achieve scalability.
The other atypical feature of Digg’s setup is its use of what Tim Ellis, another Digg engineer, calls “sharding”.
A term apparently coined by Google engineers, sharding involves breaking a database into smaller parts in order to isolate heavy loads for better performance.
“If 90% of your data is within a certain range, and you can get that part working really fast, then you can help customers,” Ellis said. “Then it’s OK if the remaining 10% is slower.”
A database can be sharded by table, date or range. It is similar to partitioning, says Ellis, but with several key differences. Sharding usually involves divvying up data onto different physical machines. Partitioning, in contrast, typically occurs on the same piece of hardware. And while MySQL does not natively allow sharding, it does support partitioned tables, federated tables and clusters.
Digg only recently began sharding. While sharding is helping Digg.com achieve much faster performance overall, breaking a database into several smaller ones increases complexity, Ellis said.
dbShards economically scales large, high transaction volume databases using database sharding, dramatically improving the response times and scalability of OLTP databases, Software as a Service applications, and any database application with many concurrent users.