IT World has published an interview with Google's Vice President of Research Alfred Spector that provides details about Google IT infrastructure and strategy:
Google uses what is now termed "cloud computing" We have numerous clusters, each containing large numbers of computers. The clusters run a distributed computing infrastructure that uses Linux on each computer. All the computers are then tied together with high-performance networking and distributed computing software. For example, we have built and deployed a global file system called the Google File System that provides scalable, fault-tolerant storage; a record-oriented data storage system for tabular data called BigTable; and a computational programming model called MapReduce that allows our batch jobs to use the inherent parallelism in our clusters.
[As for] the exact number of machines, locations and clusterswe have, suffice it to say that we have so many individual elements in our fabric that an enormous amount of attention is paid to fault tolerance, because with so many elements operating, there are exceedingly frequent component failures. Could other companies emulate that kind of architecture? First, there really are economies of scale in running systems that can support many services on a common fabric. Second, relating to the services model we espouse, there are great simplifications to releasing software as a Web-based service, because services don't have to be tested and deployed on a large number of different customer environments. Instead, software can be released to a small number of machines in a more controlled cloud and then accessed by browsers.
A third benefit is that since a software service is a logically centralized notion, the history of interactions of very many users can be aggregated and thus be the basis for various types of self-learning systems. Google uses this concept to learn to correct spelling mistakes, but businesses can use similar notions to better meet the needs of employees or customers by learning, for example, of common errors, unfulfilled product searches, etc.
Jurriaan Persyn has provided an excellent summary of the presentation he gave at FOSDEM 2009 in Brussels, Belgium. FOSDEM is one of Europe's largest annual conference on open source software with about 5,000 attendees. He also provides the actual presentation slides used during his presentation.
Even when a traditional database engine is involved, there can be database-like code sitting in the application to extend the capabilities of the underlying database engine. Database sharding is a good example of this. In this approach, data is federated over a collection of cheap servers to increase scalability and performance. Typically the applications that use sharding have the code that distributes the data over the shards and combines the results from the shards within their application code. I've used similar techniques myself before most of the commercial database engines starting supporting partitioning and clustering natively. (Something that MySQL - which most of the sharding practitioners seem to use - has only just started to support.)
One interesting trend that I've noticed in many of the organizations that I've been into is that increasingly databases are being built to serve single applications. The early visions of databases shared amongst multiple applications is no longer the first choice. To a certain extent this has always been the case for certain operational systems, but now the reach of single application databases has grown. You'll even find data replicated across multiple multi-terabyte data warehouses to support different business intelligence solutions.
db geek attributes this trend to factors such as commodity hardware and the benefits of reducing the complexities created by multi-application databases.
It's a common misconception that Google’s key intellectual property is its seemingly unique ability to delivery accurate search results. In fact, the patent for the famous PageRank algorithm is owned by Stanford University. Google listed the scalability of its vast infrastructure as its key asset during its Initial Public Offering.
Tim O’Reilly has explained that "it's not accident that Google’s system administration, networking, and load balancing techniques are perhaps even more closely guarded secrets than their search algorithms."
The business value of this infrastructure is discussed in this month's Harvard Business Review (HBR) in a detailed article about Google innovation called “Reverse Engineering Google’s Innovation Machine”.
Google has spent billions of dollars creating its Internet-based operating platform and developing proprietary technology that allows the company to rapidly develop and roll out new services of its own or its partners’ devising.
However, Google not only has infrastructure that allows it to add new on-line services quicker and easier than its competitors, it also has a management strategy that encourages staff to innovate by including it in job descriptions:
Much of what the company does is rooted in its legendary IT infrastructure, but technology and strategy at Google are inseparable and mutually permeable – making it hard to say whether technology is the DNS of its strategy or the other way around.
What would it mean for your organization if there were sufficient infrastructure resources to allow everyone to experiment with new ideas?
Web-based applications are different than normal enterprise applications – there can be unexpected usage spikes. There are many examples of on-line service scalability failures:
Government failures in information technology are famous. This is somewhat unfair because government failures are probably no more common than in commercial enterprises, they are just higher profile. For example, the UK's Public Records Office published the 1901 census on-line, the database was not able to handle the workload, which was more queries in an hour than were expected in a day. The resulting problem took 10 months to fix because they had not implemented a database architecture that allowed them to increase the capacity for read volumes (this would be simple with database sharding, for example, because the data could simply be sharded into smaller databases on separate servers).
Even companies that are famous for the scalability of their infrastructure, such as Amazon, have had scalability failures. For example, in 2003, a sudden traffic increase at Amazon.co.uk due to incorrect pricing resulted in the entire site being taken down.
The online betting site sportingindex.com was not designed to cope with an increase in customer numbers and service was offline for an entire day just before one of the biggest global betting sports events – the 2002 England versus Brazil World Cup game. The result was not just loss of revenue, but also loss of customers to other betting sites.
"we had millions of Friendster members begging us to get the site working faster so they could log in and spend hours social networking with their friends. I remember coming in to the office for months reading thousands of customer service emails telling us that if we didn’t get our site working better soon, they’d be 'forced to join' a new social networking site that had just launched called MySpace…the rest is history."
So what can engineers do to avoid on-line scalability disasters?
The most common mistake is not designing on-line applications scalability and performance – because problems are rarely anticipated. It helps a lot of there is a clear understanding between technical and business sides of a project of the capacity requirements. For example, the business managers must understand the technical risks of a big marketing launch of a new on-line service. In all cases, project managers must allow sufficient time for testing.
Jeff Dean, an engineer and Google Fellow, has presented Building a Computing System for the Worlds Information at the Seattle Conference on Scalability last June. After a general introduction to Google’s infrastructure, the presentation focuses on the famous elements of Google’s systems intrastructure: Google file system (GFS), MapReduce, and BigTable.
dbShards economically scales large, high transaction volume databases using database sharding, dramatically improving the response times and scalability of OLTP databases, Software as a Service applications, and any database application with many concurrent users.