Labels: Application Scalability, Database Scalability, Database Sharding, On-Line Applications, Scalability Architecture, Sharding Case Studies
Database Sharding Blog
Tuesday, November 4, 2008
Sharding Architectures
Friday, October 31, 2008
More on Single Application Databases
It turns out the db geek has experience with database sharding:
Even when a traditional database engine is involved, there can be database-like code sitting in the application to extend the capabilities of the underlying database engine. Database sharding is a good example of this. In this approach, data is federated over a collection of cheap servers to increase scalability and performance. Typically the applications that use sharding have the code that distributes the data over the shards and combines the results from the shards within their application code. I've used similar techniques myself before most of the commercial database engines starting supporting partitioning and clustering natively. (Something that MySQL - which most of the sharding practitioners seem to use - has only just started to support.)
Labels: Application Scalability, Database Scalability, Database Sharding
Tuesday, September 23, 2008
Trend: Single Application Databases
One interesting trend that I've noticed in many of the organizations that I've been into is that increasingly databases are being built to serve single applications. The early visions of databases shared amongst multiple applications is no longer the first choice. To a certain extent this has always been the case for certain operational systems, but now the reach of single application databases has grown. You'll even find data replicated across multiple multi-terabyte data warehouses to support different business intelligence solutions.
db geek attributes this trend to factors such as commodity hardware and the benefits of reducing the complexities created by multi-application databases.
Labels: Application Scalability, Database Scalability, Database Sharding, Scalability Architecture
Monday, September 15, 2008
Database Sharding Lessons
Lesson 1: Right Size Your Shards
Lesson 2: Use Math on Shard Counts
Lesson 3: Carefully Consider the Spread
Lesson 4: Plan for Exceeding Your Shards
Lesson 5: Shard Early and Often
You can read the details of each lesson in Dan's blog.
Labels: Database Scalability, Database Sharding, Sharding Case Studies
Wednesday, August 13, 2008
Second Google Seattle Conference on Scalability
For example, the technical lead on Google Maps for mobile dicussues the adapt-
ing the Google Maps to work in a low bandwidth, high latency environment with a wide variety of networks and devices:
The project was very international and involved collaboration with teams in London, New York, Seattle, Tokyo, Beijing, and Cupertino.
Labels: Google, Scalability Architecture
Monday, August 11, 2008
Software Pipelines Book on Rough Cuts
Labels: Database Sharding, Software Pipelines
Thursday, July 24, 2008
The Cloud Does Not Magically Deliver Scalability
One interesting point that Wayner makes almost immediately in his report is that is that cloud computing, at least in the form of immediately availability of unlimited server capacity, does not deliver scalable applications:
After a few hours, the fog of hype starts to lift and it becomes apparent that the clouds are pretty much shared servers just as the Greek gods are filled with the same flaws as earthbound humans. Yes, these services let you pull more CPU cycles from thin air whenever demand appears, but they can't solve the deepest problems that make it hard for applications to scale gracefully. Many of the real challenges lie at the architectural level, and simply pouring more server cycles on the fire won't solve fundamental mistakes in design.
There are many benefits of cloud computing, especially the fact that hardware is now a commodity that is available on demand. However, there has been a general perception that using cloud computing providers such as Google or Amazon somehow magically delivers the same scalability as the applications provided by the these cloud computing providers. It should be remembered that providers such as Google and Amazon are also famous for their scalability because they shard their applications.
Labels: Database Sharding, Scalability Architecture, Scalability Economics
Friday, July 11, 2008
Scalability Disaster: Cellular Company’s Online Application Crashes
It would be interesting to find out what really happened.....
Wednesday, July 9, 2008
Addressing Scalability Before It Is Too Late
Poornima provides a simple explanation of why Database Sharding works so well for database-driven Web applications:
I’m sure we’ve all learned from our intro computer architecture class that CPU bound processes are the fastest and can be parallelized, whereas I/O processes are the bottleneck. In the case of a website, accessing the DB is the slowest I/O process. However, you can speed up access to data by sharding the database. Sharding breaks up a large database into smaller pieces that contains redundant information or a parent db can map data to separate dbs.
Poornima concludes that scalability problems are often good - because they indicate a growing business - but you should plan for scalability from the start.
Monday, June 30, 2008
The Economics of Database Sharding
Sequin unkindly speculated that the major database vendors have ignored Database Sharding for commercial reasons.
There are a lot of expensive ways to scale your database – all of which are highly touted by the big three database vendors because, well, they want to sell you all types of really expensive stuff. Despite what an “engagement consultant” might tell you though, most of the high-traffic websites on the web (google, digg, facebook) rely on far cheaper and better strategies: the core of which is called sharding.
What’s really astounding is that sharding is database agnostic – yet only the MySQL crowd seem to really be leveraging it. The sales staff at Microsoft, IBM and Oracle are doing a good job selling us expensive solutions.
Labels: Database Sharding, Scalability Economics
Thursday, June 26, 2008
Wikipedia's Scalability Architecture
There was a big emphasis in the presentation on achiving results with minimal resources because the Wikimedia Foundation is a non-profit organization with a comparitively small budget.
The Wikipedia scalability statistics are impressive - 80,000 SQL queries per second, 18 million page objects in the English language version of the site, 220 million revisions, and 1.5 terabytes of compressed data.
Wikipedia uses Database Sharding to set up master-slave relationships between databases, which are logically based on use cases and languages. Mituzas points out that the Wikipedia team only found out that they database architecture was an example of Database Sharding after they implemented it. Mituzas said MySQL instances range from 200 to 300 gigabytes.
Monday, June 16, 2008
Third Installment of Database Sharding Unraveled
Before really diving into high scalability principles, I want to take a moment to talk about why database sharding has an important role even in small startups or medium sized web-sites (5 - 30k unique visitors/day).
It is equally important and benefic for a smaller web business to prepare itself from the beginning to tackle large amounts of users cheap. If it’s not obvious enough, think about what happens to a web-page that gets some plain old Digg attention. The server quickly collapses and the user experience immediately turns from positive to mega negative.
As I’ve explained before, the whole purpose of sharding is to be able to use an unlimited number of cheap machines topped by an open-source database. As experience taught me, the web server will rarely die. Instead, the DB server will choke easily when having to deal with many simultaneous connections.
The database doesn’t even have to be very big.
Bogdan's focus is building scalable database-driven Web sites - but his comments apply to general applications as well.
Labels: Database Sharding
Friday, June 13, 2008
A Database Sharding Plan for Twitter
Labels: Database Sharding
Wednesday, May 28, 2008
Database Sharding at eBay
Randy Shoup is a Distinguished Architect at eBay.
Randy Shoup's six best practices are:
Best Practice #1: Partition by Function
There is no single monolithic database at eBay. Instead there is a set of database hosts for user data, a set for item data, a set for purchase data, etc. - 1000 logical databases in all, on 400 physical hosts. Again, this approach allows us to scale the database infrastructure for each type of data independently of the others.
Best Practice #2: Split Horizontally
The more challenging problem arises at the database tier, since data is stateful by definition. Here we split (or "shard") the data horizontally along its primary access path. User data, for example, is currently divided over 20 hosts, with each host containing 1/20 of the users. As our numbers of users grow, and as the data we store for each user grows, we add more hosts, and subdivide the users further. Again, we use the same approach for items, for purchases, for accounts, etc.
Best Practice #3: Avoid Distributed Transactions
It turns out that you can't have everything. In particular, guaranteeing immediate consistency across multiple systems or partitions is typically neither required nor possible. The CAP theorem, postulated almost 10 years ago by Inktomi's Eric Brewer, states that of three highly desirable properties of distributed systems - consistency (C), availability (A), and partition-tolerance (P) - you can only choose two at any one time. For a high-traffic web site, we have to choose partition-tolerance, since it is fundamental to scaling. For a 24x7 web site, we typically choose availability. So immediate consistency has to give way.
Best Practice #4: Decouple Functions Asynchronously
The next key element to scaling is the aggressive use of asynchrony. If component A calls component B synchronously, A and B are tightly coupled, and that coupled system has a single scalability characteristic -- to scale A, you must also scale B. Equally problematic is its effect on availability. Going back to Logic 101, if A implies B, then not-B implies not-A. In other words, if B is down then A is down. By contrast, if A and B integrate asynchronously, whether through a queue, multicast messaging, a batch process, or some other means, each can be scaled independently of the other. Moreover, A and B now have independent availability characteristics - A can continue to move forward even if B is down or distressed.
Best Practice #5: Move Processing To Asynchronous Flows
Moving expensive processing to asynchronous flows, though, allows you to scale your infrastructure for the average load instead of the peak. Instead of needing to process all requests immediately, the queue spreads the processing over time, and thereby dampens the peaks. The more spiky or variable the load on your system, the greater this advantage becomes.
Best Practice #6: Virtualize At All Levels
At eBay, for example, we virtualize the database. Applications interact with a logical representation of a database, which is then mapped onto a particular physical machine and instance through configuration. Applications are similarly abstracted from the split routing logic, which assigns a particular record (say, that of user XYZ) to a particular partition.
Best Practice #7: Cache Appropriately
The most obvious opportunities for caching come with slow-changing, read-mostly data - metadata, configuration, and static data, for example. At eBay, we cache this type of data aggressively, and use a combination of pull and push approaches to keep the system reasonably in sync in the face of updates. Reducing repeated requests for the same data can and does make a substantial impact. More challenging is rapidly-changing, read-write data. For the most part, we intentionally sidestep these challenges at eBay. We have traditionally not done any caching of transient session data between requests. We similarly do not cache shared business objects, like item or user data, in the application layer. We are explicitly trading off the potential benefits of caching this data against availability and correctness.
It is a great public service for eBay to allow a senior engineer make its scalability strategy public.
Labels: Database Sharding
Monday, April 14, 2008
Scalability and Innovation
Tim O’Reilly has explained that "it's not accident that Google’s system administration, networking, and load balancing techniques are perhaps even more closely guarded secrets than their search algorithms."
The business value of this infrastructure is discussed in this month's Harvard Business Review (HBR) in a detailed article about Google innovation called “Reverse Engineering Google’s Innovation Machine”.
HBR explains:
Google has spent billions of dollars creating its Internet-based operating platform and developing proprietary technology that allows the company to rapidly develop and roll out new services of its own or its partners’ devising.
However, Google not only has infrastructure that allows it to add new on-line services quicker and easier than its competitors, it also has a management strategy that encourages staff to innovate by including it in job descriptions:
Much of what the company does is rooted in its legendary IT infrastructure, but technology and strategy at Google are inseparable and mutually permeable – making it hard to say whether technology is the DNS of its strategy or the other way around.
What would it mean for your organization if there were sufficient infrastructure resources to allow everyone to experiment with new ideas?
Labels: Application Scalability, Database Sharding, Google, Innovation, On-Line Applications, Scalability Architecture

