Database Sharding

The Rise of Database Sharding

The concept of Database Sharding has been gaining popularity over the past several years, due to the enormous growth in transaction volume and size of business application databases. This is particularly true for many successful online service providers, Software as a Service (SaaS) companies, and social networking Web sites.

Database Sharding can be simply defined as a "shared-nothing" partitioning scheme for large databases across a number of servers, enabling new levels of database performance and scalability achievable. If you think of broken glass, you can get the concept of sharding - breaking your database down into smaller chunks called “shards” and spreading those across a number of distributed servers.

The term "sharding" was coined by Google engineers, and popularized through their publication of the Big Table architecture. However, the concept of "shared-nothing" database partitioning has been around for a decade or more and there have been many implementations over this period, especially high profile in-house built solutions by Internet leaders such as eBay, Amazon, Digg, Flickr, Skype, YouTube, Facebook, Friendster, and Wikipedia.

The focus of this paper is on the need for Database Sharding, the options available for database partitioning, and the key considerations for a successful sharding implementation.

What Drives the Need for Database Sharding?

Database Sharding is a highly scalable approach for improving the throughput and overall performance of high-transaction, large database-centric business applications. Since the inception of the relational database, application engineers and architects have required ever-increasing performance and capacity, based on the simple observation that business databases generally grow in size over time. Adding to this general trend is the extreme expansion of business data due to the evolution of the Internet economy, the Information Age, and the prevalence of high-volume electronic commerce.

As any experienced database administrator or application developer knows all too well, it is axiomatic that as the size and transaction volume of the database tier incurs linear growth, response times tend to grow logarithmically. This is shown in the following diagram:

Database Sharding Graph

Figure 1. The growth in database transactions and volumes has a large impact on response times.

The reasons for the performance and scalability challenges are inherent to the fundamental design of the database management systems themselves. Databases rely heavily on the primary three components of any computer:

Through benchmark tests that we have performed, we know that each of these elements on a single server can only scale to a given point, and then other measures must be taken. While it is clear that disk I/O is the primary bottleneck, as database management systems have improved they also continue to take greater advantage of CPU and memory. In fact, we have observed that it is the matching of these three factors that determines maximum performance. In other words, you cannot add an unlimited number of CPUs (or processing cores) and see a commensurate increase in performance without also improving the memory capacity and performance of the disk drive subsystem. It is also common to see a diminishing return as resources are added to a single database server. These factors are especially true in mixed-use business transaction systems; systems that perform a high volume of read and write transactions, as well as supporting generalized business reporting tasks.

Therefore, as business applications gain sophistication and continue to grow in demand, architects, developers and database administrators have been presented with a constant challenge of maintaining database performance for mission critical systems. This landscape drives the need for Database Sharding.

Database Partitioning Options

It has long been known that database partitioning is the answer to improving the performance and scalability of relational databases. Many techniques have been evolved, including:

The common drawback with each of these approaches is the reliance on shared facilities and resources. Whether relying on shared memory, centralized disk, or processor capacity they each suffer with scalability limitations, not to mention many other drawbacks, including complex administration, lack of support for critical business requirements, and high availability limitations.

Database Sharding, The "Shared-Nothing" Approach

Database Sharding provides a method for scalability across independent servers, each with their own CPU, memory and disk. Contrasted with other traditional methods of achieving greater database performance, it does not suffer from many of the typical limitations posed by these other approaches. The concept of a "shared-nothing" database implementation has been under research or discussion for 15+ years, but it appears that the business application market is just now finding the more general need for such capability due to the exponential increase in data volumes over the past several years.

The basic concept of Database Sharding is very straightforward: take a large database, and break it into a number of smaller databases across servers. The concept is illustrated in the following diagram:

Database Sharding

Figure 2. Database Sharding takes large databases and breaks them down into smaller databases.

The obvious advantage of the shared-nothing Database Sharding approach is improved scalability, growing in a near-linear fashion as more servers are added to the network. However, there are several other advantages of smaller databases, which should not be overlooked when considering a sharding solution:

There is no doubt that Database Sharding is a viable solution for many organizations, supported by the number of large online vendors and SaaS organizations that have implemented the technology (giants such as Amazon, eBay, and of course Google).

Practicalities of Database Sharding

If Database Sharding is highly scalable, less costly, and improves performance, why hasn't adoption of the technology been more widespread? Is it feasible for your organization?

The reality is that Database Sharding is a very useful technology, but like other approaches, there are many factors to consider that ensure a successful implementation. Further, there are some limitations and Database Sharding will not work well for every type of business application. This chapter discusses these critical considerations and how they can be addressed.

Database Sharding Challenges

Due to the distributed nature of individual databases, a number of key elements must be taken into account:

As you can see, there are many things to consider and many capabilities required in order to ensure that a Database Sharding implementation is successful and effective, delivering on its objectives of providing new levels of scalability and performance in a cost-effective manner.

When Database Sharding is Appropriate

Database Sharding is an excellent fit for many types of business applications, those with general purpose database requirements. It can also be used effectively for Data Warehousing applications, and as there are many available products and technologies to accomplish this, we will not focus on this element here.

The general purpose database requirements that are a fit for sharding include:

To determine if Database Sharding is applicable to your specific application or environment, the most important thing to evaluate is how well your database schema lends itself to sharding. In essence, Database Sharding is a method of “horizontal” portioning, meaning that database rows (as opposed to columns) for a single schema table are distributed across multiple shards. To understand the characteristics of how well sharding fits a given situation, here are the important things to determine:

With this information, you can rapidly gain an assessment of the value and applicability of sharding to your application. As an example, here is a simple Bookstore schema showing how the data can be sharded:

Database Sharding

Figure 3. Example Bookstore schema showing how data is sharded.

In the Bookstore example, the Primary Shard Table is the 'customer' entity. This is the table that is used to shard the data. The 'customer' table is the parent of the shard hierarchy, with the 'customer_order' and 'order_item' entities as child tables. The data is sharded by the 'customer.id' attribute, and all related rows in the child tables associated with a given 'customer.id' are sharded as well. The Global Tables are the common lookup tables, which have relatively low activity, and these tables are replicated to all shards to avoid cross-shard joins.

While this example is very basic, it does provide the basic considerations for determining how to shard a given database application. By using this evaluation approach, you can determine if sharding is applicable to your particular environment, and what benefits can be realized by implementing Database Sharding.

Read More about Database Sharding: Request Free White Paper

Find out more about Database Sharding by requested the white paper.


Company   :
First Name   :
Last Name   :
Email   :
Phone   :
dbShards White Paper   :
Database Analysis   :
Request Contact   :