In Depth look at Data-Driven Cluster (On Clustering Part VI of VII)

In Depth look at Data-Driven Cluster
Let’s focus on the scalability of data-driven applications. The demands for scaling of this kind of applications are increasing, but solutions remains expensive.

To understand why scaling out this kind applications are challenging, let's start with a nominal view of data operation, and categorizes them into two: read and update (including create and remove). A system performance (P) can be represented as the sum of the rate of Read (V) and Update (U):
P = V + U

Ideally, we would like the performance (P[n]) of a system to be linear as the number(n) of machines increases:
nP = n(V + U) -- ideally

However, it is not possible. To ensure data integrity, each update must be propagated to all machines, such that the next relevance read operations will obtain the newest values.

Now, consider a two machines cluster, the performance is theoretically limited to
P[2] = 2P - 2U

It is because, for every data update to one machine, the second machine needs to be updated as well. It is the penalty we need to pay for scaling.

The Equation
Similar, for n machines, the (simplified) performance can be defined as
P[n] = nP - n(n-1)U

For a small U (close to zero), scaling can be very linear. With load-balancer, round-robin DNS, co-locations data replication, we indeed achieve very linear scalability for read-only data in the real world.

However, for a larger U, the performance peak off quickly. The penalty runs at the order of bigO(n^2).

(Note that the equation above is simplified because as more performance is spent on update, we actual have less to do Read as well. The proper equation rebalance the V:U ratio and the actual penalty is slightly lower for larger n, such that the performance will not become negative.)

IP Multicast
Some may tempt to think that using IP Multicast will eliminate big0(n^2) performance hit. It is not true. Even if multicast is used, each machine receiving the update packet from another machine need to update its own version of the data. Consider a cluster of 100 machines, and assume each machine makes 1 update per second. So, in every seconds, each of the 99 machine now send out 1 multicast about its update, and receive and process 99 multicasts (99x99 updates). The big0(n^2) term doesn't go away.

Reduce U
It does, however, reduces U, and in some case significantly. Similarly, there are other fancy techniques to reduce U, but not eliminate the big0(n^2). These techniques include data invalidation (instead of full update on each node), lock tables, voting and centralize updates. All of these techniques are important, but also come with its own trade offs. For example, centralized updates basically force us to rely on a single massive machine (that we want to replace using commodity hardware).

Not Just Data Replication
The equation might appear to apply to data replication setup only. It is not true. Invalidating data often mean we need to read all data from a centralized machine. In this case, we are just pushing the updates and scaling problem into a single machine. It is opposite to the goal of scaling out.

Ad-hoc Cluster Cache
Caching might appear to relieve the single machine problem for non-data replication setup. However, the same can not be said for clustering. Applying invalidate technique to non-cluster aware cache (some might call it clustered-cache) works for smaller number of machines and frequences of updates. When either or both value gets large, the hit rate of the cache quickly approaches zero, because there are much more machine trying to invalidate the cache, and it renders the cache empty most of the time. (Of course, a true clustered cache design aware of this problem and try to do better)

Reduce N
If bigO square on n cannot be reduced, the next best is to reduce n. In fact, it is an important consideration in real-life tuning. To reduce the n, we want to spun out any read that is not relevant to the data application. To ensure serializable level of data integrity, we need to keep track of relevant read to avoid read->write dependencies Chapter 7.6 on Gray book, or use exclusive lock on tables. It makes the performance penalty to be very high to be spun out relevance data. Only data that has no dependencies on other can be spun out.

Two the Parallel
If a parallel set of data can be isolated, we can have run two cluster systems instead of one. For example, if the two set is about as intensive, we will get
2 x P[n/2] = 2nP x 2(n/2)(n/2 -1)U

It approaches does not apply to all data. It takes symmetry out of the system, which increase design and administration complexity and cost.

Similarly, we might able do data partitioning to reduce n as well. Partitioning can be divided with data-range, hash code, lookup table or other algorithm. This approach also depends on the data schema, and increases administration complexity and cost. For instance, it might require periodical administrative task to load-balance between the partitions. (or, requuire other software)

The ideas of division and partition are the same: to exploit parallelism. The concept of exploiting parallelism can go even further: much further. In some case, they can be automated with some restrictions that is ok with most application. I will share some of them in a later post.

They are no silver bullets, either. But, putting them together helps. I tend to think that good enough solutions are already been discovered. The challenging problem of scaling data driven application is awaiting cost effective implementations. The execution matters!

Tag: , , ,