Economy of Scaling

Echo everywhere.

Friendster as a counter example.

Google's finical statement. Name scaling as its core strength.

Amazon Werner's interview. Scaling economically.

Microsoft incentive. Putting data center everywhere. The missing $2 billing.

Yahoo mail. 2GB of storage. 10 times the user.


Questions about Context Switch in VM

Blogging is often about opinions, solutions, and feedback. But, what if I have questions?

While watching the MySQL video, it surprised me when Stewart Smith said the storage note daemon runs on a single thread. They have its own context switching that is more efficient than using the thread from the OS.

Talking about context switching doesn't work best for some situation, I think of another situation: when the OS runs inside VM like VMWare. Even when the primary OS is mainly idle, the guest VM is still not very responsive.

Would it be because we have too many context switches happening in the primary OS, and it makes context switch in the guest OS happens in bad time?

What VM system is doing to help this situation? Will we have a configuration flag for Linux (or other OS) to let the OS context switch differently when it is a guess? (Of course, the guest machine is not supposed to know it is guest, unless you config it as such.)

Tag: ,

Relational and jCache Model's differences

Much of Cameron’s presentation also echo my experience (in Cache JDO cache, cluster work in the BPM company, the recent works I do with the distributed cache work, and even reading of Transaction Processing book that I mention a few times).

But, he reminded me an old problem I deal with in Castor JDO. The cache [and lock] was local, but we was trying to respect the isolation such that if there was another machine making incompatible change, data integrity will not be compromised but causing transaction roll back. After years, I now understand the problem better; I know that I didn’t achieve it. To be specific, I didn’t achieve Serialization (or Phantom read) level of isolation.

Let use class registration as an example. A student is allowed to add as many as 18 credits for a quarter. So, we do it in one query, and insert only if the first query return a result that met our rule. First,
<quote>SELECT sum(credits) FROM student_course_table WHERE student=? AND quarter=this</quote>
Now, if the sum returned by the query and credit of the new course is less than 18, we let the new course to be added.

In this case, we either disallow other thread to insert another course, or, we want to cause this transaction to fail.
The solution is pretty hard to implement efficiently (to allow parallelism). Because we read a range of value to get the result, we need to lock more than the just new row to insert, to ensure result is correct. So, we need lock set.
  • 1/ A simple solution will be all read will also hold a share lock for the table and the item. And, if an insert is issued, the lock of the table is upgraded to exclusive lock.

  • 2/ A more efficient implementation for reader to hold IS (intent share) or IX (intent exclusive) on the table.

  • 3/ More efficient yet is to use IS or IX predicate lock (lock on a range).
  • Cameron didn’t mention about lock set with Coherence. And, I thought the only way to get isolation of right was to use lock set. So, I had a discussion with him.

    It turned out the problem spaces are different. Because jCache use get(), put() which dissent it from caring about the inter-dependencies from data. So, we don’t need lock set. The specification is different.

    So, does it mean jCache model is easier? Not necessarily. They are difficult in different ways. Cameron explained to me why even lock cannot guarantee to be enough (because of out of order message, absolute time problem). On the other hand, the database has a log (journal) that essential defines the absolute time (or order of events).

    However, ORM product designer should aware of the differences in between relation model and jCache model, when they utilize jCache to scale out, especially, if Serialization isolation level is desired. One way is to pick (or let user pick) the right level of granularity. In case of the course registration example, choose student as the lock and relationship as depended objects will work (assume courses are stable). But, in some case, those are difficult problems and require analysis of the trade offs.

    Tag: , , , ,

    Distributed Caching: Essential Lessons

    Deadline was looming. My most productive (day)time in a week is often Wednesday and Thursday late afternoon. Fremont’s Peet’s Coffee was giving out free coffee to celebrate the one year anniversary of the store, and I was a bit over- caffeinated. :-P Under the temptation of getting more work done, I had almost forgone the ISTA meeting. It was a talk about distributed cache by Cameron Purdy from Tangosol.

    Glad that I were there! Beside he scared me with a poor joke in the beginning, the talk was great. (I can no longer remember the joke)

    I remembered his presentation as four parts.
  • 1/ Introduction of himself, the company, the problem space, and the product name (ie, what “coherence cache” means technically).

  • 2/ the evolution (what, how, why) of the distribution cache. (from replication cache, partition, failover cache, local cache, standalone cache server, to write behind cache)

  • 3/ highlights of technical challenge in the product implementation (finite state machine to model the communication; edge case in partition local cache; the gap in load the data and propagate it to cache; no absolute time in distributed system; network constraints: 12 ms latency, out of order delivery; cannot be proved correct, but can't find incorrect; leaving cluster etc.,)

  • 4/ cluster system design guidelines (13 of them, [java] serialization/externalization, identity, define equals, idempotent, etc.)

  • With the great wealth of experience he had with real-life systems, full knowledge of the product since the beginning, the talk was vastly interesting. (and, no one fall from his/her chairs even the talk ran pretty long :-)

    Tag: , , , ,

    In Depth look at Data-Driven Cluster (On Clustering Part VI of VII)

    In Depth look at Data-Driven Cluster
    Let’s focus on the scalability of data-driven applications. The demands for scaling of this kind of applications are increasing, but solutions remains expensive.

    To understand why scaling out this kind applications are challenging, let's start with a nominal view of data operation, and categorizes them into two: read and update (including create and remove). A system performance (P) can be represented as the sum of the rate of Read (V) and Update (U):
    P = V + U

    Ideally, we would like the performance (P[n]) of a system to be linear as the number(n) of machines increases:
    nP = n(V + U) -- ideally

    However, it is not possible. To ensure data integrity, each update must be propagated to all machines, such that the next relevance read operations will obtain the newest values.

    Now, consider a two machines cluster, the performance is theoretically limited to
    P[2] = 2P - 2U

    It is because, for every data update to one machine, the second machine needs to be updated as well. It is the penalty we need to pay for scaling.

    The Equation
    Similar, for n machines, the (simplified) performance can be defined as
    P[n] = nP - n(n-1)U

    For a small U (close to zero), scaling can be very linear. With load-balancer, round-robin DNS, co-locations data replication, we indeed achieve very linear scalability for read-only data in the real world.

    However, for a larger U, the performance peak off quickly. The penalty runs at the order of bigO(n^2).

    (Note that the equation above is simplified because as more performance is spent on update, we actual have less to do Read as well. The proper equation rebalance the V:U ratio and the actual penalty is slightly lower for larger n, such that the performance will not become negative.)

    IP Multicast
    Some may tempt to think that using IP Multicast will eliminate big0(n^2) performance hit. It is not true. Even if multicast is used, each machine receiving the update packet from another machine need to update its own version of the data. Consider a cluster of 100 machines, and assume each machine makes 1 update per second. So, in every seconds, each of the 99 machine now send out 1 multicast about its update, and receive and process 99 multicasts (99x99 updates). The big0(n^2) term doesn't go away.

    Reduce U
    It does, however, reduces U, and in some case significantly. Similarly, there are other fancy techniques to reduce U, but not eliminate the big0(n^2). These techniques include data invalidation (instead of full update on each node), lock tables, voting and centralize updates. All of these techniques are important, but also come with its own trade offs. For example, centralized updates basically force us to rely on a single massive machine (that we want to replace using commodity hardware).

    Not Just Data Replication
    The equation might appear to apply to data replication setup only. It is not true. Invalidating data often mean we need to read all data from a centralized machine. In this case, we are just pushing the updates and scaling problem into a single machine. It is opposite to the goal of scaling out.

    Ad-hoc Cluster Cache
    Caching might appear to relieve the single machine problem for non-data replication setup. However, the same can not be said for clustering. Applying invalidate technique to non-cluster aware cache (some might call it clustered-cache) works for smaller number of machines and frequences of updates. When either or both value gets large, the hit rate of the cache quickly approaches zero, because there are much more machine trying to invalidate the cache, and it renders the cache empty most of the time. (Of course, a true clustered cache design aware of this problem and try to do better)

    Reduce N
    If bigO square on n cannot be reduced, the next best is to reduce n. In fact, it is an important consideration in real-life tuning. To reduce the n, we want to spun out any read that is not relevant to the data application. To ensure serializable level of data integrity, we need to keep track of relevant read to avoid read->write dependencies Chapter 7.6 on Gray book, or use exclusive lock on tables. It makes the performance penalty to be very high to be spun out relevance data. Only data that has no dependencies on other can be spun out.

    Two the Parallel
    If a parallel set of data can be isolated, we can have run two cluster systems instead of one. For example, if the two set is about as intensive, we will get
    2 x P[n/2] = 2nP x 2(n/2)(n/2 -1)U

    It approaches does not apply to all data. It takes symmetry out of the system, which increase design and administration complexity and cost.

    Similarly, we might able do data partitioning to reduce n as well. Partitioning can be divided with data-range, hash code, lookup table or other algorithm. This approach also depends on the data schema, and increases administration complexity and cost. For instance, it might require periodical administrative task to load-balance between the partitions. (or, requuire other software)

    The ideas of division and partition are the same: to exploit parallelism. The concept of exploiting parallelism can go even further: much further. In some case, they can be automated with some restrictions that is ok with most application. I will share some of them in a later post.

    They are no silver bullets, either. But, putting them together helps. I tend to think that good enough solutions are already been discovered. The challenging problem of scaling data driven application is awaiting cost effective implementations. The execution matters!

    Tag: , , ,

    Java / Tomcat / Virtualization

    I was talking about Virtualized Linux/BSD distribution with Java and Tomcat

    And, I am glad to discover that it is there.

    I notice before I made the previous blogs on December. However, until recently, they get to the price point that is very interesting: $20 a month.

    For the price, I got my own virtual server, and we setup to run as many domain name as I want. To my surprise, it meets almost all my criteria. The HD footprint was about 63M with core linux, jdk and jre, iptables, tomcat, mail, ftp, mysql, ssh and various software. Additional software can be installed with simply checking a checkox. To my surprise, they also provide the XFree86 X11 Libraries. I tested that the JRE is able to utilize it. I was able to run a swing app on the virtual server and display the swing app on my home desktop.

    The performance is rather unacceptable for interactive UI applications. I don’t except I need any kind of UI performance from a hosted server anyway. It also consider slow for shell, or ftp operations. However, it seems to work reasonably well for serving webpage.

    The management software of eApps is provided by SWsoft. The control panel, HSPComplete is very intuitive. The Virtual server infrastructure, obviously also licensed from SWsoft.

    eApps rocks! In term of features, it certainly beat my expectation. Highly recommended!

    [Update June 24, 06] I update to the $30 plan, the performance of eApps are getting pretty good. Not sure if it is because of the plan upgrade, or is it because their ongoing performance improvment in general.

    Tag: , ,