Cluster (On Clustering Part II)

Cluster
-------
When the volume cannot be satisfied by wait and division, and it is more economical to spend engineers’ time comparing with buying massive computer, multiple machines are used to form a cluster, software and configuration can be modified to accommodate.

As the data scenario getting more complex, more scaling strategies are employed. Let’s try to categorize the nature of high-volume applications.

Static data
------------
Scaling static data to high-volume is easiest. Putting up the multiple machines, replicate the data, and make them serve client randomly.

Non real-time updates
---------------------
Data in most useful applications changes, however, they don’t necessarily change frequently. Even if the data changes all the time, a user may not be given the most updated version of the data. Those applications can be divided into two set of computers: low-hits updates, and high-hits queries. Updates are done as frequently as needed to the updates set. All queries hit routes to the second set which has its own set of data that is a snapshot of the past. It scales the same way as for static data. Once in a while, changes are aggregated from the first set and updated to the second set as a batch operation. The batch operation can be done periodically, or done during lower-traffic period. Majority of high-volume applications fall into this categories.

For example, USPS tracking are used by thousands of user everyday. The updates are brought from the low-hit set to high-hit set every night. During the day, no matter how many times you refresh the page, it was the data of yesterday night to show in your browser. UPS update the data much more frequently. But, it is still not real time. The “delivered” status showed on the site only an hour after I had sign for my package.

Many e-tailers show their stock status as “in stock”, or “low”, or “out-of-stock”, but the status is not updated instantly, but only a few times a day, even though real-time data will be highly desired.

Likewise, search engine are divided into update set and query set. In this case, both set require sophisticated clustering strategies itself. Reader can refer to The Anatomy of a Search Engine for details.

But there is another reason making this approach is so popular. The small-hit set (might well be one computer) already exists before the business goes online. For example, a computer stores might has its stock management system. Adding a new set of computer to serve web customer queries about stock makes a lot of sense and bring minimal modification and burden to an existing system.

Real-time updates
-----------------
When both real-time data and high-volume are needed, customized software are needed. For example, Web mail fall into this category. Scaling of mail is a common-enough problems. Finished customized software for mass mails is relatively easy to find.

However, early companies like Yahoo highly customize their mail servers to achieve the volume. Yahoo employed a number of strategies to scale, including Partition, Division, memory only data. First, yahoo partitioned its servers with geography locations. User id is unique globally, except for a few countries. Japan is such as exception country. For most sites, like Y! Mail, logging thru other countries page will route you back to the server of your country. Even the Y! id is unique, the email address are not interchangeable. For example, sending email to helloworld@yahoo.ca will bounce if the account is opened with yahoo.com.

The second strategy is to use in-memory database. Y! has 300 millions user worldwide, the last time I heard of it. Even with it huge number, it is still possible to store the entire id list in the memory of a single server. Only the most primitive data are stored along with the id, such as the country code. Of course, the data itself is persisted back to storage and is fully backed up. The list changes relatively slowly. Multiple servers are used along with the server that other server will do the query. The data replicate among them. It takes a few minutes for a newly created account to propagate globally. In this case, it is acceptable.

To make a useful app, the in-memory database doesn’t provide enough information. Some data also might change more rapidly. For example, when did user logged in last time, when the session is expired, has user logged out. Another layer of division is used: login server. The number of user who currently logged in is way smaller than the number of total user, even though yahoo allows user to log in a few times simultaneously with multiple browsers or computers. A cluster of computers are used just to keep track of logged in session. Web apps like Y! Mail check against a logged session with the page cookies to determines if the operation is valid. The logged server is high-volume itself.

The application itself is handled by another cluster of server. Those servers are designed to be stateless. So, all operation can be randomly route to any of the server and serve user request. It posed some limitation and challenge to the developers of the application, but it works.

Scaling: The Share-Nothing Architecture
---------------------------------------
The Y! Mail architecture capture the core design of "Share-Nothing Architecture". The LAMP (Linux Apache MySQL PHP/Python) stack which is increasingly popluar use this as the scaling model. The same can be said for Ruby on Rails.

Another layer that requires clustering itself is the storage. Because the application itself is stateless, all data is coming from the storage layers directly. The storage receive extremely high hit. Cache and RAID techniques are certainly used. Y! uses NetApp solution for scaling of its storage layer.

Tag: clustering, database, grid, virtualization