The Big Grid

Cache (On Clustering Part V of VII)

Caching
-------
Cache to an application is like a palette to a painter. A set of data is temporarily put on the cache as the application runs, the same way a painter picks a few color to the palette as he works on the painting. Data might be combined and modified in the cache, the same the color is mixed on the palette. The cache keep some frequently used data in memory, and save the application from accessing disk drive or database all the time, which save time. The palette keeps the mostly frequently used color, and reduces painter trips to the color tub. Of course, no analog goes all the way. In this case, data does change and need to store back. But, painter doesn’t put the color back into a tub when he discovers a new color that he likes.

Memory is multiple magnitudes faster than drives and database. By saving access to drives, and keep some data in the memory, the application runs a few times faster. Of course, if requests are extremely random that doesn’t tend to repeat, and/or the data set is extremely large compare with the memory size, cache might not be well utilized and incurs unnecessary overhead. But, it should be looked at exception.

If multiple machines are used, and data need to be stored back, keeping the data in the cache of each machine can become a challenge. When we have one machine, we always know if the data is updated or not. If we modify it, then it is the new data, and we need to store it back. If we didn’t modify it, then it is up-to-date. With multiple, we didn’t modify it, some other machine might. Synchronization mechanism is needed and it must handle machines that try to modify the same data at the same time. Distributed cache is designed for just that. In the Java world, multiple J-Cache implementations are available. Tangosol Coherence appears to be the leader of the space and claims deployed customers in multiple industries.

Turning the cache off can be a painful answer. It means now we need a few times more machines just to achieve the performance we had with one. One strategy is using cache in the data store level, which helps. It is like having a palette, but instead of having carrying it, it is fixed on the table. It is often what the “Share-Nothing Architecture” does.

Distributed cache is relatively new and requires additional integration. I envision distributed cache will be integrated part of application server in the future and will be part of J2EE and .Net offering. I also saw a LAMP stack company ActiveGrid job post for engineer to implement distributed cache.

In my opinion, the use of distributed cache is preferred over share nothing architecture and it will be the model of future. I am actually developing one myself as my hobby. We will see how the industry unfolds on this.

Tag: clustering, cluster cache, distributed cache, database, grid, virtualization

Stateful Session (On Clustering Part IV of VII)

Of course, high-volume computing didn’t stop at real-time updates. Let’s continue!

Stateful Session
----------------
Applications like Y! Mail can use stateless session approach for to achieve scaling. However, in some application, it is highly desirable, in programming perceptive, to have stateful session. Information of the current logged in user is good candidate to be store in a session. Some site might enable GUI hint to allow editing of some attribute. In other case, stateful session also enable easy design of web application that involves multiple steps. A questionnaire might involve multiple pages. With stateful session, previous page’s answer can be retained in the session. Some new high interactive site that provides full desktop-like experiences use stateful session heavily to store current active view, opened document, table sorting order etc. Session does not tend to be survives forever. It is often discarded after timeout or user log off. Scaling such application requires different strategies.

(side note: many developers site use shopping cart as an example for stateful session. However, it is probably not the right approach. User might log out due to various reasons. But, it is often desirable to have the cart still when user logged back in.)

Application server or web server framework is often used for session management. Scalability is often built into the application server of framework itself. For web application, HTTP load-balancer (dedicated hardware) are often use to aid the task. A HTTP load-balancer understands HTTP session ID, or rewritten URL and always reroute the same client to the same server, (unless the server failed) such that the session can be lived in the physical server. Failover of session has been a major selling point of an application server.

It is important to note that the functionality of stateful session can be simulated by moving all the trasient state to browser (probably hidden fields in HTML form), always store them back to the data layer, or both. In a way, stateful session can be viewed as "division" strategy. By factoring out data that is transient and short-lived, and keep them in memory, it reduces the hit to the data layer and increases the performance. The division also enable the use of HTTP load-balancer.

Tag: clustering, database, grid, virtualization

Database Driven and Entity Tier (On Clustering III of VII)

Database Driven
---------------
Database provides query, indexing, transaction, storage management, data security and other support. Database driven application is very common. Database tier is being inserted in between the application and storage. Query optimization and indexing enhanced performance and help the application to achieve higher volume. However, its power and convenience is usually turned to extra feature to the application and rarely leave user impression of performance. Share-nothing strategy can be used. Bandwidth from the application tiers and database tiers are very critical to the overall performance. Before gigabyte ethernet becomes common, high-volume application utilize multiple 100Mbps ethernet card per machine to achieve the desired bandwidth.

The scaling and tuning of database its an art of itself. Different optimization models and expertise are widely available, some for signficiant cost.

When performance is absolutely critical, a subset of the table can be divided into memory-only database, like TimesTen (now a Oracle product) It provides advanced functionality like other DBMS. However, its keeps all its data in-memory, and requires no disk access for query.

Entity Tier
-----------
In the J2EE world, EJB was famous on its EntityBean concept. EntityBean is Component model representation of a single entity (relational db) in a Database. (loosely speaking, component is an object which lifecycle managed by a Container.) With this model, an application into two tiers: the presentation tier and the business logic tier. The EJB model represents the later. Direct access (thru RMI or RPC) by client pretty much went out of fashion. The presentation tier most likely is a web server that serve HTML or expose the Business logic functionalty as Web Service. The scaling and load-balancing of Entities are enabled by the remote Stub and Skeleton model. Clustering is provided by that application server with EJB container. The Stub on the client side, when instantiate, was assigned to one of the server.

The database access (persistent) of the entity is sometimes provided by the container. However, no serious application developer didn't complains about the performance of this container managed persistent (or CMP). The first EJB design was simply too heavy-weighted. EJB 3.0 moves to a much ligher-weight model. It appears to be good enough for long-term sucess. However, a very wide selection of programming model and framework exists.

Tag: clustering, database, grid, virtualization

Ethernet is pretty much dead

I still remember that an instructor of my college favored token ring and thought it was a better technology. He envisioned that if token ring had better marketing at the time, it would have dominated the market. The cost differences, he argued will disappear because of the huge economy of scale once a technology like that had become mainstream. But, it was too late. Ethernet had already dominated the market. I thought he was right.

It was eight years ago. After all, both of protocols vanished. People still refer to Ethernet all the time. However, is Ethernet really Ethernet anymore?

Since the market moved to 100Mbps, the concept of Hub phrased out. Because a hub does not wait until the whole frame (layer-two packet) to be transmitted before forwarding the frame, a hub has advantage on lower latency. However, it becomes less significant as the speed increased 10 folds. When the frame size remains the same, the latency is cut by 90%.

The beauty or uniqueness of Ethernet is actually happening in the layer 2. It sends frame after waiting a random period, send the frame, do collision detect, and resend if necessary. Now, Ethernet is connected port to port from a computer to the switch. All Ethernet are full duplex, Cat-5 twised pairs cable use different wire for transmission and reception. It essentially guarantees that no collision is possible. Say bye to aloha protocol that powered corporate network for more than 10 years.

Using switch actually bring significant performance advantage. In a busy network, the theoretical network thru put is about 57% (if I remembered it correctly). With full duplex and use of switch, we are getting pretty close to 200Mbps.

Most of the switch (at least entry level) use 5-port switch chip. The interconnection is likely to be Star Network. The Mac layer packet is forwarded exactly once to reach the destination.

8-Port Switch can be constructed by combining two chips. One port of each chip is connected on the circuit board level. Packet is forward 1 to 2 times. 16-Port can be constructed using 5 of such chips, with one port of each of the 4 chip connecting to a port on the 5 chip. Packet is forward 1 or 3 times.

Tree network won the final battle.

Cluster (On Clustering Part II)

Cluster
-------
When the volume cannot be satisfied by wait and division, and it is more economical to spend engineers’ time comparing with buying massive computer, multiple machines are used to form a cluster, software and configuration can be modified to accommodate.

As the data scenario getting more complex, more scaling strategies are employed. Let’s try to categorize the nature of high-volume applications.

Static data
------------
Scaling static data to high-volume is easiest. Putting up the multiple machines, replicate the data, and make them serve client randomly.

Non real-time updates
---------------------
Data in most useful applications changes, however, they don’t necessarily change frequently. Even if the data changes all the time, a user may not be given the most updated version of the data. Those applications can be divided into two set of computers: low-hits updates, and high-hits queries. Updates are done as frequently as needed to the updates set. All queries hit routes to the second set which has its own set of data that is a snapshot of the past. It scales the same way as for static data. Once in a while, changes are aggregated from the first set and updated to the second set as a batch operation. The batch operation can be done periodically, or done during lower-traffic period. Majority of high-volume applications fall into this categories.

For example, USPS tracking are used by thousands of user everyday. The updates are brought from the low-hit set to high-hit set every night. During the day, no matter how many times you refresh the page, it was the data of yesterday night to show in your browser. UPS update the data much more frequently. But, it is still not real time. The “delivered” status showed on the site only an hour after I had sign for my package.

Many e-tailers show their stock status as “in stock”, or “low”, or “out-of-stock”, but the status is not updated instantly, but only a few times a day, even though real-time data will be highly desired.

Likewise, search engine are divided into update set and query set. In this case, both set require sophisticated clustering strategies itself. Reader can refer to The Anatomy of a Search Engine for details.

But there is another reason making this approach is so popular. The small-hit set (might well be one computer) already exists before the business goes online. For example, a computer stores might has its stock management system. Adding a new set of computer to serve web customer queries about stock makes a lot of sense and bring minimal modification and burden to an existing system.

Real-time updates
-----------------
When both real-time data and high-volume are needed, customized software are needed. For example, Web mail fall into this category. Scaling of mail is a common-enough problems. Finished customized software for mass mails is relatively easy to find.

However, early companies like Yahoo highly customize their mail servers to achieve the volume. Yahoo employed a number of strategies to scale, including Partition, Division, memory only data. First, yahoo partitioned its servers with geography locations. User id is unique globally, except for a few countries. Japan is such as exception country. For most sites, like Y! Mail, logging thru other countries page will route you back to the server of your country. Even the Y! id is unique, the email address are not interchangeable. For example, sending email to helloworld@yahoo.ca will bounce if the account is opened with yahoo.com.

The second strategy is to use in-memory database. Y! has 300 millions user worldwide, the last time I heard of it. Even with it huge number, it is still possible to store the entire id list in the memory of a single server. Only the most primitive data are stored along with the id, such as the country code. Of course, the data itself is persisted back to storage and is fully backed up. The list changes relatively slowly. Multiple servers are used along with the server that other server will do the query. The data replicate among them. It takes a few minutes for a newly created account to propagate globally. In this case, it is acceptable.

To make a useful app, the in-memory database doesn’t provide enough information. Some data also might change more rapidly. For example, when did user logged in last time, when the session is expired, has user logged out. Another layer of division is used: login server. The number of user who currently logged in is way smaller than the number of total user, even though yahoo allows user to log in a few times simultaneously with multiple browsers or computers. A cluster of computers are used just to keep track of logged in session. Web apps like Y! Mail check against a logged session with the page cookies to determines if the operation is valid. The logged server is high-volume itself.

The application itself is handled by another cluster of server. Those servers are designed to be stateless. So, all operation can be randomly route to any of the server and serve user request. It posed some limitation and challenge to the developers of the application, but it works.

Scaling: The Share-Nothing Architecture
---------------------------------------
The Y! Mail architecture capture the core design of "Share-Nothing Architecture". The LAMP (Linux Apache MySQL PHP/Python) stack which is increasingly popluar use this as the scaling model. The same can be said for Ruby on Rails.

Another layer that requires clustering itself is the storage. Because the application itself is stateless, all data is coming from the storage layers directly. The storage receive extremely high hit. Cache and RAID techniques are certainly used. Y! uses NetApp solution for scaling of its storage layer.

Tag: clustering, database, grid, virtualization

High-volume computing (On Clustering Part I)

A long way coming, I started my programming career as a Technical lead of an Object-Relational mapping software, Castor JDO. Concurrent programming, transaction, real-time data, data integrity, caching, distributed, clustering, scalability, load-balancing, high availability, and high-volume have always interested me since my first job. All of these are old terminologies that exist since the early days of computing. I called them the essences of Enterprise Computing. But, after years, such software is still a huge challenge and great endeavor.

High-volume
-----------
High-volume applications aren’t new at all, but the web makes many applications becomes high-volume. It the early days, high-volume is often handled by massive computer and highly customized software. Just like water always finds its way downward, the market finds a cheaper way to do what needs to be done. As water gather in a meadow and the water level rises, it finds (or breakout) new and faster ways to go down. As high-volume applications become common, new and cheaper way are designed to handle them.

One machine
-----------
The first method to scale is to fully utilize one machine, adding more memory, faster disk, and more CPU if the system supports it. Multi-cores, or multiple CPU system becomes common. Sun’s provides some really good resources in this area. You can being with this video Sun Nettalk (Scale)

Wait
----
The second cheapest solution to scale to higher volume sometimes is to wait. Of course, it is often unacceptable in most case. But, computer is doubling its speed every 1.5 years. The market constantly introduces new technologies announced that helps increase system preformance. Who said “waiting never solve the problem?” Same apps will be running twice as fast, if you spend the same hardware cost again 18 months later. If your volume is doubling less frequently than every 1.5 years, your problem is solving itself eventually.

Cost of making software or is very expensive. The cost of spending a man month of engineer time cost the same as a decent mid-range server. One man month doesn’t really get that much done. It is especially true if the software develop cannot be resued. Good engineers are always overworked, and should be considered as limited resource. However, salary is often considered as a fixed cost to business. Sometimes management favors spending engineer time over buying hardware. Of course, acquired hardware isn’t maintenance free and incurs cost.

Division
--------
Division is arguable the most important aspect of high-volume computing. As you will see from the following paragraphs and later blog, high-performance often comes from the right division. The first steps are to divide different service into different machine. For example, move HTTP service out of the machine that serving mail. The second is to partition the service across nature boundary. For example, mail belongs to a heavy site to be single out onto another machine. Going deeper, you might move HTTP server that serve images server from server that serve HTML pages.

Application might also be divided according to its scaling behavior. Application that serves static information is easiest to scale. The static part can be factoring out into a different machine or a set of machine, and leave as much as CPU to the part that change frequently. Move storage into a separate machine, or dedicated hardware. Adopt multi-tiers architecture and splits each tiers into machines with high-speed network connection.

Tag: clustering, database, grid, virtualization

On Maintainability

My first computer comes with MS DOS 3.1. Books were very expensive for a 5th grade kid, there was no internet, and “/?” parameter wasn’t incorporated in the dos command. Only knows “cd”, “c:”, “echo” commands, the menu are for display only. I made a batch of single-letter name batch files to achieve menu like behaviour. Not for long, I encounter of software maintainability shortly after my first menu system was done: I got a new game from a friend. The new game became one of my favorite, and I would like to put it ahead of other games. I rename each batch files and edit each of the menu’s “echo” line. Not for long, I gave up the idea of having a menu. It was my first encounter of software maintainability. Even for simple menu like this, the cost of ownership was beyond the time spent of creating the system in the first place.

Ok, enough stories of the good old day. :-)

Tag: maintainability

PC-compatible XT

I still remember the day that my mother bought us our first PC to share with my brother and sister. I was grade 5 if I remember correctly.

My mother was business woman and she saw that computer was more widely and wanted us to learn it early on. At the time, the original “red-and-white” Nintendo was fiercely popular. A lot of our friends had it. She rejected the idea of Nintendo and bought us the computer. The monitor alone cost about a month of most people salary. As recommended by the sales, we got higher model of Thompson CGA monitor. The speed was in middle of the pack, 5MHz PC-compatible XT machine with a “Turbo” switch that could bump the speed to 10MHz. Other configuration was pretty advanced. It has 2x 1.2MB floppy drive (dual side), a 30MB self-compressed hard drive from Seagate, and 640KB of memory.

Some software and games were included. The day to pick up the computer had come, when we pick up the computer, the sales showed us how to launch a few of the software: putting the floppy in, type “a:” type “dir”, then type something ended with “exe” and change disk when it asks you to. Looked easy!

When we got home, I was confused about “dir” and “dos”. Some other software was actually self-booting. So, for a few days, I keep booting to the simple games, until I went back and ask about the “dos” command. And, I started by typing whatever words showed in the dir list, and spend days on the stack of disks.

After awhile, I learned about batch file, and trying to create a menu such that we could jump directly to the games with a letter and enter key. I called it my first “programming” experience. It was long way coming.

Complexity Removed

First post! This blog will be a placeholder for my thoughts of computer/software engineering/development and other technologies issues.

The title of this blog was inspired by a quote of Burt Rutan, the architect of SpaceShipOne: “Complexity we remove can never fail.”

Burt is definitely not the first one who stress on simplicity. Einstein’s left this famous quote half a century ago: “Things should be made as simple as possible, not simpler.”

Another famous quote from an anonymous is “KISS – Keep It Simple, Stupid!” While I totally believe simplicity is the way to go, I do think this quote is as accurate as the quote from Brut or Albert, at least in the context of engineering.

In my experience, simplicity is not likely to occur naturally. Usually, the first draft of my software is more complicated than necessary. Keeping it is not enough. We need to take additional steps to pursue it.

The first version is often like the path we draw when the first time we try to solve a large maze. The first is often not the shortest. It takes additional steps to remove the unnecessary steps.

Simplicity is not mere of keeping. It is a matter of “remove”; it is a matter of “make”.