Good load test to expose the vunerabilities of the cache

In response to a user question on load test, I think it worth a blog by itself. Those are excellence questions. :-)

Coherence is a pretty mature product. I would think it should work pretty for the read-only cases.

I can think of 4 areas that the cached system can be choked with:
a/ network-load for cache synchronization,
b/ cpu load for cache management and synchronization,
c/ cpu load for doing deserialization of your data,
d/ and database access

Depends on the way which the application access the data, the system might still choke on the last two before the cache management overhead become a problem.

Database might still be the bottleneck
--------------------------------------
For example, if I have an application need to scale to high number of users who didn’t sharing too much data among them (HR application that most user concerns mostly about his/her own data), I would want to watch the CPU, file I/O, and network utilization of the database as I am adding more machines to cache cluster, especially if it is a single database (or a cluster of database) that all machines connect to. It is good to do a little projection on how many cached machines that the single database can support.

Deserialization
---------------
If I have an application that most machine shares the same set of data, then I would watch for the time that each machine spent on deserialization. If each machines request the same cache, the data will be sent over the wire and being deserialize on each machine. The time spend on deserialize might be significant. I am not sure Coherence’s near-cache is cache as object or serialized form. It would be good to check. Even if the near-cache is kept as object, with moderate changes to data, you might still see quite a bit of deserialization, because the cache will need to re-fetched.

Really large cluster
--------------------
If you have a really large cluster (say 64 machines or up), then you might need to profile the first two as well. The overhead is believed to be small, but the total times spend on the communication is at least in the magnitude of bigO(n^2), where n is the number of machine. Even the overhead is unnoticeable with 4 machines might show up as significant for when you have 64, for example.

Tag: distributed cache, cluster cache, clustering, concurrency