Quest to a more efficient LockSet (volatile field, semaphore, reentrant lock)

While I am staying up late, I recall an article about anti-sleeping pill that I read a couple of weeks ago. Somehow, I tend to have a much clearer mind late at night (I mean after I awake for many hours :-P). It explains why many of my blog was written late at night.

The feeling that I can stay up as late as I want in a Sunday is very good. A long weekend means that I can have two really long late nights for my own. Labor Day wasn't only an extra weekend day in a week, but also doubling my productive time to my project.

And, I spent it on coding a small part of the cluster cache project: the quest to a more efficient LockSet. It actually started while I was driving 100 miles north Friday. I was driving alone and turned off the stereo in my car to think about the problem. After about an hour running thru my 4-steps synchronized block LockSet in my head, I realized the only way to get an more efficient (in term of how much synchronization I need to do) lock set requires the ability to enter a semaphore before leaving the other.

I read the pseudocode of lock from Gray/Reuter book. The sequence looks something like that:
1/ semaphore get on the data structure for the lock set (ie, spinlock S),
2/ find the node represent the lock, or create one if not exists
3/ semaphore get on the node (ie, spinlock N)
4/ semaphore give the lock set (ie, unlock S)
5/ acquire the lock (may block the thread)

It sounds simple, but it cannot be done “efficiently” with Java’s synchronized blocks.

Spinlock (compare and store) is a primitive pillar in concurrent world that it cannot be reduced. Of course, you can simulate a spinlock using Java synchronized block. But, synchronized block is itself built by Spinlock.

Java 1.5 provided a set of new concurrent utilities. I know that it has a lock interface that would allow me to do what the Gray/Reuter lock implementation did. So, I dig deep into it.

After digging deep into the code, I found ReentrantLock is itself pretty expensive. Semaphore is a bit lighter, because it doesn’t maintain a linked list for thread, but it still maintaining more state than I need. But, in the process, I found what I want, the spinlock.

It is exposed to the API thru (AbstractQueuedSynchronizer/AtomicXYZ.compareAndSetState(int, int)). The “spin-unlock” is achieved by AbstractQueuedSynchronizer.setState(). The javadoc didn’t mention the memory model constraints. However, the way Semaphore implemented using those methods implies that compareAndSetState() and setState() acts as a read-barrier and a write-barrier respectively.

I would expect setState() to call a method in the other class that declare native. To my surprise, it simply sets the volatile field, that compareAndSetState() set using a native method.

Why it reqiures setting a volatile field only? I remember volatile as read/write-ordering on that specific field. But, a write-barrier is a different guarantee.

It was because I rarely find a useful case for volatile field. Because the guarantee was on the field only, you cannot use it to guard another data. While I aware JMM changes in Java 5, and followed the mailing list for a few good months, I didn’t pay much attention to volatile field changes.

Brian Goetz explained it very well with his article in developerWorks.

The semantic of volatile field in Java 5 is updated to have memory boundary guarantee.

Now, after this quest to a more efficient lock set, I gained the understanding of not only how to implement it efficiently, but also what was missing in older version of Java, why JMM change is needed, way better understanding of the JMM itself.

It is a good feeling that what I need is already created by someone else before I need it.

Now, I think I know JMM very well, come challenge me with tough questions! :-P

Tag:

Good load test to expose the vunerabilities of the cache

In response to a user question on load test, I think it worth a blog by itself. Those are excellence questions. :-)

Coherence is a pretty mature product. I would think it should work pretty for the read-only cases.

I can think of 4 areas that the cached system can be choked with:
a/ network-load for cache synchronization,
b/ cpu load for cache management and synchronization,
c/ cpu load for doing deserialization of your data,
d/ and database access

Depends on the way which the application access the data, the system might still choke on the last two before the cache management overhead become a problem.

Database might still be the bottleneck
--------------------------------------
For example, if I have an application need to scale to high number of users who didn’t sharing too much data among them (HR application that most user concerns mostly about his/her own data), I would want to watch the CPU, file I/O, and network utilization of the database as I am adding more machines to cache cluster, especially if it is a single database (or a cluster of database) that all machines connect to. It is good to do a little projection on how many cached machines that the single database can support.

Deserialization
---------------
If I have an application that most machine shares the same set of data, then I would watch for the time that each machine spent on deserialization. If each machines request the same cache, the data will be sent over the wire and being deserialize on each machine. The time spend on deserialize might be significant. I am not sure Coherence’s near-cache is cache as object or serialized form. It would be good to check. Even if the near-cache is kept as object, with moderate changes to data, you might still see quite a bit of deserialization, because the cache will need to re-fetched.

Really large cluster
--------------------
If you have a really large cluster (say 64 machines or up), then you might need to profile the first two as well. The overhead is believed to be small, but the total times spend on the communication is at least in the magnitude of bigO(n^2), where n is the number of machine. Even the overhead is unnoticeable with 4 machines might show up as significant for when you have 64, for example.

Tag: , , ,

Questions about Context Switch in VM

Blogging is often about opinions, solutions, and feedback. But, what if I have questions?

While watching the MySQL video, it surprised me when Stewart Smith said the storage note daemon runs on a single thread. They have its own context switching that is more efficient than using the thread from the OS.

Talking about context switching doesn't work best for some situation, I think of another situation: when the OS runs inside VM like VMWare. Even when the primary OS is mainly idle, the guest VM is still not very responsive.

Would it be because we have too many context switches happening in the primary OS, and it makes context switch in the guest OS happens in bad time?

What VM system is doing to help this situation? Will we have a configuration flag for Linux (or other OS) to let the OS context switch differently when it is a guess? (Of course, the guest machine is not supposed to know it is guest, unless you config it as such.)

Tag: ,