Saturday, January 15, 2005

Need for speed

There was a very bad bug in Jasmine which caused the performance of the system to degrade pretty rapidly whenever you were connected. This problem has been around for some time. I was worried that it might be something having to do with the basic architecture of the system, which could be a very bad thing. On Thursday, Andreas and I decided to figure out the problem and fix it.

We could literally watch the frames per second drop over time, especially when both users were wandering around. Interestingly, a full garbage collect would clean up the problem for a short time, then it would begin to get slower again. We performed a number of message tallies on the system and we found that the system was spending more and more time inside of the garbage collection. It would get so bad that we could see it spending as much as 60-70% of its time performing a GC.

Andreas said that there were only a few things that could cause this kind of behavior - in particular having a large tenured object that referenced new objects that were in the youg object space. We did a number of additional profiles of the GC to get a better handle on what was going on and then we started turning off parts of the system to see where it was happening. This helped somewhat, but we did not return to our previous, unconnected performance. Finally, I had narrowed down the location of the problem sufficiently that I decided to just look through the code to see if anything might pop up that could be the culprit. Sure enough, there was an object called "selectorHistogram" that was used to keep track of the number of messages recieved from the remote user that would be sent to particular TObject targets. Croquet was running, but I had commented out this code. I removed the comments and allowed it to execute through and what I found was that instead of keeping track of the number of TObject/message pairs, which typically would not grow very fast, it was adding a new item for every single message sent. This meant that this array was growing in an unbounded way and was exactly the kind of object that would cause the problems with the garbage collector that we were seeing.

The reason this started happening with Jasmine, and why we never saw it with Solar, was that Solar would convert a TeaName into a TObject as soon as a message was received. Due to problems in ordering (and other things) we determined that it would be best to convert only when we were about to execute the TeaMessage. What this meant is that instead of adding a TObject and a corresponding message to the histogram, where there was only one TeaObject, we were adding TeaNames, which were constantly being generated for each message send. This meant that the histogram was constantly adding new objects with none of them matching any of the previous ones. Hence, unbounded growth of the IdentityDictionary. We were able to just look at this array as the system was running and watch it grow.

We decided to remove it completely from the system (we had never really used it much anyway) and we began looking for other similar large objects that could cause problems. It turned out that we had another similar histogram used for the message/object pairs that were SENT. This meant that this array was growing on both ends - the message send and recieve. Once we removed this one as well, we found the connected performance of Croquet to be quite similar to the unconnected speed of the system.

This was not an exercise in tuning - I really needed a working version of Jasmine for demonstration purposes. Now it seems to be working pretty well.

No comments: