When it comes to finding (and solving) computer bugs, the worst kind are the ones that appear to be happening somewhat randomly. Take for example the bug I was solving late Sunday night (or was it early Monday morning):
I got a report via email that a person kept getting kicked out of the site. They could get logged it, but it wouldn’t stay that way, annoying them over and over again with a new login window each time. I thought strange, so I went to check it out for myself… Didn’t seem to have the problem. At the same time I happened to be in a remote desktop session with someone else, so I had them try. Sure enough it happened to them as well.
My next guess was that it was browser related. I figured that they must be running IE, and I must be running Firefox, so therefore, I’m protected from whatever IE bug they are currently hitting. I decided that i should look at my IE, so I closed down Firefox and tested in IE. Sure enough, the problem happened. I played with the bug for a bit, trying to isolate what in IE was breaking, but after about 3 clicks of playing around with the site, everything started working. Strange I thought.
I then went back to Firefox, and sure enough, now the problem started happening there. Sure enough, after loading up a few more pages, the site would suddenly start working fine.
Immediately my mind jumped to sessions. Something must be wrong, and it must have been something recent that I changed. I mean, could Sunfox really have a bug this big for over a year and we hadn’t run into it? I tested some of the stuff I changed recently on this particular site, and nothing had any sort of effect. I then went and tested one of our older sites, and sure enough it showed signs of the same bug. Why no one had ever reported it, I’m not really sure, but it was there, plain as day.
For there, it took awhile to actually isolate what was happening. It turns out we had a race condition on the webpage. A race condition in computer lingo is when you have a computer doing 2 things at once, and depending on which one finishes first, different things happened. In some cases this can be a real problem, for example when opening and saving a file.
See, what was happening is the user would load page A (file a) at the same time as loading dynamic image B (file b). Both files make use of Sunfox’s database driven sessions. Both file A and B would load the session data into memory. Depending on which one ran faster (which varied based on internet connection speeds and how busy the server was) file A would sometimes finish faster, other times file b would.
If file A finished first (aka, if the page load finished first) it would write the user login data into the session database. The image then would come along and overwrite that data with nothing (because the image didn’t know how to handle user logins). This would nullify the users login, immediately kicking them out.
If file B finished first (aka, if the image load finished first) it would write nothing to the session database. The page would then come along and overwrite the nothingness with the user data that had been processed by the page. After this, everything would work just fine, because after this point, the image would load in the valid session data.
In any case, we had never run into it in the past because a) we don’t make heavy use of multiple files loading sessions at the same time. b) we don’t have that many sites that make heavy use of sessions. (at least, not for public things anyway). c) For whatever reason, people never complained about getting kicked out of a few sites from time to time.
In any case, I got the Sunfox bug fixed, and spent a good hour or two on Monday patching all the websites running Sunfox. Tomorrow, perhaps I’ll post and talk about the code that powers Sunfox’s sessions. It adds some security to the system, and perhaps even some speed. Hope you find it as useful as I have.