Weird bugs: Django, timezones, and importing from eggs

Every so often you hit a bug that makes you question your sanity. The past several days have been spent chasing one of the more confusing ones I’ve seen in a long time.

Review Board 1.7 added the ability to set the server-wide timezone. During development, we found problems using SSH with a non-default timezone. This only happened when updating os.environ[‘TZ’] to something other than our default of UTC. We’d see the SSH process (rbssh, our wrapper for SSH communication) break due to an EOF on stdin and stdout, and then we’d see the development server reload itself.

Odd.

Since this originated with a Subversion repository, I first suspected libsvn. I spent some time going through their code to see if a timezone update would break something. Perhaps timeout logic. That didn’t turn up anything interesting, but I couldn’t rule it out.

Other candidates for suspicion were rbssh itself, paramiko (the SSH library), Django, and the trickster god Loki. We just had too many moving pieces to know for sure.

So I wrote a little script to get in-between a calling process and another process and log all communication between them. I tested this with rbssh and with plain ol’ ssh. rbssh was the only one that broke. Strange, since it wasn’t doing anything obviously wrong, and it worked with the default timezone. Unless it was Paramiko somehow…

For the heck of it, I tried copying some of rbssh’s imports into this new script. Ah-ha! It dropped its streams when importing Paramiko, same as rbssh. Interesting. Time to dig into that code.

The base paramiko module imports a couple dozen other modules, so I started by narrowing it down and reducing imports until I found the common one that breaks things. Well that turned out to be a module that imported Crypto.Random. Replacing the paramiko import in my wrapper with Crypto.Random verified that that was the culprit.

Getting closer…

I rinsed and repeated with Crypto.Random, digging through the code and seeing what could have broken. Hmm, that code’s pretty straight-forward, but there are some native libraries in there. Well, all this is in a .egg file (not an extracted .egg directory), making it hard to look through, so I extracted it and replaced it with a .egg directory.

Woah! The problem went away!

I glance at the clock. 3AM. I’m not sure I can trust what I’m seeing anymore. Crypto.Random breaks rbssh, but only when installed as a .egg file and not a .egg directory. That made no sense, but I figured I’d deal with it in the morning.

My dreams that night were filled with people wearing “stdin” and “stdout” labels on their foreheads, not at all getting along.

Today, I considered just ripping out timezone support. I didn’t know what else to do. Though, since I’m apparently a bit of a masochist, I decided to look into this just a little bit more. And finally struck gold.

With my Django development server running, I opened up a separate, plain Python shell. In it, I typed “import Crypto.Random”. And suddenly saw my development server reload.

How could that happen, I wondered. I tried it again. Same result. And then… lightbulb!

Django reloads the dev server when modules change. Crypto is a self-contained .egg file with native files that must be extracted and added to the module path. Causing Django to reload. Causing it to drop the spawned rbssh process. Causing the streams to disconnect. Ah-ha. This had to be it.

One last piece of the puzzle. The timezone change.

I quickly located their autoreload code and pulled it up. Yep, it’s comparing modified timestamps. We have two processes with two different ideas of what the current timezone is (one UTC, one US/Pacific, in my case), meaning when rbssh launched and imported Crypto, we’d get a bunch of files extracted with US/Pacific-based timestamps and not UTC, triggering the autoreload.

Now that the world makes sense again, I can finally fix the problem!

All told, that was about 4 or 5 days of debugging. Certainly not the longest debugging session I’ve had, but easily one of the more confusing ones in a while. Yet in the end, it’s almost obvious.

Sentience discovered in the Linux kernel

Ladies and gentlemen, after much experimentation, I have made a remarkable discovery. Perhaps the very first case of a sentient AI has been discovered, sitting right under our noses, in the Linux Kernel. With such a complicated codebase that has evolved greatly over the years, there are certainly more surprising places for it to spring up, but it’s still quite unexpected.

And where, specifically, has this sentience manifested itself? The suspend/resume code.

See now, like many of you, I’ve dealt with the instabilities of suspend/resume. I’ve considered it to just be buggy, unreliable, and possibly incompatible with my hardware. That is, until I realized that there’s a pattern. One that began to make a sort of sense.

A couple months back, I gave suspend/resume another shot, and to my surprise it worked. I figured that Ubuntu 10.04 finally fixed it, but it still wasn’t perfect. I still noticed problems.

The first thing I noticed was that when I unsuspended at work, I couldn’t use my volume keys. Everything else was fine, but my laptop’s volume keys didn’t register as a key press on anything. If I suspended again and brought it back home, the keys would work fine. If I suspended at home and resumed at home, I wouldn’t have the volume key problem. Weird, but just buggy, right?

It was a couple weekends ago when I suspended my laptop to take it somewhere. It wouldn’t suspend at all. Just hard-locked. This continued until the week, when it worked again. Last weekend? Same problem, couldn’t suspend. Monday, it worked fine.

It was then that I realized suspend/resume was breaking deliberately! See, my laptop feels more comfortable at home, less so at work but it tolerates it (with some complaining), but absolutely doesn’t want to leave during the weekend. It’s like a cat that just wants to be in a familiar environment, selfishly vying for your attention through mischievous acts. Look at it hard enough and the pattern emerges. It’s undeniable.

That got me thinking. What other possible instances of AI have we been misconstruing as bugs or random glitches? All those inter-connected street lights that occasionally shut off as you walk underneath them? Maybe they’re just shy, or they hate you. Maybe NES cartridges just found being blown stimulating.

So remember guys. Windows suspend/resume may work just fine. Mac too. But Linux’s suspend/resume isn’t a buggy pile of crap. It’s an intelligent buggy pile of crap, that just wants to be loved.

Best Bugs

I was having a chat with my brother earlier about software bugs, and I started trying to remember about the best bugs I’ve encountered in software I’ve had a hand in. Below are my list of favorite bugs that I found entertaining. I’m kind of curious as to what other people would choose for their favorites.

My favorite bug would have to be Gaim’s flying buddies. When the rewrite of the Gaim buddy list was commited, in 0.60, we had a fun little bug where the drag-and-drops weren’t completed. This triggered a Gtk bug (I think it was Gtk’s bug?) where the nodes in the GtkTreeView would fly around the screen a bit from point A (where the node originally was) to point B (where the node is now). When the buddy list was off-screen, this made it particularly fun. As these were flying around, we quickly named them Flying Buddies.

Between classes one semester, I wrote a Snake game for my TI-89 calculator. It was rather easy to do, but I had an off-by-one error that generated what I called “snake droppings.” When the snake ate one of the blocks on the screen, it would of course extend. As soon as the tail started moving again, it would leave a pixel or two behind, hence the name.

My third favorite bug was during the development of my BilliardZ game for the Sharp Zaurus SL-5×00 PDA. Occasionally when hitting a ball, a big black hole would open up on the board, and the ball would disappear in it. Playing pool with black holes littering the table is a little inconvenient. I’m still not sure what caused it, but I ended up fixing it.

Speaking of bugs, something messed up on gnome-blog. I’ve spent over an hour trying to get it working again. Works now though.

(Update: I somehow lost the top paragraph. I don’t know how, but it’s back. That should provide more context.)