As the freshly brewed coffee enters my mouth, I experience my first glimpse of consciousness for the day. “Where am I?” I mutter, in broken English. The gray walls around me slowly come into focus, lit by the flickering of a long-in-the-tooth fluorescent bulb. The top half of a man’s face appears over the top of my cubicle wall.
“How’s the wonderful world of iNetConjoinApp?”
The caffeine must have made it past my blood-brain barrier, as I recognize at once that I’m at EnergyModCo, where I am one of a handful of employees. The half-head belongs to Freyr, EnergyModCo’s COO, lead customer service rep, and deployment technician.
“Umm, it’s, well, I just started working on the –”
“Great, that sounds good. You remember BigBoxoCo?”
“You mean, as in our biggest cust–”
“There’s a problem at one of their warehouses. Something to do with our lighting controller.”
“Oh?”
“I just got off the phone with the warehouse manager. All the lights went out for a few minutes, but they’re back on now.”
“Uh, that’s bad. Thank goodness they have skylights.”
“Nope. This is their first two-story warehouse. The only light the first-floor customers had was from the emergency floodlights.”
My throat tightens. “Well, I’m on it. We can’t let that happen again.”
The weight of the situation slams into me like an over-packed palette of giant mayonnaise jars. After being awake for only 25 seconds, I’m not ready to douse this kind of blaze. I don’t have a choice though, so I lean back in my chair and gaze at the craquelure on the ceiling tiles. How could this have happened? I recall that at one time we did have problems with the smart-breakers that switched the lights. They would sometimes mysteriously ignore the commands sent to them by EnergyModCo’s software. But I fixed that by adding a watchdog that would retry the switch commands if they did not take effect. After a brief palpitation subsides, I admit to myself that the lighting control watchdog must contain a nasty bug.
After some brief email digging, face palming, and silent cursing, I manage to get a VPN connection set up so that I can SSH into EnergyModCo’s on-site lighting controller. I bump up the logging verbosity, which requires that I restart the system, and start looking for clues. After a few minutes, I see the periscope that is Freyr’s forehead rise above my cubicle wall.
“The lights are off again! I’ve got the BigBoxoCo manager on hold, and he’s about to lose it!”
My lip quivers as I struggle to suppress my fight-or-flight instinct. It can’t be a coincidence that the lights went off right when I restarted the software. What have I done? In a panic, I force our software to turn the lights back on, and thankfully it works. At this point, I am paralyzed with fear. I want to disable our software entirely, but what if stopping it is what made the lights go off just now? I pull my hands away from the keyboard, fearing that anything I do might cause the BigBoxoCo manager to enter sudden cardiac arrest, or worse.
Without being able to touch the on-site software, I dive into the source code, hoping to track down the bug analytically. I pore through the entire stack, following the data flow and logic for the relatively simple lighting control subsystem. The scheduling code makes sense. So does the the timer code. The trickiest code, for the watchdog system, looks entirely correct. I rack my brain; what am I missing? I am startled by the crack of thunder, but there doesn’t seem to be a storm outside. As my nerves resonate with the imagined sound, Freyr’s forehead crests my cubicle wall.
“The BigBoxoCo manager is flipping out. He says, and I quote, that ‘There’s a God damned disco party going on’ in his warehouse. They are going to have to stop accepting customers.”
Content with having surpassed even my worst expectations, Freyr jogs back to his office. I follow him briskly.
“Freyr, can’t the manager hit the manual lighting override? I think it might take me a while to figure out the problem.”
“What are you doing away from your desk? No! Only BigBoxoCo’s maintenance engineer has a key to the enclosure, and he’s AWOL. Go!”
I save three seconds by running back to my desk. At this point, I’m bouncing ideas off our other programmer, Nate. No good; he has never worked on this system, and can only offer limited advice. I go back to staring at the code. I may have been unconscious an hour ago, but now the fire of my mind is burning with the focused intensity of a TIG welder. The coffee is gone. I begin questioning all of my assumptions. Compiler bug? Memory corruption? Cosmic rays? Nate complains about the sound of my forehead slamming against the desk. In my heightened state of awareness, I perceive the ghostly sound of footsteps come to a stop outside my cube. Moments pass before the shrunken form of Freyr emerges from the hallway. I am calmed by his lack of speed as well as the fact that he is not using his periscope.
“I’m sorry,” he says.
“Uh, hey Freyr, what’s up…?”
“I fixed the lights. It was my fault.”
Until this point, it had not crossed my mind that the problems may have been caused by the lighting controller being configured incorrectly. “Wha — what the hell happened?”
“I had configured the lighting controller at a different site with the IP address of the smart-breaker at the disco warehouse. The other site was in a different timezone, and its schedule said that the lights should be off.”
The problem was too simple. Why didn’t I think of this? One controller thought the lights should be on, and the other thought they should be off. Thus, the lighting watchdogs at each site were fighting over control of the lights. Neither controller knew about the other one; they just thought that the smart-breakers were disobeying them and retried their commands. Over and over. I subdue my first instinct to tackle Freyr on the spot, and murmur, “Okay. Thanks for letting me know.”
I sit still for a few minutes, allowing the turbulence of my rage to subside. My initial response is to be angry at Freyr for wasting my time and terrifying me. However, as I calm down and regain clarity, I realize that he did nothing wrong. Who hasn’t mistyped an IP address before? I know I certainly have, many times. Freyr made a simple and understandable mistake. The problem was that the BigBoxoCo network was set up in such a way as to allow a simple mistake to wreak utter chaos.
As it turned out, BigBoxoCo had all of their hundreds of warehouses on the same virtual network. Not only could BigBoxoCo’s corporate headquarters reach machines at every single warehouse, but so could any individual warehouse. A PC at a BigBoxoCo in New York could ping a PC at a BigBoxoCo in Oregon with no problem. Even ignoring the security repercussions of such a setup, there are good reasons to avoid it. If the network was set up with a star topology, with only the corporate headquarters having access to every single warehouse, the disco party fiasco could have been easily avoided. In other words, segmentation is good.
Comments are disabled for this post