Jaruzel.com

Don't Screw Up - A VAXCluster Tale

Black and white image of a VAX Cluster

My first serious job was working as a Junior Sysadmin on a large VAXCluster. I literally had NO idea what I was doing. On my first day my boss said to me, "Here's your account details, it's got the same rights as SYSTEM, if you screw up, we'll fire you. Good Luck.". The senior admin showed me the ropes, especially the HELP command, and I basically spent the next week typing HELP followed by every command I encountered.

Fast forward a few months, and I'm fairly comfortable with VMS, writing scripts and improving some of the daily housekeeping tasks to make them faster and more efficient. I'm working on the startup script that boots the cluster, stepping through the existing one to understand what it does and I encounter a command I've never seen before, so I duly flip terminal screens (VT420 dumb terminal with dual inputs), and type HELP followed by the command.

Except I didn't actually type HELP. I just typed the command.

As I watched the terminal fill up with output, I realised what it did. Basically it flushed all core running processes out of memory. Including the process that allowed people to logon.

Normally this would be a fairly embarrassing screw up requiring a simple reboot, except that this VAXCluster belonged to the MoD, and it was running military combat scenarios (wargames), and had been running a large war scenario for several weeks, non-stop. Somewhere on the site where I was working, there were two large rooms each filled with soldiers and each day they'd been planning out their campaign tactics against each other and these decisions were being plumbed into the running scenario in real time by an operator.

I'd basically broken everything, and potentially destroyed the whole scenario. The scenario was still running in memory, but no-one could talk to it anymore. Rebooting the cluster was not an option.

Immediately, white faced, I admitted to what I'd done. One of the 'real' VMS coders on my team also went ashen. After lots of head scratching, he set to work. I don't know in detail what he did, but from what I understood at the time, he effectively manually restarted the hundreds of processes that the Cluster needed to operate - as if it was rebooting, but without actually flipping the switch.

Several hours later, the Cluster was fixed and running normally. I'd also not been fired.

This event, left me with a massive appreciation of multi-user systems and how one small error can affect hundreds of people. A year or so later, I transitioned into PC support as Desktop PCs were slowly replacing the dumb terminals across the site, and later on I started working for an international bank where again networked PCs were new. I found that having cut my teeth on and been burned on a large multi-user system gave me an Impact perspective that most people working in the fledgling PC support industry didn't yet have.

Because of this, VAX/VMS systems will always be special to me.

Reposted from my original Hacker News comment.