It’s 2:59 AM on a Sunday, and Barry is about to execute a change to the company’s 800 production IIS web servers. Barry’s not worried–he has a custom script to do the job quickly. Barry figures that shouldn’t be a problem–the IT team’s “scripting guy” whipped up the script for this change just before leaving for vacation last week. And the script’s pretty simple–it just adds one new line of data to the middle of each core config file on each IIS web server.
The clock flips to 3:00 AM; the start of the approved maintenance window, and Barry starts the script. It begins connecting to web servers, one by one, changing files, moving along at a brisk pace. Barry decides to do a spot check, so he remotely connects to one of the changed web servers, logs in, opens the config file–and stares at it in horror. There’s a typo in the line the script just added–and all web sites on that server will be down until it’s corrected.
Barry quickly hits CTRL-C to stop the script and stares at his screen in a cold sweat. He has a serious problem. The script got about halfway through the server list before he killed it. Hundreds of websites across hundreds of servers are now offline, and his 3:00 AM – 3:30 AM change window–which took weeks of meetings and back-and-forth emails to get approved–is rapidly running out. Barry’s not even sure exactly how many web servers were changed–or weren’t changed.
800 servers is too many to address manually, but Barry has no confidence he can write a new script quickly to fix the problem without causing even more damage. Still, he figures, he has to try. Frantically, with trembling hands, he begins composing a new script on the fly to run against the entire IIS web server environment…