Computer Outage

I am sure that you heard about the outage today causing lots of flights to be delayed.

Clark has spoken before that Southwest is one of the few airlines that does not use the standard flight booking and managing software that most others use. That seems to have allowed them to avoid all the issues that other carriers have had today.

The problem appears to have been a software update gone wrong. Normally you test these things offline and then roll them out gradually but that appears this did not happen.

I personally have been involved with this type of update gone wrong when I worked for a large computer comapny. The tech was supposed to wait for 7pm (so backups on all the mainframes across the region were complete), but he started on one system early. A very strange issue caused all of the mainframes connected to the network to fail instantly. They went off like a lightbulb. We actually had many, many systems fail simultaneously. Luckily the tech helped figure out the issue and we were able to halt all updates. Actually, if any of the systems had been updated, all would fail. It was weird, but once we figured out what was causing it, we fixed it and the new patch eliminated the possibility of it happening again.

As you can see, similar software on similar hardware connected to the Internet where they can be automatically updated, can be a major problem. This issue with the Windows systems will now become another target for future attacks by hackers I’m sure.

As a piece of info, Microsoft used to have “Patch Tuesday” where they would issue patches on a Tuesday. It would not be unusual to get a call on Wednesday mornings with people who had computer issues. That is why I try to avoid any kind of automatic update.

A good friend of mine also swore by the “Never load software versions ending in ‘0’”. Soif it is version 12.0 being released, he woudl wait a few days until 12.01 came along. That is because the people that loaded 12.0 had issues and almost immedietly they had to issue a patch.

Anyway, that is my take on this issue of the bad software update. The major issue is that even though the company issued an fix, someone is going to have to manually update each computer. THAT will not be a quick fix as many systems may be in inaccessable remote locations.

I loved being in I.T. but now that I am retired, I don’t really miss the craziness.

1 Like

I saw this comment on a YouTube news report about this. Having retired after 30 years in IT, I had to laugh

“We don’t always test our code but when we do, its in Production”

Crowdstrike

1 Like

For each business that was affected by this outage, I blame the business for not having a contingency plan. It seems that businesses just don’t have a “Plan B” to keep their businesses running smoothly. All businesses, no matter the size, should plan for the eventual outage. Think about something as simple as going to the grocery store. If the computers are down, they could not price the groceries and do checkout. They could not process credit cards for payments. Even if they did take cash, no one seems to know how to do that math anymore to make change. Terrible. :grimacing:

I’m sure they all have contingency plans for a failure like this. Those plans are designed to mitigate the failure’s fallout until the system is restored, not to continue on “business as usual.”

To have a fully redundant system capable of plugging into the existing global communications network, using today’s technology would be impossible and/or prohibitively expensive to build, implement and maintain.

It’s kinda like your local credit union or bank will be when the Internet goes down.

Sorry for the long posting but my morning coffee just kicked in.

I once saw a keno slot machine in a casino displaying a blue screen of death. It was running NT, not a specialized gaming OS.

A couple of decades ago, the computer company I worked for moved to a new building. We spent a year planning how to move hundreds of mainframes over Presidents Day weekend. The process involved backing up each system, sending backups on one truck, and disassembling and reassembling the systems after verifying the backups at the new location. We even created a movie to explain the process to non-technical employees who were to help us disassemble and reassemble the systems. We had people in Accounting work on their own mainframes, thus ensuring they would be overly careful!

Annually, we conducted a disaster recovery test, simulating scenarios like a tornado. Our priority was getting the financial system, especially Accounts Payable, up and running first to ensure employees got paid. If necessary, we instructed the bank to rerun the last payroll, ensuring most employees were paid and could focus on recovery without financial worries.

Our company, being large, prioritized Accounts Receivable later in the process. We had alternate sites and a factory ready to supply new computers immediately.

Airlines and stores also need solid backup plans. Most grocery stores rely heavily on barcodes, with prices stored only in the computer system and on the shelf edge. They should have hardcopy price lists for manual transactions, but they often don’t. After a 6.9 earthquake near our Home Improvement store, cashiers let people take items with minimal questioning to help with immediate repairs. During Hurricane Iniki, vendors on Kauai gave away food, and people helped each other. Spielberg even used real hurricane damage in “Jurassic Park.”

Always be prepared for disaster. https://www.youtube.com/watch?v=aFSLeXYnAhA

Always be prepared for disaster.

1 Like

Funny story here about when I flew for TWA. In the late 90’s they came up with a plan to “keep operating” after a computer break down. Back then TWA used Sabre, a computer system for airlines. They sent us to class to learn this backup plan. Basically call the dispatcher and talk to him. Discuss weather and fuel load and still fly planes.

So a couple months later Sabre goes down. Many of us pilots call dispatch. They said: “Oh, we’re not going to do that. Just keep the planes parked”. All us pilots just shook our heads! So much for the big backup plan!

I knew the outage on Friday was bad when I read that some casinos had to shutdown.

In the early days of Atlantic City casinos, we were there for a convention. The fire alarm went off in the room and the hall. I called the front desk and they put me on hold for a whiole. Meanwhile I got dressed and grabbed my flat bed plotter and a few things to head downstairs. The firetrucks appeared and went into the hotel but below, the casino was business as normal. It was surreal.

In any disaster, unless it is life or death, both a casino and Waffle House will remain open.

??? :thinking:

A flat bed plotter is a flat surface with a bar that can move in X and Y axis. The pen moves up and down; down to draw, up to allow the arm to move it elsewhere.

The While House uses a similar device called a autopen to sign the Presidents name to documents.

The flatbed plotter

An autopen

image

Oh, I see, you must have been at the convention as a vendor. I missed that part when I posted my ???. I was wondering what you were doing hauling a plotter around with you and having it in your hotel room. Why wouldn’t you just leave it set up in the exhibit area?

I’ve owned a small one and sold several large format plotters back in the 80’s when I was an IBM salesman. I got interested in geoprocessing and fiddled around with ARC/INFO for a while.

Frustrating! Did you reach out to your ISP for more information or checking their website for outage updates?

Rest assured the “Craziness” hasn’t stopped. Retired here too, but still connected. I once saw a jr engineer install a patch over the weekend to a major Internet backbone company and instantly distribute it globally to all their servers, took the network down. Of course they tried to blame it on the vendor, but hey, I was there! (Jr Engineer was the wife of one of the customer execs)

I used to install ‘foreign’ memory on those 'Blue" mainframes. We also had the competitive tape/disk. The difficult part in moving all that equipment for me was the raised floor tiles, power cables. No one talks about that part, it was the most physical. I started with a customer in a large midwestern insurance company. I think they rearranged their IT floors weekly! :sweat_smile:

I see Delta Airlines is still having issues and still cancelling flights. I thought they were supposed to be one of the good airlines.

I haven’t heard the term “raised floor” for long time. In my career as a sales rep for IBM (1970-1990), selling turn-key systems, including construction of data centers with raised floors, lotsa disk storage, water chillers, etc. made for some of my biggest sales.

The raised floors and environmental control equipment added a lot of expense.

Fun times for sure. I sat on a Credit Union board for 20 years, and helped them with IT suggestions etc. They started with raised floor, a very large facility for such a small CU. When I retired from the board, the IT room was down to a small closet with one rack of servers and NO raised floor. Cost for cooling, electricity, and equipment moves dropped drastically.

Years ago I we had a raised floor in the data center. One day someone went to brew a pot of coffee in the break room behind the data room. All the system came down. When we went to investigate we found a “chain” of power strips (4) under the floor with all those outlets used that was plugged into one outlet. Nobody fessed up to doing that one.

I cannot find the specs but was told that our data center was the size of a football field. It had electric cables under the raised floor. Access was controlled by a credit card-sized keycard, similar to modern credit cards with chips, which you presented at a reader to unlock the glass doors to enter or exit.

The room contained hundreds of mainframe computers and two massive telephone exchanges. Above the computer room was a cafeteria. One day, a water leak caused water to flow into the computer room and under the floor. Maintenance stopped the leak before it shorted any power cables, and an outside firm was called to vacuum up the water. A guard was posted at the exit door since the contractor didn’t have an access card.

The guard left briefly, and the contractor, unable to open the door, pushed a nearby big red button, thinking it would unlock the door.

image

Not designed as a ‘push to exit’ button, instead, it cut off all power and air conditioning, sounding the emergency alarm. This button was for emergency power shutdowns like for a fire.

Our System Managers spent the afternoon restoring services. Maintenance covered the button with a plastic box that had a beeper if the lid was raised. A few days later, Computer Operators received written instructions on the system. One operator, while familiarizing himself with the instructions, lifted the lid, triggering the beeper and he stopped instantly while figuraing out what to do to silence the beeper. A guard said he had an idea and he pushed the red button, which caused another shutdown.

The guard was reassigned, and we restored services again. This time, a big sign was installed next to the big red button.

Not all computer outages are hardware or software issues.