Flaws in testing may be real source of Wells Fargo's tech failure
While Wells Fargo has yet to identify exactly how a problem at a server facility in Minnesota took down its operations nationwide, observers pointed to possible gaps in the bank’s emergency backup contingency plans that prolonged an outage that lasted some 24 hours.
Experts agreed Wells undoubtedly had a blueprint in place for a systemwide failure, but that it’s impossible to test for every type of situation.
“I would equate this situation to one where banks may be limited by their imaginations or actual testing constraints,” said Shirley Inscoe, a senior analyst at Aite Group. “It is very difficult to test backup systems for every potential scenario, and testing may literally be impossible for some scenarios. Bank executives also have to make some assumptions during that disaster recovery testing and their assumptions may be at fault.”
Wells’ troubles started early Thursday morning as it was forced to shut down a facility when “smoke was detected following routine maintenance,” according to a statement from the bank.
A local news report from Shoreview, Minn., indicated a fire suppression system was accidentally triggered at the Wells Fargo Shoreview Operations Center the morning of Feb. 7. The report noted the trip occurred at 5 a.m. local time, and the bank called the fire department to investigate four hours later. The cause of the alarm was "due to construction dust."
What followed was a nationwide outage that affected Wells’ online and mobile banking capabilities, its ATM network and card processing. The outage also extended to the bank’s call center where an automated message told customers that bankers were unable to access account information.
Customers peppered the bank on Twitter with questions about how the outage was disrupting routine tasks such as paying for gas and bills or transferring money via the Zelle person-to-person network.
As of Friday morning, Wells said its operations were back to normal.
“Team members and customers can use their accounts with confidence,” the bank said in a statement. “We are experiencing higher than normal volumes, so there still may be delays in online banking and contact center response times.”
Still, customers on Friday wanted to know how a disruption at one facility could cause a nationwide shutdown.
Wells has yet to answer any questions about why the outage lasted as long as it did, but emphasized in Twitter updates that it has not fallen victim to a cybersecurity event.
We sincerely apologize for any inconvenience. We know this has created difficulty for our customers, and we are sorry to have let you down. Please reach out if you have concerns or need help today, and ask for your understanding as our phone wait times may be longer than usual.
— Wells Fargo (@WellsFargo) February 8, 2019
Tim Sloane, the director of the emerging technologies advisory service at Mercator Advisory Group, said that in the right scenario, Wells’ backup system in a different location would have recognized the problem at the Minnesota facility and stepped in to handle operations.
However, “the dilemma is, a lot of times systems don’t go down gracefully, and the ability to detect that one system going down isn’t so obvious."
“It starts to create a couple of errors," Sloane continued. "The backup system recognizes that the first system corrects itself, and then the [backup] system thinks everything is OK, things get out of whack and the original system dies. Now, the backup system doesn’t know the status of the primary system and has a bunch of transactions it doesn’t know what to do with and can’t take over.”
Sloane added that it’s unusual for a bank to have all its channels connected the way Wells did.
“I can have my core system connected to a backup. The online banking connects to the core, but it should still be capable of doing other transactions that don’t depend on the core,” Sloane said. “The ATM should be able to have a network stand-in that would keep the ATMs operating.”
Sloane said even the smallest institutions have an arrangement with the card networks to process transactions should the bank’s systems falter.
As for whether Wells could have prevented an extended outage, it might have come down to the testing procedures the bank has in place.
“The question enterprises should be asking themselves is that how can I guarantee that when disaster strikes, I can actually recover everything in an acceptable time frame,” said Doron Pinhas, the chief technology officer of Continuity Software, an Israel-based IT company that works with financial institutions worldwide.
“I think what most likely happened is that Wells did not pay enough attention and did not have enough controls and prior testing in place to get operations back up and running in an acceptable time frame,” he said.
Pinhas noted that a cloud-based system would not have mattered in this situation because such technology is still susceptible to human error.
“There would have been absolutely no difference” if Wells was completely on a cloud-based system,” he said. “Most of the reasons why you have outages is because of the human factor.
“When you’re working on large systems, you have a lot of people involved with planning and sometimes communication between those people is not what it should be,” he added.
Wells was already under scrutiny for a number of scandals. The $1.9 trillion-asset bank recently revealed a change in its business practices at a time when it is facing pressure from Congress about its operations.
Pinhas said he expects regulators to press the bank on why it took so long to recover from this outage.
“If there is a silver lining, it is that such incidents can help these institutions be better prepared for future business continuity challenges,” Inscoe said.