1&1 Network Outage Blamed on Middleware
; published on September 27, 2011 at 3:05 pm
1&1 experienced a server outage on Monday 26th September 2011. Two outages of around three hours affected dedicated servers, VPS and cloud hosting with only shared hosting and mail left unaffected. Customers were unable to reach the control panel, meaning many were unable to change DNS settings to switch providers. Customer service phone lines were down and limited information was available through the company, with only the Facebook and Twitter carrying messages about the issue.
The company later explained the problem thus:
“Our network configuration files are stored in a central Sybase database. For dedicated servers in particular, network data frequently changes (e.g. when customers change their firewall settings). These changes need to be broadcasted to all network components (routers, switches). About one month ago, we installed middleware which regularly fetches the settings from the database and centrally distributes it to the network components.
“Until yesterday, this system worked seamlessly. Yesterday, however, our database server was under extraordinarily high load and hence had longer reply times than usual. Due to an error in the middleware, a Sybase::DBI Connector, the software interpreted the long waiting time as an “empty” answer and as a consequence overwrote the configuration files of routers and switches in our server segments with empty data.
“When the source of the issue had been identified, we first had to manually restore the correct configuration files. At the same time we temporarily disabled the automated database queries by the middleware. During this time, our customers could not make any changes to their firewall settings or install new IP addresses.
“In order to avoid similar outages in the future, we have implemented additional check routines in our network which prevent the error from occurring again. Changes to network settings through the 1&1 Control Panel have been enabled again, but will take approximately 15 minutes to become effective due to the check mechanisms.
“Additionally, we closely collaborate with the middleware manufacturer to eliminate the software errors as soon as possible.”
Customers criticised the use of Facebook to distribute information on the outage, saying that company website was a more suitable place for such updates or that customers should have been emailed. In addition, critical and angry comments were deleted from the Facebook page. The effect on enterprise-level customers who have clients of their own will leave many pondering the alternatives.