Update on ITS response to power outage

On Wednesday, October 23, there was an interruption of core IT services. Given the extraordinary circumstances and duration of the problem, we would like to follow up with a brief description of the event and our response.

What happened?
At approximately 3PM, there was a Con-Edison power surge that affected most of the Upper East Side, including 1300 York Avenue. Because power fluctuations can be devastating to sensitive electronic equipment, the data center is equipped with a switch that can sense a power surge and automatically change to emergency (generator-powered) electricity. Though the switch appropriately detected the power fluctuation, it malfunctioned and wound up in a state which cut off both regular power and the emergency power. Weill Cornell's emergency power transfer switches are tested monthly to confirm proper function.

Though the data center is equipped with battery "backups" for individual servers, these generally can only sustain equipment for a few minutes. With the loss of all sources of power and a limited duration of battery backup, the majority of servers in the data center were abruptly shut down. This affected dozens of critical systems including network, email, Epic, web servers, and a host of department specific applications.

How long was the service interruption?
With the help of Weill Cornell facilities staff, the faulty power switch was reset to deliver normal electricity to the data center within approximately one hour after the power fluctuation. The loss of power affected dozens of servers that have intricate interdependencies. The restart of equipment and services needed to be carefully planned and appropriately sequenced. Because of the risk of data corruption and physical equipment damage, each service needed to be carefully evaluated and tested.

Many services, including Internet and email, were restored within two hours of the event. The systems with many hardware and software dependencies took significantly longer to restart. At least one critical piece of shared hardware was damaged during the event, and a part needed to be delivered, installed, and tested before many mission-critical services were restored. Though a read-only "shadow" Epic system was available relatively quickly after power restoration, the full Epic production system was not available for approximately 9 hours.

What can we do to be better prepared?
ITS and POIS carefully architect our IT infrastructure to be as resilient as possible. As is typical for these kinds of extraordinary events, multiple failures had to occur in order to end up with a catastrophic service failure. In this case, the well-designed system to compensate for a power interruption itself failed.

Weill Cornell continues to invest in more robust IT infrastructure to ensure business continuity. The majority of our mission-critical systems will be moved to the new data center within the Belfer Research Building. This modern facility has been carefully designed to increase our system resiliency. In addition, we are investing in a "co-location" strategy whereby many of our most important systems will be replicated within a data center at Cornell University in Ithaca.

Departmental leadership can also take this opportunity to revisit local business continuity plans. There should be clear policies and procedures in place to sustain operations in the event of prolonged system interruptions.

The IT service providers at Weill Cornell take very seriously the obligation to deliver highly reliable and available services. We are aware of the very significant workflow and operational problems that are caused by these types of prolonged service interruptions. We apologize for the inconvenience caused by this outage and remain motivated to continue to improve our service and support.

Need Help?

myHelpdesk
(212) 746-4878
Monday-Sunday
Open: 24/7 (Excluding holidays)
SMARTDesk
WCM Library Commons
1300 York Ave
New York, NY
10065
M-F
9AM - 5PM
Make an appointment

575 Lexington Ave
3rd Floor
New York, NY
10022
Temporarily Closed