Friday, December 18, 2009

My Windows Azure Table Test Harness App Was Down for 02 Hours and 30 - 40 Minutes Yesterday

Updated 12/18/2009: See end of post.

I monitor my Azure test harness applications that run on Microsoft’s South-Central US (San Antonio, Tx) data center with free Pingdom and Monitis (mon.itor.us) services. Both services send mail when a specified application goes down and when it returns to service. Here are messages for the start of the outage:

mon.itor.us Pingdom

PROBLEM Alert

Service: HTTP
URL: oakleaf.cloudapp.net
State: CRITICAL
Date/Time: 2009-12-18 02:02:10 GMT

PingdomAlert DOWN:

Azure Tables (oakleaf.cloudapp.net) is down since 12/17/2009 05:43:21PM.

These messages arrived after the instance restarted:

mon.itor.us Pingdom

RECOVERY Alert

Service: HTTP
URL: oakleaf.cloudapp.net
State: OK
Date/Time: 2009-12-18 04:29:38 GMT

PingdomAlert UP:

Azure Tables (oakleaf.cloudapp.net) is UP again at 12/17/2009 08:23:24PM, after 2h 40m of downtime.

The discrepancies in start and end times of the outage are strange, but the durations (02:27:28 for mon.itor.us and 02:40:03 for Pingdom) are similar. Pingdom doesn’t report what time zone it uses.

My test harness at http://oakleaf.cloudap.net runs two instances. mon.itor.us also monitors my blob test harness at http://oakleaf2.cloudapp.net and queue test harness at http://oakleaf5.cloudapp.net. The mon.itor.us didn’t report an outage for the blog and queue test harnesses.

If you encountered an outage of Windows Azure instances running on the South Central US data center on 12/17/2009, please leave a comment. Thanks.

Update 12/18/2009: ToddySM left a comment with a link to Steve Marx’s RESOLVED: Recent errors in storage and portal thread of 12/27/2009 in the Windows Azure forum, which offers an explanation for the outage:

This evening some CTP participants with storage accounts in the "South Central US" region received errors from the storage service.  Because the Windows Azure portal relies on the storage service, some operations in the portal resulted in errors as well.  This issue has already been resolved, and no data was lost.

The root cause was a bug in queue storage, which had a cascading effect on blobs and tables for some customers.  We applied a manual workaround to restore service to full functionality, and we're working on a code fix for the underlying bug.

The probable reason I didn’t receive reports from mon.itor.us about problems with the blob and queue test harnesses is that user interaction is required to exercise the storage services; the table test harness reads the first 12 entities of the table into the UI on activation.

Prior to this outage, both mon.itor.us and Pingdom reported a week with 0% downtime for the table test harness. More follows in my monthly SLA report for December 2009.

blog comments powered by Disqus