AWS SNS Outage: Effects On The Unreliable Town Clock

It took a while, but the Unreliable Town Clock finally lived up to its name. Surprisingly, the fault was not mine, but Amazon’s.

For several hours tonight, a number of AWS services in us-east-1, including SNS, experienced elevated error rates according to the AWS status page.

Successful, timely chimes were broadcast through the Unreliable Town Clock public SNS topic up to and including:

2015-07-31 05:00 UTC

and successful chimes resumed again at:

2015-07-31 08:00 UTC

Chimes in between were mostly unpublished, though SNS appears to have delivered a few chimes during that period up to several hours late and out of order.

I had set up Unreliable Town Clock monitoring and alerting through Cronitor.io. This worked perfectly and I was notified within 1 minute of the first missed chime, though it turned out there was nothing I could do but wait for AWS to correct the underlying issue with SNS.

Since we now know SNS has the potential to fail in a region, I have launched an Unreliable Town Clock public SNS Topic in a second region: us-west-2. The infrastructure in each region is entirely independent.

The public SNS topic ARNs for both regions are listed at the top of this page:

https://alestic.com/2015/05/aws-lambda-recurring-schedule/

You are welcome to subscribe to the public SNS topics in both regions to improve the reliability of invoking your scheduled functionality.

The SNS message content will indicate which region is generating the chime.