However, speed and consistency are easier to measure than availabilityWith a service like SimpleDB, it is hard to know from the outside what will happen during various failure scenariosThere may be internal failures that occur that we do not even know about Those hidden failures, handled swiftly or automatically, are a credit to AWS, but can make it difficult for users to understand the difference between a system that hasn t had a failure yet and one that handles failures gracefully Consider the July 2009 Google AppEngine outage that lasted for six hours Google AppEngine provides a data store that has some similarities to SimpleDB in that it is entirely hosted and managed by Google in its data centers, as SimpleDB is hosted and managed by AWS in Amazon data centers One week after the outage, Chris Beckmann (Google AppEngine PM) provided a very detailed explanation for the outage, including a timeline of events throughout the failure One sentence highlights the difference between the AppEngine data store and SimpleDB, as follows:
Since needed application data was completely unreachable for a longer than expected time period, we could not follow the usual procedure of serving of App Engine applications from an alternate data center, because doing so would have resulted in inconsistent or unavailable data for applications
This is a key insight into the different levels of availabilityThe root cause of the problem was a software bug, but that s not the whole storyThe interesting part is that there was a choice to be made Google engineers had a 30-minute-old copy of all the data for the thousands of AppEngine appsThey could have just switched it on at a different data center However, the choice was essentially made for them because they had a consistency guarantee to uphold Some data would be missing, some would be stale, and in the end, they would have no way to reconcile the divergent updates to stale data With SimpleDB, that same choice has already been made, but the difference is that availability is chosen over consistency It has already been decided, and the mechanism is already in place to synchronize up the stale or temporarily missing data The important thing to realize is that outages inevitably occur, even at Google and Amazon, and this leaves a choice for the developers who use these services If an outage causes the temporary loss of 30 minutes worth of data, is it better for your app to continue running without that data and have it synced up later or is it better for the app to wait out a six-hour outage unavailable to all users, even those unaffected by the data issue SimpleDB is a solution for those applications that would benefit from continuing to offer users access to the system even if the data is stale for a while If all the data in the application belongs to the user for example, in an email application the user is probably going to argue for the ability to keep working even if the data entered between 6:00 am and 6:30 am is absent for a whileWhen the application provides quantifiable business value, the decision can be far easier to makeWhen you weigh the cost of your entire online shopping cart application going down for all users for six hours in the middle of the day, against the cost of serving 30-minute-old stale shopping cart data to only those users who were active between 6:00 am and 6:30 am, there might be a dollar value difference you can compare
Boundaries of Eventual Consistency
A big part of the value of an eventually consistent system like SimpleDB lies in the ability to maintain availability during an outage It also lies in being able to reason about the qualities that you want your application to have, and in the ability to choose a database that supports those qualities
