It has been said that necessity is the mother of invention and this seems to be the case of some of the pragmatic solutions that have been developed in the last couple of years. Data 2.0 is one example of this, driven by the massive growth of business, government, social and consumer data along with the need to gather intelligence from it.
During the time we’ve been working with our customers in this area Daitan has leveraged some interesting technologies that are worthy of attention for anybody who needs to develop scalable, resilient data processing solutions dealing with massive amounts of data.
Amazon: the once web bookseller and now biggest e-commerce site has a large investment on thousands of servers to handle peak demand, but most of the time these servers were idle, generating no value. In the middle of the 2000´s it devised a model where it would rent these idle servers for a few cents per hour, avoiding the large upfront investment of buying servers and the ongoing IT management costs. It even offered to manage these servers using the same in-house software Amazon uses to manage its own servers. This model was called The Cloud and today is an important consideration for any scalable solution strategy.
Facebook: a few years ago faced with a data deluge from millions of new users subscribing to the social network it created its own data persistence solution which had the following features: it was network distributed, provided fail-safety of data and used a more flexible data model than relational database solutions. This solution, that they named Cassandra, was a success and Facebook, instead of keeping this solution closed, open sourced it to the community, donating it to the Apache project. This contributed to a much quicker product evolution as it was adopted and modified by other companies with similar problems.
Netflix was one such company and now is one of the largest users of Cassandra running on top of Amazon cloud services to store large quantities of data about customer movie preferences.The scale of Netflix data and processing resources introduced a new set of challenges though such as, how do we automatically backup and restore the data on each Cassandra machine without human intervention? How do we control and distribute efficiently Cassandra configuration to all machines in Amazon cluster? Again, in the same spirit of Facebook, Netfix developed an in-house backup and restore solution for Cassandra which, as it matured, was contributed as free software to the community. This project named Priam is the paradigm of a solution which extends other successful solutions to reach its aim: it uses the Amazon S3 cloud storage services to store the backup data with 99.999999999% of durability and 99.99% of availability, uses Amazon´s SimpleDB to distribute Cassandra´s configuration to the cluster and it uses the Cassandra snapshot mechanism to get the data to be backed up.
With the easy access to seemingly unlimited compute resources and these continuing innovations in the area of Data 2.0 it seems the scale of innovation is limited only to our imagination (and budget).