As with most social online applications, you can think of the data as more like a graph than very simple objects that stand alone. This kind of an application and data is my favourite kind of system to architect and here is why. I warn you, this might get geeky..
One of the fundemental concepts behind Picle is the ability to share your moments and stories with other people, in the same beautiful way that the iPhone app allows you to play them.
We could have acheived this in an Instagram fashion, having just a single page for a story that's viewable around the web. This however, seems like an opportunity missed. Picle is all about sharing, but perhaps even more important is the community behind it. So we decided to create a fully functioning website, where you can follow your friends, like their posts and share them around the web. This means you get your own picleapp.com login to view a dashboard which is effectively a stream of moments and stories created by the friends you are following.
Part 1: The Data Layer
It's all about relations
As with most social online applications, you can think of the data as more like a graph than very simple objects that stand online. For example, consider the image below, showing users following each other.
If you were to model this using a traditional relational database using joins or similar (MySQL, PostgreSQL), it wouldn't be too hard, but you would soon encounter problems and difficulties for the kind of calculations and more importantly the scalablity you want to achieve.
In fact, as I have found before, that sort of database has almost zero features that I would require for architecting the data structure for Picle, which include but are not limited to..
Scale and redundancy over multiple nodes without losing availability. (i.e be able to take a fifth of your database stack down and still support sufficient reads)
Distrubution of data by adding more nodes, moving and replicating the data around the ring of available nodes.
Advanced queries that can walk many links, perform intersections, etc and reduce the final results to a caluated set which can then be stored (i.e MapReduce).
No global write lock, meaning the system is not affected performance wise by the amount of writes or reads happening concurrently.
If you look hard at the previous graph you can now perhaps see a theme occuring, with "Nodes" and "Distribution". Our data can behave and be modeled just like our databases. Which even leads me to think, that for this kind of application, traditional web stack databases would be the completely the wrong choice.
So, how do we store these relations and the values that they need to relate to and what do we store them in.
We essentially need to be able to store two types of data, the main "values" and the relationships between them. In Picle the main values are in model terms a "User", "Moment" and a "Story". A "User" can have many "stories" and many "moments". A "User" can have many followers ("Users") and be following many "Users". Which then creates a stream of their "moments" and "stories" for the user to view.
Before we look at the database, let's view this in a pure computer science like way and just create it in data structures. Using the structures you already understand. Arrays and Objects.
In the following diagram, grey items are Objects and orange objects are Ordered Arrays.
What we're looking for then is something that allows us to store the following..
Stored ordered arrays.
Key/Values (perhaps in a hash format, instead of JSON)
Ideally it would also be (mostly) persisted, available and tolerate node failures, this is refered to as the CAP theorem, stading for "Consistency, Availability and Partion Tolerance". A theory first conceived by University of California, Berkeley computer scientist Eric Brewer. The theory was famously first put into practise by Amazon in developing the Dynamo database, which powered much of their site, infrastructure and is most notably the system behind S3. The theory states that not all three of these laws are possible at once and that a tradeoff must be made. So, onto choices..
Amazon DynamoDB (recently released, very promising, somewhat expensive)
Riak (based on the Dynamo paper, open source and supports multiple storage backends, automatic handling of nodes/partitions and replication of data/hashing)
Redis (semi persistent, extremely fast, cluster not yet here/on the way)
Cassandra (built by Facebook, table like, built in distribution, uses a confusing protocol)
Unfortunately, none of these currently meet all requirements for me personally (which will come in part two), so I have used a mix of two. One of them might suprise you, but which I also plan to change..
With MySQL in a state that would be easy to replace with any key value store, I'm in a good position. MySQL performs very fast under these conditions (around 50,000 ops), as there isn't a single join and every lookup key I use is on a secondary index. But I will be in trouble when I come to move to two servers, Riak will be in use by then..
In part two, I'll move on to how you would go about writing an app with this architecture and include sample code. I'd also be extremely interested in what other people think about this type of social scaling and different tools for distributing the data layer.
In part three, I will be going into creating APIs for this kind of data and how to optimize your API to give levels of information that can help build everything an app would need in only one or two API calls, I'll also go into some detail of paginating this kind of data.