This article describes how to use data denormalization in a service-oriented context and it’s possible usage within the metalcon redesign.
Knock, knock! Who’s there? Metalcon!
Later this week I’ve had an discussion with Rene about the architectural concepts behind the metalcon redesign. One aspekt he mentioned was that different services should use their own data storage and gather context-specific information based on this data store. With this in mind, it could be very easy, to decouple the huge amount of queries from just one single mysql instance, which will be definitely the bottleneck, to their designated services. This approach would let us use the mysql instance as a bare data storage without any indexes for just two simple usages: data store and syncronization/replication of our decoupled services such as news-stream, search or the API.
In the following I will „evaluate“ some approaches for metalcon, because we will have to deal with a huge amount of volatile data. Due to this, it emphazises the need for an alternative (combined) approach
Scaling in general
Reacting dynamicly to increased application load is omnipresent and several strategies had evolved to handle these requirements:
- Running multiple server with same application code (alsa known as load-balancing)
- Using numerous database nodes within master-slave topology
- Introducing various caching-layers or techniques (e.g. memcached)
For sure, this list is not exhausting at all, but gives some good overview.
Where common concepts fall short
Most of the commonly used concepts have a major downside: Data dependency.
This mean, that different services demand on the same dataset. Modern distributed database systems focus on this issue by sharding data on different documents or nodes like servers. This introduces the concept of data denormalization, which is the counter-part of normalized database-schemes which are used since more than 20 years. For sure, denormalization will introduce some redundant data, but as mentioned, storage costs are very low and this approach will increase the availability of your application.
I recommend to read some articles on the web regarding to the CAP-Theorem.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services of Gilbert and Lynch is a good primer on different aspects of distributed systems especially in context of consistency, availability and partition-tolerance.
Proof of concept on data denormalization
Five researchers from Tsingua University and Vrije Universiteit published a paper in 2008 which shows some benchmarking on how data denormalization could improve the overall performance of your distributed system.
The main question for modern architecture of these system is, that all UDI (Update, Delete, Insert) Operations are crutial and have to be propagated to all database nodes. In this context, database masters will have to update all it slaves, so the overall performance is limited to the capacity of the master’s instance. No loss in consistency or transactional properties is to be expected.
Here they kick in: By denormalizing the data in a service-oriented context data dependencies are reduced and the overall performance is increased. They show their results on 3 different benchmarking-frameworks like TCP-W, RUBiS, RUBBoS.
Concluding their work, they increased the performance based on the results of benchmarks by a magnitude. All this effort with small impact on the development cycle as they mention.
Benefits for metalcon
The question is, what can we use for our architecure design on metalcon? The following picture depicts the main concepts based on the discussion with Rene and reflects the perceptions of the paper metioned above.
Each service has it’s own data store where time critical and context-aware informations are stored. The communication and syncronization between the service and storage backend are minimized. In best case scenario the service data store will only read from the backend storage once (during initialization). Otherwise the backend will only be used to persist the data.
For more increased speed the different services will be cached seperately.
Evernote: Architectural Overview (including their sharded setup)