get in touch
When you start programming, you discover new things on daily basis, which is both wonderful and bit frightening at the same time. Every new thing you learn gives you more power, better understanding of things that once seemed like black magic. Slowly, you gain more confidence in yourself but at the same time you realize that behind every thing you learned, there are several new things to be learned.
If you are back-end or full-stack developer, one of the first things to learn is how to work with database. Once you get familiar with SQL and data access frameworks, it takes years to master it to the point where you can easily design data structures and relations that behave nicely under heavy load.
At this point you already know how to get maximum speed from SQL queries by modeling database tables carefully and using indexes effectively. But, no mater how fast your queries work, once your web site becomes successful and used by thousands of users, you will hit the limit in number of queries you can run against your database. At this point you reach for optimization. One of most effective optimizations you can do to offload your database is caching.
It’s very common that data is read much more frequently then it is updated. In those cases caching can do wonders for you. Instead of querying database every time you need some data, first you check if you have it cached in memory, and if you do, you can return it immediately. If not, you get it from database and store it to in memory cache so you have it ready for subsequent requests that need this data. When specific data is updated, you simply remove it from in-memory cache, so fresh data will be fetched from database on next request.
With caching in your arsenal, you can handle thousands request per second on single machine. YOU KNOW KUNG FU!
But if you are lucky, and your website/service becomes very popular, even with caching in place, number of requests can be to big to handle on single machine.
This is really good news for your business, but in technical sense — you are in trouble. It is time for clustering: running multiple instances of your application acting as one service.
Basically, running clustered application means running multiple instances and balancing work load among them.
Let’s consider a web application. In single-instance mode you would run your application on single machine/container and everyone who can access that machine via IP can see your application. If this is public IP whole world can user your service and you will probably map a domain to this IP so your service address is easier to remember.
But what if we run two instances of our application on two different machines ? Which IP should be use to access your service? How do you map internet domain to multiple IP’s ? Well, you don’t. We use “entry point” machine with public IP and this machine will be used as load balancer for our cluster. We don’t have our application installed on this machine, instead we have software such as nginx that have proxying and load-balancing features. They don’t do anything more then forwarding requests to your application instances using round-robin (or some more advanced) technique and routing responses from our app instance back to the client. Since all the hard work is done by our application instances, load-balancer has very low resource consumption (CPU and RAM) so it can easily handle thousands of requests per second. As number of requests grow, we simply add more application instances to our cluster and load balancer will forward some requests to them, offloading existing instances.
Your users don’t even know how many instances you are running, they are only connecting to load balancer using public IP or internet domain.
Sounds wonderful, right? But, there are few problems…
Cache was our magic wand for dealing with high number of requests while we ran single instance application, but in cluster this is bit more complicated.
If data is updated via request that ended up on first application instance , that instance updated data in database and updated it’s own cache, but other instances are not aware of that and they still have old data in their cache.
So, now we have to somehow inform other instances that cache needs to be updated. There are several solutions for this, one of them is cache replication.
Many caching libraries support cache replication between application instances, all you have to do is to configure it properly.
However, if you hold large amount of data in cache, than replication can becomes bottleneck. All instances store all of the cached data and this is quite bad, especially in clusters with many application instances. This problem can also be solved using some of out-of-the-box solutions that can be found out there on internet called in-memory data grids (such as Hazelcast). These solutions will distribute cached data across your app instances, not replicating all the data to all nodes, but only enough to have necessary data redundancy. They use highly optimized network protocol to preform replication and know how to get your data regardless on which node they are stored.
If you don’t like idea of data replication among your application nodes, there is another type of solution: external caching. Instead of having pieces of cache in your application instances memory, you can use cache servers. Popular solution of this type is Memcached. In Java world there is EHCache server, and lately key-value databases can be very good choice for this purpose. So instead of caching locally, you do it remotely and the benefit of this is that application instances need much less memory to operate. Also, as soon as one application updates the cache, changes are immediately visible to all other instances (just as with replication). You might ask — what if cache server or key-value database goes down? Well they also work as clustered, highly available solutions, so no worries about that.
When we design our applications for cluster, our goal should be to make application instances as stateless as possible. That means that we should make our requests independent of each other. One example of dependent requests is using HTTP sessions to store authentication data. Every request from the same user, coming after authentication, depends on data stored in session during authentication request. This requires our load balancer to use so called “sticky session” in order to send all request from same user to same application instance. Simple solution for this problem is to send authentication data (some kind of tokens like JWT or OAuth tokens…) in every request — that way we break dependency to HTTP session.
Generally, caching offloads CPU, but it will be heavy on RAM and since cache is making our application more stateful, it makes it harder to scale. Stateless applications are easier to scale, but they require more CPU (and network) usage since we have to fetch data from external system more frequently.
We talked about handling direct requests coming to our application cluster from outside, but often there is work that has to be done “behind the scenes”. For example, generating some kind of daily or weekly reports or processing data gathered from user (or IoT devices) requests asynchronously.
Lets start with first case: doing some periodic jobs in background. We could be generating reports, calculating average user rating on some content etc. While doing this on single instance application, it was easy: we used some scheduling mechanism which usually fetched data from database, processed it and stored results somewhere and also mark database records as processed in order not to process then again on next run.
Now, imagine that we just run two instances of this application in parallel, what would happen? Both of them would probably fetch and process same data so we would have duplicate results. Even if results are just overriding each other with same values we have unnecessary CPU/RAM usage, but more often we will have corrupted data due to race condition or similar problems.
So, what is the solution for this? There are sophisticated solutions for scheduling that track jobs and not allow parallel execution of same job by multiple application instances. However, this solves the problem of race condition and work duplication, but it doesn’t scale well. If only one instance is allowed to do the job, it must do it all, and other instances might be idle.
More advanced technique would be to somehow partition the data and let each node process part of it. An example: we want to send weekly emails to our users. We could split this work among our nodes by partitioning using user id in this formula: worker_node = user_id % number_of_nodes. So now, each worker could send emails only to users whose id’s are such that they return worker node number if used in formula above. This way we achieved that work is balanced among all nodes and there are is no duplicated emails.
There is one problem with this approach. If worker_node is calculated (and stored) while inserting data to database, and before it is processed one node goes down for some reason, part of the data will be unprocessed. Slightly better situation is if we calculate worker_node while fetching the data (as part of where clause), but even then it’s still tricky to track number of live nodes to use that in our formula. We have to implement some kind of heartbeat mechanism, define when we consider node live or not, then decide which of the nodes will process heartbeats and how will it share resolved number of nodes.
There is another technique, which handles workload balancing much better then traditional “periodic jobs fetching from database” — it’s called message queuing: We store data to some kind of messaging system , usually external, but could also be implemented as part of our application using some library/framework. Data might also be stored to database for redundancy, but for processing it is not fetched from database, but from queues. Message queuing frameworks support two models: queuing and publish-subscribe. In first model, workers get data in competing manner. Data from same queue is distributed to multiple workers. In publish-subscribe model, all data form the same queue is delivered to all workers that subscribed to this queue. As you can guess, for our case, first model is more interesting.
In Java world, there is standard for message queuing called JMS and there are several implementations of it (mostly commercial).
And at last, recently we have a rise of so called streaming platforms with Apache Kafka being a leader of this wave. This solutions have many similarities with traditional messaging systems, but they are more advanced in sense of scalability, performance and data durability. Using this kind of solutions it is possible to handle huge amounts of data even with relatively small number of nodes. However, mastering this systems takes lot of time since in distributed systems world everything is order of magnitude more complex then running everything on single machine.
We could also talk about challenges of running large clusters, monitoring, auto-scaling, etc. but lets just stop here. I hope you enjoyed reading this post and I would appreciate your feedback very much.