If you read my article about semantic diffusion in microservices you might recognize the title. This article is kind of a continuation, and its aim is to emphasize that having microservices and benefitting from them is possible only when we put enough effort to handle both organizational and distributed computing issues we are about to face. In subsequent paragraphs you will find what we get from the real microservices and what they take from us, as well. You won’t find any concrete solutions here, but rather a high level overview of how many different, and complex problems we need to solve before we go for microservices. Read on!
One of main advantages of microservice architecture is that each microservice is an autonomous unit. What does the autonomy mean? Well, we can analyze it from numerous perspectives.
First of all, an autonomous service can be deployed at will, independently from other services. The basic way to provide autonomy is to clearly separate business capabilities, so that our bounded contexts don’t leak to other services and create tight coupling. If a microservice is autonomous it can be characterized with high cohesion. Now cohesion is a term adapted from physics and in this context high cohesion means that all parts of the application (expressed in source code) are very strongly connected and it requires a lot of energy to separate a thing from it.
Autonomy is also about technology. Of course while developing our applications you should choose technologies wisely so that they are aligned with company’s competences and strategy. However, you should be able to create applications with technology stack that differs from other services. It means that you cannot bind your interfaces with particular technology – your API should be technology agnostic.
Another thing worth mentioning is that microservices are not only about the code – these are databases as well. Firstly, each microservice needs to manage data it uses on its own. Secondly, even if you identify bounded contexts perfectly, but some of your services use the same database (schema) your applications will still be coupled, and you won’t be able deploy them independently, and in case of a database failure all of them will be unavailable.
There are two main reasons to scale our application. The first one is because we want to serve a bigger load. The second one is about resilience that we want to provide in case of failure. In a monolithic architecture we need to scale the whole system with all its components. It is often a pain that the only way to scale a monolith is vertical scaling by adding more CPU or memory. With microservices, in contrary, we have autonomy, which give us the ability to scale applications horizontally, by adding new nodes independently and even automatically.
I already mentioned resilience in the previous paragraph, but I need to add that apart from providing resilience in the scope of a single service, autonomy gives us the possibility to isolate problems and errors within particular services, so that other ones are not affected, and thus, the system can still work.
Last but not least, autonomy is also about organization culture. Each microservice should be managed and developed by a single agile team, that takes full responsibility for its whole lifecycle. In order to deliver the best software possible, the team needs to have the power to decide about how people within the team will work.
You don’t get it for free
Tradeoffs – they are everywhere. Microservices can give us a lot of benefits. Both business and tech people can drink from the chalice as long as they make the sacrifice. Below you will find things, you need to provide in order to call your architecture a microservice one.
In a traditional environment, where applications run on physical hardware, and their locations are relatively static, services may communicate each other using simple, file based configuration with predefined URLs. However in a cloud-based microservices infrastructure, where a service can have multiple instances with dynamically assigned resources (especially hosts) relying on a static configuration won’t be sufficient. In such situations, services should discover each other dynamically using their identifiers (codes/names). And that is what service discovery is all about. In this area you have two options – server-side service discovery and client-side service discovery. In both cases each service instance registers itself in service registry. In the former case calls are directed through a load balancer that queries the registry and forwards requests to the target instances that are currently available. Functionality like this is provided by platforms like Amazon or Kubernetes. In the latter case, though, each application queries the registry itself and it is up to this application to choose the proper instance to call (client-side load balancing). A good example of this is Eureka. If you don’t use any of the mentioned options, you neither have scalability nor autonomy, as adding new instance of dependent service or deploying it on some other host implies the necessity of configuration changes in every application that communicates with it.
Scalability and load balancing
Horizontal scaling is not only about increasing the number of service instances. It must be also transparent to service consumers (you shouldn’t do anything when the instance count changes), so you need to use either client-side or server-side load balancing, so that the traffic can get to new instances without any extra manual work. When we use cloud platforms like Amazon, Kubernetes, we can scale our applications automatically with the respect to configured policies.
Design for failure
We cannot avoid failures. It is not always about our code, it might be network or hardware failures as well or it might be too many requests that saturated CPU or memory of our components. A failure in monolithic application usually means total unavailability. In enterprise systems that I worked with resilience was improved by horizontal scaling but if some component was erroneous – the error eventually occurred in all instances and couldn’t be easily isolated. In general, prevention is not as vital as coping with failure when it occurs and this is what changes our mindset, especially in the context of microservice architecture, where the number of components that may fail is much higher.
Topics that we have already covered should give you some idea of how to handle the failures, but let’s systematize it. Regardless the reason, if our the application that we are trying to communicate with is not responsive, we can always scale it up. Then we will be able to both serve bigger traffic and stay resilient in case of a failure. Sometimes, though, we have limited resources and it is impossible to scale our application. What then? In order not to slow down our system as a whole we should set up timeouts. Random backoff policies would also be a good idea to choose, so that we won’t overload the target application with retries. We should also think about isolating failures through mechanisms like circuit breaker, so that we prevent the failure to cascade up to clients. My intention is not to describe all possibilities but rather emphasize how difficult it is to deal with failures, especially when they are unavoidable.
Almost every article about microservices, especially in terms of databases, contains a few words about the CAP Theorem. It couldn’t be any different here. Like we all know, software development is about tradeoffs. In our every day work we make decisions that sacrifice one thing to give us another – the more valuable one for us in a given context. The same choice we face with Consistency, Availability and Partition tolerance. If for example due to some network failure we cannot perform data replication among database nodes, we need to make a decision about what to sacrifice – availability or consistency. If we want to handle requests (keep availability) despite failure we must know that we are working in non consistent (eventually consistent) environment until the problem is solved. If we cannot accept inconsistency at any moment, we need to reject all requests – resign from availability. The vital thing about CAP theorem is that some parts of our system might AP and some CP – we can make the decision on a component level.
When our system consists of a single application residing on one or just few nodes then you can pretty easily locate log files and look into them in case of any problems. If there are no errors, but your users experience very long response times, then you can probably profile your application, and look for bottlenecks in one place. In the distributed environment things get complicated. First of all you have dozens of applications running with several instances each on different nodes, which are very often assigned dynamically, and finding the place where something went wrong is a very tough task. Secondly, if it is not about an error that you can find in logs but about the responsiveness, then finding the guilty is even worse. So with microservice architecture we must accept and handle the complexity in terms of system monitoring.
Now the first thing you should provide is a central system that would aggregate logs from all applications’ instances and that would enable you to analyze them in one place. A good example would be the ELK stack (Elasticsearch, Logstash, Kibana). Secondly, you should also provide a tracing solution so that you could find out which request exactly caused a problem. In this field an example of a fantastic solution is Spring Cloud Sleuth, which can be easily enhanced with Zipkin, that helps you analyze and visualize dependencies among services in your infrastructure and latencies. With this set of tools you can easily find out which part of your system creates the bottleneck. When we are talking about application logs, we think about finding the source of an error that already occurred. In microservice architecture real-time monitoring of hosts, CPUs, memory, network latencies, response times, etc. is priceless as well. Using tools like Grafana + Graphite and properly configuring your applications you can easily visualize all those metrics. Setting proper threshold values you can trigger alarm and react to it before something really bad happens.
This paragraph might seem to be an option in microservice environment. One may say “I can search for logs in each instance. It takes some time but I can deal with it”. If you have 3 applications then it might work, but if you have dozens of them the amount of time that you spend on looking for problems will eventually discredit all other benefits you gain with microservice architecture as it would be completely not maintainable. We need to agree, that microservices bring a lot of complexity in the context of system monitoring, but this is something we literally need to provide.
Yet another characteristic of microservice architecture is that when you have small, independent applications, you can provide changes much quicker, and they have much smaller impact on the whole system comparing to monolithic approach. That means that you should be ready to deploy the features as quickly as you develop them. The first step to achieve this is to use so called Continuous Integration so that every change you provide into your codebase is automatically verified in terms of integrity with already existing version – does your code compile? do your tests pass successfully? is static code analysis result positive? These and maybe more conditions are checked within your build pipeline. This is the basis of continuous delivery. By delivery I mean producing some artifact, that is a potential release candidate and can be safely deployed on production. It might be a .jar file or even more platform specific creature like docker image for example. The reason for this is that in microservice architecture we need to respond to changes quickly and be ready to deploy them right after pushing the code to repository.
Of course sometimes it is not so easy to deploy our changes to production. There might be some regulations and processes containing some manual work, like user acceptance testing, or just clicking the “deploy” button by someone in charge, but this is not how it should look like, as development teams should be a part of company that is responsible for the services throughout their whole lifecycle. Drawing a line between developers, testers, and people in charge is not healthy, but in terms of delivery, we – developers should be ready for call.
Microservices has been very popular for a couple of years, and there are strong reasons for that. The autonomy that concerns organization culture, technology, deployment, data management, scalability and resilience brings a lot of value for both technical and business people, but at the same time it requires a lot effort to reach it. Service discovery, load balancing, design for failure, monitoring, continuous delivery are the very base we need to have, and it is not that cheap after all. Before we go for microservices we need to be aware of all these things. I hope that after reading this article, you will be confident to say if your infrastructure provides you the full microservice autonomy, or if you have just another distributed system. And please, care more for the words, and be pragmatic in your day to day work.