This website uses cookies. By using the website you agree with our use of cookies. Know more


The art of Failure - Policy of Good Neighbours

The art of Failure - Policy of Good Neighbours
Building software is easy but building robust software it's a different story due to all the variables we need to consider. What makes it hard is the fact that some of these variables can be complete unknowns that can trigger application crashes. In modern software development, we can take these unknowns to a new level, even if we are building a simple application. This simple app is probably part of a complex system built on a microservice architecture. With this type of architecture, we are able to break some of these challenges into smaller ones. But when doing so we also need to rely more on other services and do a lot more input/output work. These downstream services may use different technologies, be built by different people and behave differently under high loads.
Despite the technology, people or any other variable, one thing is certain, the software will fail. Failure is inevitable and, if someone does not believe it, then just relax and wait. Failure is part of who we are and what we build. Failure is what makes us better. And software engineering is no different. The best approach is to accept failure and start to talk about risk management. Identify risk and prepare all the necessary due diligences to recover from failure as fast as possible. On top of this, have proper metrics to guide further risk mitigation. Do a mix of uptime (Availability) with other reliability metrics such as MTTR (Mean Time To Recover) and/or MTTF (Mean Time To Failure). 

Can you rely on your software?

Let's use a simple service as an example, it has low computation effort but is highly dependent on other services. Although simple, it is part of a complex system. We want our service to be as robust as possible. Robust is a nice word because I always look at it like this:

Bullet proof

All of these words provide good validation to test/measure our service against and where the end goal is to have better software. A software that is more ROBUST. 

We want our simple service to always be available or at least be highly available. 100% available can be almost impossible in some cases or even too expensive. This is where the Operational and Up are mostly synonyms because they represent Availability. Availability can be described as the percentage of time a service is able to respond to requests.

During that time we want it to be Stable and Trustworthy. These two tend towards reliability. Reliability is closely related to availability but a system that is available may not necessarily mean that it is working correctly and that's exactly what reliability is all about.
Bullet proof and Resilient also come in as another pair of synonyms. We want our service to be able to adapt to different failure scenarios and, above all, not compromise the complex system it is part of. In this post I'm going to lean more towards resilience because it's deeply connected to withstanding failure. 

We can see resilience as a way to improve both Availability and Reliability. As a disclaimer, we are leaving infrastructure resilience out of the discussion. Infrastructure can do a lot of heavy lifting in making everything more fault tolerant but it can't make the code run smarter. And that's what I intend to focus on in this post, changing the insides of our software to make it more resilient.

Understanding context 

Our simple service depends on other services to do its job. These dependencies represent network communication or, as mentioned before, input/output work for the service to handle. Let's focus in an upstream, downstream direction where we will have http requests from our service to other dependencies. Our service needs to get data and/or submit data from/to other services in order to perform its main business transactions. 

This scenario is pretty mainstream and a classical example in a microservice architecture. The simple service is part of a system that needs to be scalable and capable of dealing with unexpected bursts of requests due to its exposure to several user facing applications that can also be crawled by robots that we have no control of.

We first need to understand that our service is part of a system, it's another gear in the machine that we want to be fully Operational. In order to do that it relies on others being operational as well. But what if they aren't? Should the simple service blindly rely on chance? That's the rationale that every software engineer should consider, sooner rather than later. But normally, this is not tackled early on and, sometimes, classified as an optimization. First comes the feature and only afterwards it is coded in the "right”, Bullet proof, way. We call this technical debt, which we all know is sometimes not paid. 

If we don't manage our dependencies in some way and they start to fail, will the consumer or, in this case, the simple service start to fail as well? When we are talking about failure here, we mean the type of failure that can make the dependency unresponsive. For instance, loss of network connectivity, TCP packet loss, rejected requests (rate limiting), deadlocks, lack of resources (memory, CPU, etc.) There can be multiple reasons but for our simple service they all mean the same, the dependency is not accessible or maybe down. At this point, what happens if incoming requests start to increase exponentially and queue up? Always think about this use case in more extreme ways. If this is an ecommerce platform, always consider  how this would work during Black Friday. What would be the extent of the blast radius? If a dependency is down would that represent a Single Point Of Failure (SPOF)? 

Single point of failure

A SPOF is a "potential risk posed by a flaw in the design, implementation or configuration of a circuit or system in which one fault or malfunction causes an entire system to stop operating". So this means that if a part of a system fails, the whole system fails. This scenario is highly undesirable for a system that needs high availability and reliability. 

So, if a dependency is down, we don't want the whole system to go down. If our system is a ship and our services are tanks on that ship, if one tank is flooded we don't want the whole ship to sink. Don't be another Titanic!

In a microservice architecture, we have more guarantees that failures can be more localized however their blast radius can still be severe. It may not bring the whole system down but can still take down a critical business flow, which can also be seen as a SPOF on its own. It’s important to understand what hinders our business.

Mitigating SPOFs should be addressed very seriously. Again, think about extremes on a high peak season or day. One hour of downtime can make a huge difference and can damage reputation which can be far worse. 

There’s a concept I like to think of which I call the policy of good neighbours when dealing with failure in complex systems. This policy is all about maintaining a good relationship with our neighbours. We all want good neighbors but we rarely choose which neighbours we have. I believe this is a great metaphor for a microservice architecture which promotes autonomy and ownership. Each service is a house or property owned by a team that needs to communicate with other services (neighbours). If there’s a fire in one neighbour’s house/property, no one would want it to spread to other neighbours, taking down the block or even the entire neighbourhood (cascading failures leading to a SPOF). So that’s why a good policy in the right place can make the difference even when we have misbehaving neighbours.
Next, I will cover a few resilience approaches that embody this policy of good neighbours, helping mitigate SPOFs and making services more robust.


Policy of good neighbours: "I’m not going to wait forever”

Timeouts would probably be the number one step towards improving resilience. Using core as an example, at a very basic level, we could use HttpClientFactory and configure the timeout.  

Every downstream dependency in a service should have a configured timeout and should not rely on default timeouts. Default timeouts are permissive by nature. Let's take httpclient, from .net applications as an example. If no timeout is configured, then the default timeout is applied, which in this particular case is 100 seconds. No one wants to wait 100 seconds for a response, especially if we are talking about a high demand system. So, if you start to see response times of 100 seconds in your traces, you are probably hitting a default timeout.

It's the software engineer's job to configure a timeout. However, this is easier said than done and something everyone struggles with. There's no perfect or right answer to what's the right timeout. Collaboration should exist in microservice architectures so a valid approach is to meet with the team that owns the dependency service. It's their job to monitor the service, so they are in the best position to make the best recommendation. Naturally, a seasonal analysis of the dependency response times can be done by understanding average, median and/or percentiles to come up with an answer based on evidence. Timeouts can still be readjusted along the way to address changes, issues or load in the overall system. If you are looking for a place to start, start with timeouts.

Retry pattern

Policy of good neighbours: "I’ll call you back later”

The retry pattern is particularly useful when transient failures occur (momentary loss of network connectivity, temporary unavailability of a service, or timeouts that occur when a service is busy). Typically, repeating the action/request that triggered the failure, after a while, will most likely be successful. A good effort when applying this pattern is being able to recognize a failure as a transient failure. This normally means checking the exception type that is returned but it relies deeply on the richness of the metadata available, if any, when errors are returned. If it’s too generic, it will be difficult to differentiate a transient error from another error.

The main questions to consider in this pattern are:

1. How many retries should be attempted?
If we are in the presence of a web application, having an aggressive retry policy may be undesirable, so a small number of retries is preferable. It’s better to fail as fast as possible and bubble an exception so the user is aware. Batching operations would be more suitable for a higher number of retries.
It’s also important to zoom out a bit and understand if different retry policies are chained together in one macro flow, meaning that service X may retry on simple service which will retry on service A, B and C. This chain can introduce an undesired overhead and will increase response times.

2. How long should the invoker wait to attempt another retry?
The invoker could retry right away or it can use a delay and increase it over time if more attempts fail. This is a concept known as "back-off" as time between requests grows with each attempt (normally increases exponentially). This "back-off” approach is normally a better alternative as it gives the service being called a little room to breathe in case it’s busy.

Circuit Breaker

Policy of good neighbours: "There’s definitely something wrong. I’m gonna stop calling you for a while and check back later”

The circuit breaker pattern is probably one of the most useful patterns to have in a microservice architecture and a prime example of how a service can cleverly adapt to failure scenarios and, at the same time, preserve/alleviate the whole system.

This pattern is helpful when failures look more severe and/or take longer to fix (beyond transient failures) and where it becomes evident that more requests to a failing service are pointless. This is actually a very important realization and a signal that something should change. Signal is the right word here because it points to electric circuits, where this pattern originates from. We look at this request/response scheme as a circuit. 

In a healthy flow when there’s no significant signs of failure, we say the circuit is in a "Closed” state. If failures happen, there’s an increment on a counter. When a predefined failure threshold is reached (counter reaches the threshold), this signals a change in the circuit and it commutes the state to "Open”. In this state, the calls to the failing service are stopped and an exception is returned immediately. This situation is kept for a certain amount of time. During this time, there are several aspects that really shine:
  • We fail faster, meaning that we take less time to send a response and have less impact in performance.
  • We preserve our own resources. Making requests holds critical resources like memory, threads and so on. By removing this work, we reduce the blast radius of the failing service. 
  • We create offload (less requests to process) in the failing service, which means that we are collaborating and helping the service get back on track. This obviously depends on the type of failure it is facing. But if the service is struggling and exhausting its resources, this offload can be a life saver by sparing its resources as well. 
All of the above aspects combined show how services can collaborate and help each other in preventing cascading failures. Connecting once more with the policy of good neighbours, if there’s a fire, this basically means we are helping our neighbour to put out the fire. Definitely a good neighbourhood.

Services can also exit the "Open” state on their own. As already mentioned, this state is kept for a certain amount of time. There’s a timer that starts when there’s a transition to the "Open” state and once that timer goes off, the circuit transitions to the "Half Open” state. During this state, we go again into a sampling phase, but this time, we are trying to collect the opposite, success calls. The purpose of this state is to understand if the failing service has recovered in order to resume the normal behaviour. For that, we need to try and call the service once again. If the service fails once again, we revert to the "Open” state once more. If the call succeeds, then a success counter is incremented. If the success rate keeps going, the success count threshold is reached and the circuit goes back to the "Closed” state (normal flow). 

If using .net applications, Polly is a great candidate to implement this pattern as well as the previous ones. 

Ambassador pattern

Policy of good neighbours: "I’m going to delegate all my calls to an expert”

Most of the time, patterns such as the previous ones are implemented in the same codebase as the service itself, sitting there together with the main business logic. The Ambassador pattern is an interesting way to abstract these patterns and move that logic elsewhere. The Ambassador acts as a proxy between the service and other remote services to handle client connectivity work so that the application code can concentrate on computing important business logic and really shine where it matters most. With this approach, attempting retries or opening circuits become abstracted, simplifying how the code deals with data.  

The Ambassador proxy can be deployed independently to an external process running in the same hosting environment.

Ops toggles

Policy of good neighbours: "I’m going to stop calling you until further notice”

Ops toggles are flags that are normally used to introduce new features in production where there’s some uncertainty about if they will behave properly. These toggles are usually short lived while confidence in the features grows. On the other hand, ops flags can also be used as true "kill switches” that support business in the long-term. These "kill switches” are a fast and convenient way to shut down non-vital functions whenever the system is struggling, like when under high load. 

In a system, it’s important to identify what is vital and what’s not so we are able to preserve what is vital under extreme circumstances. And in order to do that we need to be able to switch off what is not essential before it takes down or severely disrupts a critical business flow. It helps to think about our software in layers and be able to switch off a layer whenever necessary. 

A service that handles requests executes its main purpose, the critical/vital flow (red color layer in the diagram above), and then is capable of augmenting its response with logic that belongs to other layers. If these layers are programmed in a flexible way, that augmentation becomes optional and needs to pass a sort of authorization in order to be executed. That authorization control can be exported outside the service itself to a control room that manages all sorts of ops toggles. There would be a toggle for each layer. We could switch off that layer for several reasons:
  • The layer is processing wrong data and it needs to be stopped immediately. Responses will not be augmented and the wrong data will not be surfaced.
  • The layer communicates with another service that is down with no expectation when it’s coming back online.
Switching off the layer means the flow would continue to other layers but the same principle would apply.

Of course, it also depends on how the code is structured and how tangled the business logic is. This layer approach to development can be implemented in multiple ways, for instance, dependency injection together with a Strategy pattern where a dummy strategy is injected when the layer is switched off. It could also be as simple as a pipeline pattern where we skip an operation (matches the work required for a non vital layer) and progress to the next operation.
We can take this concept even further.

Naturally, in a microservice architecture, these flags can benefit multiple services which can be bound to the same toggle. In this example, if the personalization service is down, we can switch off all layers that consume that service spread across multiple services. This is an orchestrated, distributed and coordinated way to immediately remove disturbance from the system and let operations continue without unnecessary pressure or waste of precious resources. 

Let´s Continue!

Related Articles