【Design Pattern】Microservices - Circuit Breaker Pattern

Posted by 西维蜀黍 on 2021-06-25, Last Modified on 2022-12-10

Background

In Electrical or Electornics domain, a circuit breaker is an automatically operated electrical switch designed to protect an electrical circuit. Its basic function is to interrupt current flow after a fault is detected. It can then be reset to resume normal operation after the fault is solved.

Similarly to protect our microservices from an excess of requests, it’s better to interrupt communication between the front-end and the back-end as soon as a recurring fault has been detected in the back-end.

In his excellent book Release It, Michael Nygard popularized the Circuit Breaker pattern to prevent this kind of catastrophic cascade.

Context

Services sometimes collaborate when handling requests.

When one service synchronously invokes another there is always the possibility that the other service is unavailable or is exhibiting such high latency it is essentially unusable. Precious resources such as threads might be consumed in the caller while waiting for the other service to respond. This might lead to resource exhaustion, which would make the calling service unable to handle other requests. The failure of one service can potentially cascade to other services throughout the application.

Example

  • We have four different services namely Service A, B, C & D.
  • Because of the Client request, Service A will connect with Service B and B to C and C to D.
  • But, Service D failed due to some reason.
  • So, Connections made to Service D by other Services will get to wait and unresponsive service D will remain in the same state.
  • Therefore, Service C will continuously retry to make the connection with Service D. As so, Service B and Service A.
  • This will cause Cascade failure and entire application, or the part of service will go unavailable

Problem

How to prevent a network or service failure from cascading to other services?

Solution

Circuit breakers allow your system to handle these failures gracefully. The circuit breaker concept is straightforward. It wraps a function with a monitor that tracks failures. The circuit breaker has 3 distinct states, Closed, Open, and Half-Open:

  • Closed – When everything is normal, the circuit breaker remains in the closed state and all calls pass through to the services. When the number of failures exceeds a predetermined threshold the breaker trips, and it goes into the Open state (Closed -> Open).
    • In this Closed state, the circuit breaker returns with an error or a default response or cached response
  • Open – When the number of consecutive failures crosses a threshold, the circuit breaker trips, and for the duration of a timeout period all attempts to invoke the remote service will fail immediately without actually calling the dependency services.
  • Half-Open – After a timeout period, the circuit switches to a half-open state to test if the underlying problem still exists.
    • If a single call fails in this half-open state, the breaker is once again tripped (Half-Open -> Closed).
    • If it succeeds, the circuit breaker resets back to the normal closed state (Half-Open -> Open).

Circuit Breakers in CLOSED State

When the circuit breaker is in the CLOSED state, all calls go through to the Supplier Microservice, which responds without any latency.

Circuit Breakers in OPEN State

Once number of timeouts or number of errors reaches a predetermined threshold, it trips the circuit breaker to the OPEN state. In the OPEN state the circuit breaker returns an error for all calls to the service without making the calls to the Supplier Microservice. This behavior allows the Supplier Microservice to recover by reducing its load.

Circuit Breakers in HALF-OPEN State

The circuit breaker uses a monitoring and feedback mechanism called the HALF-OPEN state to know if and when the Supplier Microservice has recovered.

It uses this mechanism to make a trial call to the supplier microservice periodically to check if it has recovered.

  • If the call to the Supplier Microservice still times out or returns an error, the circuit breaker remains in the OPEN state.
  • If the call returns success, then the circuit switches to the CLOSED state. The circuit breaker then returns all external calls to the service with an error during the HALF-OPEN state.

Note that

  • after the CB transits to Open state, it could either make a trial call or multiple trial calls immediately so as to try to transit to Half-open state or wait for a while (controled by a waiting duration), and then make a trial call or multiple trial calls.
  • let us say it make trial call(s) immediately, and it could either
    • just make one trial call (once this call is return as normal, saying no error or timeout), just transits back to Open state
    • or make multiple trial calls, and only when the error/timeout number/rate of these trial calls is lower than the set threshold, the CB transits back to Open state

Why Circuit Breaker

Implementing the circuit breaker pattern adds stability and resiliency to a system, offering stability while the system recovers from a failure and minimizing the impact of this failure on performance. It can help to maintain the response time of the system by quickly rejecting a request for an operation that is likely to fail, rather than waiting for the operation to time out (or never return). If the circuit breaker raises an event each time it changes state, this information can be used to monitor the health of the part of the system protected by the circuit breaker, or to alert an administrator when a circuit breaker trips to the Open state.

Issues and Considerations

Customization

When Trigger Closed -> Open

  • The error count reaches a threshold within a time range (e.g., 1 second)
  • Or the error rate reaches a threshold within a time range

When Triggers Half-open -> Closed

  • (after wait for a specified duration,) if the number of successful requests reaches a threshold within a time range, Half-open -> Closed

Adaptive Concurrency Limits

Hystrix’s focus has shifted towards more adaptive implementations that react to an application’s real time performance rather than pre-configured settings (for example, through adaptive concurrency limits).

https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581

https://github.com/Netflix/concurrency-limits

Distributed Situation

In a multi-node (clustered) server, the state of the upstream service will need to be reflected across all the nodes in the cluster. Therefore, implementations may need to use a persistent storage layer, e.g. a network cache such as Memcached or Redis, or local cache (disk or memory based) to record the availability of what is, to the application, an external service.

Example

Hystrix

Reference