【Engineering】服务降级(Downgrade)

Posted by 西维蜀黍 on 2021-10-19, Last Modified on 2024-01-15

Background

When the issue is coming from our service (like a bug in our code), usually we need to fix the issue ourself quickly. When, the issue is coming from external party, usually we can’t fix the issue directly, we need other team to fix the issue for us. In other hand, we have some downgrades that we can do to reduce the impact of the issue. For example, if there is an issue with voucher recommendation service, we can downgrade the API call to voucher recommendation service. By doing this, we won’t get the error from voucher recommendation service. Initially, when the voucher recommendation service returns us an error, we will return an error to the buyer, but when we downgrade the service, we will skip the voucher recommendation. As a result, the buyer might not get the best voucher, but it is better than they can’t checkout at all.

Downgrade(服务降级

  • 概念:服务降级一般是指在服务器压力剧增的时候,根据实际业务使用情况以及流量,对一些服务和页面有策略的不处理或者用一种简单的方式进行处理,从而释放服务器资源的资源以保证核心业务的正常高效运行。
  • 原因: 服务器的资源是有限的,而请求是无限的。在用户使用即并发高峰期,会影响整体服务的性能,严重的话会导致宕机,以至于某些重要服务不可用。故高峰期为了保证核心功能服务的可用性,就需要对某些服务降级处理。可以理解为舍小保大
  • 应用场景: 多用于微服务架构中,一般当整个微服务架构整体的负载超出了预设的上限阈值(和服务器的配置性能有关系),或者即将到来的流量预计会超过预设的阈值时(比如双11、6.18等活动或者秒杀活动)
  • 服务降级是从整个系统的负荷情况出发和考虑的,对某些负荷会比较高的情况,为了预防某些功能(业务场景)出现负荷过载或者响应慢的情况,在其内部暂时舍弃对一些非核心的接口和数据的请求,而直接返回一个提前准备好的fallback(退路)错误处理信息。这样,虽然提供的是一个有损的服务,但却保证了整个系统的稳定性和可用性。

Categories

  • 开关降级

  • 熔断降级

    • 基于失败次数:主要是一些不稳定的 API,当失败调用次数达到一定阀值自动降级,同样要使用异步机制探测回复情况
    • 基于错误率
    • 基于超时(Timeout):主要配置好超时时间和超时重试次数和机制,并使用异步机制探测回复情况
  • 限流降级(Rate Limits Downgrade):秒杀或者抢购一些限购商品时,此时可能会因为访问量太大而导致系统崩溃,此时会使用限流来进行限制访问量,当达到限流阀值,后续请求会被降级;降级后的处理方案可以是:排队页面(将用户导流到排队页面等一会重试)、无货(直接告知用户没货了)、错误页(如活动太火爆了,稍后重试)。

  • 故障降级:比如要调用的远程服务挂掉了(网络故障、DNS故障、HTTP 服务返回错误的状态码、 RPC 服务抛出异常),则可以直接降级。

降级后的处理方案有:默认值(比如库存服务挂了,返回默认现货)、兜底数据(比如广告挂了,返回提前准备好的一些静态页面)、缓存(之前暂存的一些缓存数据)

Dependency API Downgrade

API Call Latency Downgrade

API timeout is the amount of time we wait for API response before we decide that the request is failed. When we call an API, there is a delay before we get the response from the API. The delay happens because our requests need to travel through network until it goes to our downstream API process, the process itself need to process the API call, and the response also need to travel through the network until the response arrived to us. In networking, there is no guarantee that our message will be successfully sent to our recipient. The messages can be lost in the network, the recipient’s process itself can crash. Because of that, we can’t wait forever and expecting the response will eventually come to us. That’s why we need a timeout. Supposed we have 100ms timeout, it means we will wait 100ms for the response. If 100ms is reached, and we don’t get any response, we will assume that the API call is failed.

API Soft/Hard Downgrade

downgrade is the API called when error when ok
normal yes return error return the response
hard downgrade no return fallback everytime
soft downgrade yes return fallback return the response

Some APIs have soft downgrades. While downgrading an API means we are skipping the API call, soft downgrading an API doesn’t skip the API call. Instead, when soft downgrading an API, we are still calling the API, but when the API returns error, we fallback into some logics.

For example, lets say we are calling shipping discount shared service to calculate the shipping discount in the checkout. Normally, when shipping discount shared service returns error, the checkout request will return an error also. If we are downgrading the shipping discount shared service, we won’t call the shipping discount shared service and the shipping discount will be set to 0 automatically. But, when we are soft downgrading the shipping discount shared service. We will still call the shipping discount shared service, if the API return non-error response, we will use that response. But, if the shipping discount shared service returns error, we will fallback to the no-discount logic.

Middleware Downgrade

  • Redis
  • MySQL

Reference