【Distributed System】高可用(High-availability)

Posted by 西维蜀黍 on 2020-10-14, Last Modified on 2023-09-29

High availability

High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user’s point of view – unavailable.

Generally, the term downtime is used to refer to periods when a system is unavailable.

Principles

There are three principles of systems design in reliability engineering which can help achieve high availability.

  1. Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.
  2. Reliable crossover(非常可靠的交汇点). In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover. 比如域名解析,负载均衡器等。
  3. Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.

听起似乎很简单吧,然而不是,细节之处全是魔鬼,冗余结点最大的难题就是对于有状态的结点的数据复制和数据一致性的保证(无状态结点的冗余相对比较简单)。冗余数据所带来的一致性问题是魔鬼中的魔鬼:

  • 如果系统的数据镜像到冗余结点是异步的,那么在failover的时候就会出现数据差异的情况。

  • 如果系统在数据镜像到冗余结点是同步的,那么就会导致冗余结点越多性能越慢。

所以,很多高可用系统都是在做各种取舍,这需要比对着业务的特点来的,比如银行账号的余额是一个状态型的数据,那么,冗余时就必需做到强一致性,再比如说,订单记录属于追加性的数据,那么在failover的时候,就可以到备机上进行追加,这样就比较简单了(阿里目前所谓的异地双活其实根本做不到状态型数据的双活)。

Measuring Availability - Percentage Calculation

Availability is usually expressed as a percentage of uptime in a given year.

The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously.

Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.

Availability % Downtime per year[note 1] Downtime per quarter Downtime per month Downtime per week Downtime per day (24 hours)
90% (“one nine”) 36.53 days 9.13 days 73.05 hours 16.80 hours 2.40 hours
95% (“one and a half nines”) 18.26 days 4.56 days 36.53 hours 8.40 hours 1.20 hours
97% 10.96 days 2.74 days 21.92 hours 5.04 hours 43.20 minutes
98% 7.31 days 43.86 hours 14.61 hours 3.36 hours 28.80 minutes
99% (“two nines”) 3.65 days 21.9 hours 7.31 hours 1.68 hours 14.40 minutes
99.5% (“two and a half nines”) 1.83 days 10.98 hours 3.65 hours 50.40 minutes 7.20 minutes
99.8% 17.53 hours 4.38 hours 87.66 minutes 20.16 minutes 2.88 minutes
99.9% (“three nines”) 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes
99.95% (“three and a half nines”) 4.38 hours 65.7 minutes 21.92 minutes 5.04 minutes 43.20 seconds
99.99% (“four nines”) 52.60 minutes 13.15 minutes 4.38 minutes 1.01 minutes 8.64 seconds
99.995% (“four and a half nines”) 26.30 minutes 6.57 minutes 2.19 minutes 30.24 seconds 4.32 seconds
99.999% (“five nines”) 5.26 minutes 1.31 minutes 26.30 seconds 6.05 seconds 864.00 milliseconds
99.9999% (“six nines”) 31.56 seconds 7.89 seconds 2.63 seconds 604.80 milliseconds 86.40 milliseconds
99.99999% (“seven nines”) 3.16 seconds 0.79 seconds 262.98 milliseconds 60.48 milliseconds 8.64 milliseconds
99.999999% (“eight nines”) 315.58 milliseconds 78.89 milliseconds 26.30 milliseconds 6.05 milliseconds 864.00 microseconds
99.9999999% (“nine nines”) 31.56 milliseconds 7.89 milliseconds 2.63 milliseconds 604.80 microseconds 86.40 microseconds

Disaster Planning

Real Application Clusters are primarily a single site, high availability solution. This means the nodes in the cluster generally exist within the same building, if not the same room. Thus, disaster planning can be critical. Disaster planning covers planning for fires, floods, hurricanes, earthquakes, terrorism, and so on. Depending on how mission critical your system is, and the propensity of your system’s location for such disasters, disaster planning could be an important high availability component.

Reference