High availability
High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user’s point of view – unavailable.
Generally, the term downtime is used to refer to periods when a system is unavailable.
Principles
There are three principles of systems design in reliability engineering which can help achieve high availability.
- Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.
- Reliable crossover(非常可靠的交汇点). In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover. 比如域名解析,负载均衡器等。
- Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.
听起似乎很简单吧,然而不是,细节之处全是魔鬼,冗余结点最大的难题就是对于有状态的结点的数据复制和数据一致性的保证(无状态结点的冗余相对比较简单)。冗余数据所带来的一致性问题是魔鬼中的魔鬼:
-
如果系统的数据镜像到冗余结点是异步的,那么在failover的时候就会出现数据差异的情况。
-
如果系统在数据镜像到冗余结点是同步的,那么就会导致冗余结点越多性能越慢。
所以,很多高可用系统都是在做各种取舍,这需要比对着业务的特点来的,比如银行账号的余额是一个状态型的数据,那么,冗余时就必需做到强一致性,再比如说,订单记录属于追加性的数据,那么在failover的时候,就可以到备机上进行追加,这样就比较简单了(阿里目前所谓的异地双活其实根本做不到状态型数据的双活)。
Measuring Availability - Percentage Calculation
Availability is usually expressed as a percentage of uptime in a given year.
The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously.
Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.
Availability % | Downtime per year[note 1] | Downtime per quarter | Downtime per month | Downtime per week | Downtime per day (24 hours) |
---|---|---|---|---|---|
90% (“one nine”) | 36.53 days | 9.13 days | 73.05 hours | 16.80 hours | 2.40 hours |
95% (“one and a half nines”) | 18.26 days | 4.56 days | 36.53 hours | 8.40 hours | 1.20 hours |
97% | 10.96 days | 2.74 days | 21.92 hours | 5.04 hours | 43.20 minutes |
98% | 7.31 days | 43.86 hours | 14.61 hours | 3.36 hours | 28.80 minutes |
99% (“two nines”) | 3.65 days | 21.9 hours | 7.31 hours | 1.68 hours | 14.40 minutes |
99.5% (“two and a half nines”) | 1.83 days | 10.98 hours | 3.65 hours | 50.40 minutes | 7.20 minutes |
99.8% | 17.53 hours | 4.38 hours | 87.66 minutes | 20.16 minutes | 2.88 minutes |
99.9% (“three nines”) | 8.77 hours | 2.19 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes |
99.95% (“three and a half nines”) | 4.38 hours | 65.7 minutes | 21.92 minutes | 5.04 minutes | 43.20 seconds |
99.99% (“four nines”) | 52.60 minutes | 13.15 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds |
99.995% (“four and a half nines”) | 26.30 minutes | 6.57 minutes | 2.19 minutes | 30.24 seconds | 4.32 seconds |
99.999% (“five nines”) | 5.26 minutes | 1.31 minutes | 26.30 seconds | 6.05 seconds | 864.00 milliseconds |
99.9999% (“six nines”) | 31.56 seconds | 7.89 seconds | 2.63 seconds | 604.80 milliseconds | 86.40 milliseconds |
99.99999% (“seven nines”) | 3.16 seconds | 0.79 seconds | 262.98 milliseconds | 60.48 milliseconds | 8.64 milliseconds |
99.999999% (“eight nines”) | 315.58 milliseconds | 78.89 milliseconds | 26.30 milliseconds | 6.05 milliseconds | 864.00 microseconds |
99.9999999% (“nine nines”) | 31.56 milliseconds | 7.89 milliseconds | 2.63 milliseconds | 604.80 microseconds | 86.40 microseconds |
Disaster Planning
Real Application Clusters are primarily a single site, high availability solution. This means the nodes in the cluster generally exist within the same building, if not the same room. Thus, disaster planning can be critical. Disaster planning covers planning for fires, floods, hurricanes, earthquakes, terrorism, and so on. Depending on how mission critical your system is, and the propensity of your system’s location for such disasters, disaster planning could be an important high availability component.
Reference
- https://en.wikipedia.org/wiki/High_availability
- 关于高可用的系统 - https://coolshell.cn/articles/17459.html
- https://docs.oracle.com/cd/A91202_01/901_doc/rac.901/a89867/pshavdtl.htm