Background
Types of stress tests
Traffic Record
Record live traffic and playback directly in the live region. (Only can do for “GET” endpoint, i.e., the endpoints without writing DB operation)
Pros
- 100% simulate the real user behaviour in Live env during normal days
- Cover the full-chain (services) triggered by the endpoints stessed
Cons
- Cannot cover POST endpoint
- Cannot cover the services not passthrough by the endpoints stessed
- Cannot simulate the real user behaviour in Live env during campaign
Note that user bahaviour during campaign may be different during normal time, e.g.,
- during campaign the items/shops with high promotion may be hotter, while during normal days views to items/shops are relatively more distributed
- during campaign more vouchers are dispatched
- during campaign QPSs usually have spike due to user behaviour, e.g., Flash Sales start, TV shows start or are ongoing
Stress Test by Service
Pros
- Simple
Cons
- May not simulate the real user behaviour in Live env, subject to the way used to similate real requests (of parameters)
- Cannot cover POST endpoint
- Cannot cover the services not passthrough by the endpoints stessed
- Cannot simulate the real user behaviour in Live env during campaign
Why Full Chain Stress Test (FCST)
The objective FCST is to enable Stress Testing on live environment on real regions to try to simulate Big Campaign, e.g., traffic patterns.
So as to increase stability
How
Generally, we will give a specific flag (shadow flag) to the traffic triggered by our stress test engine, and everything should be the same as live traffic except the data is stored in a different table/topic.
Compotents involved
- Service
- Identify shadow traffic
- Cache
- Remote Cache, e.g., Redis
- Local Cache, i.e., in-process cache
- DB
- Message Queue, e.g., Kafka
- Cronjob
- Library
- ORM - add the logic where if shadow traffic, read/write to shadow DB
- Cache Library - add the logic where if shadow traffic, read/write to shadow cache