【Design】Error Handling

Posted by 西维蜀黍 on 2019-10-28, Last Modified on 2023-02-21

Returned Errors

If service can not handle request for internal error or upstream error, it should response proper error information to client.

Service should write log for every error return.

The error response structure should be consistent for all APIs in one service.

  • The error code should be either a short string code or an integer, defined by service. This is to help client understand the situation and choose different ways to handle the error.
  • The detailed error message is optional. This is to help client debug the issue.
  • The error code need to be categorized to 2 types, and allow client side easily identify the type of error.
    • Occasional error, such as upstream down, internal error, etc. This kind of errors means client can safely retry the request.
    • Logic error, such as invalid argument. Client can cached the result and should not directly retry the request.

Error Codes

Google APIs must use the canonical error codes defined by google.rpc.Code. Individual APIs should avoid defining additional error codes, since developers are very unlikely to write logic to handle a large number of error codes. For reference, handling an average of 3 error codes per API call would mean most application logic would just be for error handling, which would not be a good developer experience.

Error Messages

The error message should help users understand and resolve the API error easily and quickly. In general, consider the following guidelines when writing error messages:

  • Do not assume the user is an expert user of your API. Users could be client developers, operations people, IT staff, or end-users of apps.
  • Do not assume the user knows anything about your service implementation or is familiar with the context of the errors (such as log analysis).
  • When possible, error messages should be constructed such that a technical user (but not necessarily a developer of your API) can respond to the error and correct it.
  • Keep the error message brief. If needed, provide a link where a confused reader can ask questions, give feedback, or get more information that doesn’t cleanly fit in an error message. Otherwise, use the details field to expand.

Error Details

Google APIs define a set of standard error payloads for error details, which you can find in google/rpc/error_details.proto. These cover the most common needs for API errors, such as quota failure and invalid parameters. Like error codes, error details should use these standard payloads whenever possible.

Additional error detail types should only be introduced if they can assist application code to handle the errors. If the error information can only be handled by humans, rely on the error message content and let developers handle it manually rather than introducing new error detail types.

Logging

Log Format

We suggest the log format as follow, to make sure it is easy for programs to parse it.

key_words|key1=value1,key2=value2,key3=value3……

Here are some examples, the gray text was automatically generated by logger

C++:

2013-04-08 15:30:42.621|FATAL|0x7f865e4b9720|GdpProcessor.cpp(35)|CGdpPrcessor::Init|init_db_client_fail|id=95,db=gpp_db,type=1

Python:

2014-07-30 17:47:27.358|WARNING|544:3108|beepay_processor.py:122|beepay_processor.on_packet|process_request_no_session|client=0.0.0.0:0(0),cmd=2175

Log Level

  • Trace - Only when I would be “tracing” the code and trying to find one part of a function specifically.
  • Debug - Information that is diagnostically helpful to people more than just developers (IT, sysadmins, etc.).
  • Info - Generally useful information to log (service start/stop, configuration assumptions, etc). Info I want to always have available but usually don’t care about under normal circumstances. This is my out-of-the-box config level.
  • Warn - Anything that can potentially cause application oddities, but for which I am automatically recovering. (Such as switching from a primary to backup server, retrying an operation, missing secondary data, etc.)
  • Error - Any error which is fatal to the operation, but not the service or application (can’t open a required file, missing data, etc.). These errors will force user (administrator, or direct user) intervention. These are usually reserved (in my apps) for incorrect connection strings, missing services, etc.
  • Fatal - Any error that is forcing a shutdown of the service or application to prevent data loss (or further data loss). I reserve these only for the most heinous errors and situations where there is guaranteed to have been data corruption or loss.

Reference