【Engineering】Logging

Posted by 西维蜀黍 on 2022-01-23, Last Modified on 2022-01-23

Log Levels

The log level is used to denote the severity of each event in the system. In most logging frameworks, the following levels are available.

  • FATAL: denotes very severe error events that will presumably lead the application to abort. Usually, these can end up in catastrophic failures.
  • ERROR: denotes error events that might still allow the application to continue running, with reduced capabilities in the affected paths.
  • WARN: denotes less-harmful events than errors. Usually, they do not lead to any degradation of capabilities or complete failure of the application. However, they are still red flags and must be investigated.
  • INFO: denotes the important event banners and informational messages in the application behaviour.
  • DEBUG: denotes specific and detailed information, mainly used for debugging purposes. These logs help us step through the code.
  • TRACE: denotes most low-level information like stack traces of code to provide the most information on a certain event/context. These logs help us inspect the variable values and full error stacks.

Best Practice

Semantic logging

Putting more semantic meaning in the logged events allows to get more value out of our log data. We log audit trails, what users are doing, transactions, timing information, data flows, and so on. We log anything that can add value when aggregated, charted, or further analyzed. In other words, we log anything that is interesting!

Think about the five W’s:

  1. When did it happen (timestamps, etc.)
  2. What happened (actions, state or status changes, error codes, impact level, etc.)
  3. Where did it happen (hostnames, gateways, etc.)
  4. Who was involved (user names, client IDs, process names, session IDs, etc.)
  5. Where it came from (source IPs, telephony extensions, etc.)

Attach a traceable event ID/token to messages

Usually, when an event/message enters into the system, a unique ID is assigned to that event/message. This ID will be carried forward throughout the next stages of processing, and it helps us trace the flow of an event throughout the system, especially for debugging, concurrency handling, and data-related operations. Ideally, in a single system, there should be only one unique ID for an event throughout all modules and services.

[87d4s-a98d7-9a8742jsd] Request Body: {
  "request_id": "87d4s-a98d7-9a8742jsd",
  "app_id": "TX001",
  "option_val": "IBM",
  "req_type_id": "0013",
  "data": {...........}
}

Encrypt/mask sensitive data

In most cases, it is a legal requirement for systems to mask and/or encrypt the sensitive information of the users and the internal systems. Based on the industry you operate in and the geographical region you belong to, the compliance requirements may change. Therefore, do proper research and implement the correct procedures in your application. In some cases, before getting the application up and running, you may need to present the logging strategy to security authorities and get their approvals/certifications, too.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508[1000508] Request Body: {
 ”user_id”:”XXXXXXXXXX”,
 ”personal_details”:{
   ”firstName”:”XXXXXXXXXX”,
   ”lastName”:”XXXXXXXXXX”,
   ”DOB”:”XXXXXXXXXX”,
   ”gender”:”Male”,
   ”proffessional”:”Software Architect”,
   ”industry”:”IT”,
   ”SSN”:”XXXXXXXXXX”
 },
 ”address_history”:[
   {”streetAdress”:”Street No 1″,”zipcode”:”XXXXX”,”state”:”CA”},
   {”streetAdress”:”Street No 2″,”zipcode”:”XXXXX”,”state”:”NY”},   
   {”streetAdress”:”Street No 2″,”zipcode”:”XXXXX”,”state”:”AL”}
 ],
 ”card_info”:[
   {”type”:”amex″,”card_number”:”XXXXXXXXX”,”credit_limit”:”XXXXX”},
   {”type”:”visa″,”card_number”:”XXXXXXXXX”,”credit_limit”:”XXXXX”}
 ]
}

Write Meaningful Log Messages

This might probably be the most important best practice. There’s nothing worse than cryptic log entries assuming you have a deep understanding of the program internals.

When writing your log entries messages, always anticipate that there are emergency situations where the only thing you have is the log file, from which you have to understand what happened. Doing it right might be the subtle difference between getting fired and promoted.

When a developer writes a log message, it is in the context of the code in which the log directive is to be inserted. Under these conditions, we tend to write messages that infer on the current context. Unfortunately, when reading the log itself this context is absent, and those messages might not be understandable.

One way to overcome this situation (and that’s particularly important when writing at the warn or error level), is to add remediation information to the log message. Or if that’s not possible, what was the purpose of the operation and its outcome.

Also, don’t add a log message that depends on a previous message’s content. The reason is that those previous messages might not appear if they are logged in a different category or level. Or worse, they can appear in a different place (or way before) in a multi-threaded or asynchronous context.

Use Developer-Friendly Log Formats

Using log formats like JSON is beneficial as these are readable by humans and machines. This lets us to aggregate and create views and graphs of the log data to answer business questions, such as “how many users were logged in when the incident happened?”, “how many calls were in progress when we lost call guidance?”, or “how many extensions on customer sites were impacted by this error?” With JSON formatted log messages it is easy to parse the fields and aggregate or count events. Sumo Logic provides a rich set of operators that can be combined to create complex queries to answer the questions we get from our Customer Support team members.

For example the following Sumo query creates a report on annotation durations by user by date sorted by users and time, accessing and parsing JSON fields in the log messages:

_sourceCategory="/cogito/cloud/prod/annotation" 
| replace(_raw,"'", "\"") as message
| parse field=message "* Saving labeldata by user:* for call:* labels:*" as date,  user, call, dict
| json auto field=dict maxdepth 1 nodrop
| fields  timesubmitted, timeclipaccessed,  user 
| parseDate(timesubmitted,"yyyy-MM-dd'T'HH:mm:ss") as TS
| parseDate(timeclipaccessed,"yyyy-MM-dd'T'HH:mm:ss") as CA
| TS - CA as duration
| duration/(3600*1000) as hours
| timeslice 1d
| sum(hours) as mytotal by user, _timeslice
| sort by user, _timeslice asc

Reference