Logging forms the diagnostic foundation of production applications. When systems misbehave, well-crafted logs become your first investigative tool. They reveal hidden patterns and anomalies without requiring direct code access. I’ve seen teams spend hours reproducing bugs that logged clues could have solved in minutes.
Structured logging transforms chaotic text into searchable data. Consider this Python implementation using JSON formatting:
import logging
from pythonjsonlogger import jsonlogger
log_handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
log_handler.setFormatter(formatter)
app_log = logging.getLogger('payment_service')
app_log.addHandler(log_handler)
app_log.setLevel(logging.INFO)
# Contextual logging example
app_log.info('Inventory updated', extra={
'sku': 'PROD-8876',
'previous_stock': 42,
'new_stock': 38,
'warehouse': 'CHI-3'
})
This outputs machine-parseable JSON:
{"message": "Inventory updated", "sku": "PROD-8876", ...}
Log levels establish severity hierarchies. During a payment gateway outage, I dynamically elevated levels to DEBUG without redeploying:
// Java dynamic log level adjustment
LoggerContext ctx = (LoggerContext) LoggerFactory.getILoggerFactory();
ctx.getLogger("com.payments").setLevel(Level.DEBUG);
- DEBUG: Detailed flow tracing (disable in production)
- INFO: Service milestones (“Order 42 shipped”)
- WARN: Recoverable issues (“Cache miss: product_88”)
- ERROR: Critical failures (“DB connection timeout”)
Distributed tracing connects cross-service workflows. This Node.js snippet propagates trace IDs:
const { createNamespace } = require('cls-hooked');
const traceNamespace = createNamespace('transaction');
// Middleware to propagate context
app.use((req, res, next) => {
traceNamespace.run(() => {
const traceId = req.headers['x-trace-id'] || uuidv4();
traceNamespace.set('traceId', traceId);
next();
});
});
// Service function using context
function chargeCard(payment) {
const traceId = traceNamespace.get('traceId');
logger.error('Card declined', {
traceId,
code: payment.error_code
});
}
Performance requires deliberate design. I use asynchronous logging to prevent thread blocking:
// C# async logging with Serilog
Log.Logger = new LoggerConfiguration()
.WriteTo.Async(a => a.File("logs/app.log"))
.CreateLogger();
// Non-blocking call
Log.Information("Async log written");
For high-traffic systems, sampling prevents log floods:
# Python probabilistic sampling
import random
def should_log():
return random.random() < 0.1 # 10% sampling
if should_log():
logger.debug("Backend call latency: 42ms")
Sensitive data demands rigorous masking. This Java regex hides PII:
public String sanitizeLog(String rawLog) {
return rawLog
.replaceAll("\\b(?:4[0-9]{12}(?:[0-9]{3})?)\\b", "CREDIT_CARD_MASKED")
.replaceAll("(?i)\\b[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,}\\b", "EMAIL_MASKED");
}
Log retention balances access needs with costs. My current policy:
- 7 days in hot storage (immediate querying)
- 30 days in warm storage (S3 with Athena)
- 1 year in cold archival (Glacier)
Monitoring integration turns logs into actionable signals. I correlate Python logs with Prometheus metrics:
# Log-triggered metric increment
from prometheus_client import Counter
log_errors = Counter('app_errors', 'Errors by type', ['error_code'])
try:
process_payment()
except InvalidCardException as e:
log_errors.labels(error_code="CARD_INVALID").inc()
logger.warning("Invalid card", extra={'error': str(e)})
Maintain schema consistency like API contracts. When adding a response_size
field, I ensure historical parsers ignore it gracefully. Log changes deserve the same rigor as code changes - version them and communicate breaking modifications.
Effective logging resembles a skilled conversation. It provides necessary context without unnecessary chatter. Through trial and error, I’ve learned that the most valuable logs answer three questions: “What happened?”, “Where did it occur?”, and “Why does it matter?”