In my work with distributed systems, I’ve learned that understanding what’s happening inside a microservice is just as important as what it’s supposed to do. You can write perfect logic, but if you can’t see it running, you’re operating blind. Over time, I’ve settled on a set of reliable methods that make Go services transparent and easier to manage. Let me walk you through these approaches.
The foundation is often called the three pillars: metrics, traces, and logs. Think of them as different lenses for examining your system. Metrics tell you how many and how fast. Traces show you the journey. Logs give you the narrative details. You need all three to get the full picture. Many teams start with just logs and then wonder why finding the root cause of a slowdown is so difficult.
Starting with metrics, they are the numerical pulse of your service. You track things like how many requests you get per second, how many of them fail, and how long they take. In Go, you can use a library like OpenTelemetry to define these measurements. The key is to be consistent. If every service calls its request counter http_requests_total, you can easily compare them all on a single dashboard.
func setupMetrics(meter metric.Meter) error {
// A simple counter for requests
requests, err := meter.Int64Counter("http.requests.total",
metric.WithDescription("Total number of HTTP requests"),
metric.WithUnit("{request}"))
if err != nil {
return err
}
// A histogram to track request duration
duration, err := meter.Float64Histogram("http.request.duration.seconds",
metric.WithDescription("The duration of HTTP requests"),
metric.WithUnit("s"))
if err != nil {
return err
}
// Store these instruments for use in your handlers
metrics := struct {
requests metric.Int64Counter
duration metric.Float64Histogram
}{requests, duration}
// ... attach to your app state
return nil
}
But raw numbers aren’t enough. You need to know which requests are slow or failing. This is where distributed tracing shines. When a user request comes in, you generate a unique trace ID. This ID gets passed along to every other service that request touches, like a guest pass at a multi-stop tour. Each service adds its own “span” to the trace, marking when it started work and when it finished. Later, you can see the entire path and pinpoint exactly where the delay happened.
Making this work requires passing that trace context. In Go, we use the context.Context for this. When your service calls another, you inject the trace information into the HTTP headers or gRPC metadata.
func callDownstreamService(ctx context.Context, client *http.Client, url string) ([]byte, error) {
// Create a new span for this specific operation
ctx, span := tracer.Start(ctx, "call-downstream-api")
defer span.End() // Marks the end of the span when function exits
// Prepare the request
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
span.RecordError(err) // Attach the error to the span
return nil, err
}
// The OpenTelemetry propagator automatically injects trace headers
propagator := otel.GetTextMapPropagator()
propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))
// Execute the request
resp, err := client.Do(req)
if err != nil {
span.RecordError(err)
return nil, err
}
defer resp.Body.Close()
return io.ReadAll(resp.Body)
}
Now, let’s talk about logs. The old way of printing lines of text is not helpful in a system with hundreds of instances. Instead, you use structured logging. Every log entry becomes a structured event with key-value pairs. This lets you search for all logs from a specific user or related to a failing transaction instantly.
func HandleLogin(ctx context.Context, logger *zap.Logger, username string) {
// Create a logger with fields relevant to this request
requestLogger := logger.With(
zap.String("handler", "login"),
zap.String("username", username),
)
requestLogger.Info("login_attempt_started")
// ... authentication logic ...
if err := authenticateUser(username); err != nil {
// Log the failure with the error
requestLogger.Error("login_failed", zap.Error(err))
return
}
requestLogger.Info("login_succeeded")
}
A powerful pattern is to connect your logs and traces. Notice in the trace code we have a trace_id. You can include that same ID in every log line for a request. Suddenly, you can find a trace in your tracing tool, grab its ID, and search your logs for everything that happened during that trace. It turns fragmented data into a coherent story.
Health checks are your service’s way of saying “I’m okay” or “I’m not.” Kubernetes and other orchestrators constantly call these endpoints. A liveness probe answers, “Is the process running?” A readiness probe answers, “Can I handle new work?” The latter might check database connections or the status of a cache.
func (s *Service) healthHandler(w http.ResponseWriter, r *http.Request) {
// Simple liveness check
if r.URL.Path == "/live" {
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "alive")
return
}
// Readiness check with dependencies
if r.URL.Path == "/ready" {
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
// Check database
if err := s.db.PingContext(ctx); err != nil {
s.logger.Error("db_not_ready", zap.Error(err))
http.Error(w, "database not ready", http.StatusServiceUnavailable)
return
}
// Check cache
if !s.cache.IsConnected() {
http.Error(w, "cache not ready", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "ready")
return
}
}
When things go wrong, good error handling is part of observability. Don’t just return a generic “internal server error.” Capture the error with as much context as possible and report it to a dedicated service. This includes the stack trace, the variables involved, and the trace ID. I’ve seen teams spend days trying to reproduce a bug that their error tracker could have explained in minutes.
func riskyOperation(ctx context.Context) (result string, err error) {
// Defer a function to recover from any panic and log it as an error
defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("panic recovered: %v", r)
// Capture the stack trace
logger.Error("operation_panicked",
zap.Any("panic_value", r),
zap.String("stack", string(debug.Stack())),
zap.String("trace_id", getTraceIDFromContext(ctx)),
)
}
}()
// Your main logic that might panic
result = doSomethingDangerous()
return result, nil
}
Defining what “normal” looks like is a critical step. You measure your key metrics—latency, error rate, throughput—during a period of known good performance. These become your baselines. Your monitoring system can then watch for deviations. If latency for the payment service suddenly jumps from a baseline of 50ms to 200ms, you get an alert before users start complaining. This shifts you from reactive to proactive.
In high-volume systems, you can’t record every single trace or log at the highest detail level. You’d drown in data. This is where sampling comes in. You might record full detail for only 1 out of every 100 requests. But you should always record data for requests that result in errors. This gives you representative data without the cost of storing everything.
func shouldSample(traceID string, isError bool) bool {
// Always sample errors
if isError {
return true
}
// For successful requests, sample ~10%
// Use a deterministic hash of the trace ID for consistency
hash := fnv.New32a()
hash.Write([]byte(traceID))
return hash.Sum32()%100 < 10 // 10% chance
}
The final pattern I rely on is managing cardinality, a fancy term for “too many unique combinations.” Imagine you add a label user_id to your request metric. If you have a million users, you now have a million different time series. Your monitoring system will grind to a halt. Instead, use labels that group things usefully: route, method, status_code. These have limited, known values. You can still put the user_id in the trace and logs for specific investigation.
Putting it all together feels like building a diagnostic cockpit for your service. You have dials for the current speed and load (metrics), a map of the recent routes taken (traces), and a detailed flight recorder (logs). When an alert goes off, you aren’t staring at a blank screen. You have the tools to find the problem, understand its impact, and fix it. This isn’t just about debugging; it’s about building confidence that you can run your software reliably, at any scale. You move from hoping it works to knowing how it works.