In today’s distributed systems landscape, designing robust APIs isn’t just about implementing business logic correctly—it’s about preparing for the inevitable failures. Distributed systems fail in unique ways, and our applications need to gracefully handle these failures to maintain availability.
I’ve learned through years of building microservices architectures that fault tolerance isn’t an afterthought; it’s a fundamental design principle that separates production-ready services from fragile prototypes. Let me share practical approaches to building resilient APIs using circuit breakers and retry patterns.
Understanding the Need for Resilience
When services communicate over networks, failures happen. Servers crash, networks become congested, and deployments introduce bugs. Without proper fault tolerance, these failures cascade through dependent services, potentially bringing down entire systems.
Consider a typical e-commerce application. If the payment service becomes slow, it can tie up all available connections from the checkout service, which in turn affects the shopping cart, eventually making the entire application unresponsive. This phenomenon—where failure propagates from one service to others—is known as a cascading failure.
The Circuit Breaker Pattern
The circuit breaker pattern, inspired by electrical circuit breakers, provides a solution by “breaking the circuit” when a service is failing repeatedly, preventing resource exhaustion and allowing the failing service time to recover.
Circuit breakers have three states:
- Closed: Requests flow normally
- Open: Failures have occurred; requests are immediately rejected
- Half-Open: Testing if the service has recovered
Let’s look at a complete implementation using Node.js:
class CircuitBreaker {
constructor(requestFn, options = {}) {
this.requestFn = requestFn;
this.state = 'CLOSED';
this.failureThreshold = options.failureThreshold || 5;
this.failureCount = 0;
this.successThreshold = options.successThreshold || 2;
this.successCount = 0;
this.timeout = options.timeout || 10000;
this.nextAttempt = Date.now();
this.resetTimeout = options.resetTimeout || 30000;
this.listeners = {};
}
async fire(...args) {
if (this.state === 'OPEN') {
if (Date.now() > this.nextAttempt) {
this.state = 'HALF-OPEN';
this.emit('half-open');
} else {
this.emit('rejected');
return Promise.reject(new Error('Circuit breaker is OPEN'));
}
}
try {
const response = await this.requestFn(...args);
return this.success(response);
} catch (err) {
return this.failure(err);
}
}
success(response) {
if (this.state === 'HALF-OPEN') {
this.successCount++;
if (this.successCount >= this.successThreshold) {
this.reset();
}
}
return response;
}
failure(err) {
this.failureCount++;
if (this.state === 'HALF-OPEN' || this.failureCount >= this.failureThreshold) {
this.open();
}
return Promise.reject(err);
}
open() {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.resetTimeout;
this.emit('open');
}
reset() {
this.failureCount = 0;
this.successCount = 0;
this.state = 'CLOSED';
this.emit('close');
}
on(event, callback) {
this.listeners[event] = this.listeners[event] || [];
this.listeners[event].push(callback);
}
emit(event) {
if (this.listeners[event]) {
this.listeners[event].forEach(cb => cb());
}
}
}
This implementation provides a robust circuit breaker with event notifications and configurable thresholds.
Implementing Retry Patterns
While circuit breakers help prevent cascading failures, retry patterns help handle transient failures. The key is to retry intelligently, with appropriate backoff strategies.
Here’s a practical implementation of a retry mechanism with exponential backoff:
async function retryWithExponentialBackoff(fn, options = {}) {
const maxRetries = options.maxRetries || 3;
const initialDelay = options.initialDelay || 100;
const factor = options.factor || 2;
const jitter = options.jitter || 0.1;
let retries = 0;
while (true) {
try {
return await fn();
} catch (error) {
retries += 1;
if (retries >= maxRetries) {
throw error;
}
// Calculate delay with exponential backoff and jitter
const delay = initialDelay * Math.pow(factor, retries - 1);
const randomFactor = 1 + Math.random() * jitter * 2 - jitter;
const actualDelay = Math.floor(delay * randomFactor);
console.log(`Retry ${retries} after ${actualDelay}ms`);
await new Promise(resolve => setTimeout(resolve, actualDelay));
}
}
}
// Usage
async function fetchData(url) {
return await retryWithExponentialBackoff(
async () => {
const response = await fetch(url);
if (!response.ok) throw new Error(`HTTP error: ${response.status}`);
return response.json();
},
{ maxRetries: 5, initialDelay: 200 }
);
}
This implementation includes jitter—slight randomness in the delay—which helps prevent “retry storms” when many clients retry simultaneously.
Combining Circuit Breakers and Retries
For maximum resilience, we can combine both patterns. The circuit breaker prevents unnecessary load during sustained failures, while retries handle transient issues.
Here’s a combined approach using both patterns:
class ResilientClient {
constructor(options = {}) {
this.circuitBreaker = new CircuitBreaker(
this._makeRequest.bind(this),
{
failureThreshold: options.failureThreshold || 5,
resetTimeout: options.resetTimeout || 30000,
}
);
this.retryOptions = {
maxRetries: options.maxRetries || 3,
initialDelay: options.initialDelay || 200,
factor: options.factor || 2,
jitter: options.jitter || 0.1,
};
}
async request(url, options = {}) {
try {
return await this.circuitBreaker.fire(url, options);
} catch (error) {
if (error.message === 'Circuit breaker is OPEN') {
// Fallback behavior when circuit is open
return this._handleFallback(url, options);
}
throw error;
}
}
async _makeRequest(url, options = {}) {
return await retryWithExponentialBackoff(
async () => {
const response = await fetch(url, options);
if (!response.ok) throw new Error(`HTTP error: ${response.status}`);
return response.json();
},
this.retryOptions
);
}
_handleFallback(url, options) {
// Return cached data, default values, or gracefully degrade
console.log('Using fallback for:', url);
return { fallback: true };
}
}
// Usage
const client = new ResilientClient();
const data = await client.request('https://api.example.com/data');
This client handles both transient failures with retries and sustained failures with the circuit breaker pattern.
Real-World Implementation with Resilience4j
While the examples above demonstrate the concepts, production systems often use battle-tested libraries. Resilience4j is one of the best options for Java applications:
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import java.time.Duration;
import java.util.function.Supplier;
public class ResilientService {
private final CircuitBreaker circuitBreaker;
private final Retry retry;
public ResilientService() {
// Configure Circuit Breaker
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofMillis(1000))
.permittedNumberOfCallsInHalfOpenState(2)
.slidingWindowSize(10)
.build();
this.circuitBreaker = CircuitBreaker.of("myCircuitBreaker", circuitBreakerConfig);
// Configure Retry
RetryConfig retryConfig = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.retryExceptions(RuntimeException.class)
.build();
this.retry = Retry.of("myRetry", retryConfig);
}
public <T> T executeWithResilience(Supplier<T> supplier) {
// Combine retry with circuit breaker
Supplier<T> retryableSupplier = Retry.decorateSupplier(retry, supplier);
return CircuitBreaker.decorateSupplier(circuitBreaker, retryableSupplier).get();
}
// Usage example
public String fetchData() {
return executeWithResilience(() -> {
// This code is protected by both retry and circuit breaker
return callExternalService();
});
}
private String callExternalService() {
// Actual service call
return "response data";
}
}
For .NET applications, Polly is an excellent choice:
using Polly;
using Polly.CircuitBreaker;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class ResilientHttpClient
{
private readonly HttpClient _httpClient;
private readonly AsyncCircuitBreakerPolicy<HttpResponseMessage> _circuitBreakerPolicy;
private readonly IAsyncPolicy<HttpResponseMessage> _retryPolicy;
private readonly IAsyncPolicy<HttpResponseMessage> _combinedPolicy;
public ResilientHttpClient(HttpClient httpClient)
{
_httpClient = httpClient;
// Circuit Breaker Policy
_circuitBreakerPolicy = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (result, timespan) => Console.WriteLine($"Circuit broken for {timespan.TotalSeconds}s!"),
onReset: () => Console.WriteLine("Circuit reset!"),
onHalfOpen: () => Console.WriteLine("Circuit half-open!")
);
// Retry Policy with Exponential Backoff
_retryPolicy = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
onRetry: (result, timespan, retryCount, context) =>
Console.WriteLine($"Retry {retryCount} after {timespan.TotalSeconds}s")
);
// Combine policies: retry first, then circuit breaker
_combinedPolicy = Policy.WrapAsync(_circuitBreakerPolicy, _retryPolicy);
}
public async Task<HttpResponseMessage> GetAsync(string url)
{
return await _combinedPolicy.ExecuteAsync(() => _httpClient.GetAsync(url));
}
}
Timeout Strategies
Timeouts are another crucial aspect of resilient systems. Without timeouts, a service can hang indefinitely, exhausting resources.
Here’s a simple implementation of a timeout pattern in JavaScript:
async function withTimeout(promiseFn, timeoutMs) {
return new Promise(async (resolve, reject) => {
// Create a timeout promise
const timeoutPromise = new Promise((_, timeoutReject) => {
setTimeout(() => {
timeoutReject(new Error(`Operation timed out after ${timeoutMs}ms`));
}, timeoutMs);
});
// Race the operation against the timeout
try {
const result = await Promise.race([promiseFn(), timeoutPromise]);
resolve(result);
} catch (error) {
reject(error);
}
});
}
// Usage
async function fetchDataWithTimeout(url, timeout = 3000) {
return withTimeout(async () => {
const response = await fetch(url);
return response.json();
}, timeout);
}
Bulkhead Pattern for Resource Isolation
Another useful resilience pattern is the bulkhead pattern, which isolates different parts of the system to prevent failures from affecting unrelated components.
import io.github.resilience4j.bulkhead.Bulkhead;
import io.github.resilience4j.bulkhead.BulkheadConfig;
public class BulkheadExample {
public static void main(String[] args) {
// Configure bulkhead
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(10)
.maxWaitDuration(Duration.ofMillis(500))
.build();
Bulkhead paymentBulkhead = Bulkhead.of("paymentService", config);
Bulkhead inventoryBulkhead = Bulkhead.of("inventoryService", config);
// Protect different operations with different bulkheads
Supplier<Payment> decoratedPaymentSupplier = Bulkhead.decorateSupplier(
paymentBulkhead, PaymentService::processPayment);
Supplier<Inventory> decoratedInventorySupplier = Bulkhead.decorateSupplier(
inventoryBulkhead, InventoryService::checkInventory);
// Even if payment service is overwhelmed, inventory checks can still proceed
}
}
Monitoring and Observability
Resilience patterns are most effective when combined with proper monitoring. We need to track circuit breaker states, retry counts, and response times to fine-tune our resilience strategies.
Using a library like Prometheus with Node.js:
const prometheus = require('prom-client');
const circuitBreakerStates = new prometheus.Gauge({
name: 'circuit_breaker_state',
help: 'State of the circuit breaker (0=closed, 1=half-open, 2=open)',
labelNames: ['service']
});
const retryCounter = new prometheus.Counter({
name: 'retry_count',
help: 'Number of retries',
labelNames: ['service', 'operation']
});
// Modified circuit breaker with metrics
class MonitoredCircuitBreaker extends CircuitBreaker {
constructor(requestFn, options = {}) {
super(requestFn, options);
this.serviceName = options.serviceName || 'unknown';
this.on('open', () => {
circuitBreakerStates.set({ service: this.serviceName }, 2);
});
this.on('half-open', () => {
circuitBreakerStates.set({ service: this.serviceName }, 1);
});
this.on('close', () => {
circuitBreakerStates.set({ service: this.serviceName }, 0);
});
}
}
// Modified retry function with metrics
async function monitoredRetry(fn, options = {}) {
const service = options.service || 'unknown';
const operation = options.operation || 'unknown';
try {
return await retryWithExponentialBackoff(fn, options);
} catch (error) {
// Count failed retries
retryCounter.inc({ service, operation });
throw error;
}
}
Best Practices for API Resilience
From my experience implementing these patterns across numerous services, here are key principles to follow:
-
Default timeouts should be conservative. I typically start with 1-2 second timeouts and adjust based on observed performance.
-
Circuit breaker thresholds depend on traffic volume. For high-volume services, use higher failure thresholds (10-20); for low-volume services, lower thresholds (3-5) make more sense.
-
Retry immediately for idempotent operations, but use increasing delays for non-idempotent operations.
-
Always add jitter to retry delays to prevent retry storms.
-
Provide fallbacks whenever possible. A degraded response is better than no response.
-
Design your API to be resilient to partial failures. If one data source is down, return what you can.
-
Be careful with retries for non-idempotent operations like payment processing.
-
Make circuit breaker and retry configurations adjustable at runtime to respond to changing conditions.
Implementation in Popular Frameworks
In Spring Boot applications, resilience can be elegantly implemented using the Resilience4j starter:
@Service
public class ProductService {
@CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
@Retry(name = "productService")
@Bulkhead(name = "productService")
@TimeLimiter(name = "productService")
public Product getProduct(Long id) {
return restTemplate.getForObject("/products/" + id, Product.class);
}
public Product getProductFallback(Long id, Exception e) {
return new Product(id, "Fallback Product", "This is a fallback product");
}
}
In Express.js, middleware can provide similar capabilities:
function circuitBreakerMiddleware(options = {}) {
const breaker = new CircuitBreaker(options);
return function(req, res, next) {
if (breaker.state === 'OPEN') {
return res.status(503).json({ error: 'Service temporarily unavailable' });
}
// Track the original end method
const originalEnd = res.end;
res.end = function(...args) {
const statusCode = res.statusCode;
if (statusCode >= 500) {
breaker.onFailure();
} else {
breaker.onSuccess();
}
return originalEnd.apply(this, args);
};
next();
};
}
// Usage in Express app
app.use('/api/payments', circuitBreakerMiddleware({ failureThreshold: 5 }));
Conclusion
Building resilient APIs requires a thoughtful combination of patterns including circuit breakers, retries, timeouts, and bulkheads. These techniques don’t just protect your services—they create a better user experience by handling failures gracefully.
I’ve implemented these patterns across various architectures, and while the tools may change, the principles remain the same. Start with reasonable defaults, monitor your services, and continuously refine your resilience strategies based on real-world observations.
Remember that resilience is a journey, not a destination. As your system evolves, so too should your fault tolerance mechanisms. With the right patterns in place, your APIs will weather the inevitable storms of distributed systems, maintaining availability even when components fail.