Let’s talk about making your Go programs faster when many things are happening at once. Often, when multiple parts of your code try to use the same piece of data, they have to wait in line. This waiting is called contention, and it can slow everything down. Today, I’ll show you practical ways to reduce that waiting, not by better waiting in line, but by redesigning the line so it barely exists. We’ll use techniques that avoid traditional locks, letting your goroutines work more independently.
I remember hitting a performance wall in a server I was building. The CPU graphs would plateau, not from doing work, but from goroutines blocking each other. That’s when I started looking beyond sync.Mutex. It’s a fantastic tool, but sometimes you need a different approach.
Instead of a lock that grants exclusive access, we can use operations that the CPU guarantees will happen in one, uninterrupted step. In Go, the sync/atomic package is our gateway to this. Think of it like handing off a baton in a relay race as a single, smooth action, rather than stopping to negotiate who gets it.
One of the simplest and most useful tools is the atomic counter. Imagine you’re counting HTTP requests from thousands of connections. Using a mutex to protect a single counter would create a massive traffic jam. An atomic counter avoids this entirely.
type ServerMetrics struct {
totalRequests uint64
failedLogins uint64
cacheMisses uint64
}
func (m *ServerMetrics) RequestHandled() {
atomic.AddUint64(&m.totalRequests, 1)
}
func (m *ServerMetrics) GetRequestCount() uint64 {
return atomic.LoadUint64(&m.totalRequests)
}
When you call AddUint64, the CPU handles the read, add, and write as one atomic operation. Other goroutines will see either the old value or the new one, never a corrupted half-updated value. It’s perfect for these high-volume, low-coordination tasks.
But what if you need to update something more complex, like the head of a linked list? You can’t do it in a single arithmetic operation. This is where Compare-And-Swap, or CAS, becomes your best friend. CAS is the fundamental building block for most lock-free structures. The idea is simple: “I think the current value is X. If it still is X, swap it for Y. If someone changed it in the meantime, tell me and I’ll try again.”
Let’s build a trivial lock-free stack to see it in action. This is a learning example; a real one would need careful memory management.
package main
import (
"sync/atomic"
"unsafe"
)
type Node struct {
value int
next unsafe.Pointer // *Node, stored as unsafe.Pointer for atomic ops
}
type LockFreeStack struct {
top unsafe.Pointer // *Node
}
// Push adds a value to the top of the stack.
func (s *LockFreeStack) Push(v int) {
newNode := &Node{value: v}
for { // This is the retry loop, crucial for CAS operations.
oldTop := atomic.LoadPointer(&s.top) // Safely read the current top.
newNode.next = oldTop // Point new node to old top.
// Try to swing the stack's top pointer to our new node.
if atomic.CompareAndSwapPointer(&s.top, oldTop, unsafe.Pointer(newNode)) {
return // Success! We updated the stack.
}
// If we get here, the CAS failed. oldTop was no longer the current top.
// The loop continues, we reload oldTop, and try again.
}
}
// Pop removes and returns the top value, or false if the stack is empty.
func (s *LockFreeStack) Pop() (int, bool) {
for {
oldTop := atomic.LoadPointer(&s.top)
if oldTop == nil {
return 0, false // Stack is empty.
}
oldNode := (*Node)(oldTop)
newTop := oldNode.next // The new top should be the next node down.
// Try to swing the top pointer from oldNode to its next node.
if atomic.CompareAndSwapPointer(&s.top, oldTop, newTop) {
return oldNode.value, true // We successfully claimed oldNode.
}
// CAS failed. Someone else popped before us. Retry.
}
}
This pattern—read, prepare, CAS, retry—is at the heart of lock-free programming. The for loop is not busy-waiting in the traditional sense; it’s an optimistic retry. In practice, under high contention, you’d want to add a backoff, which we’ll discuss later.
One of the most powerful mental models is the Single-Writer Principle. If you can design your data flow so that a specific piece of memory is only ever written by one goroutine, you’ve eliminated write contention for that data. Multiple goroutines can read it atomically all day long without conflict. I use this for things like a periodic config loader. One goroutine loads the config and atomically updates a shared pointer. Hundreds of worker goroutines atomically load that pointer whenever they need to read the config.
This leads us beautifully to a pattern called Read-Copy-Update, or RCU. It’s excellent for read-heavy, write-rarely data like configuration, routing tables, or feature flags. The process is elegant: 1) Create a complete, updated copy of the data structure. 2) Perform all modifications on this copy. 3) When ready, atomically switch the public pointer to the new version.
type Config struct {
Timeout time.Duration
Endpoints []string
// ... other fields
}
var configPtr atomic.Value // Stores *Config
// Writer goroutine: updates the config.
func updateConfig(newTimeout time.Duration) {
oldConfig := configPtr.Load().(*Config)
// Create a full copy. This is crucial.
newConfig := &Config{
Timeout: newTimeout,
Endpoints: append([]string(nil), oldConfig.Endpoints...), // Copy slice
}
// Atomically publish the new version.
configPtr.Store(newConfig)
// At this point, oldConfig is still in memory.
// Goroutines that read it earlier continue to see a consistent old version.
}
// Many reader goroutines:
func handleRequest() {
config := configPtr.Load().(*Config) // Always gets a consistent snapshot.
use(config.Timeout, config.Endpoints)
}
The magic here is that readers are completely wait-free. They never block, not even for a nanosecond. The cost is on the writer, which must copy the entire structure. For small or infrequently updated data, this is a fantastic trade-off.
Now, let’s tackle a tougher problem: memory. In the lock-free stack example, when a node is popped, who deletes it? If you delete it immediately, another goroutine that just read the pointer might still try to access its next field. This is the problem of safe memory reclamation.
One practical strategy is called Epoch-Based Reclamation. The idea is that you defer actual cleanup until no goroutine could possibly be holding a reference to the old data. Goroutines register themselves in an “epoch” when they start a read operation. Objects deleted in an old epoch can be safely reclaimed once no active goroutines are still in that epoch.
var (
globalEpoch int64 = 0
// Each slot holds pointers retired by a goroutine in that epoch.
retiredLists [3][]unsafe.Pointer
mu sync.Mutex // Used only for final reclamation, not for critical path.
)
func enterCriticalSection() int64 {
// In a real implementation, a goroutine-local variable would store its epoch.
return atomic.LoadInt64(&globalEpoch)
}
func retireObject(ptr unsafe.Pointer, readerEpoch int64) {
// Add the pointer to the list for the epoch the retiring goroutine is in.
retiredLists[readerEpoch%3] = append(retiredLists[readerEpoch%3], ptr)
// Periodically, a goroutine can advance the global epoch and clean up lists
// from two epochs ago, once it's sure no readers are that far behind.
}
Implementing this fully is complex, but the pattern is important to know. It highlights that lock-free programming often shifts complexity from lock management to lifecycle management.
Sometimes, the best way to avoid a fight is not to have one at all. Sharding is exactly that. Instead of having one counter that everyone fights over, have many counters. Each goroutine writes to its own designated counter. When you need the total, you sum them all. Writes become uncontended.
type ShardedCounter struct {
// 64 shards is a common choice.
shards [64]struct {
c uint64
pad [56]byte // Padding to prevent "false sharing"
}
}
func (s *ShardedCounter) Inc() {
// A simple way to pick a shard: use the goroutine ID.
// runtime_procPin()/runtime_procUnpin() or a TLS-based ID is better for production.
shard := getGoroutineID() % 64
atomic.AddUint64(&s.shards[shard].c, 1)
}
func (s *ShardedCounter) Value() uint64 {
total := uint64(0)
for i := 0; i < 64; i++ {
total += atomic.LoadUint64(&s.shards[i].c)
}
return total
}
That pad [56]byte is not just wasted space. It’s fighting a hidden performance killer called false sharing. Modern CPUs cache memory in lines, typically 64 bytes. If two frequently-written variables (like two counters from adjacent shards) fall on the same cache line, a write by Core A to its variable will invalidate the entire cache line for Core B, even though Core B’s variable wasn’t changed. This causes constant cache misses and slows everything down. Padding ensures each counter lives on its own cache line.
When you do have contention on a CAS loop, a little politeness goes a long way. Instead of hammering the CAS in a tight loop, you can back off. A simple exponential backoff gives other goroutines a chance to finish.
func tryWithBackoff(doWork func() bool) {
const maxRetries = 10
delay := 1 * time.Nanosecond
for i := 0; i < maxRetries; i++ {
if doWork() {
return // Success!
}
// Wait a bit before retrying.
for j := 0; j < int(delay); j++ {
runtime.Gosched() // Yield to the scheduler.
}
delay *= 2 // Double the delay for next time.
}
// Fall back to a different strategy or just try one more time.
doWork()
}
// Used like:
tryWithBackoff(func() bool {
return atomic.CompareAndSwapUint64(&someValue, old, new)
})
This pattern reduces the “busy” part of the wait, lowering system load and often leading to faster overall progress.
For producer-consumer scenarios, a lock-free queue is a workhorse. A simple but effective design is the Michael-Scott queue, which uses two CAS operations (one for the tail, one for the head) to allow fully concurrent enqueues and dequeues.
type LFQueue struct {
head unsafe.Pointer // *node
tail unsafe.Pointer // *node
}
type node struct {
value interface{}
next unsafe.Pointer
}
func NewLFQueue() *LFQueue {
dummy := &node{}
ptr := unsafe.Pointer(dummy)
return &LFQueue{head: ptr, tail: ptr}
}
func (q *LFQueue) Enqueue(v interface{}) {
newNode := &node{value: v}
for {
tail := atomic.LoadPointer(&q.tail)
tailNode := (*node)(tail)
next := atomic.LoadPointer(&tailNode.next)
if tail == atomic.LoadPointer(&q.tail) { // Are tail and next consistent?
if next == nil { // Was tail pointing to the last node?
// Try to link our new node at the end.
if atomic.CompareAndSwapPointer(&tailNode.next, nil, unsafe.Pointer(newNode)) {
// Enqueue done. Try to swing tail to the new node.
atomic.CompareAndSwapPointer(&q.tail, tail, unsafe.Pointer(newNode))
return
}
} else { // Tail was not pointing to the last node. Help advance it.
atomic.CompareAndSwapPointer(&q.tail, tail, next)
}
}
}
}
This code shows another useful trick: helping. If a goroutine sees the tail is out of date, it tries to advance it before proceeding with its own work. This cooperation keeps the queue structure healthy.
Finally, for complex read operations, a version number can provide a lightweight consistency check. You increment the version on every write. A reader reads the version, reads the data, then reads the version again. If the version changed, the read might have seen an inconsistent state, so it retries.
type ProtectedData struct {
mu sync.Mutex // Used for writes only.
version uint64
data []string
}
func (p *ProtectedData) GetSnapshot() []string {
for {
v1 := atomic.LoadUint64(&p.version)
// Create a copy of the slice for the caller.
snapshot := append([]string(nil), p.data...)
v2 := atomic.LoadUint64(&p.version)
// If version didn't change, no write occurred during our read.
if v1 == v2 {
return snapshot
}
// Version changed, a write happened. Retry for a consistent view.
}
}
func (p *ProtectedData) Write(newData []string) {
p.mu.Lock()
defer p.mu.Unlock()
p.data = newData
atomic.AddUint64(&p.version, 1) // Publish the change.
}
This pattern gives you a form of snapshot isolation without readers needing to take a lock.
These patterns are tools, not defaults. My advice is to always start simple. Use channels, use mutexes, keep your critical sections small. Profile your application. Only when you see contention as a measurable bottleneck should you reach for these lock-free techniques. They are more complex to write, test, and debug. But when you need that last bit of scalable performance, understanding how to let goroutines work together without getting in each other’s way is an incredibly powerful skill. It changes how you see concurrency, from managing conflict to designing for cooperation.