Go profiling has become an essential skill in my development workflow. After years of optimizing Go applications, I’ve refined six techniques that consistently help identify performance bottlenecks and optimize code effectively.
CPU Profiling for Processing Bottlenecks
CPU profiling remains my first choice when applications show high processing times. I start by integrating the profiling endpoint into my application’s startup routine.
package main
import (
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"time"
)
func main() {
// Start profiling server
go func() {
log.Println("Starting profiler on :6060")
if err := http.ListenAndServe("localhost:6060", nil); err != nil {
log.Fatal("Failed to start profiler:", err)
}
}()
// Your main application logic
runApplication()
}
func runApplication() {
for i := 0; i < 5; i++ {
processLargeDataset()
time.Sleep(100 * time.Millisecond)
}
}
func processLargeDataset() {
data := make([]int, 1000000)
for i := range data {
data[i] = expensiveCalculation(i)
}
}
func expensiveCalculation(n int) int {
if n <= 1 {
return n
}
return expensiveCalculation(n-1) + expensiveCalculation(n-2)
}
I collect CPU profiles during peak load periods using the command line tool. The 30-second sampling window provides sufficient data for analysis.
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
The interactive pprof interface allows me to examine function call graphs and identify expensive operations. I frequently use the top
command to see which functions consume the most CPU time.
(pprof) top 10
(pprof) list expensiveCalculation
(pprof) web
Memory Profiling for Allocation Analysis
Memory profiling helps me identify allocation hotspots and potential memory leaks. I examine both heap usage and allocation patterns to optimize memory consumption.
package main
import (
"fmt"
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"time"
)
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// Force GC to get accurate baseline
runtime.GC()
simulateMemoryUsage()
}
func simulateMemoryUsage() {
cache := make(map[string][]byte)
for i := 0; i < 10000; i++ {
key := fmt.Sprintf("key_%d", i)
// Allocate large byte slices
cache[key] = make([]byte, 1024*1024) // 1MB per entry
if i%1000 == 0 {
printMemStats()
}
}
// Keep cache alive
time.Sleep(30 * time.Second)
}
func printMemStats() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Alloc = %d KB", bToKb(m.Alloc))
fmt.Printf(", TotalAlloc = %d KB", bToKb(m.TotalAlloc))
fmt.Printf(", Sys = %d KB", bToKb(m.Sys))
fmt.Printf(", NumGC = %v\n", m.NumGC)
}
func bToKb(b uint64) uint64 {
return b / 1024
}
I access heap profiles through the profiling endpoint to see current memory usage patterns.
go tool pprof http://localhost:6060/debug/pprof/heap
For allocation analysis, I examine the allocs profile to understand total allocation patterns regardless of garbage collection.
go tool pprof http://localhost:6060/debug/pprof/allocs
The flame graph visualization helps me quickly identify memory allocation hotspots.
(pprof) web
(pprof) top 10 -cum
(pprof) list simulateMemoryUsage
Goroutine Profiling for Concurrency Issues
Goroutine profiling reveals concurrency bottlenecks and goroutine leaks. I monitor goroutine counts and examine their stack traces to identify blocking operations.
package main
import (
"context"
"fmt"
"log"
"net/http"
_ "net/http/pprof"
"sync"
"time"
)
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
demonstrateGoroutinePatterns()
}
func demonstrateGoroutinePatterns() {
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
var wg sync.WaitGroup
// Start multiple worker goroutines
for i := 0; i < 100; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
worker(ctx, id)
}(i)
}
// Monitor goroutine count
go func() {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for {
select {
case <-ticker.C:
fmt.Printf("Active goroutines: %d\n", runtime.NumGoroutine())
case <-ctx.Done():
return
}
}
}()
wg.Wait()
}
func worker(ctx context.Context, id int) {
for {
select {
case <-ctx.Done():
return
default:
// Simulate work with potential blocking
simulateWork(id)
time.Sleep(100 * time.Millisecond)
}
}
}
func simulateWork(id int) {
// Simulate different types of work that might block
if id%10 == 0 {
// Simulate network call
time.Sleep(50 * time.Millisecond)
} else {
// Simulate CPU work
for i := 0; i < 10000; i++ {
_ = i * i
}
}
}
I examine goroutine profiles to identify blocking patterns and potential leaks.
go tool pprof http://localhost:6060/debug/pprof/goroutine
The goroutine analysis shows stack traces for all active goroutines, helping identify where they’re blocked.
(pprof) top
(pprof) traces
(pprof) web
Block Profiling for Synchronization Analysis
Block profiling measures time spent waiting on synchronization primitives. I enable it to identify mutex contention and channel blocking issues.
package main
import (
"fmt"
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"sync"
"time"
)
func main() {
// Enable block profiling
runtime.SetBlockProfileRate(1)
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
demonstrateBlockingScenarios()
}
func demonstrateBlockingScenarios() {
var mu sync.Mutex
var wg sync.WaitGroup
sharedResource := 0
// Create contention on mutex
for i := 0; i < 10; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for j := 0; j < 1000; j++ {
mu.Lock()
// Simulate work while holding lock
time.Sleep(time.Microsecond * 10)
sharedResource++
mu.Unlock()
// Small delay between lock acquisitions
time.Sleep(time.Microsecond * 5)
}
}(i)
}
// Demonstrate channel blocking
ch := make(chan int, 1) // Small buffer
wg.Add(2)
// Slow consumer
go func() {
defer wg.Done()
for i := 0; i < 100; i++ {
<-ch
time.Sleep(10 * time.Millisecond) // Slow processing
}
}()
// Fast producer
go func() {
defer wg.Done()
for i := 0; i < 100; i++ {
ch <- i // Will block when buffer is full
}
close(ch)
}()
wg.Wait()
fmt.Printf("Final shared resource value: %d\n", sharedResource)
}
I analyze block profiles to identify synchronization bottlenecks.
go tool pprof http://localhost:6060/debug/pprof/block
The block profile shows where goroutines spend time waiting, helping optimize synchronization patterns.
(pprof) top
(pprof) list demonstrateBlockingScenarios
(pprof) web
Mutex Profiling for Lock Contention
Mutex profiling specifically tracks lock contention events. I enable it to identify which mutexes cause the most blocking in concurrent applications.
package main
import (
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"sync"
"time"
)
func main() {
// Enable mutex profiling with 1/1000 sampling rate
runtime.SetMutexProfileFraction(1000)
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
demonstrateMutexContention()
}
type ContentionDemo struct {
mu sync.RWMutex
data map[string]int
rwMu sync.RWMutex
rwData []int
}
func NewContentionDemo() *ContentionDemo {
return &ContentionDemo{
data: make(map[string]int),
rwData: make([]int, 0),
}
}
func (cd *ContentionDemo) writeHeavyOperation(id int) {
for i := 0; i < 100; i++ {
cd.mu.Lock()
cd.data[fmt.Sprintf("key_%d_%d", id, i)] = i
// Simulate expensive operation while holding lock
time.Sleep(time.Microsecond * 100)
cd.mu.Unlock()
}
}
func (cd *ContentionDemo) readHeavyOperation(id int) {
for i := 0; i < 100; i++ {
cd.mu.Lock()
_ = cd.data[fmt.Sprintf("key_%d_%d", id, i)]
time.Sleep(time.Microsecond * 50)
cd.mu.Unlock()
}
}
func (cd *ContentionDemo) rwMutexDemo(id int, write bool) {
if write {
for i := 0; i < 50; i++ {
cd.rwMu.Lock()
cd.rwData = append(cd.rwData, id*1000+i)
time.Sleep(time.Microsecond * 200)
cd.rwMu.Unlock()
}
} else {
for i := 0; i < 200; i++ {
cd.rwMu.RLock()
if len(cd.rwData) > 0 {
_ = cd.rwData[len(cd.rwData)-1]
}
time.Sleep(time.Microsecond * 25)
cd.rwMu.RUnlock()
}
}
}
func demonstrateMutexContention() {
demo := NewContentionDemo()
var wg sync.WaitGroup
// Create high contention scenario
for i := 0; i < 5; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
demo.writeHeavyOperation(id)
}(i)
wg.Add(1)
go func(id int) {
defer wg.Done()
demo.readHeavyOperation(id)
}(i)
}
// Test RWMutex patterns
for i := 0; i < 2; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
demo.rwMutexDemo(id, true) // Writers
}(i)
}
for i := 0; i < 8; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
demo.rwMutexDemo(id, false) // Readers
}(i)
}
wg.Wait()
}
I examine mutex profiles to understand lock contention patterns.
go tool pprof http://localhost:6060/debug/pprof/mutex
The mutex profile reveals which locks cause the most contention and waiting time.
(pprof) top
(pprof) list writeHeavyOperation
(pprof) web
Execution Tracing for Timeline Analysis
Execution tracing provides comprehensive timeline visualization of program execution. I use traces to understand goroutine scheduling, garbage collection impact, and system call patterns.
package main
import (
"context"
"fmt"
"os"
"runtime"
"runtime/trace"
"sync"
"time"
)
func main() {
// Create trace file
f, err := os.Create("trace.out")
if err != nil {
panic(err)
}
defer f.Close()
// Start tracing
if err := trace.Start(f); err != nil {
panic(err)
}
defer trace.Stop()
demonstrateTraceableWorkload()
}
func demonstrateTraceableWorkload() {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
var wg sync.WaitGroup
// CPU-intensive workers
for i := 0; i < 4; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
cpuIntensiveWork(ctx, id)
}(i)
}
// IO-simulating workers
for i := 0; i < 2; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
ioSimulatingWork(ctx, id)
}(i)
}
// Memory allocation workers
for i := 0; i < 2; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
memoryIntensiveWork(ctx, id)
}(i)
}
// Background GC trigger
wg.Add(1)
go func() {
defer wg.Done()
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
select {
case <-ticker.C:
runtime.GC()
fmt.Println("Triggered GC")
case <-ctx.Done():
return
}
}
}()
wg.Wait()
}
func cpuIntensiveWork(ctx context.Context, id int) {
trace.WithRegion(ctx, "cpu-work", func() {
for {
select {
case <-ctx.Done():
return
default:
// CPU-bound calculation
result := 0
for i := 0; i < 100000; i++ {
result += i * i
}
_ = result
time.Sleep(time.Millisecond)
}
}
})
}
func ioSimulatingWork(ctx context.Context, id int) {
trace.WithRegion(ctx, "io-work", func() {
for {
select {
case <-ctx.Done():
return
default:
// Simulate IO wait
time.Sleep(50 * time.Millisecond)
}
}
})
}
func memoryIntensiveWork(ctx context.Context, id int) {
trace.WithRegion(ctx, "memory-work", func() {
for {
select {
case <-ctx.Done():
return
default:
// Allocate and release memory
data := make([]byte, 1024*1024) // 1MB
for i := range data {
data[i] = byte(i % 256)
}
time.Sleep(100 * time.Millisecond)
runtime.KeepAlive(data)
}
}
})
}
After running the traced application, I analyze the execution timeline.
go tool trace trace.out
The web interface provides multiple views including goroutine analysis, network blocking profile, and synchronization blocking profile. I examine the timeline view to understand how goroutines are scheduled and where blocking occurs.
The trace analysis helps me identify patterns like:
- Goroutine scheduling inefficiencies
- Garbage collection frequency and duration
- System call blocking patterns
- Network and disk IO waiting times
Practical Integration Strategies
I integrate these profiling techniques into my development workflow through automated profiling in testing environments. This continuous profiling approach catches performance regressions early.
package main
import (
"context"
"fmt"
"log"
"net/http"
_ "net/http/pprof"
"os"
"runtime"
"runtime/trace"
"time"
)
type ProfileConfig struct {
EnableCPU bool
EnableMemory bool
EnableBlock bool
EnableMutex bool
EnableTrace bool
Duration time.Duration
}
func StartProfiling(config ProfileConfig) {
if config.EnableBlock {
runtime.SetBlockProfileRate(1)
}
if config.EnableMutex {
runtime.SetMutexProfileFraction(1000)
}
if config.EnableTrace {
f, err := os.Create("execution.trace")
if err == nil {
trace.Start(f)
go func() {
time.Sleep(config.Duration)
trace.Stop()
f.Close()
}()
}
}
// Start profiling server
go func() {
log.Println("Profiling server started on :6060")
log.Fatal(http.ListenAndServe("localhost:6060", nil))
}()
}
func main() {
config := ProfileConfig{
EnableCPU: true,
EnableMemory: true,
EnableBlock: true,
EnableMutex: true,
EnableTrace: true,
Duration: 30 * time.Second,
}
StartProfiling(config)
// Run your application
runApplicationWorkload()
}
func runApplicationWorkload() {
// Simulate realistic application workload
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
// Your actual application logic here
simulateRealWorldScenario(ctx)
}
func simulateRealWorldScenario(ctx context.Context) {
// Implementation of realistic workload
for {
select {
case <-ctx.Done():
return
default:
processRequest()
time.Sleep(10 * time.Millisecond)
}
}
}
func processRequest() {
// Simulate request processing
data := make([]int, 1000)
for i := range data {
data[i] = i * i
}
}
These six profiling techniques form a comprehensive performance analysis toolkit. CPU profiling identifies processing bottlenecks, memory profiling reveals allocation patterns, goroutine profiling exposes concurrency issues, block profiling shows synchronization delays, mutex profiling tracks lock contention, and execution tracing provides timeline visualization.
Regular profiling during development and production monitoring helps maintain optimal performance. I recommend establishing baseline profiles for applications and comparing them regularly to detect performance regressions before they impact users.
The combination of these techniques provides complete visibility into Go application performance characteristics, enabling data-driven optimization decisions and proactive performance management.