Profile-Guided Optimization in Go (PGO): a practical guide

Golang PGOProfile-Guided Optimization – is the Go compiler feature most teams haven’t enabled despite it being stable since Go 1.21 in early 2023. The feature feeds runtime profile data into the compiler and uses it to make smarter optimization decisions. Typical gains run 2-7% on real workloads, with some applications seeing 10% or more. That’s not the kind of speedup that justifies a major refactor, but it’s free performance once you’ve wired up the workflow, and the workflow itself is genuinely simple.

I’ve enabled PGO on a couple of Go services in production over the past year. One was a JSON-heavy API where the gains were a clean 5% reduction in p99 latency. The other was a worker process where gains were closer to 2-3% in CPU time. Neither setup took more than an afternoon. The harder question – the one this post mostly answers – is whether PGO is worth the engineering attention at all, and when the gains justify maintaining the profile collection pipeline.

Quick answer: what is PGO in Go?

PGO (Profile-Guided Optimization) is a Go compiler feature that uses runtime CPU profiles to make smarter optimization decisions. You collect a CPU profile from a representative workload, save it as default.pgo in your main package directory, and rebuild with go build – the compiler automatically picks it up. Typical gains are 2-7% on real workloads, primarily from more aggressive inlining of hot functions and devirtualization of interface calls. Stable since Go 1.21. Used in production by Google, Cloudflare, and others.


What PGO actually does at the compiler level

Without PGO, the Go compiler makes optimization decisions based on static analysis of your code. It can’t see which functions are called millions of times per second versus which ones run once at startup. So it applies general heuristics – inline functions under a certain size, leave interface calls as virtual dispatches, allocate registers based on local analysis.

PGO gives the compiler real runtime data to override those heuristics. When you build with a CPU profile, the compiler reads it to identify hot functions (the ones consuming the most CPU during your representative workload) and applies optimizations that would be too expensive to apply everywhere.

Two specific optimizations matter most. Inlining becomes more aggressive for hot functions. Without PGO, inlining a function across the size threshold would bloat the binary and probably hurt instruction cache performance for cold functions. With PGO, the compiler knows which functions are hot enough to justify the inline cost. Devirtualization rewrites interface calls into direct calls when the profile shows one implementation dominates. Interface dispatch is fast in Go but not free; eliminating it for the hot path produces measurable wins.

A few other optimizations factor in (register allocation, basic block layout), but inlining and devirtualization account for most of the practical gains.


How to enable PGO in Go

The PGO workflow in Go is simpler than the equivalent in C++ or Rust. Three steps.

Step 1: Build your application normally and deploy it. PGO requires a real workload to profile, so you need a deployed binary first. No special build flags at this stage.

Step 2: Collect a CPU profile from production-like traffic. The standard library’s net/http/pprof package exposes a profile endpoint when imported:

import _ "net/http/pprof"

With pprof imported and an HTTP server running, hit the profile endpoint to collect a 30-second sample:

curl -o default.pgo http://localhost:6060/debug/pprof/profile?seconds=30

The profile is captured during 30 seconds of representative traffic, which is what produces meaningful optimization signal. Longer profiles (60-120 seconds) often produce slightly better results.

Step 3: Rebuild with the profile. Place default.pgo in your main package directory (the same directory as main.go) and run a normal build:

go build

Go’s toolchain automatically detects default.pgo and applies PGO. If you want to use a profile from a different location, the explicit flag is:

go build -pgo=path/to/profile.pprof

That’s the entire workflow. The compiler logs that PGO is active during the build, and the resulting binary includes the optimizations. The first build produces a baseline; subsequent profiles capture how the optimized binary behaves under load, which you can use to iterate.


What performance gains to expect

Performance gains from PGO vary by workload shape, but the patterns across reported benchmarks are consistent enough to set realistic expectations.

The Go team’s official benchmarks show 2-7% performance improvement as the typical range across diverse workloads. CPU-bound applications with significant time in hot paths see the higher end. I/O-bound applications where most time is spent waiting see the lower end because there’s less CPU work for the optimizer to improve.

Heavy JSON marshaling, HTTP servers handling many small requests, and code with well-defined hot paths all see gains toward the upper end of the typical range. Compute-heavy code with simple hot loops sometimes sees less benefit because those loops were already well-optimized by static analysis.

The honest framing on gains: PGO is worth enabling for any production Go service, but it’s not a substitute for actual optimization work. A 5% improvement matters in aggregate across a fleet of services and won’t fix an O(n²) algorithm. Treat it as compounding free performance rather than a tool for fixing specific bottlenecks.


How profile collection works in practice

Profile quality shapes optimization quality. A profile from synthetic benchmarks is meaningfully worse than one from real production traffic because the optimizer makes decisions about which functions are actually hot in your real usage patterns.

The pattern that works best in production: collect profiles continuously from a small sample of production instances (1-5% sample rate is plenty for CPU profiling), aggregate periodically, and use the merged profile for builds. The pprof tool can merge multiple profiles together.

Most teams shipping PGO in production wire it into CI/CD. The build job pulls the latest aggregated profile from object storage, places it as default.pgo in the main package directory, and runs go build. The whole workflow can be added to an existing build pipeline in maybe a day of engineering work.

Profile drift is the most common practical problem. If your code changes significantly between profile collection and build, some optimizations get applied to functions that have moved or been refactored. The compiler handles this gracefully but gains decrease. Refresh profiles weekly for actively-developed services, monthly for stable ones.


When PGO is worth the effort

The decision rule on adopting PGO in production comes down to traffic volume and engineering capacity.

For services receiving meaningful production traffic (more than a few requests per second sustained), PGO is worth the engineering time. The setup is small, the maintenance overhead is small, and the 2-7% gains compound across the fleet. For services running at significant scale, even a 3% improvement translates to real cost savings on compute.

For low-traffic services, internal tools, or batch jobs, the absolute gains are small enough that the engineering attention is better spent elsewhere. PGO works on these workloads, but the improvement is unlikely to be noticeable.

For greenfield projects, enabling PGO from the start is essentially free engineering capital. Wire up the profile collection in CI/CD before the service ships, and you get the gains without the after-the-fact integration work.

The question that compresses the decision: would a 5% performance improvement meaningfully reduce your compute bill or improve your latency SLOs? If yes, PGO pays back the engineering time. If no, prioritize other work and revisit when scale changes the answer.


Common PGO gotchas

A few production-relevant pitfalls show up consistently for teams enabling PGO.

Profile location matters. The Go toolchain looks for default.pgo specifically in the main package directory. If your repo structure has main nested deeply, the profile goes there, not at the repo root. The error mode is silent – the build succeeds without PGO and the only signal is the absence of a “PGO active” log line.

Sample duration matters for profile quality. Profiles shorter than 30 seconds often produce noisier optimization signal. Profiles from very low-traffic systems produce profiles that don’t represent steady-state behavior. Collect during periods of real traffic and aggregate across instances if per-instance traffic is low.

PGO doesn’t replace pprof-based profiling. The CPU profile you collect for PGO is the same kind of profile you’d use for performance analysis, but the workflows are separate. Continue using pprof to find bottlenecks; use PGO for compounding free performance once the bottlenecks are addressed.

FAQ

If you’ve enabled PGO on a production Go service and have measured numbers on what changed – latency, CPU usage, cost – that writeup is the gap worth filling. Real production reports across diverse workloads are scarce, and specific numbers from teams running PGO at real scale would help the next wave of adopters calibrate expectations.

Resources & Further Reading

Explore these resources for more on PGO:

Leave a Comment