· engineering · 18 min read
Scaling PostgreSQL for high-traffic web applications
PostgreSQL scales further than most teams realise. The trouble is rarely the database itself, it's the order in which teams reach for fixes. Here is the layered hierarchy we use with clients before reaching for distributed PostgreSQL.

PostgreSQL scales further than most teams realise. The trouble is rarely the database itself, it’s the order in which teams reach for fixes. We’ve watched well-meaning engineering teams jump to read replicas, sharding, or a managed distributed offering before they’ve spent an afternoon looking at query plans, and the result is always the same: more infrastructure, more cost, more complexity, and the original bottleneck still sitting there waiting.
When OpenAI shared how it runs PostgreSQL behind ChatGPT, the headline detail wasn’t that they’d built something exotic. They run a single primary on Azure Database for PostgreSQL Flexible Server with nearly 50 read replicas. That’s it. No sharding. No bespoke distributed system. The engineering work went into operational discipline at every layer of the stack, not into a clever architecture diagram.
That’s the right model for thinking about PostgreSQL scaling. There’s a hierarchy of interventions, ordered roughly by simplicity and cost. The right move is almost always to exhaust the cheaper layer before reaching for the next one. Below is how we walk clients through that hierarchy, with the patterns we run first, the configuration changes we make second, and the architectural shifts we leave for last.
Layer 1: start with the query, not the architecture
The biggest single gain any team can find before touching infrastructure is in the queries already running on the database. A poorly written query cannot be rescued by faster disks or more replicas. Adding the right index can turn a 10-second sequential scan into a sub-10ms lookup, and that’s the kind of improvement no amount of hardware will deliver.
The tool for this is EXPLAIN ANALYZE. It shows exactly how the query planner is executing a statement and where the time is going. The patterns we look for first are sequential scans on large tables (which usually means a missing index), high-cost nested loop joins that should be hash joins, and large row count estimates that suggest the statistics are stale and need an ANALYZE.
Index choice matters more than people realise. B-tree indexes are the default and they’re right for most equality and range queries on ordered data. Partial indexes (where you index only rows matching a WHERE clause, like WHERE status = 'active') can be a fraction of the size of a full index and much faster when your queries are selective. Composite indexes are where teams trip up most often, the column order matters, and the most selective or equality-matched column should come first. For full-text search, JSONB, or array columns, GIN and GiST indexes are the right tools.
There’s a Mattermost engineering write-up that’s become a small classic in this space. They had a paginated query against a posts table that was getting slower and slower, eventually timing out at 30 seconds. The query plan showed an Index Scan with a Filter that was processing 40 million rows just to return 10,000 results. The fix was reshaping the keyset pagination clause so the planner could use an Index Cond instead of an Index Filter. End result, a 1,000x improvement on a query that hadn’t changed in years.
That kind of speedup isn’t unusual once you understand what the planner is doing. We worked with a SaaS team last year whose activity feed was getting slow as their tables grew, and they were considering moving the workload to a separate database. The actual problem was straightforward: they were using OFFSET-based pagination. On a feed table with tens of millions of rows, OFFSET 50000 means PostgreSQL has to read and discard 50,000 rows before returning anything. The performance degrades linearly as users page deeper. Switching to keyset pagination (where the WHERE clause references the last seen ID, like WHERE id > $last_seen_id ORDER BY id LIMIT 100) gave them constant-time performance regardless of how deep the user paged. No new infrastructure. No replicas. Just a better query.
Layer 2: the conservative defaults problem
PostgreSQL ships with defaults designed to run on the smallest possible machine. That makes sense for a project that has to install cleanly on a Raspberry Pi or a developer’s laptop, but it means a production server running out of the box is using a tiny fraction of its available resources. Four parameters matter most.
| Parameter | Default | Production setting | Notes |
|---|---|---|---|
shared_buffers | 128 MB | 25 to 40% of total RAM | The primary in-memory page cache, around 4 GB on a 16 GB server |
work_mem | 4 MB | 16 to 64 MB, workload-dependent | Per-operation sort and hash memory, multiply by concurrent queries |
effective_cache_size | None | 75% of total RAM | Tells the planner how much RAM is available for caching |
maintenance_work_mem | 64 MB | 256 MB to 1 GB | Used for VACUUM and CREATE INDEX, safe to set high |
wal_buffers | Auto, around 16 MB | 64 MB for write-heavy workloads | WAL in-memory staging buffer |
The work_mem setting is the one teams get wrong most often. It applies per operation, not per query. A single complex query can spawn many sort and hash operations simultaneously, each of which can use up to work_mem of memory. We’ve seen teams set work_mem to something aggressive like 256 MB globally, then watch the server run out of memory under concurrent load. The right pattern is a conservative global value, with per-session overrides for heavy analytical queries using SET LOCAL work_mem = '256MB'.
WAL tuning is the other lever at this layer. PostgreSQL’s Write-Ahead Log ensures durability by writing changes sequentially to disk before applying them to data pages. Sequential writes are inherently faster than random I/O, but the configuration controls the trade-off between durability and throughput. We typically set wal_level = replica for streaming replication support, synchronous_commit = local so commits return after a local WAL flush without waiting for replicas, and checkpoint_completion_target = 0.9 to spread checkpoint I/O over 90% of the interval rather than producing periodic spikes.
These settings won’t make a bad query fast, but they’ll make a well-written query use the hardware properly. We’ve seen 2 to 5x performance improvements from configuration tuning alone, on systems where nothing else changed.
Layer 3: connection pooling is not optional
PostgreSQL uses a process-per-connection architecture, which means a new operating system process is spawned for every client connection. At low concurrency this is fine. At a few hundred concurrent connections, the memory pressure and context-switching overhead start to dominate, and the database spends more time managing connections than answering queries.
A connection pooler solves this by sitting between the application and the database, holding a small number of long-lived database connections and multiplexing many client connections across them. PgBouncer is the de-facto standard, and it operates in three modes.
| Mode | Description | When to use |
|---|---|---|
| Session pooling | One pooled connection per client session | Not recommended for high concurrency, it doesn’t solve the root problem |
| Statement pooling | New pooled connection per SQL statement | Too restrictive, breaks multi-statement transactions |
| Transaction pooling | Pooled connection held only during a transaction | The right answer for almost all production systems |
Transaction pooling is where teams should default. OpenAI’s PgBouncer deployment, running in transaction or statement mode, cut their average connection setup time from 50ms to 5ms, a 10x improvement just from pooling. Each of their read replicas runs its own Kubernetes deployment with multiple PgBouncer pods load-balanced behind a service. That’s a sophisticated implementation, but the underlying pattern is the same as the one we deploy for clients running an order of magnitude smaller workloads.
There is a catch with transaction pooling that catches teams out. Because connections can be reassigned between transactions, session-level features stop working as expected. Prepared statements, advisory locks, SET LOCAL, temporary tables, and LISTEN/NOTIFY all need careful handling. Most modern ORMs and frameworks have a transaction-pooling-compatible mode, but it’s worth auditing before flipping the switch in production.
For very high connection counts, a two-tier PgBouncer architecture (application to PgBouncer layer 1 to PgBouncer layer 2 to PostgreSQL) reduces the number of actual server connections further. We’ve only seen this needed for clients with genuine connection storms, the kind that come from running thousands of serverless function instances. For most production deployments, a single PgBouncer layer in transaction mode is the right answer.
One operational note: PgBouncer’s admin interface supports PAUSE commands, which enqueue new transactions while active operations finish. This is what enables zero-error schema migrations and database restarts. If your deployment pipeline doesn’t use PgBouncer’s pause and resume during deploys, you’re leaving uptime on the table.
Layer 4: read replicas, where horizontal scaling actually starts
For read-heavy workloads (which is most web applications), streaming read replicas are the most practical way to scale horizontally. PostgreSQL’s native streaming replication propagates changes from the primary’s WAL to one or more hot standby replicas, which serve read-only queries.
The mechanics are straightforward. Changes are streamed asynchronously, so replicas lag the primary by some amount, typically sub-second under normal load. Under write spikes that lag can grow, which means replicas are unsuitable for reads that require absolute consistency without explicit routing logic. If a user submits a comment and immediately reloads the page expecting to see it, you can’t serve that read from a replica without risking the comment not being there yet. The pattern we usually deploy is sticky routing for recently-written data (read your own writes from the primary, within a short window) and replica routing for everything else.
There are a few useful patterns at this layer. A single replica is enough to offload reporting, analytics, and batch jobs from the primary, and that alone often resolves the most common scaling complaint we see, which is that nightly reports are degrading daytime transactional performance. We had a client running a B2B platform where their finance team’s month-end queries were causing customer-facing API latency to spike by 10x for a few hours each cycle. Moving those queries to a dedicated reporting replica fixed it without touching application code beyond the database connection string.
For more demanding workloads, priority-tiered replicas are the pattern to reach for. OpenAI runs nearly 50 replicas across multiple regions and categorises traffic by priority, with high-priority requests routed to dedicated replica pools so a slow analytical query can’t degrade latency for critical-path traffic. That’s the same Quality of Service pattern that’s been standard in network engineering for decades, applied to the database layer.
Failover at this layer needs a manager. Patroni is the leading open-source HA controller, automating leader election and failover when the primary goes down. The key configuration decisions are around synchronous_commit (set to remote_apply if you need zero data loss at the cost of write latency, local if you can tolerate a small data loss window for better write throughput), maximum_lag_on_failover to bound how stale a promoted replica can be, and replication slots to ensure the primary retains WAL until a replica has received it.
Layer 5: table partitioning before you reach for sharding
Once individual tables exceed hundreds of millions of rows, declarative table partitioning starts to pay off. PostgreSQL’s partition pruning lets the query planner skip irrelevant partitions entirely, and maintenance operations like vacuuming, indexing, and bulk deletes become much cheaper because they operate on smaller individual partitions.
Range partitioning is the most common pattern, partition by created_at month or day for time-series data. The killer feature is that dropping old data becomes a near-instant DDL operation (DROP PARTITION) instead of a slow DELETE statement. A 200 million row delete generates 200 million WAL entries that have to be replicated, vacuumed, and indexed. A DROP PARTITION generates a handful. Hash partitioning distributes rows evenly across N partitions by hashing a key, useful for high-write tables without natural time ordering. List partitioning segments data by discrete values like region or status, best when query patterns align tightly with the partition keys.
| Dimension | Partitioning | Sharding |
|---|---|---|
| Scope | Single PostgreSQL instance | Multiple PostgreSQL nodes |
| Transparency | Fully transparent to queries | Requires routing logic in the application or middleware |
| Write scalability | Minimal improvement | Significant improvement |
| Complexity | Low to medium | High |
| Use case | Large tables, data lifecycle management | Write throughput beyond a single node |
The two get conflated in conversations. Partitioning is a single-instance optimisation, sharding is a multi-instance one. Partitioning is usually a prerequisite for sharding rather than an alternative, sharded deployments typically shard at the partition granularity. We almost always implement partitioning first and revisit sharding only if the write workload actually warrants it.
Layer 6: the vacuum problem you only notice when it’s too late
PostgreSQL’s Multi-Version Concurrency Control (MVCC) is what allows readers and writers to proceed without blocking each other. The cost is that every UPDATE and DELETE creates dead tuples, old row versions that accumulate until autovacuum cleans them up. Without aggressive vacuuming, tables and indexes bloat, query performance degrades, and eventually you’re forced to take downtime to fix it.
The default autovacuum settings throttle aggressively, which makes sense for a small database where vacuuming during business hours could degrade performance. For high-write tables in a production system, those defaults are too conservative. The most important fix is per-table autovacuum overrides on high-churn tables:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01, -- vacuum when 1% of rows are dead (vs 20% default)
autovacuum_vacuum_cost_delay = 2 -- reduce throttling from default 20ms
);We’ve had clients where a single high-write table, often a job queue or audit log, was responsible for the majority of their performance complaints. Autovacuum couldn’t keep up with the dead tuple accumulation, the table bloated, indexes bloated, and queries that should have been milliseconds were taking seconds. Per-table autovacuum tuning fixed it in every case, without touching application code or hardware.
There’s a more existential problem to monitor for as well. PostgreSQL’s transaction ID counter is 32-bit and wraps around at roughly 2 billion transactions. As tables approach this threshold, PostgreSQL triggers a forced VACUUM FREEZE that can halt writes entirely. We monitor age(datfrozenxid) across all databases on every production system we manage and alert if it climbs above 1.5 billion.
When a table has already bloated badly, VACUUM FULL reclaims the space but takes an exclusive table lock, blocking all reads and writes for the duration. That’s almost never acceptable in production. The right tool is pg_repack, which rebuilds tables and indexes online with only brief lock windows. We use it routinely on tables that have gotten away from us before we were brought in.
This is the layer most teams underestimate, and Bohan Zhang’s POSETTE talk about OpenAI’s PostgreSQL deployment said the same thing in different words: MVCC’s write amplification is the primary constraint at scale. You can’t escape it by adding replicas, you have to manage it directly.
Layer 7: caching, Redis or PostgreSQL?
The conventional answer to read-heavy load is Redis. It’s fast, it’s well-understood, and it integrates cleanly with most application stacks. The cache-aside pattern (check Redis first, fall back to PostgreSQL on a miss, populate the cache on the way back) can absorb 80 to 90% of read requests without touching the database, which is the kind of headroom that changes scaling math.
But it isn’t always the right answer. We’ve started seeing teams discover that a well-tuned PostgreSQL instance can serve a surprising amount of cache-like workload directly out of its shared buffer pool. There’s a recent write-up from a team running a 400,000 events per second pipeline that hit a 99.7% cache hit ratio directly from PostgreSQL’s shared buffer cache, by sizing shared_buffers and tuning indexes so that recent data pages stayed permanently in RAM. The system served 203,000 blocks per second from memory.
That’s not an argument against Redis, it’s an argument for understanding what PostgreSQL is already doing for you. PostgreSQL has two layers of in-memory caching (its shared buffers plus the operating system page cache), and for workloads with a bounded hot dataset, those layers can serve everything you need. The cost of adding Redis is real: a second piece of infrastructure to operate, cache invalidation logic to maintain, and the dual-write consistency problem to reason about.
The pattern we use with clients is to size PostgreSQL properly first, measure the actual cache hit ratio (buffers_hit / (buffers_hit + buffers_read) in pg_stat_bgwriter), and only add Redis when the workload requires sub-millisecond latency that PostgreSQL can’t provide. For sessions and ephemeral state, Redis is usually the right call. For your application’s main read workload, well-tuned PostgreSQL is often enough.
Layer 8: distributed PostgreSQL, the last lever
When write throughput exceeds what a single primary can handle, even after all the optimisations above, distributed PostgreSQL becomes necessary. We want to be clear about how rare this actually is. Most workloads we see, including reasonably high-traffic SaaS platforms doing hundreds of millions of requests per month, have never come close to needing this layer.
Citus, now part of Microsoft and available as Azure Cosmos DB for PostgreSQL, turns PostgreSQL into a distributed coordinator-worker cluster. Distributed tables are sharded across worker nodes by a distribution column like tenant_id. Reference tables are replicated to all nodes so that joins with distributed tables still work. Colocation places related shards on the same worker so that tenant-scoped queries don’t pay a network hop. For multi-tenant SaaS workloads sharding by tenant, this works well. For workloads that need large cross-shard joins or complex ETL operations, it doesn’t.
For multi-region write availability, the options have matured significantly. AWS open-sourced pgactive in June 2025, an asynchronous active-active replication extension built on PostgreSQL’s logical replication infrastructure. pgEdge offers open-source multi-master replication using the Spock extension. EDB Postgres Distributed provides enterprise-grade multi-primary with built-in conflict resolution. All of them solve the same fundamental problem (writes from multiple regions or instances) and all of them introduce the same fundamental challenge, which is conflict resolution when two nodes update the same row simultaneously.
The managed cloud options each take a different architectural approach.
| Platform | Architecture | Key advantage | Trade-off |
|---|---|---|---|
| AWS Aurora for PostgreSQL | Shared distributed storage, single writer | Faster failover, up to 15 read replicas | Higher cost, some latency for write-then-read |
| Google AlloyDB | Shared storage with columnar cache | Strong analytical performance, claims around 2x transactional and 100x analytical vs standard PostgreSQL | GCP lock-in, not a full PostgreSQL drop-in |
| YugabyteDB | True distributed SQL, shared-nothing | Multi-region active-active writes | Higher operational complexity |
| Citus / Azure Cosmos DB for PostgreSQL | Coordinator and worker sharding | Open source, full SQL support | Cross-shard queries need careful schema design |
We talk clients out of jumping to this layer more often than we move them into it. The pattern we see most often is a team that’s hit a performance ceiling, hasn’t run EXPLAIN ANALYZE on their slowest queries, hasn’t tuned shared_buffers, doesn’t have PgBouncer, and is reading from the primary because they haven’t set up replicas yet. Distributed PostgreSQL is the right answer when you’ve exhausted everything else, not when you’ve skipped to the end of the book.
Observability is how you know which layer to touch
Scaling without observability is flying blind. Two extensions and a handful of system views tell you almost everything you need to know.
pg_stat_statements is enabled via shared_preload_libraries = 'pg_stat_statements' and tracks query execution frequency, total time, and I/O statistics. This is the most useful diagnostic tool in PostgreSQL.
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;This query tells you where the database is actually spending its time. The top of that list is almost always where the highest-leverage optimisations live. pg_stat_monitor from Percona is an advanced replacement that aggregates statistics into configurable time-based buckets, providing query origin metadata and more accurate timing. It’s particularly useful for identifying short-lived query spikes that pg_stat_statements averages away.
Beyond the extensions, a handful of system views cover the rest:
| View | What it reveals |
|---|---|
pg_stat_activity | Active connections, query text, wait events |
pg_stat_replication | Replica lag, WAL send/write/flush positions |
pg_stat_user_tables | Live and dead tuple counts per table, a bloat signal |
pg_stat_bgwriter | Checkpoint frequency, buffer cache hit ratio |
pg_locks | Lock contention between transactions |
A high n_dead_tup relative to n_live_tup in pg_stat_user_tables is the autovacuum-falling-behind signal. A low buffer cache hit ratio in pg_stat_bgwriter means shared_buffers is too small. Replication lag in pg_stat_replication means either the primary is overwhelmed or the replicas can’t keep up. Each metric points at a specific layer of the hierarchy.
How we approach this with clients
When we’re brought in to a system that’s struggling, our diagnostic process maps to the layers above. We start with pg_stat_statements to find the queries that matter, run EXPLAIN ANALYZE on the top ten, and look for the obvious wins (missing indexes, sequential scans, pagination anti-patterns). That alone resolves the bulk of performance complaints we see, often within a day or two of engagement.
If queries are clean and performance is still degraded, we look at server configuration next, because the defaults are wrong on essentially every production database we encounter. From there it’s connection pooling (PgBouncer in transaction mode if it isn’t already there), then read replicas if the workload is read-heavy, then per-table autovacuum tuning on high-churn tables. By the time we’ve worked through those layers, the vast majority of clients are sitting on plenty of headroom for the foreseeable future.
We’ve moved a small number of clients to partitioning, fewer to Citus, and so far none to active-active replication. The tools work fine. The cost of operating them is real, and most workloads don’t justify it. The honest answer for most teams is that PostgreSQL on a properly sized instance with PgBouncer and a read replica or two will get them further than they expect.
The pattern OpenAI demonstrated at hyperscale is the same pattern that works at our clients’ scale. Centralise on a single primary, lean on read replicas aggressively, treat connection pooling and query optimisation as non-negotiable, and reach for distributed systems only when profiling proves you need them. That order is the most reliable path we know to getting more out of PostgreSQL than most teams realise is sitting there.


