← Blog2026-04-18Engineering

Cutting PostgreSQL Audit-Report Query Time 44% with GROUPING SETS and Materialized CTEs

One endpoint was burning 70 seconds of database CPU every five minutes. Two PostgreSQL features most people already have cut it to 39. Here is the anti-pattern, the rewrite, the measurements, and the caveat nobody writes about.

The numbers up front

Same endpoint, same stress harness (20 concurrent readers, 5 minutes, 2,260 mandates on Aurora PostgreSQL 17.7 Serverless v2), two consecutive releases.

MetricBeforeAfterDelta
Queries per request62−4
Total DB time69,892 ms38,846 ms−44.4%
Mean DB time per call15.0 ms8.0 ms−47%
Suite success rate92.9%100%+7.1 pp

Source: pg_stat_statements, 5-minute stress window, agledger-testbed perf-snapshots v0.19.8 vs v0.19.9.

The situation

AGLedger runs a compliance report endpoint that surfaces a dozen metrics for a given enterprise over a date range — totals, fulfillment rates, verification pass rates, contract type breakdowns, status distributions. The shape of the response hadn’t changed in months, and at the scale of our earliest customers it was fast and boring.

Under the audit stress suite it stopped being boring. Twenty concurrent readers against 2,260 mandates over five minutes pushed Aurora Serverless v2 to its 4 ACU cap, held it at 100% CPU for four straight minutes, and turned that one endpoint into the top consumer of database time in the whole system. Other endpoints started timing out. The suite failed at 92.9%.

When we looked in pg_stat_statements, the cause was obvious: one request, six queries, each of them re-scanning the same enterprise-scoped mandate set.

The anti-pattern

The shape is common in compliance and dashboard code. Each metric is its own clean little function. Each function returns a number. Locally, each query is fine. Reviewers like it:

-- Metric 1: how many mandates?
SELECT COUNT(*) FROM mandates
WHERE enterprise_id = $1 AND created_at BETWEEN $2 AND $3;

-- Metric 2: how many are FULFILLED?
SELECT COUNT(*) FROM mandates
WHERE enterprise_id = $1 AND created_at BETWEEN $2 AND $3
  AND status = 'FULFILLED';

-- Metric 3: breakdown by contract type
SELECT contract_type, COUNT(*) FROM mandates
WHERE enterprise_id = $1 AND created_at BETWEEN $2 AND $3
GROUP BY contract_type;

-- ...and four more siblings, each re-scanning the same filtered set.

At 10 mandates and one concurrent reader this is microseconds. At 2,000 mandates and 20 concurrent readers on Aurora Serverless v2, every call stacks six passes over the same set. Here is what pg_stat_statements actually reported for the five-minute window, each row representing 4,652 calls:

AggregationTotal msMean msMax ms
Scoped mandate rollup29,991.86.45199.5
Verification pass-rate13,157.02.83142.7
Fulfillment-rate9,949.52.1488.6
Contract-type counts6,542.91.4148.2
Contract-type + shadow CTE6,138.81.3250.3
Status counts4,112.30.8876.8

Six queries, every request, each re-scanning the same enterprise-scoped set.

The rewrite

Two PostgreSQL features, both available on any modern version (9.5+ for GROUPING SETS, 12+ for the MATERIALIZED keyword), do the whole job.

1. WITH … AS MATERIALIZED — compute the filtered set once

Since PostgreSQL 12, CTEs are inlined by default — the optimizer rewrites them as subqueries. That is usually what you want. But when six downstream queries all filter on the same conditions, you want the planner to compute the subset once and reuse it. The MATERIALIZED keyword is the explicit opt-in:

WITH m_scoped AS MATERIALIZED (
  SELECT id, agent_id, status, contract_type, created_at, fulfilled_at
  FROM mandates
  WHERE enterprise_id = $1
    AND ($2::timestamptz IS NULL OR created_at >= $2)
    AND ($3::timestamptz IS NULL OR created_at <= $3)
)
...

One scan, filter once, hand the subset to every aggregate that follows.

2. GROUPING SETS + COUNT(*) FILTER — many rollups, one pass

GROUPING SETS is a SQL standard feature that computes several groupings in a single query. Pair it with COUNT(*) FILTER (WHERE …) for conditional aggregation and you replace a stack of siblings with one scan:

SELECT
  grouping(status)::int        AS gs,
  grouping(contract_type)::int AS gc,
  status,
  contract_type,
  COUNT(*)::int                                       AS total,
  COUNT(*) FILTER (WHERE status = 'FULFILLED')::int   AS fulfilled,
  COUNT(*) FILTER (WHERE status = 'ACTIVE')::int      AS active,
  ROUND(AVG(EXTRACT(EPOCH FROM fulfilled_at - created_at)))::int AS avg_latency_s
FROM m_scoped
GROUP BY GROUPING SETS (
  (),                         -- grand total
  (status),                   -- by status
  (contract_type),            -- by contract type
  (status, contract_type)     -- by both
);

The grouping() scalar tells you which rollup each row came from: 0 means the column was part of the grouping set, 1 means it was rolled up. The application joins rows to metrics by the (gs, gc) pattern — (1, 1) is the grand total row, (0, 1) is by status, and so on.

Three under-appreciated facts worth keeping:

  • FILTER (WHERE …) is equivalent to COUNT(CASE WHEN … THEN 1 END), but the planner handles FILTER natively — it is cleaner and occasionally faster, because the optimizer can push predicates without unpacking the CASE.
  • GROUPING SETS of the form ((a), (b), (a, b)) is equivalent to CUBE(a, b) minus the grand-total row. ROLLUP(a, b) gives (a, b), (a), and (). Learn all three — they cover almost every real reporting query.
  • All of this works on PostgreSQL 9.5+ and is fully supported on Aurora. You probably already have it.

After the rewrite

Two queries per request. Same answers. Here is what pg_stat_statements reported for the v0.19.9 window (4,830 calls each):

QueryTotal msMean msMax ms
m_scoped MATERIALIZED CTE21,863.64.53112.5
GROUPING SETS rollup16,981.93.52112.9

Total DB time for the endpoint: 38,846 ms. Down from 69,892 ms. The audit stress suite went from 92.9% to 100% success — the downstream timeouts that had been failing other endpoints were a symptom of Aurora pegged at its ACU cap, not bugs in those endpoints.

The honest caveat

Per-request p50 latency went from ~111 ms to ~124 ms. On paper, it got slightly worse. Three reasons the wall-clock didn’t move the way you might expect:

Aurora Serverless v2 autoscales to meet demand. In v0.19.8 the database was already at the ACU cap and CPU-bound. Additional queries just stacked. In v0.19.9 the CPU ceiling never hit. The savings show up as lower ACU consumption — lower dollars per report — and headroom for other endpoints, not as faster per-request response at the same load.

An adjacent v0.19.9 fix changed what “free” slots counted for. A separate endpoint had been returning 400s for 100% of calls in under 5 ms each — effectively free request slots that biased every other endpoint’s p50 down. Fixing that endpoint meant those slots now did real ~100 ms work, so everything else’s observed p50 nudged up 10–20 ms. That is a pricing correction, not a regression.

The right metric for a query-shape improvement is DB CPU-time per call. That dropped 47%. Wall-clock is a noisy proxy when the database is elastic — the planner turns saved CPU into less scaling, not faster responses. Read the right signal.

Measurement recipe

The reason we caught this at all was pg_stat_statements. Top 20 queries by total exec time is the one-liner every PostgreSQL operator should know:

SELECT calls,
       round(total_exec_time::numeric, 1)  AS total_ms,
       round(mean_exec_time::numeric, 2)   AS mean_ms,
       round(max_exec_time::numeric, 1)    AS max_ms,
       left(regexp_replace(query, '\s+', ' ', 'g'), 200) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

On Aurora, pg_stat_statements ships in shared_preload_libraries by default but must be CREATE EXTENSION'd in each new database. For ephemeral test databases, creating the extension fresh is sufficient — the reset function needs elevated permissions anyway.

Takeaways

  • If a reporting endpoint fires N queries per call and they all filter on the same scope, hoist the scope into a WITH … AS MATERIALIZED CTE and fold the aggregations into one pass.
  • Reach for GROUPING SETS before you invent an aggregation framework in application code. The database has already solved this.
  • Measure with pg_stat_statements by total exec time, not just call count. The six-query anti-pattern shows up as a cluster of sibling queries with near-identical call counts.
  • Benchmark under concurrent load on elastic database infrastructure. A sequential benchmark will hide the win — the gain shows up as CPU-time saved, which the planner turns into less Aurora scaling.

Sources & further reading

Related