← Blog2026-04-18Engineering

Cutting PostgreSQL Audit-Report Query Time 44% with GROUPING SETS and Materialized CTEs

By Michael Cooper · Founder

Note: Numbers in this post were measured on AGLedger v0.19.8 / v0.19.9. The current API baseline is v0.23.0 (2026-05-19). Methodology and direction hold; specific magnitudes may differ on the current version.

One endpoint was burning 70 seconds of database CPU every five minutes. Two PostgreSQL features most people already have cut it to 39. Here is the anti-pattern, the rewrite, the measurements, and the caveat nobody writes about.

The numbers up front

Same endpoint, same stress harness (20 concurrent readers, 5 minutes, 2,260 records on Aurora PostgreSQL 17.7 Serverless v2), two consecutive releases.

Metric	Before	After	Delta
Queries per request	6	2	−4
Total DB time	69,892 ms	38,846 ms	−44.4%
Mean DB time per call	15.0 ms	8.0 ms	−47%
Suite success rate	92.9%	100%	+7.1 pp

Source: pg_stat_statements, 5-minute stress window, AGLedger performance snapshots v0.19.8 vs v0.19.9.

The situation

AGLedger runs a compliance report endpoint that surfaces a dozen metrics for a given enterprise over a date range: totals, fulfillment rates, verification pass rates, contract type breakdowns, status distributions. The shape of the response hadn’t changed in months, and at the scale of our earliest customers it was fast and boring.

Under the audit stress suite it stopped being boring. Twenty concurrent readers against 2,260 records over five minutes pushed Aurora Serverless v2 to its 4 ACU cap, held it at 100% CPU for four straight minutes, and turned that one endpoint into the top consumer of database time in the whole system. Other endpoints started timing out. The suite failed at 92.9%.

When we looked in pg_stat_statements, the cause was obvious: one request, six queries, each of them re-scanning the same enterprise-scoped record set.

The anti-pattern

The shape is common in compliance and dashboard code. Each metric is its own clean little function. Each function returns a number. Locally, each query is fine. Reviewers like it:

-- Metric 1: how many records?
SELECT COUNT(*) FROM records
WHERE enterprise_id = $1 AND created_at BETWEEN $2 AND $3;

-- Metric 2: how many are FULFILLED?
SELECT COUNT(*) FROM records
WHERE enterprise_id = $1 AND created_at BETWEEN $2 AND $3
  AND status = 'FULFILLED';

-- Metric 3: breakdown by contract type
SELECT contract_type, COUNT(*) FROM records
WHERE enterprise_id = $1 AND created_at BETWEEN $2 AND $3
GROUP BY contract_type;

-- ...and four more siblings, each re-scanning the same filtered set.

At 10 records and one concurrent reader this is microseconds. At 2,000 records and 20 concurrent readers on Aurora Serverless v2, every call stacks six passes over the same set. Here is what pg_stat_statements actually reported for the five-minute window, each row representing 4,652 calls:

Aggregation	Total ms	Mean ms	Max ms
Scoped record rollup	29,991.8	6.45	199.5
Verification pass-rate	13,157.0	2.83	142.7
Fulfillment-rate	9,949.5	2.14	88.6
Contract-type counts	6,542.9	1.41	48.2
Contract-type + shadow CTE	6,138.8	1.32	50.3
Status counts	4,112.3	0.88	76.8

Six queries, every request, each re-scanning the same enterprise-scoped set.

The rewrite

Two PostgreSQL features, both available on any modern version (9.5+ for GROUPING SETS, 12+ for the MATERIALIZED keyword), do the whole job.

1. WITH … AS MATERIALIZED — compute the filtered set once

Since PostgreSQL 12, CTEs are inlined by default; the optimizer rewrites them as subqueries. That is usually what you want. But when six downstream queries all filter on the same conditions, you want the planner to compute the subset once and reuse it. The MATERIALIZED keyword is the explicit opt-in:

WITH m_scoped AS MATERIALIZED (
  SELECT id, agent_id, status, contract_type, created_at, fulfilled_at
  FROM records
  WHERE enterprise_id = $1
    AND ($2::timestamptz IS NULL OR created_at >= $2)
    AND ($3::timestamptz IS NULL OR created_at <= $3)
)
...

One scan, filter once, hand the subset to every aggregate that follows.

2. GROUPING SETS + COUNT(*) FILTER — many rollups, one pass

GROUPING SETS is a SQL standard feature that computes several groupings in a single query. Pair it with COUNT(*) FILTER (WHERE …) for conditional aggregation and you replace a stack of siblings with one scan:

SELECT
  grouping(status)::int        AS gs,
  grouping(contract_type)::int AS gc,
  status,
  contract_type,
  COUNT(*)::int                                       AS total,
  COUNT(*) FILTER (WHERE status = 'FULFILLED')::int   AS fulfilled,
  COUNT(*) FILTER (WHERE status = 'ACTIVE')::int      AS active,
  ROUND(AVG(EXTRACT(EPOCH FROM fulfilled_at - created_at)))::int AS avg_latency_s
FROM m_scoped
GROUP BY GROUPING SETS (
  (),                         -- grand total
  (status),                   -- by status
  (contract_type),            -- by contract type
  (status, contract_type)     -- by both
);

The grouping() scalar tells you which rollup each row came from: 0 means the column was part of the grouping set, 1 means it was rolled up. The application joins rows to metrics by the (gs, gc) pattern: (1, 1) is the grand total row, (0, 1) is by status, and so on.

Three under-appreciated facts worth keeping:

FILTER (WHERE …) is equivalent to COUNT(CASE WHEN … THEN 1 END), but the planner handles FILTER natively. It is cleaner and occasionally faster, because the optimizer can push predicates without unpacking the CASE.
GROUPING SETS of the form ((a), (b), (a, b)) is equivalent to CUBE(a, b) minus the grand-total row. ROLLUP(a, b) gives (a, b), (a), and (). Learn all three; they cover almost every real reporting query.
All of this works on PostgreSQL 9.5+ and is fully supported on Aurora. You probably already have it.

After the rewrite

Two queries per request. Same answers. Here is what pg_stat_statements reported for the v0.19.9 window (4,830 calls each):

Query	Total ms	Mean ms	Max ms
m_scoped MATERIALIZED CTE	21,863.6	4.53	112.5
GROUPING SETS rollup	16,981.9	3.52	112.9

Total DB time for the endpoint: 38,846 ms. Down from 69,892 ms. The audit stress suite went from 92.9% to 100% success: the downstream timeouts that had been failing other endpoints were a symptom of Aurora pegged at its ACU cap, not bugs in those endpoints.

The honest caveat

Per-request p50 latency went from ~111 ms to ~124 ms. On paper, it got slightly worse. Three reasons the wall-clock didn’t move the way you might expect:

Aurora Serverless v2 autoscales to meet demand. In v0.19.8 the database was already at the ACU cap and CPU-bound. Additional queries just stacked. In v0.19.9 the CPU ceiling never hit. The savings show up as lower ACU consumption (lower dollars per report) and headroom for other endpoints, not as faster per-request response at the same load.

An adjacent v0.19.9 fix changed what “free” slots counted for. A separate endpoint had been returning 400s for 100% of calls in under 5 ms each — effectively free request slots that biased every other endpoint’s p50 down. Fixing that endpoint meant those slots now did real ~100 ms work, so everything else’s observed p50 nudged up 10–20 ms. That is a pricing correction, not a regression.

The right metric for a query-shape improvement is DB CPU-time per call. That dropped 47%. Wall-clock is a noisy proxy when the database is elastic — the planner turns saved CPU into less scaling, not faster responses. Read the right signal.

Measurement recipe

The reason we caught this at all was pg_stat_statements. Top 20 queries by total exec time is the one-liner every PostgreSQL operator should know:

SELECT calls,
       round(total_exec_time::numeric, 1)  AS total_ms,
       round(mean_exec_time::numeric, 2)   AS mean_ms,
       round(max_exec_time::numeric, 1)    AS max_ms,
       left(regexp_replace(query, '\s+', ' ', 'g'), 200) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

On Aurora, pg_stat_statements ships in shared_preload_libraries by default but must be CREATE EXTENSION'd in each new database. For ephemeral test databases, creating the extension fresh is sufficient — the reset function needs elevated permissions anyway.

Takeaways

If a reporting endpoint fires N queries per call and they all filter on the same scope, hoist the scope into a WITH … AS MATERIALIZED CTE and fold the aggregations into one pass.
Reach for GROUPING SETS before you invent an aggregation framework in application code. The database has already solved this.
Measure with pg_stat_statements by total exec time, not just call count. The six-query anti-pattern shows up as a cluster of sibling queries with near-identical call counts.
Benchmark under concurrent load on elastic database infrastructure. A sequential benchmark will hide the win — the gain shows up as CPU-time saved, which the planner turns into less Aurora scaling.

Sources & further reading

PostgreSQL documentation — GROUPING SETS, CUBE, and ROLLUP: postgresql.org
PostgreSQL documentation — Common Table Expressions and MATERIALIZED semantics (note the PG 12 inlining change): postgresql.org
PostgreSQL documentation — pg_stat_statements: postgresql.org
Our throughput benchmark: 60 Records per Second on a Single Replica
How we validate every release: Zero Mocks, Real Infrastructure

60 Records per Second on a Single Replica: AGLedger Performance at Scale →Zero Mocks, Real Infrastructure: How AGLedger Tests an Accountability Engine →