← Blog2026-05-06Engineering

The advisory-lock self-deadlock Postgres can't see

By Michael Cooper · Founder

A two-connection wait cycle that forms through JavaScriptawait and stays invisible to PostgreSQL's deadlock detector. The symptom is an idle in transaction backend that hangs until the session timeout fires. Reproduction, real diagnostic output, and the structural fix.

The short version

PostgreSQL's deadlock detector walks the row-lock wait graph. Advisory locks held across await boundaries can form cycles through application code that the detector cannot see. The symptom is idle in transaction for the duration of idle_in_transaction_session_timeout.

Tested 2026-05-06 against PostgreSQL 17.9, Node.js 24.14.1, pg (node-postgres) 8.20.0. We hit this in production at AGLedger and shipped a structural fix in agledger-api v0.22.12.

The scenario

An inbound message handler takes a per-tenant advisory lock to serialize work for that tenant. On certain error paths it writes a forensic audit row that must survive the outer transaction's rollback, so the audit helper opens its own pool connection. If that helper also takes the same advisory lock for its own ordering guarantees, the two connections deadlock through application code.

That is the entire setup. Domain detail does not matter; the bug depends only on the lock interaction.

A self-contained reproduction

Save this as repro.mjs and run it against any PostgreSQL 17 instance with pg 8.x installed. It hangs for ten seconds, then exits non-zero. The bounded hang is intentional; there is no graceful path out of the deadlock, and that is itself the lesson.

import { Pool } from 'pg';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const LOCK_KEY = 12345;

async function auditOnFreshConnection() {
  // A "forensic audit must survive outer rollback" helper.
  // Opens its own connection and takes the same advisory lock
  // for its own coherent ordering.
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    await client.query('SELECT pg_advisory_xact_lock($1)', [LOCK_KEY]);
    // ... write audit row ...
    await client.query('COMMIT');
  } finally {
    client.release();
  }
}

async function handleMessage() {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    await client.query('SELECT pg_advisory_xact_lock($1)', [LOCK_KEY]);
    try {
      throw new Error('validation failed');
    } catch (err) {
      // Try to audit on a fresh connection. Hangs forever:
      // PostgreSQL does not detect this as a deadlock.
      await auditOnFreshConnection();
    }
  } finally {
    await client.query('ROLLBACK').catch(() => {});
    client.release();
  }
}

const timeout = setTimeout(() => {
  console.error('Hung as expected. Inspect pg_locks / pg_stat_activity in another session.');
  process.exit(1);
}, 10_000);

handleMessage().finally(() => clearTimeout(timeout));

The pool has two connections live: one held by the outer transaction (with the lock granted), one held by the audit helper (waiting on the same lock). Neither will release. Node is blocked on the helper's await, which will never resolve, so the outer transaction never reaches its finally block to roll back.

Diagnostics during the hang

While the script is hung, in a separate psql session against the same database, run two queries.

pg_locks: granted vs. waiting advisory locks

SELECT pid, locktype, classid, objid, granted, mode
FROM pg_locks
WHERE locktype = 'advisory'
ORDER BY granted DESC, pid;

Captured output during the hang:

 pid | locktype | classid | objid | granted |     mode
-----+----------+---------+-------+---------+---------------
  73 | advisory |       0 | 12345 | t       | ExclusiveLock
  74 | advisory |       0 | 12345 | f       | ExclusiveLock

pg_stat_activity: what state is each backend in?

SELECT pid, state, wait_event_type, wait_event, state_change, query
FROM pg_stat_activity
WHERE datname = current_database()
  AND state IS NOT NULL
ORDER BY state_change;

Captured output (trimmed to the relevant rows):

 pid |        state        | wait_event_type | wait_event |   query
-----+---------------------+-----------------+------------+--------------------------------
  73 | idle in transaction | Client          | ClientRead | SELECT pg_advisory_xact_lock($1)
  74 | active              | Lock            | advisory   | SELECT pg_advisory_xact_lock($1)

The asymmetry is the diagnosis. Backend 74 is the helper; it is active with wait_event = advisory. It wants the lock and PostgreSQL knows it. Backend 73 is the outer transaction; it holds the lock, but its state is idle in transaction with wait_event = ClientRead. From PostgreSQL's perspective, backend 73 is not waiting on any lock at all. It is waiting for the next message from its client.

The client is Node, and Node is blocked on an await that depends on backend 74 finishing. The wait graph has only one edge that PostgreSQL can see (helper → outer). The other edge lives in the JavaScript event loop. The detector finds no cycle and lets the system hang.

Why PostgreSQL can't detect this

From the advisory locks documentation: "Advisory locks can be acquired in either shared or exclusive modes... advisory locks are stored in a shared memory pool whose size is defined by the max_locks_per_transaction and max_connections configuration variables." The detector walks the row-lock wait graph from pg_locks looking for cycles. Both edges of the cycle have to be visible in pg_locks.

In this bug, one edge of the cycle is a JavaScript await, which is invisible to the detector. From the database's point of view, backend 73 is simply not asking for anything. The detector concludes there is no cycle, and the outer backend stays idle in transaction until the idle_in_transaction_session_timeout fires.

This is not a PostgreSQL bug. The detector is doing exactly what its contract specifies. The pattern is an application-level deadlock, and the application is responsible for not building it.

The structural fix

No timeout adjustment closes the bug. The bug is that the audit ran while the outer transaction held the lock. Once the audit runs after rollback, the cycle cannot form.

The shape of the fix: instead of trying to audit inline, the held-transaction block throws a private error class that carries the audit context. The transaction wrapper rolls back and releases the advisory lock. The outer caller catches the private error, runs the audit on a fresh connection, and re-raises the original cause.

// pseudocode — illustrative shape, not runnable
class AuditableRejection extends Error {
  constructor(
    public readonly auditContext: Record<string, unknown>,
    public readonly cause: Error,
  ) {
    super(cause.message);
  }
}

async function handleMessage(pool: Pool, input: Input) {
  try {
    await runInTransaction(pool, async (client) => {
      await client.query('SELECT pg_advisory_xact_lock($1)', [lockKey]);
      // ... work ...
      if (!validate(input)) {
        throw new AuditableRejection(
          { reason: 'invalid' },
          new ValidationError('input failed schema'),
        );
      }
    });
  } catch (err) {
    if (err instanceof AuditableRejection) {
      // Outer tx has rolled back. Lock is released. Safe to audit.
      await auditOnFreshConnection(pool, err.auditContext);
      throw err.cause;
    }
    throw err;
  }
}

The transaction wrapper's contract is the load-bearing piece. When the inner callback throws, the wrapper guarantees rollback and connection release before the throw propagates. Whatever runs after the catch sees a clean pool with no advisory locks held by the original caller.

A second mitigation, with its tradeoff

You can pass the held client through to the audit helper so it does not open a fresh connection. That avoids the second lock acquisition entirely. The tradeoff is that the audit row now participates in the outer transaction's rollback. If the whole point of the audit is "must survive when the outer transaction rolls back," this mitigation defeats the audit requirement. Use it only when the audit is part of the same logical unit of work.

Defense-in-depth: bounding the blast radius

We pin idle_in_transaction_session_timeout to thirty seconds on application connections. It bounds the blast radius if the bug ever recurs in a path we have not audited. Once the structural pattern is in place, the timeout is a backstop, not a fix. Treat it that way; tuning the GUC without addressing the lock pattern is treating a symptom and calling it cured.

When this matters

The bug only fires when three conditions overlap: an advisory lock held inside a transaction, an await on application code while the lock is held, and that awaited code taking the same lock on a fresh pool connection. Any one of these alone is fine. The combination is what creates the invisible edge.

If you do not use advisory locks, you cannot hit this. If your advisory-lock callsites are leaf operations with no nested await on database work, you cannot hit this. We hit it because two independent invariants (per-tenant serialization and audit-survives-rollback) collided through a pool connection neither author had reason to think about.

Related

Earlier in this Postgres-at-scale series: pg-boss in production: footguns we hit and how to avoid them — four operational gotchas in a Postgres-backed job queue.

Sources and further reading

AGLedger is a self-hosted cryptographic notary for automated work; every record is hash-chained and Ed25519-signed. We hit this bug operating Postgres at scale; the fix shipped in v0.22.12. Learn more.