Neon Deploy: Learn how Neon helps developers ship faster with Postgres. Grab your ticket
Engineering

Delayed Start Compute Operations – Triggering Event

This post is part of a series discussing the Neon outages on 2025-05-16 and 2025-05-19 in the AWS us-east-1 region.  In this post, we cover the triggering cause of the outage, a change in a Postgres execution plan that ultimately caused idle Computes to be unable to be suspended.

For further details, read the top-level Post-Mortem.

Summary 

The Neon Control Plane service is backed by a Postgres database. A scheduled job in the Control plane, Activity Monitor, is responsible for identifying Computes that are ready to be suspended.  A Postgres query executed by this job against the region’s control plane database changed its execution plan.

Different execution plans can cause dramatic performance changes – in this case, using a suboptimal index – which in turn can make a query run slower and consume more resources than before.  In turn, this led to database CPU saturation, and caused the Activity Monitor to be unable to suspend Compute VMs, ultimately resulting in a significant increase in concurrently running Computes, above the cluster’s planned capacity of 6,000, but below the tested ceiling of 10,000 VMs.

This resulted in IP Allocation failures – the specifics of which are described in the top-level Postmortem.

Architecture

Behind the Neon Console is the Neon Control Plane service. It is a regional Golang service which orchestrates many supporting functions, but primarily it starts and suspends Neon Postgres VMs, which we call “Computes”. Neon’s Serverless architecture requires the Control Plane to start Computes when customers execute SQL statements, and suspend them if there are no statements executed for a (configurable) period of time. We call this functionality, scale-to-zero.

The mechanism that suspends Computes is called the “Activity Monitor”, a scheduled job running inside the Neon Control Plane that evaluates running computes, and identifies those which can be suspended.

Chain of events

The Activity Monitor relies on a Postgres SQL query, which contains several joins, including one against the computes table. The computes table contains rows for every Compute which has been started or made idle.

The triggering event for the 2025-05-16 outage was a change in Postgres execution plan for queries made by the Activity Monitor. This led to a broad scan on the computes table, causing CPU saturation on the control plane’s backing Postgres database.

Due to the high CPU usage, all fetchEndpointWithCompute queries’ execution times increased from a few hundred milliseconds to over 100 seconds.  In turn, this resulted in the control plane becoming unable to suspend Compute VMs, which led to an increase in the number of concurrently running VMs, ultimately resulting in IP exhaustion inside the cluster.

The computes table has tens of millions of rows, as it holds a recent history of all start and suspend events of Neon Computes. However, the number of active Computes (rows) at any given moment is much lower.

When execution plans use the correct indexes, this table can be queried in milliseconds. However, if the wrong index is selected, its performance can fall off a cliff, scanning millions of pages to return a couple thousand rows.

This change in execution plan didn’t uniformly happen in all regions, as the Control Plane and backing databases are region-isolated for resilience. As a result, each database’s planner independently decides the most efficient execution plan for each executed statement.  The us-east-1 region was the first to be affected by this plan change.  In parallel with mitigating the outage in that region, we have taken steps to prevent the same failure pattern from occurring in other regions in the future.

During the incident, and in other regions afterwards:

  • We ran ANALYZE computes, which refreshed the statistics and brought the execution plan back to normal.
  • We refactored fetchEndpointWithCompute to wrap the potentially expensive sub-query in a materialized CTE.  The CTE acts as an optimization fence, causing the planner to evaluate it once, use an appropriate index, and return a small set of relevant rows. This prevents the planner from occasionally choosing a suboptimal index and causing poor performance.
Post image
By using a materialized CTE, we are seeing more stable results from the Postgres query planner.

What other improvements have we considered?

  • We’re implementing a more active Garbage Collection strategy to make future regressions to this query plan less expensive.  We expect this change to result in an order of magnitude reduction in rows in this table (from tens of millions, to millions of rows), reducing total index size and reducing the occurrence of dead tuples by enabling AUTOVACUUM to run more often.
  • In the medium-term, we intend to move all historical data outside of the ‘hot’ control plane OLTP backing database into an OLAP system.
  • We have considered moving the Activity Monitor to read from a Postgres read replica. Unfortunately, this is not a viable option due to its need for reading consistent (not stale) results and making transactional writes.

Next steps

One item we’re currently working on is understanding the exact conditions that lead to a change in execution plan.  We observed different behaviours by region, depending on the shape of the computes table in each backing database.

We will publish further findings, once we have concluded this investigation.