I worked for a few years on a product which needed to perform a long round-trip before the page could load. This caused high page tail latency and a lot of non-actionable pages (1-4 pages a month, 30% of the pages in my rotation).

Loading a page ended up looking like this:

```mermaid
sequenceDiagram
    participant Client
    participant api_resolver
    participant CapabilitiesDB as capabilities (Postgres Table)
    participant new_database_service
    participant old_database_service

    Client->>api_resolver: Request capabilities/service info
    api_resolver->>CapabilitiesDB: Query enabled capabilities
    CapabilitiesDB-->>api_resolver: Capabilities data
    api_resolver->>api_resolver: Is new_database_service available in this environment? 
    api_resolver->>new_database_service: Does customer's org exist in DB?
    new_database_service-->>api_resolver: Yes/No
    api_resolver-->>Client: Return capabilities and org's service
    alt Customer org exists in new_database_service
        Client->>new_database_service: (Second request as instructed)
        new_database_service-->>Client: Return content
    else Customer org does not exist
        Client->>old_database_service: (Second request as instructed)
        old_database_service-->>Client: Return content
    end
```

This architecture formed gradually due to a combination of migrations and product pivots. A chronology:

1. A long-running migration was started from `old_database_service` to `new_database_service`. It was stalled because the platform team rolling out `new_database_service` was unable to perform the migration in some environments. Originally projected to take two quarters, it continued for years. Both services were [CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)s over databases (`old_database_service` was for Mongo and `new_database_service` was for a custom database). Calls to `new_database_service` produced a lot of different effects in addition to persistent storage.
2. The `capabilities` database was originally created to manage a suite of new, optional features called capabilities.
3. But, after the first capability was released, the product manager changed and the other capabilities under development were deprioritized.
4. The `api_resolver` was created to handle both live and asynchronous migrations, allowing users to switch from one system to another during a request.
5. The product manager eventually decided that new capabilities would only be added for users within environments with `new_database_service`. This product change meant that instead of being independent, the contents of `capabilities` was now tightly coupled to the migration and the state in `new_database_service`.
6. The UI was redesigned based on the new capabilities in `new_database_service`. Now, `api_resolver` also determines whether to use the old or new UI, different pages in the react FE.

This architecture introduced operational problems, each of which degraded user experience and led to pages. This endpoint took on the order of 1,000 to 10,000 non-[Synthetics](https://docs.datadoghq.com/synthetics/) requests per day and paged 1-4 times per month.

There were multiple root causes, causing compounding error rates:
1. An `api_resolver` round-trip is necessary before the frontend knows what layout to render or how to query for the contents. Slow page loads degraded user experience.
2. `capabilities` table had noisy neighbors, leading to 5xx timeouts.
    - Fortunately, retries yielded responses and the retries were well-configured.
    - But, retrying introduced additional pageload delay.
    - Postgres provides some tuning capacity around timeouts which could reduce paging rate marginally.
    - Upscaling the database to handle performance during bursts helped.
    - Limited organizational ability without doing a live migration to a new database, other team need traffic bursts and had committed to using the same database as us.
3. `new_database_service` had high tail latency.
    - This causes `api_resolver` to have higher tail latency, which delays UI loading for all customers.
    - Occasional paging timeouts (15 second timeout, hit once per >10 million requests).
    - The team had more severe operational priorities than tail calls due to many more clients than expected sending requests to `new_database_service`.

Critically, outside of database incidents, there was no operational response for these pages. We needed to re-architect the system to fix slow loading and inactionable pages, without disrupting the ability to resume the active migration when `new_database_service` was deployed in new environments, and before I was scheduled to begin my next large project.

I implemented a solution in three weeks that reduced pages on this path to zero.

When I started having ideas about how to re-architect to fix the problem (and pages), my backend team showed strong interest. Frontend was focused on other work but positive. The product manager was already thinking about ideas to simplify UX. My org as a whole values ops and it was easy to sell them on investing one eng for a few weeks of full-stack development. The RFC was approved within a day.

The project had three phases:

1. Product/UX change: First, the product manager had no objections for us to refactor the sole remaining capability. Instead of requiring a toggle that was tightly coupled to the state in `new_database_service`, we implemented a stateless button and changed how the contents of `new_database_service` were displayed.
2. (most critically) FE source of truth change: next, a feature flagging system skipped the call to `api_resolver` entirely. As the migration was stalled due to infrastructure limitations, we short-circuited resolution of the appropriate UI/service. `api_resolver` only needed to be called in an environment actively being migrated into `new_database_service`, reducing number of requests per day from between 1000 and 10000 to zero. This also provided a modest decrease in traffic to `new_database_service` which their ops team appreciated as they worked through its scaling issues.
3. Postgres table deprecation: with a stateless UI, `capabilities` could be dropped, allowing us to avoid our noisy neighbors.

The updated architecture looked like this:
```mermaid
sequenceDiagram
    participant Client as UI (Client)
    participant api_resolver
    participant new_database_service
    participant old_database_service

    alt Normal operation (feature flag active)
        Client->>Client: Feature flag determines target service
        alt Use new_database_service
            Client->>new_database_service: GET page data
            new_database_service-->>Client: Return content
        else Use old_database_service
            Client->>old_database_service: GET page data
            old_database_service-->>Client: Return content
        end
    else Active migration in env (feature flag off)
        Client->>api_resolver: Request service resolution
        api_resolver->>api_resolver: Is new_database_service available in this environment? 
        api_resolver->>new_database_service: Does customer's org exist in DB?
        new_database_service-->>api_resolver: Yes/No
        api_resolver-->>Client: Instruct which service to use
        alt Use new_database_service
            Client->>new_database_service: Request after instruction
            new_database_service-->>Client: Return content
        else Use old_database_service
            Client->>old_database_service: Request after instruction
            old_database_service-->>Client: Return content
        end
    end
```

Arguably, we could have simplified further. In the front, the two UIs were different enough that they could have been routed differently, avoiding the need for a flag entirely. I didn't do this as URL changes can be heavy; it touched on other user journeys and seemed out of scope. I had already identified a few display bugs due to bad combinations of feature flags so I focused on bug fixes instead.

In the back, I kept `api_resolver` deployed since it was used during live database migrations. It was easy to document the flag toggles necessary to extend the existing migration strategy. But, at that point the migration had not advanced for 18 months. Code is cheap enough and enough had changed over 18 months that deleting the endpoint entirely might have been the better call. Since it was no longer on the hot path, it didn't impact users or ops, so the question wasn't as important to me.

The outcome was good:

```mermaid
xychart-beta
    title "Requests to api_resolver (excluding Synthetics)"
    x-axis ["Mar 31", "Apr 07", "Apr 14", "Apr 21", "Apr 28", "May 05", "May 12", "May 19", "May 26", "Jun 02", "Jun 09", "Jun 16", "Jun 23", "Jun 30", "Jul 07", "Jul 14", "Jul 21", "Jul 28", "Aug 04", "Aug 11", "Aug 18"]
    y-axis "Requests"
    bar [15074, 12738, 16254, 14531, 20515, 13779, 18406, 13740, 18024, 11951, 11221, 1816, 46, 8, 13, 6, 7, 6, 4, 2, 7]
```

```mermaid
xychart-beta
    title "api_resolver Latency — p50, p95, p99"
    x-axis ["Mar 31", "Apr 07", "Apr 14", "Apr 21", "Apr 28", "May 05", "May 12", "May 19", "May 26", "Jun 02", "Jun 09", "Jun 16", "Jun 23", "Jun 30", "Jul 07", "Jul 14", "Jul 21", "Jul 28", "Aug 04", "Aug 11", "Aug 18"]
    y-axis "Latency (ms)"
    line [56, 62, 57, 53, 50, 52, 55, 57, 57, 58, 42, 29, 28, 28, 28, 28, 29, 29, 29, 29, 28]
    line [240, 318, 256, 245, 221, 218, 230, 224, 216, 203, 105, 43, 43, 43, 44, 45, 53, 53, 48, 47, 44]
    line [397, 533, 428, 1723, 360, 347, 376, 495, 721, 459, 169, 66, 67, 68, 75, 76, 87, 87, 81, 79, 74]
```

The pages stopped at the same time traffic to the endpoint did. There was a clear path to resume the migration when `new_database_service` was deployed in additional environments, only activating `api_resolver` for a small portion of customers at a time. With the context of the full stack and full-team alignment, it was easy to perform some additional opportunistic refactors which left the code in a better state.

Architectural drift or debt like this can be hard to anticipate. Sometimes we build towards one goal and need to shift to another. What I liked during this project was that I could work across the front and back to solve an operational problem with a series of targeted changes across the full stack.

With coding agents, I wonder how the cost of architectural debt will be weighed differently. I see different responses in different parts of my org.

I think the modal answer will be that engineers will throw away more code. Implementing re-architected solutions is so much cheaper now, and I think as engineers we need to improve both our process and tooling for confidently deploying large, multi-system refactors. CI/CD!