Architectural Debt

I worked for a few years on a product which needed to perform a long round-trip before the page could load. This caused high page tail latency and a lot of non-actionable pages (1-4 pages a month, 30% of the pages in my rotation).

Loading a page ended up looking like this:

old_database_servicenew_database_servicecapabilities (Postgres Table)api_resolverClientold_database_servicenew_database_servicecapabilities (Postgres Table)api_resolverClientalt[Customer org exists in new_database_service][Customer org does not exist]Request capabilities/service infoQuery enabled capabilitiesCapabilities dataIs new_database_service available in this environment?Does customer's org exist in DB?Yes/NoReturn capabilities and org's service(Second request as instructed)Return content(Second request as instructed)Return content
old_database_servicenew_database_servicecapabilities (Postgres Table)api_resolverClientold_database_servicenew_database_servicecapabilities (Postgres Table)api_resolverClientalt[Customer org exists in new_database_service][Customer org does not exist]Request capabilities/service infoQuery enabled capabilitiesCapabilities dataIs new_database_service available in this environment?Does customer's org exist in DB?Yes/NoReturn capabilities and org's service(Second request as instructed)Return content(Second request as instructed)Return content

This architecture formed gradually due to a combination of migrations and product pivots. A chronology:

  1. A long-running migration was started from old_database_service to new_database_service. It was stalled because the platform team rolling out new_database_service was unable to perform the migration in some environments. Originally projected to take two quarters, it continued for years. Both services were CRUDs over databases (old_database_service was for Mongo and new_database_service was for a custom database). Calls to new_database_service produced a lot of different effects in addition to persistent storage.
  2. The capabilities database was originally created to manage a suite of new, optional features called capabilities.
  3. But, after the first capability was released, the product manager changed and the other capabilities under development were deprioritized.
  4. The api_resolver was created to handle both live and asynchronous migrations, allowing users to switch from one system to another during a request.
  5. The product manager eventually decided that new capabilities would only be added for users within environments with new_database_service. This product change meant that instead of being independent, the contents of capabilities was now tightly coupled to the migration and the state in new_database_service.
  6. The UI was redesigned based on the new capabilities in new_database_service. Now, api_resolver also determines whether to use the old or new UI, different pages in the react FE.

This architecture introduced operational problems, each of which degraded user experience and led to pages. This endpoint took on the order of 1,000 to 10,000 non-Synthetics requests per day and paged 1-4 times per month.

There were multiple root causes, causing compounding error rates:

  1. An api_resolver round-trip is necessary before the frontend knows what layout to render or how to query for the contents. Slow page loads degraded user experience.
  2. capabilities table had noisy neighbors, leading to 5xx timeouts.
    • Fortunately, retries yielded responses and the retries were well-configured.
    • But, retrying introduced additional pageload delay.
    • Postgres provides some tuning capacity around timeouts which could reduce paging rate marginally.
    • Upscaling the database to handle performance during bursts helped.
    • Limited organizational ability without doing a live migration to a new database, other team need traffic bursts and had committed to using the same database as us.
  3. new_database_service had high tail latency.
    • This causes api_resolver to have higher tail latency, which delays UI loading for all customers.
    • Occasional paging timeouts (15 second timeout, hit once per >10 million requests).
    • The team had more severe operational priorities than tail calls due to many more clients than expected sending requests to new_database_service.

Critically, outside of database incidents, there was no operational response for these pages. We needed to re-architect the system to fix slow loading and inactionable pages, without disrupting the ability to resume the active migration when new_database_service was deployed in new environments, and before I was scheduled to begin my next large project.

I implemented a solution in three weeks that reduced pages on this path to zero.

When I started having ideas about how to re-architect to fix the problem (and pages), my backend team showed strong interest. Frontend was focused on other work but positive. The product manager was already thinking about ideas to simplify UX. My org as a whole values ops and it was easy to sell them on investing one eng for a few weeks of full-stack development. The RFC was approved within a day.

The project had three phases:

  1. Product/UX change: First, the product manager had no objections for us to refactor the sole remaining capability. Instead of requiring a toggle that was tightly coupled to the state in new_database_service, we implemented a stateless button and changed how the contents of new_database_service were displayed.
  2. (most critically) FE source of truth change: next, a feature flagging system skipped the call to api_resolver entirely. As the migration was stalled due to infrastructure limitations, we short-circuited resolution of the appropriate UI/service. api_resolver only needed to be called in an environment actively being migrated into new_database_service, reducing number of requests per day from between 1000 and 10000 to zero. This also provided a modest decrease in traffic to new_database_service which their ops team appreciated as they worked through its scaling issues.
  3. Postgres table deprecation: with a stateless UI, capabilities could be dropped, allowing us to avoid our noisy neighbors.

The updated architecture looked like this:

old_database_servicenew_database_serviceapi_resolverUI (Client)old_database_servicenew_database_serviceapi_resolverUI (Client)alt[Use new_database_service][Use old_database_service]alt[Use new_database_service][Use old_database_service]alt[Normal operation (feature flag active)][Active migration in env (feature flag off)]Feature flag determines target serviceGET page dataReturn contentGET page dataReturn contentRequest service resolutionIs new_database_service available in this environment?Does customer's org exist in DB?Yes/NoInstruct which service to useRequest after instructionReturn contentRequest after instructionReturn content
old_database_servicenew_database_serviceapi_resolverUI (Client)old_database_servicenew_database_serviceapi_resolverUI (Client)alt[Use new_database_service][Use old_database_service]alt[Use new_database_service][Use old_database_service]alt[Normal operation (feature flag active)][Active migration in env (feature flag off)]Feature flag determines target serviceGET page dataReturn contentGET page dataReturn contentRequest service resolutionIs new_database_service available in this environment?Does customer's org exist in DB?Yes/NoInstruct which service to useRequest after instructionReturn contentRequest after instructionReturn content

Arguably, we could have simplified further. In the front, the two UIs were different enough that they could have been routed differently, avoiding the need for a flag entirely. I didn't do this as URL changes can be heavy; it touched on other user journeys and seemed out of scope. I had already identified a few display bugs due to bad combinations of feature flags so I focused on bug fixes instead.

In the back, I kept api_resolver deployed since it was used during live database migrations. It was easy to document the flag toggles necessary to extend the existing migration strategy. But, at that point the migration had not advanced for 18 months. Code is cheap enough and enough had changed over 18 months that deleting the endpoint entirely might have been the better call. Since it was no longer on the hot path, it didn't impact users or ops, so the question wasn't as important to me.

The outcome was good:

Requests to api_resolver (excluding Synthetics)Mar 31Apr 07Apr 14Apr 21Apr 28May 05May 12May 19May 26Jun 02Jun 09Jun 16Jun 23Jun 30Jul 07Jul 14Jul 21Jul 28Aug 04Aug 11Aug 182000018000160001400012000100008000600040002000Requests
Requests to api_resolver (excluding Synthetics)Mar 31Apr 07Apr 14Apr 21Apr 28May 05May 12May 19May 26Jun 02Jun 09Jun 16Jun 23Jun 30Jul 07Jul 14Jul 21Jul 28Aug 04Aug 11Aug 182000018000160001400012000100008000600040002000Requests
api_resolver Latency — p50, p95, p99Mar 31Apr 07Apr 14Apr 21Apr 28May 05May 12May 19May 26Jun 02Jun 09Jun 16Jun 23Jun 30Jul 07Jul 14Jul 21Jul 28Aug 04Aug 11Aug 181600140012001000800600400200Latency (ms)
api_resolver Latency — p50, p95, p99Mar 31Apr 07Apr 14Apr 21Apr 28May 05May 12May 19May 26Jun 02Jun 09Jun 16Jun 23Jun 30Jul 07Jul 14Jul 21Jul 28Aug 04Aug 11Aug 181600140012001000800600400200Latency (ms)

The pages stopped at the same time traffic to the endpoint did. There was a clear path to resume the migration when new_database_service was deployed in additional environments, only activating api_resolver for a small portion of customers at a time. With the context of the full stack and full-team alignment, it was easy to perform some additional opportunistic refactors which left the code in a better state.

Architectural drift or debt like this can be hard to anticipate. Sometimes we build towards one goal and need to shift to another. What I liked during this project was that I could work across the front and back to solve an operational problem with a series of targeted changes across the full stack.

With coding agents, I wonder how the cost of architectural debt will be weighed differently. I see different responses in different parts of my org.

I think the modal answer will be that engineers will throw away more code. Implementing re-architected solutions is so much cheaper now, and I think as engineers we need to improve both our process and tooling for confidently deploying large, multi-system refactors. CI/CD!

☄︎