Architectural Debt
I worked for a few years on a product which needed to perform a long round-trip before the page could load. This caused high page tail latency and a lot of non-actionable pages (1-4 pages a month, 30% of the pages in my rotation).
Loading a page ended up looking like this:
This architecture formed gradually due to a combination of migrations and product pivots. A chronology:
- A long-running migration was started from
old_database_servicetonew_database_service. It was stalled because the platform team rolling outnew_database_servicewas unable to perform the migration in some environments. Originally projected to take two quarters, it continued for years. Both services were CRUDs over databases (old_database_servicewas for Mongo andnew_database_servicewas for a custom database). Calls tonew_database_serviceproduced a lot of different effects in addition to persistent storage. - The
capabilitiesdatabase was originally created to manage a suite of new, optional features called capabilities. - But, after the first capability was released, the product manager changed and the other capabilities under development were deprioritized.
- The
api_resolverwas created to handle both live and asynchronous migrations, allowing users to switch from one system to another during a request. - The product manager eventually decided that new capabilities would only be added for users within environments with
new_database_service. This product change meant that instead of being independent, the contents ofcapabilitieswas now tightly coupled to the migration and the state innew_database_service. - The UI was redesigned based on the new capabilities in
new_database_service. Now,api_resolveralso determines whether to use the old or new UI, different pages in the react FE.
This architecture introduced operational problems, each of which degraded user experience and led to pages. This endpoint took on the order of 1,000 to 10,000 non-Synthetics requests per day and paged 1-4 times per month.
There were multiple root causes, causing compounding error rates:
- An
api_resolverround-trip is necessary before the frontend knows what layout to render or how to query for the contents. Slow page loads degraded user experience. capabilitiestable had noisy neighbors, leading to 5xx timeouts.- Fortunately, retries yielded responses and the retries were well-configured.
- But, retrying introduced additional pageload delay.
- Postgres provides some tuning capacity around timeouts which could reduce paging rate marginally.
- Upscaling the database to handle performance during bursts helped.
- Limited organizational ability without doing a live migration to a new database, other team need traffic bursts and had committed to using the same database as us.
new_database_servicehad high tail latency.- This causes
api_resolverto have higher tail latency, which delays UI loading for all customers. - Occasional paging timeouts (15 second timeout, hit once per >10 million requests).
- The team had more severe operational priorities than tail calls due to many more clients than expected sending requests to
new_database_service.
- This causes
Critically, outside of database incidents, there was no operational response for these pages. We needed to re-architect the system to fix slow loading and inactionable pages, without disrupting the ability to resume the active migration when new_database_service was deployed in new environments, and before I was scheduled to begin my next large project.
I implemented a solution in three weeks that reduced pages on this path to zero.
When I started having ideas about how to re-architect to fix the problem (and pages), my backend team showed strong interest. Frontend was focused on other work but positive. The product manager was already thinking about ideas to simplify UX. My org as a whole values ops and it was easy to sell them on investing one eng for a few weeks of full-stack development. The RFC was approved within a day.
The project had three phases:
- Product/UX change: First, the product manager had no objections for us to refactor the sole remaining capability. Instead of requiring a toggle that was tightly coupled to the state in
new_database_service, we implemented a stateless button and changed how the contents ofnew_database_servicewere displayed. - (most critically) FE source of truth change: next, a feature flagging system skipped the call to
api_resolverentirely. As the migration was stalled due to infrastructure limitations, we short-circuited resolution of the appropriate UI/service.api_resolveronly needed to be called in an environment actively being migrated intonew_database_service, reducing number of requests per day from between 1000 and 10000 to zero. This also provided a modest decrease in traffic tonew_database_servicewhich their ops team appreciated as they worked through its scaling issues. - Postgres table deprecation: with a stateless UI,
capabilitiescould be dropped, allowing us to avoid our noisy neighbors.
The updated architecture looked like this:
Arguably, we could have simplified further. In the front, the two UIs were different enough that they could have been routed differently, avoiding the need for a flag entirely. I didn't do this as URL changes can be heavy; it touched on other user journeys and seemed out of scope. I had already identified a few display bugs due to bad combinations of feature flags so I focused on bug fixes instead.
In the back, I kept api_resolver deployed since it was used during live database migrations. It was easy to document the flag toggles necessary to extend the existing migration strategy. But, at that point the migration had not advanced for 18 months. Code is cheap enough and enough had changed over 18 months that deleting the endpoint entirely might have been the better call. Since it was no longer on the hot path, it didn't impact users or ops, so the question wasn't as important to me.
The outcome was good:
The pages stopped at the same time traffic to the endpoint did. There was a clear path to resume the migration when new_database_service was deployed in additional environments, only activating api_resolver for a small portion of customers at a time. With the context of the full stack and full-team alignment, it was easy to perform some additional opportunistic refactors which left the code in a better state.
Architectural drift or debt like this can be hard to anticipate. Sometimes we build towards one goal and need to shift to another. What I liked during this project was that I could work across the front and back to solve an operational problem with a series of targeted changes across the full stack.
With coding agents, I wonder how the cost of architectural debt will be weighed differently. I see different responses in different parts of my org.
I think the modal answer will be that engineers will throw away more code. Implementing re-architected solutions is so much cheaper now, and I think as engineers we need to improve both our process and tooling for confidently deploying large, multi-system refactors. CI/CD!