<img src="/assets/locked_computer.jpg" title="a computer locked within a birdcage" alt="a computer locked within a birdcage" style="width: 80%; display: block; margin-left: auto; margin-right: auto; border-radius: 8px; height: auto;" />

*"Isolated process," made with DALL·E*

A friend sent me an [article that involved setting type constraints on code generated from neural networks](https://arxiv.org/abs/2402.15332). The prospect of verifying the correctness of generated code made me think more about [side effects](https://en.wikipedia.org/wiki/Side_effect_(computer_science)) and the implications of algebraic type checks in generated code on program analysis. Where are we with side effect analysis, and where are we going?

For most practical programs, side effects are the only things that matter about the program. Often, programs are [specified](https://en.wikipedia.org/wiki/Specification_(technical_standard)) wholly or in part by the side effects they must produce. Understanding precisely what a program does is essential to building reliable, secure, and correct software. This is especially the case if some side-effects may be [hallucinatory](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)).

Despite the singular importance of side effects, they remain difficult to model in practice. Approaches are generally divided into static (compile-time) or dynamic (runtime) approaches.

Determining side effects statically is complex and resource-intensive. Most approaches rely on source code analysis (like [SonarQube](https://www.sonarsource.com/products/sonarqube/) for identifying effects or [Safe Haskell](https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/safe_haskell.html) for limiting effects, mostly IO), and are therefore:

1. limited to one language in the pluralistic environment of many real-world applications,
2. limited in scope (generally to application logic), and
3. limited in terms of practical value for non-[reproducible](https://en.wikipedia.org/wiki/Reproducible_builds) systems.

This can be adequate to detect if, for example, generated code includes hallucinations for accessing global variables that are not actually set. They will not protect against supply-chain attacks or linker issues. Even in cases where they can detect effects like a file being opened, they can't generally model dynamic resolution of file names and track, for example, if the list of files opened by a program has changed.

Lower-level static analysis models runtime behavior with increasing precision, asymptoting into virtualized execution(s) of the program in order to resolve dynamic references and adequately track pointers. This asymptote isn't to say that static analysis is not valuable. In the same way this article describes promising mechanisms for establishing type constraints in generated code, symbolic execution / formal methods can attempt to glean the same level of information to provide excellent static analysis of programs despite their intrinsic limitations.

Instead, [runtime verification](https://en.wikipedia.org/wiki/Runtime_verification) approaches are used to understand side effects empirically, through observing or altering side effects at different levels. At the OS level, [eBPF](https://en.wikipedia.org/wiki/EBPF) can run arbitrary code on system calls with effects, such as [`socket()`](https://en.wikipedia.org/wiki/Berkeley_sockets#socket). At any layer of virtualization, runtimes similarly impose constraints around side effects (isolating side effects is one of the fundamental objectives of virtualization).

As examples of runtime verification for containers, the Docker runtime includes additional tools to limit side effects, such as the kernel tools [apparmor](https://docs.docker.com/engine/security/apparmor/) (mandatory access control across file IO, network IO, and signals, among other capabilities) and [seccomp](https://docs.docker.com/engine/security/seccomp/) (process isolation by blocking most syscalls).

These empirical approaches, like the static ones, have great value. But like static ones, they are not a panacea for program analysis. Runtime approaches:

1. can use significant **runtime** resources, introducing performance issues or resource exhaustion that break the systems they are supposed to observe/guard,
2. can, at virtualization levels, be defeated through isolation failure (including privilege escalation outside of the virtual environment as in container breakouts), and
3. can fail to accurately detect some kinds of side effects due to the complexity and interconnectedness of real-world systems. As an example in linux: `mmap()` has side effects because it can alter potentially shared memory. But if `mmap()` is interacting with memory shared with a network card, can we identify only side effect one (shared memory)? Or also side effect two (network I/O)?

With both approaches, even when side effects can be detected, it is difficult to describe them in practical terms. Most people interested in side effects, both software engineers and computer scientists, focus on specific domains in which they can describe side effects clearly.

One such endeavor– which has particular promise for code synthesis– is in the world of programming languages. If analysis of existing grammars is limited, grammars can be written that facilitate static analysis.

[Haskell](https://en.wikipedia.org/wiki/Haskell) and similar languages use [category theory](https://en.wikipedia.org/wiki/Category_theory) to describe I/O monadically. While this can mitigate some difficulties with unexpected side effects **within the application code**, the portions of runtime behavior that exist out of this walled garden can still cause difficult surprises: `unsafePerformIO`, memory-related side effects, foreign functions, other processes in the runtime environment, and lower-level issues (drivers, kernel behavior, etc.).

The ubiquity of such surprises cause some software engineers to question whether modeling side effects with languages like Haskell is worth the trouble. Any attempt to solve a portion of this problem can seem quixotic or myopic. The goal of understanding what a program does in practice apparently unravels into a quest for understanding the totality of the universe.

There are currently strong arguments for the defeatist position when it comes to language solutions to improving side effect modeling. The most convincing one to me is the populist one: a small percentage of software engineers use Haskell despite its longstanding availability, and a small portion of those users use tools like Safe Haskell statically limit side effects. I'm not familiar with any great tools for Haskell to set an effect allowlist or track changes to effects used e.g. across versions of a program. Yet software continues to be written and deployed, and often software does generally as it is intended. Not being able to prove something works does not mean that it does not work.

Similarly, technical approaches like those discussed in the AI algebra article will diminish the frequency of errors in generated code and allow a fair degree of confidence, either through additional validation within the model itself or validation in generated code compilation, with errors being fed into a second model trained to eventually generate something that compiles based on a source code and an error. Eventually, I think the correctness of generated code will pass a threshold similar to that of human-generated code, probably without the models writing Haskell.

The defeatist position will not continue to be reasonable as analytical tools improve. With significant work being put into the application of formal methods, the effort to leverage these tools decreases and the benefits increase.

Specifically around generated code, my belief is that static verification approaches will be increasingly relied on for the iterative component of models (like the compiler iterations baked into Copilot training data) and will capture more and more classes of side-effects. Runtime verification will play a greater role in verifying the correctness of an application to counteract hallucinations or other unexpected behavior, especially as some generative approaches may not produce legible code.

I'm not sure what the future of language-based approaches are though. Generally, I think modern ecosystems can benefit from DSLs which encode domain-specific invariants into the grammar or type system. But side-effects are everywhere at every level of execution, and language attempts to model them via Haskell have not had significant direct penetration in industry or the sciences.

Perhaps static analysis will obviate these needs at the language level. Or perhaps we will enter a declarative era where people will make new grammars that help communicate program invariants between humans and code-generating models. Likely both of these things will happen– along with many other outcomes– for different programmers as different approaches to these problems are applied in different niches. I look forward to seeing what happens in my niches.