photo of Dominic

Dominic is a software engineer at Datadog. Interests: backend, security, and data engineering.


Projects

Marigold 🏵️

(site | source code)

Marigold is a domain-specific language for working with asynchronous streams of data. It compiles to Rust, and can be integrated into Rust programs using a macro.

Marigold is designed to rapidly build parallel data pipelines and analyses. The grammar includes pure functions, immutable, fixed-size types with built-in serialization and compression, and an opinionated, always-async IO/network API. Integrated into a Rust program, Marigold seemlessly accepts Rust structs and functions that are in scope.

Keywords: Domain Specific Language, Programming Language Design, PL, Async Rust, Streaming Data, Data Pipelining, Data Analysis, Procedural Macros, LALRPOP.

Turbolift 🚡

(site | source code)

Turbolift is a toy Rust distribution interface. Turbolift lets you automatically turn functions into microservices using a custom macro. These microservices are then distributed and managed, making it easier to create and maintain cluster-agnostic, distributed rust scripts.

Currently, Turbolift development is only targeting Kubernetes, but Turbolift was built to be extensible to different platforms without significant API changes. Swarm, AWS Lambda, and SLURM are all viable targets for future development.

Keywords: Rust, Async Rust, Metaprogramming, Orchestration, Distributed Computing, API Design, DevOps, Infrastructure as Code, Kubernetes, K8s, Docker, Open Source, Flagrant Macro Abuse.

Wikipedia Server 🗃

(source code)

This program downloads, compactly stores, and efficiently serves every edit to wikipedia, available by time period or by revision ID. By optimizing for compact storage, this project reduces the size of the full revisions dataset from over 60 TB using postgres to less than 6 TB.

This means that instead of being run on an AWS server that would cost over 1,000 USD per month, the server can run on much smaller devices, such as a Raspberry Pi 4 with 4 GB RAM and an external hard drive. The program also has nice scaling characteristics. On a single device, with more memory available for the OS file cache, fewer calls hit storage; with more CPU cycles available, stream compression is faster. Since each node is independent, the server can be set up as a service behind a single load balancer, or geographically distributed to decrease latency while serving multiple datacenters. Read more.

Keywords: Data Pipeline, Data Engineering, Docker, Rust, Actix, Python, PyPy, Wikipedia, Open Data, Open Source, Optimization, Ode to the File System.

Birdie 🐦

The New York City Council oversees the city's budget ($77 billion in 2017). As stewards of the most populous city in the United States, the 51 New York City Council members have significant legislative authority.

Birdie is a command line tool that generates static webpage reports on proposed council legislation, using open data to find similar prior bills, to predict likely supporters, and to estimate the likelihood that a bill will pass.

Keywords: Data Pipeline, Data Engineering, Machine Learning, Contagion Modeling, Sequence Prediction, CLI, command line interface, Docker, Python, Open Data, Civics.