Structuring compiler projects

I’m currently writing a compiler for a programming language. During early development, I often found myself restarting over and over again, which was really unproductive and a waste of time. I decided to investigate why I was so inclined to just scrap everything and build things from scratch again, trying to salvage reusable components from the old code to save a bit of time.

Part of the problem was that I had a tendency to explore new things, including new libraries, data structures, intermediate representations, etc. which caused a lot of huge breaking changes, forcing me to rewrite the code such that I could incorporate my experiments. However, such problems could be resolved with a version control system like Git. The real issue was that I didn’t structure my compiler codebase very well, so I’ll be working on that today.

Technical note: I’m writing the compiler in Rust, so I’ll be discussing a couple of Rust-specific things. Notably, Rust uses the word “crate” instead of “package”. However, I’ll be using a more-general module system (instead of Rust’s) to represent structure instead of how the files are actually laid out on the filesystem.

Part of this post was inspired by this article, which also discusses the structure of large codebases. However, I’ll focus on more compiler-specific details.

Considerations for structure

Obviously, how a codebase should be structured depends on multiple factors — most significantly, the size of the codebase. Structure matters a lot more in large codebases; an improperly structured codebase can suffer from poor compilation times (especially with Rust), confusion, inconvenience and tradeoffs with no perfect solution.

Compilers generally have pretty large codebases; they do a lot of work in the process of translating source code to binaries. However, every compiler has to start somewhere. The tricky part of the problem is using a structure that works at a small scale (which isn’t that hard in itself), then keeping it organized as the project scales up. This doesn’t even sound that hard, but so far I’ve made enough project destroying “organizational refactors” that I’ve decided that working on project structure is worth my time.

The traditional compilation model is like a pipeline — there are many transformations that we apply to the code, in some specified order, which change it until the compiler has produced its output. For example, a very traditional example would be feeding the source code through a lexer, then feeding the output of that into a parser, then performing name resolution on the result… until we’re done with our analysis and optimizations, leaving us with an object file, LLVM intermediate representation, or whatever our compiler outputs. Along the way, code is converted between multiple representations (in a process called lowering), called intermediate representations (IRs). Modern compilers, like rustc, tend to use many kinds of IRs.

The normal way to structure a compiler would be breaking it into its individual components that perform analyses and optimizations, for example the parser, type checker, optimizer and code generator. However, what this structure doesn’t tackle is the data representations in the compilation process. Although breaking a compiler down into components organizes logic well, it doesn’t organize data as well. Sometimes IR is shared between different components (which was especially prevalent in my compiler, to keep things simple), making it awkward to cram into the module of a component, and I don’t consider lowering the IR inside the logic of a component a good design — rather, a component should do the analysis and conversion that it is designed for, and leave lowering IR to be separate logic.

These reasons were the main causes of the structural problems in my compiler.

Structure ideas

Below I will discuss some ideas that I’ve had for structuring the compiler:

Separating the data and the analysis logic

I tried this the first time I tried to build a prototype. The idea was to put the multitude of IRs in one module, to encapsulate the implementation details and boilerplate code for formatting, construction and other mundane tasks. This let me separate the data from the logic, allowing me to group type definitions and analysis code separately.

This went pretty well actually, as I could keep all of the IR definitions in one place, which was helpful for iterating on the IR (which I did a lot). If I had grouped the IR with the accompanying logic, I would have to go into the parser code to change the AST, the lexer to change the tokens, etc. In fact, as my prototype was not very complicated, IRs were reused in different phases, so it didn’t actually make sense to keep the AST just in the parser if it was also used for name resolution and type checking, for example. Furthermore, converting between different IRs (which is called lowering) was made a lot cleaner as I could put that with the IR instead of with the analysis logic.

The main problem with this layout for me was jumping back and forth too much. If I made a breaking change to the type definitions in the IR module, I would have to propagate these changes throughout the codebase, which often involved not only the relevant analysis logic, but also the utilities (pretty-printing, error diagnostics) and IR lowering. Jumping between 3 different modules (especially in the terminal) was not fun.

Flat structure

Like workspaces, I then thought using a completely (well, mostly) flat structure would be a good idea. This just meant putting most modules in the top-level src directory, including IR, analysis logic, and in the future, optimization and code generation (which I hadn’t gotten to yet).

The main advantage of this was less nesting. Having mostly everything in the top-level directory made navigation a breeze. However, this did mean that I lost the information that the file system hierarchy encodes, as src became a hot mess of different compiler bits and pieces glued together in main.rs. This also negatively affected the separation of concerns within the codebase.

I still think this is what works best for very simple compilers (like the “math compilers” often used to show off expression parsers), but I decided to ditch this structure as my project grew. However, notably, rustc seems to be using a flat structure at the crate level.

Organization by compiler phases

What I ended up going with was organization by compiler phases. At first glance, this seems to be no different that breaking the compiler down into its components, but instead of modularizing individual components, I modularized the codebase by the compilation phase the code was used in. For example, syntax analysis logic and IR would go in one module, semantic analysis in another, code generation in another. There would also be a couple of other modules for the driver code, compiler configuration, a compiler API, shared utilities like string interning and error reporting, etc.

The structure within these big modules was mostly flat, with IR and logic just floating around. In fact, it is pretty trivial to migrate from flat structure to this, akin to sorting files into folders by category. This kept the structure relatively unnested in comparison to the original approach, but wasn’t as confusing as a completely flat structure. In the future, if these modules continue to grow, I plan to use a Cargo workspace instead, making the phase modules entire crates, improving compilation speed and encapsulation.

In my opinion, this structure takes most of the advantages of the previous approaches, for example proper separation of concerns and simple navigation, while incorporating less disadvantages.

Conclusion

During the time spent working on my programming language, I’ve learned to appreciate the importance of details such as modularity and structure when programming at a larger scale. I hope my findings in this article will be useful for anyone else trying to write a compiler!