Modern compiler design

Compilers transform source code into an executable file. However, as programming languages and developer tools have evolved over time, compilers must be able to keep up.

Developer tools & LSP

Nowadays, most developers use an IDE or code editor with language intelligence, for example Visual Studio or Xcode. In order to help tooling developers create said language plugins (and to prevent the combinatorial explosion of each language having a different plugin for each editor), Microsoft created the Language Server Protocol (LSP) to define a standard interface between editors and language plugins. Now, plugin authors can just adhere to the LSP specification, and support all IDEs and code editors at once!

However, creating a completely separate editor plugin for a programming language is still extra effort. IDEs and compilers innately have similar functions, but they are designed in very different ways, making them fundamentally incompatible. For example, IDEs need to parse and re-parse code extremely quickly, and still provide helpful information in the face of syntax errors (when the programmer is still typing).

Other tools, like code formatters, may need to access information about the syntax tree that a compiler would normally discard, like whitespace. Or a documentation generator like Doxygen or rustdoc might need to access special “documentation comments” that the compiler would normally throw away.

If we want to write language toolchains that can not only compile code, but also provide IDE diagnostics and sophisticated developer tooling, we need to update our compilation model for these purposes.

Our outdated compilation model

Traditionally, compilers were implemented with a pipeline architecture; this typically began with some lexing and parsing (syntax analysis), then name resolution, type checking, and so on, eventually ending with code generation. Each stage would transform the representation of the source code in some way (the intermediate representation, or IR). The parser would convert the source code into an AST, which would be fed into the name resolution algorithm, which could create the symbol table used for type checking, and so on.

This approach, known as batch compilation, works great for just compiling code, but is subpar when it comes to IDE integration or developer tooling.

One problem is that small changes to the source code can lead to a lot of duplicated work, as the compilation model throws everything it knows away and starts over from scratch. This obviously isn’t ideal when the IDE is requesting error diagnostics and type signatures from the compiler every time the programmer presses a key.

Another issue, as described earlier, is that syntax analysis in compilers isn’t flexible enough to support a wide range of tooling. Compilers normally throw away any information from the source code that it doesn’t need, to avoid tediously skipping over irrelevant details all the time. However, it is these intricate syntax details that are most useful for many types of developer tools.

This has led to the common problem of having a lot of different implementations of various compiler components. For example, multiple parsers might be developed for a language, one in the compiler that discards information and produces the AST, another that delicately tracks the spans of each region for an IDE, another optimized for speedy syntax highlighting, etc. This leads to an increase in engineering effort and complexity, and hurts maintainability.

Modern compiler design

As languages, compilers, and development tools advance, the above issues are leading to significantly more problems for language developers. What we have learnt from literature is no longer enough to meet the expectations of the modern programmer; we need to find a way to keep up with this evolution, while keeping engineering effort manageable.

In subsequent posts, I want to explore the different aspects of compiler design in a more modern age.