BLOG

How and why I wrote a transpiler

I didn't want to. It's one of those questionable decisions that invites attack like the horrified reactions to FogBugz revealing they created their own Basic dialect Wasabi. I would keep this as my dirty little secret but @maxdesiatov was interested so here we go!

So how did it happen?

I had an idea for new type of problem solving tool. A bit like a spreadsheet but based on a knowledge graph where facts would not just be asserted but the outcome of interconnected decisions and reasoning. The idea was that as you adjust your thinking, the consequences would ripple through the graph making it easy to handle mind melting complexity.

To prototype the idea, I reached for my go to language C#:

  1. Multi-paradigm - OO, functional, dynamic

  2. No sharp corners e.g. generics, good basic types and libraries

  3. Exceptional tooling e.g. debuggers, intellisense, refactoring, edit-continue, profilers, coverage, unit testing etc.

  4. Cross-platform and open source

I cranked out a prototype of the knowledge graph and calculation engine in 15k LOC. It was a throwaway exploration that I wanted to hook it up to a web front-end. Typescript was recently released and, thanks to sharing the same designer, was so similar to C# that rather than a rewrite, it would be little more than mangling the format of method signatures and types. It could almost be done with a few regex. No programmer can face doing 15k braindead mechanical tasks converting A into B... so I decided to write a quick disposable code-generator to perform a one-time typescript conversion.

// A. C#
public class Atom : IAtom {
public bool AreEqual(IAtom atom) { return this == atom; }
}
// B. Typescript
export class Atom implements IAtom {
public AreEqual(atom : IAtom) : boolean { return this === atom; }
}

To my surprise, the conversion was so good that I no longer felt I had to throw away the C# and could continue to enjoy all the C# tooling benefits I had. At the time TypeScript was very new and although some editors claimed TypeScript syntax support they were, shall we say “generous” about its quality.

How?

In the past I had looked at “run C# in the browser” tools and found them quite terrifying. Tools like Saltarella are impressive feats - they convert the underlying .net IL (the portable assembly code) into javascript. However I couldn’t get what they generated to run and what they generated was not debuggable. Worse, they required major changes to the C# such as adding markup to control the generation and replacing base class libraries with their special alternatives. This upset the tools I was using. A little code generator couldn’t be too hard could it?

The code generator

I’ve created many code generators over the years. The source data has included xml, spreadsheets, apis, wiring diagrams and reflection over object libraries. I’ve generated code using many tools including templating languages like T4 and Xslt. Templates are great if you have 90% boilerplate and only injecting minor changes. However if you want to generate mostly new content then templates are a hindrance and building up strings in a standard programming language is much less painful.

A third path is some form of AST library like CodeDom where you build up an OO representation of method signatures and such. The last one I tried was so heavy weight, verbose and ugly that you might need a dozen lines for a simple method signature. Such approaches feel “correct” in some respects e.g. you can’t generate invalid code and don’t have to worry about whitespace but the practical reality is so painful compared to squirting out a few strings that I can’t recommend it. If you generate a string, you can see your output.

The parser

When you settle on a “home” language for general purpose development, you should definitely ensure it has a high quality parser library. It gives you tremendous power over your code to build your own refactorings, extract info and will also ensure that your language ecosystem includes good tooling.

I started with NRefactory used by the Mono project and had no complaints but later switched out to Roslyn when it became available for .net core. They were pretty similar. If anything, Roslyn was a little uglier to use but thanks to a few C# extension methods, I could smooth out those sharp corners.

The conversion

Walking the code

The basic converter needs to walk through the existing code and spit out the equivalent version in the new language. Parsers often support some type of visitor access pattern. This is elegant but it also minimises the level of control you have. Like a lot of design ideas, “elegant with low control” is much worse than “dumb with high control”. They a great to start with but inevitably paint you into a corner where you need to escape the confines of the elegance. So for transpiling, I find it much more sensible to drill into the AST from the outside - this lets you convert chunks and use context without any constraint of a visitor.

Writing a transpiler this way is fairly obvious e.g.

string Convert(string csharp);
void ConvertClass(StringBuilder sb, ClassDeclarationSyntax typeDecl);
void ConvertMethod(StringBuilder sb, MethodDeclarationSyntax methodDecl);
void ConvertType(StringBuilder sb, TypeSyntax type);
void ConvertEnum(StringBuilder sb, EnumDeclarationSyntax typeDecl);
void ConvertStatement(string indent, StringBuilder sb, StatementSyntax statement);
void ConvertExpression(string indent, StringBuilder sb, ExpressionSyntax expression);

As you expect, ConvertClass calls ConvertMethod for each methods which calls ConvertStatement / ConvertExpression to process the actual code. You want to use a class for efficiently joining huge strings because boy, there is going to be a awful lot of appending happening!

The statement and expression conversion use the sophisticated design pattern called “big fucking if statement”. I implemented the minimum number of syntax elements needed and dropped in new conversions as needed. It currently stands at about 20 types of statement and 30 types of expression. In the majority of cases, conversion is a slight reordering or some such e.g. C# base class members are referenced using “base.thing” whereas TypeScript uses “super.thing” - this means is that it all looks like this:

static void Convert(string indent, StringBuilder sb, BaseExpressionSyntax baseExpr)
{
sb.Append("super");
}

The only notable thing for programmers new to the topic is that expression conversion is highly recursive because expressions are trees of other expressions. For a binary operator, it’s just a case of recurring on ConvertExpression for both the left and right. No matter how complicated a passage of code is, each expression converter is only something like:

static void Convert(string indent, StringBuilder sb, MemberAccessExpressionSyntax memberExpr)
{
ConvertExpression(indent, sb, memberExpr.Expression);
sb.Append(".");
ConvertSimpleName(indent, sb, memberExpr.Name);
}

Syntax

Conversion is hard if you are converting between language paradigms requiring you to generate run-time infrastructure to support high level language features but C# and TypeScript share most of high-level features like OO, automatic memory management, generics, async/await and lambdas. If you have to generate memory management support then I think you not in transpiler land anymore Dorothy, you are now writing a compiler and run-time.

Types

Generics make type conversions a bit like expressions. You have to recur down the hierarchy of generic parameters to generate the appropriate replacement.

Typing is where some run-time differences between C# and Typescript are inevitable e.g. C# has a broad range of numeric types like unsigned short and 32 bit floats, but typescript only has the 64 bit number type. Fortunately, TypeScript is gradually typed which means that there is little fear of getting stuck in an awkward typing dead-end because you can just drop back and use any.

So I maintained a list of types e.g.all numeric types map to “number”

C#

Typescript

double

number

int

number

char

string

bool

boolean

Action

() => void

object

any

dynamic

any

ExpandoObject

any

Delegate

any

object[]

any[]

Task

Promise

Guid

string

Base Class Library

I started out assuming the C# code should look like the desired TypeScript so call system functions that resebled Javascript standard library. As I found more use for the C# for use in the server and native tools, this approach reversed and I instead created TypeScript equivalents for basic parts of the C# base class library. Replicating the whole base class library would be an insanely big task but its barely noticeable when you just need to add a couple of methods each week. They are rarely more than e.g. you have string.toUpper in TypeScript and string.ToUpperCase in C# - you have to choose which way your code looks. I could have added method name transformations to the transpiler but it wouldn’t convert the cases where there is more work to do so left that to support libraries.

The main classes that: String, DateTime, TimeSpan, List, WebSocket, Http, Linq, Timers

Interfaces from the web world I use in the C# side include: React, Jquery, History.

The compromises

So where is the github project for this masterpiece? As much as it meets my needs, using it mean accepting many compromises and gotchas that I can’t in good conscious inflict on anyone else:

No value types (stack allocated objects)

Some fundamental types like DateTime are value types in C#. This gives them better performance but more importantly they have semantic differences e.g. immutable, pass by copy, and two DateTimes created with new() for the same value will equal the same value under the reference equality operator. If you implement a DateTime as a class, it means reference equality is no longer the same as value equality.

No overloaded methods

Typescript doesn’t support them. It would be possible to generate support in the converter to to rename overloads and change their names everywhere they are called but that level of code change was beyond the scope of what I wanted to do so I just suck it up and give each method a unique name. As a funny consequence, it has helped me notice how often overloads stem from indecision and inconsistency in the use of datatypes and they can be avoided with simplification.

Constructor initialisation

In typescript the order of class initialisation is base class constructor then derived. In C#, it's the opposite way round. Cue much confusion! My solution - don’t have such sensitive initialisation code in constructors that ordering could affect behaviour. It's not such a bad policy but it was not a good day when I first ran into this.

This pointer and method references

A gotcha in the javascript world that if you refer to an instance method handle as a lambda, it doesn’t bind the this pointer e.g. call(myInstance.MyMethod) will fail if MyMethod refers to this. Instead you have to declare a new closure e.g. run(() => myInstance.MyMethod()). Its annoying because the C# style is cleaner but a proper fix has been beyond my goals.

No yield keyword

Or rather generator support in browsers and javascript was not yet ready. It makes it very quick and easy to write iterators for implementing linq like functions. Without it, writing such functions requires writing enumerator classes which is pretty tiresome compared to the elegance of yield.

No extension methods

I wouldn’t have thought this would be much of a loss but it has surprised me how useful they are. For the benefit of non-C# developers, they allow you to attach new methods to existing classes without redefining their interfaces/class definitions. You can import them by namespace so you don’t pollute definitions with extensions everywhere, only when you want them. They are also not run-time editing of class definitions, just sugar over static methods. It’s great for helper methods to make a specific task more elegant without requiring the underlying classes to have really fat interfaces. There is no equivalent in TypeScript aside from altering the prototype at run-time.

The killer use that no C# dev can live without is linq and fluent style syntax e.g.

return list
.Where(x => x.IsVisible)
.Select(x => x.Thing)
.OrderBy(x => x.Id);

Without extension methods to add methods like Where/Select/OrderBy to enumerable containers it meant I had to bake the basic linq methods into the containers themselves. A hack but it causes little issue for typical code.

No strict null mode

C# typing does have many strengths over TypeScript e.g. TypeScript enums behave very strangely, but TypeScript has its advantages too that I miss by writing C#. I really like null strict mode in TypeScript but it’s only currently a C# proposal.

Variations

String interpolation

Its nice that both TypeScript and C# support it with only slightly different syntax. The compromise is that C# has all sorts of support for formatting the data being injected into the strings. I tend not to use it so haven’t tried to replicate it.

$“this is data {data}”
`“this is data ${data}”

Math

Every number in javascript is floating point so if you use integer math in the C# world it’s not going to behave the same in TypeScript. It requires some care but I have unit tests around math code which run in both typescript and C# so would spot any issues if they occurred in practice. I still find integers useful to assert the restricted domain of the numbers even if I can’t get the size and performance benefits they should have.

String sorting

It's quite interesting to discover that C# and TypeScript have different default sorting rules for strings. It’s quite useful to know these things when using a mix of languages on the front and backend. I bet there are bugs out there where people haven’t realised this.

Conclusion

The only disadvantage that would motivate me to move on is that the transpiler generates a single typescript file. This is because TypeScript requires manual imports of every class you use. This seems an utterly unnecessary overhead for a modern compiler to foist on the developer but such is the slow moving car-crash that is javascript modules. In C# you only need to import namespaces i.e. a sensible level of control to manage scopes.

Anyway, the result of this large file is that it can’t be incrementally compiled or support hot swapping for minor changes. This is hardly a big deal in C# world - my current 80k LOC builds and transpiles in seconds but the WebPack compilation… good grief. Most of the time is spent by the typescript compiler but babel and minification take their toll too so it takes minutes grinding away to produce the Javascript. A multi-minute build cycle is a brutal development experience. Thankfully the fast majority of my development is test-driven so I am running unit tests in an IDE and not invoking the WebPack but when there is some idiotic browser issue that requires many experiments then I’m itching for a different solution.

Notes:

Is it a transpiler, compiler, converter, generator?

Some developers feel very strongly about the word transpiler. One argument is that we should stick to the known term “compiler” rather than introduce new language. Having written one, I believe transpiler useful to refer to a fairly simple transliteration of one high level language into another. The task is often just choosing to equivalent syntax with only limited code generation. You have to tackle few of the traditional compiler problems and I would find it misleading to say e.g. “I know how to write a compiler”.