Jacob Lee

Jacob Lee Blog	Version	v0.0.4
Updated	11.27.2024
Author	Jacob Lee

Writing a Compiler Day 1

There’s this website https://compilers.iecc.com/crenshaw/. Ideally I can use this as a resource to write my own compiler for my own language… Maybe not a production ready language but to learn these concepts will be worth the challenge!

The author, Crenshaw, writes all of the code in Turbo Pascal which I have no interest in doing.

I will be attempting to complete this project in C on Windows, using as much of the native Win32 API as I possibly can.

The reason I am choosing to follow this tutorial series is this quote right here:

“I intend to completely ignore the more theoretical aspects of the subject. What you WILL know is all the practical aspects that one needs to know to build a working system.”

I have started reading through three or four different books on compilation, and as soon as I see any thing start to wade in theory I start to lose interest. Not that theory is not important (in fact I have a ton of respect for the theory programming languages), but wading through tons of theory just to get a working program is not very much fun. And programming should be fun!

The author starts with the following to get us started:

{--------------------------------------------------------------}
program Cradle;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }
                              
{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look = x then GetChar
   else Expected('''' + x + '''');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := upcase(c) in ['A'..'Z'];
end;
                              

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   if not IsAlpha(Look) then Expected('Name');
   GetName := UpCase(Look);
   GetChar;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;




{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;

{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
   GetChar;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
end.
{--------------------------------------------------------------}

As someone who has never written a single line of Pascal in their life, my original thought is… “What in the hell am I looking at?” But after looking at it for a couple seconds, I can see that this boiler plate is quite terse, yet descriptive.

Below will be my translation to C:

We’ll start with cradle.h

/** Date :: November 22 2024 10:01 AM **/

#ifndef _CRADLE_H
#define _CRADLE_H


/**
* Structs [CRADLE]
*
*/

/**
* Macro|Enum|Variable Declarations [CRADLE]
*
*/
// Macro
#define UPPERCASE(C) (~(1<<5) & (C)) // From lotabout -- macro to convert character to upper case
#define MAX_BUFF_LEN 64

// Variable
global char Look; // Look ahead character

/**
* Prototypes [CRADLE]
*
*/
// Error reporting
function void error(char *str); // Report an error
function void lake_abort(char *str); // report error and halt
function void expected(char *str); // What was expected

// Character and string stuff
function void get_char(void); // Read new character from input stream
function void match(char c); // Match a specific character input
function bool is_alpha(char c); // Check if character is alphabetical
function bool is_digit(char c); // Check if character is numerical
function char get_name(void); // Get an identifier
function char get_num(void); // Get a number
function void emit(char *str); // Output a string with tab
function void emit_ln(char *str); // Output a string with tab and CRLF

// Initialize
function void init(void); // Intialize

#endif // _CRADLE_H

Then we have our cradle.c

/** Date :: November 22 2024 10:01 AM **/

/**
* Implementations [CRADLE]
*
*/
// Error reporting
function void 
error(char *str)
{
   fprintf (stderr, "\nEncountered error: %s", str);
}

function void 
lake_abort(char *str)
{
   error (str);
   exit (1);
}

function void 
expected(char *str)
{
   char buff[MAX_BUFF_LEN];
   if (sprintf_s(buff, sizeof(buff), "%s Expected", str) > 0)
   {
      lake_abort (buff);
   } else
   {
      lake_abort ("Failed to format 'expected()' message");
   }
}


// Character and string stuff
function void 
get_char(void)
{
   Look = (char)getchar();
}

function void 
match(char c)
{
   if (Look == c)
   {
      get_char();
   } else
   {
      char buff[MAX_BUFF_LEN];
      if (sprintf_s(buff, sizeof(buff), "' %c '", c) > 0)
      {
         expected(buff);
      } else
      {
         expected("Failed to format 'match()' message");
      }
   }
}

function bool 
is_alpha(char c)
{
   return (UPPERCASE(c) >= 'A') && (UPPERCASE(c) <= 'Z');
}

function bool 
is_digit(char c)
{
   return (c >= '0') && (c <= '9');
}

function char 
get_name(void)
{
   char c = Look;

   if (!is_alpha(Look))
   {
      expected("Name");
   }

   get_char();

   return UPPERCASE(c);
}

function char 
get_num(void)
{
   char c = Look;

   if (!is_digit(Look))
   {
      expected("Integer");
   }

   get_char();

   return c;
}

function void
emit(char *str)
{
   printf("\t%s", str);
}

function void
emit_ln(char *str)
{
   emit(str);
   printf("\r\n");
}


// Initialize
function void 
init(void)
{
   get_char();
}

And then finally our main.c for our main program.

// Base (Cotains all the includes and typedefs that we might need)
#include "base.h"

// Dependencies [h]
#include "cradle.h"

// Core [h/c]

// Dependencies [c]
#include "cradle.c"

// Disable unused variable warnings for now
#pragma warning( push )
#pragma warning( disable : 4100 )
int
main(int argc, char *argv[])
#pragma warning( pop )
{
   init();
}

The above structure looks a little different compared to what some people might be used to. But this is how I keep my compilation speeds fast for large projects.

The #include preprocessor directive is basically just copy/paste, so if we keep the order correct we can compile the entire projects into a single translation unit! This allows us to keep things neat (or as neat as they can be) without compromising on compilation time.

My general workflow is edit -> build -> run, and when compilation times exceed anything longer than 10 seconds, I start to wonder about the general architecture of the program I am working on.

And that’s pretty much it for now until the next chapter!

Writing a Compiler Day 2, 3, and 4

So I have been chipping away at math expression parsing and code generation for those expressions.

This was a fun little challenge to get the correct MASM code generated!

(Crazy that I’m actually learning Pascal while doing this lol)

We started with the simple add and subtract functions (or Procedures in Pascal)

{---------------------------------------------------------------}
{ This is the version that handles any number of terms! }
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   while Look in ['+', '-'] do begin
      EmitLn('MOVE D0,D1');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}


Next, just above Expression enter these two procedures:


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD D1,D0');
end;


{-------------------------------------------------------------}
{ This is the 'correct' version }
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB D1,D0');
   EmitLn('NEG D0');
end;
{-------------------------------------------------------------}

And here is my version in C:

function void 
expression(void)
{
   if (is_add_op(Look))
   {
      emit_ln("xor rax, rax");
   } else
   {
      term();
   }

   while (is_add_op(Look))
   {
      emit_ln("push rax");
      switch (Look)
      {
         case ('+'):
         {
            add();
         } break;
         case ('-'):
         {
            subtract();
         } break;
         default:
         {
            expected("Addop");
         }
      }
   }
}

// Binary/Math Operations
function void
add(void)
{
   match('+');
   term();
   emit_ln("pop rbx");
   emit_ln("add rax, rbx");
}

function void
subtract(void)
{
   match('-');
   term();
   emit_ln("pop rbx");
   emit_ln("sub rax, rbx");
   emit_ln("neg rax");
}

Now we have mulitplication and division or ‘mulops’:

{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Factor;
begin
   EmitLn('MOVE #' + GetNum + ',D0')
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   while Look in ['*', '/'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      else Expected('Mulop');
      end;
   end;
end;

And now my version in C:

function void
multiply(void)
{
   match('*');
   factor();
   emit_ln("pop rbx");
   emit_ln("mul rbx");
}

function void
divide(void)
{
   match('/');
   factor();
   emit_ln("push rax");
   emit_ln("pop rbx");
   emit_ln("pop rax");
   emit_ln("xor rdx, rdx");
   emit_ln("div rbx");
}

function void 
term(void)
{
   factor();
   while (is_mul_op(Look))
   {
      emit_ln("push rax");
      switch (Look)
      {
         case ('*'):
         {
            multiply();
         } break;
         case ('/'):
         {
            divide();
         } break;
         default:
         {
            expected("Mulop");
         }
      }
   }
}

Next we need to handle one teensy little thing when it comes to parsing math expressions: Parentheses!

This will allow us to handle any arbitrarily complex math expression!

First the Pascal version:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}

And now for my C version:

function void 
factor(void)
{
   if (Look == '(')
   {
      match('(');
      expression();
      match(')');
   } else
   {
      char c = get_num();
      if (sprintf_s(buffer, sizeof(buffer), "mov rax, %c", c) > 0)
      {
         emit_ln(buffer);
      }
      MemoryZero(buffer);
   }
}

Lastly we need to bne able handle the unary minus (negating the sign of a number).

We have Crenshaw’s version in Pascal:

{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   if IsAddop(Look) then
      EmitLn('CLR D0')
   else
      Term;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}

And now mine in C:

function void 
expression(void)
{
   if (is_add_op(Look))
   {
      emit_ln("xor rax, rax");
   } else
   {
      term();
   }

   while (is_add_op(Look))
   {
      emit_ln("push rax");
      switch (Look)
      {
         case ('+'):
         {
            add();
         } break;
         case ('-'):
         {
            subtract();
         } break;
         default:
         {
            expected("Addop");
         }
      }
   }
}

function bool
is_add_op(char c)
{
   return (StrChrA("+-", c));
}

function bool
is_mul_op(char c)
{
   return (StrChrA("*/", c));
}

Now… Anyone looking at this code who knows even a little bit of assembly can see that the code our parser is generating is way less efficient than code we could just write by hand.

As Crenshaw puts it: “Get used to it.”

This series of tutorials was written in the late 80’s, but as far as I know this still rings true today. Unless we devote A LOT of time writing our own optimizer or relying on a third-party backend like LLVM, we will never ever generate hyper optimized output code… But that doesn’t mean we’re out of the race yet.

Crenshaw mentions another technique that he “invented” (at least he hadn’t seen it mentioned in writing at the time of writing (1988)), but regardless it’s a cool idea!

Since we own the parser, we know exactly what code is being generated when we are parsing expression. This in turn means that at any point during processing, the parser also knows how many items are on the stack as well! We can use a privately managed “stack pointer” (not the built-in RSP) to keep track of what level we are on and what addresses the corresponding register.

Instead of using main memory, we will use all the available registers to create this privately managed stack.

I am generating MASM code for this compiler, so I will have access to 16 possible registers for use. Ideally we won’t have any expressions that will exceed 16, but if and when we do, then we just let it spill over into main memory.

For now these optimizations will be left out, but keep this idea in mind since we are going to implement it later!