There’s this website https://compilers.iecc.com/crenshaw/. Ideally I can use this as a resource to write my own compiler for my own language… Maybe not a production ready language but to learn these concepts will be worth the challenge!
The author, Crenshaw, writes all of the code in Turbo Pascal which I have no interest in doing.
I will be attempting to complete this project in C on Windows, using as much of the native Win32 API as I possibly can.
The reason I am choosing to follow this tutorial series is this quote right here:
“I intend to completely ignore the more theoretical aspects of the subject. What you WILL know is all the practical aspects that one needs to know to build a working system.”
I have started reading through three or four different books on compilation, and as soon as I see any thing start to wade in theory I start to lose interest. Not that theory is not important (in fact I have a ton of respect for the theory programming languages), but wading through tons of theory just to get a working program is not very much fun. And programming should be fun!
The author starts with the following to get us started:
{--------------------------------------------------------------} program Cradle; {--------------------------------------------------------------} { Constant Declarations } const TAB = ^I; {--------------------------------------------------------------} { Variable Declarations } var Look: char; { Lookahead Character } {--------------------------------------------------------------} { Read New Character From Input Stream } procedure GetChar; begin Read(Look); end; {--------------------------------------------------------------} { Report an Error } procedure Error(s: string); begin WriteLn; WriteLn(^G, 'Error: ', s, '.'); end; {--------------------------------------------------------------} { Report Error and Halt } procedure Abort(s: string); begin Error(s); Halt; end; {--------------------------------------------------------------} { Report What Was Expected } procedure Expected(s: string); begin Abort(s + ' Expected'); end; {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look = x then GetChar else Expected('''' + x + ''''); end; {--------------------------------------------------------------} { Recognize an Alpha Character } function IsAlpha(c: char): boolean; begin IsAlpha := upcase(c) in ['A'..'Z']; end; {--------------------------------------------------------------} { Recognize a Decimal Digit } function IsDigit(c: char): boolean; begin IsDigit := c in ['0'..'9']; end; {--------------------------------------------------------------} { Get an Identifier } function GetName: char; begin if not IsAlpha(Look) then Expected('Name'); GetName := UpCase(Look); GetChar; end; {--------------------------------------------------------------} { Get a Number } function GetNum: char; begin if not IsDigit(Look) then Expected('Integer'); GetNum := Look; GetChar; end; {--------------------------------------------------------------} { Output a String with Tab } procedure Emit(s: string); begin Write(TAB, s); end; {--------------------------------------------------------------} { Output a String with Tab and CRLF } procedure EmitLn(s: string); begin Emit(s); WriteLn; end; {--------------------------------------------------------------} { Initialize } procedure Init; begin GetChar; end; {--------------------------------------------------------------} { Main Program } begin Init; end. {--------------------------------------------------------------}
As someone who has never written a single line of Pascal in their life, my original thought is… “What in the hell am I looking at?” But after looking at it for a couple seconds, I can see that this boiler plate is quite terse, yet descriptive.
Below will be my translation to C:
We’ll start with cradle.h
cradle.h
/** Date :: November 22 2024 10:01 AM **/ #ifndef _CRADLE_H #define _CRADLE_H /** * Structs [CRADLE] * */ /** * Macro|Enum|Variable Declarations [CRADLE] * */ // Macro #define UPPERCASE(C) (~(1<<5) & (C)) // From lotabout -- macro to convert character to upper case #define MAX_BUFF_LEN 64 // Variable global char Look; // Look ahead character /** * Prototypes [CRADLE] * */ // Error reporting function void error(char *str); // Report an error function void lake_abort(char *str); // report error and halt function void expected(char *str); // What was expected // Character and string stuff function void get_char(void); // Read new character from input stream function void match(char c); // Match a specific character input function bool is_alpha(char c); // Check if character is alphabetical function bool is_digit(char c); // Check if character is numerical function char get_name(void); // Get an identifier function char get_num(void); // Get a number function void emit(char *str); // Output a string with tab function void emit_ln(char *str); // Output a string with tab and CRLF // Initialize function void init(void); // Intialize #endif // _CRADLE_H
Then we have our cradle.c
cradle.c
/** Date :: November 22 2024 10:01 AM **/ /** * Implementations [CRADLE] * */ // Error reporting function void error(char *str) { fprintf (stderr, "\nEncountered error: %s", str); } function void lake_abort(char *str) { error (str); exit (1); } function void expected(char *str) { char buff[MAX_BUFF_LEN]; if (sprintf_s(buff, sizeof(buff), "%s Expected", str) > 0) { lake_abort (buff); } else { lake_abort ("Failed to format 'expected()' message"); } } // Character and string stuff function void get_char(void) { Look = (char)getchar(); } function void match(char c) { if (Look == c) { get_char(); } else { char buff[MAX_BUFF_LEN]; if (sprintf_s(buff, sizeof(buff), "' %c '", c) > 0) { expected(buff); } else { expected("Failed to format 'match()' message"); } } } function bool is_alpha(char c) { return (UPPERCASE(c) >= 'A') && (UPPERCASE(c) <= 'Z'); } function bool is_digit(char c) { return (c >= '0') && (c <= '9'); } function char get_name(void) { char c = Look; if (!is_alpha(Look)) { expected("Name"); } get_char(); return UPPERCASE(c); } function char get_num(void) { char c = Look; if (!is_digit(Look)) { expected("Integer"); } get_char(); return c; } function void emit(char *str) { printf("\t%s", str); } function void emit_ln(char *str) { emit(str); printf("\r\n"); } // Initialize function void init(void) { get_char(); }
And then finally our main.c for our main program.
main.c
// Base (Cotains all the includes and typedefs that we might need) #include "base.h" // Dependencies [h] #include "cradle.h" // Core [h/c] // Dependencies [c] #include "cradle.c" // Disable unused variable warnings for now #pragma warning( push ) #pragma warning( disable : 4100 ) int main(int argc, char *argv[]) #pragma warning( pop ) { init(); }
The above structure looks a little different compared to what some people might be used to. But this is how I keep my compilation speeds fast for large projects.
The #include preprocessor directive is basically just copy/paste, so if we keep the order correct we can compile the entire projects into a single translation unit! This allows us to keep things neat (or as neat as they can be) without compromising on compilation time.
#include
My general workflow is edit -> build -> run, and when compilation times exceed anything longer than 10 seconds, I start to wonder about the general architecture of the program I am working on.
edit -> build -> run
And that’s pretty much it for now until the next chapter!
So I have been chipping away at math expression parsing and code generation for those expressions.
This was a fun little challenge to get the correct MASM code generated!
(Crazy that I’m actually learning Pascal while doing this lol)
We started with the simple add and subtract functions (or Procedures in Pascal)
{---------------------------------------------------------------} { This is the version that handles any number of terms! } { Parse and Translate an Expression } procedure Expression; begin Term; while Look in ['+', '-'] do begin EmitLn('MOVE D0,D1'); case Look of '+': Add; '-': Subtract; else Expected('Addop'); end; end; end; {--------------------------------------------------------------} Next, just above Expression enter these two procedures: {--------------------------------------------------------------} { Recognize and Translate an Add } procedure Add; begin Match('+'); Term; EmitLn('ADD D1,D0'); end; {-------------------------------------------------------------} { This is the 'correct' version } { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB D1,D0'); EmitLn('NEG D0'); end; {-------------------------------------------------------------}
And here is my version in C:
function void expression(void) { if (is_add_op(Look)) { emit_ln("xor rax, rax"); } else { term(); } while (is_add_op(Look)) { emit_ln("push rax"); switch (Look) { case ('+'): { add(); } break; case ('-'): { subtract(); } break; default: { expected("Addop"); } } } } // Binary/Math Operations function void add(void) { match('+'); term(); emit_ln("pop rbx"); emit_ln("add rax, rbx"); } function void subtract(void) { match('-'); term(); emit_ln("pop rbx"); emit_ln("sub rax, rbx"); emit_ln("neg rax"); }
Now we have mulitplication and division or ‘mulops’:
{---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Factor; begin EmitLn('MOVE #' + GetNum + ',D0') end; {--------------------------------------------------------------} { Recognize and Translate a Multiply } procedure Multiply; begin Match('*'); Factor; EmitLn('MULS (SP)+,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Divide } procedure Divide; begin Match('/'); Factor; EmitLn('MOVE (SP)+,D1'); EmitLn('DIVS D1,D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Term } procedure Term; begin Factor; while Look in ['*', '/'] do begin EmitLn('MOVE D0,-(SP)'); case Look of '*': Multiply; '/': Divide; else Expected('Mulop'); end; end; end;
And now my version in C:
function void multiply(void) { match('*'); factor(); emit_ln("pop rbx"); emit_ln("mul rbx"); } function void divide(void) { match('/'); factor(); emit_ln("push rax"); emit_ln("pop rbx"); emit_ln("pop rax"); emit_ln("xor rdx, rdx"); emit_ln("div rbx"); } function void term(void) { factor(); while (is_mul_op(Look)) { emit_ln("push rax"); switch (Look) { case ('*'): { multiply(); } break; case ('/'): { divide(); } break; default: { expected("Mulop"); } } } }
Next we need to handle one teensy little thing when it comes to parsing math expressions: Parentheses!
This will allow us to handle any arbitrarily complex math expression!
First the Pascal version:
{---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else EmitLn('MOVE #' + GetNum + ',D0'); end; {--------------------------------------------------------------}
And now for my C version:
function void factor(void) { if (Look == '(') { match('('); expression(); match(')'); } else { char c = get_num(); if (sprintf_s(buffer, sizeof(buffer), "mov rax, %c", c) > 0) { emit_ln(buffer); } MemoryZero(buffer); } }
Lastly we need to bne able handle the unary minus (negating the sign of a number).
We have Crenshaw’s version in Pascal:
{---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin if IsAddop(Look) then EmitLn('CLR D0') else Term; while IsAddop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '+': Add; '-': Subtract; else Expected('Addop'); end; end; end; {--------------------------------------------------------------} {--------------------------------------------------------------} { Recognize an Addop } function IsAddop(c: char): boolean; begin IsAddop := c in ['+', '-']; end; {--------------------------------------------------------------}
And now mine in C:
function void expression(void) { if (is_add_op(Look)) { emit_ln("xor rax, rax"); } else { term(); } while (is_add_op(Look)) { emit_ln("push rax"); switch (Look) { case ('+'): { add(); } break; case ('-'): { subtract(); } break; default: { expected("Addop"); } } } } function bool is_add_op(char c) { return (StrChrA("+-", c)); } function bool is_mul_op(char c) { return (StrChrA("*/", c)); }
Now… Anyone looking at this code who knows even a little bit of assembly can see that the code our parser is generating is way less efficient than code we could just write by hand.
As Crenshaw puts it: “Get used to it.”
This series of tutorials was written in the late 80’s, but as far as I know this still rings true today. Unless we devote A LOT of time writing our own optimizer or relying on a third-party backend like LLVM, we will never ever generate hyper optimized output code… But that doesn’t mean we’re out of the race yet.
Crenshaw mentions another technique that he “invented” (at least he hadn’t seen it mentioned in writing at the time of writing (1988)), but regardless it’s a cool idea!
Since we own the parser, we know exactly what code is being generated when we are parsing expression. This in turn means that at any point during processing, the parser also knows how many items are on the stack as well! We can use a privately managed “stack pointer” (not the built-in RSP) to keep track of what level we are on and what addresses the corresponding register.
Instead of using main memory, we will use all the available registers to create this privately managed stack.
I am generating MASM code for this compiler, so I will have access to 16 possible registers for use. Ideally we won’t have any expressions that will exceed 16, but if and when we do, then we just let it spill over into main memory.
For now these optimizations will be left out, but keep this idea in mind since we are going to implement it later!