DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Build a WebAssembly Language for Fun and Profit: Lexing

Build a WebAssembly Language for Fun and Profit: Lexing

I decided to create this guide and provide a simple overview designed to help get your feet wet in building languages with wasm, starting with lexing.

Drew Youngwerth user avatar by
Drew Youngwerth
·
Oct. 11, 22 · Tutorial
Like (1)
Save
Tweet
Share
2.78K Views

Join the DZone community and get the full member experience.

Join For Free

WebAssembly (wasm) is a high-performance assembly-like format optimized for the web. Code targeting WebAssembly can run at near-native speeds while still benefiting from the safe environment of a browser VM. Wasm has opened up a whole new world of demanding desktop-class apps that can comfortably run in the browser. For example, AutoCAD was able to port decades of code to the web using wasm. Cases like AutoCAD’s have made it clear that wasm will be a major disruptive force in how web apps are developed.

To facilitate the adoption of wasm, WebAssembly team developed a powerful compiler toolchain library called binaryen. Binaryen does a huge amount of heavy lifting for compiler authors. It offers dead code removal, code size reduction, and various performance optimizations out of the box.

As someone who has long been interested in programming languages, this piqued my interest. Writing compiled languages has always been a daunting task. What I found is binaryen made it incredibly fun and easy to build _new_ languages that are shockingly speedy.

That's why I decided to create this guide and provide a simple overview designed to help get your feet wet in building languages and exploring the inner workings of wasm. 

Here's a quick taste of the lisp-inspired language we'll build, wispy:

Common Lisp
 
(fn fib:i32 [val:i32]
  (if (lt_i32 val 2)
    val
    (add_i32 (fib (sub_i32 val 1)) (fib (sub_i32 val 2)))))

(fn main:i32 [] (fib 15))


This simple function calculates values of the Fibonacci sequence, a sequence of numbers that appears in surprising places through mathematics and nature. It's one of my favorite illustrations of how elegantly patterns of the universe can be described in code.

This guide is designed for intermediate to advanced software developers looking for a fun side

project to challenge themselves with. By the end, we’ll have built a working compiler and runtime for wispy.

The guide will be broken down into three articles:

Setup and Lexing(this article): the process of converting the characters of code into meaningful units called tokens

Parsing: the process of converting the tokens into a logical tree known as an AST.

Compiling (or code generation): the process of converting the AST into the binary instructions run by our computer

Setup

In this guide, we will be using TypeScript and NodeJS. The concepts are highly portable, so

feel free to use the environment you're most comfortable with. Our only major dependency, binaryen, has a simple C API. You are welcome to skip ahead to the next section if you're using a different language.

Requirements:

  •  NodeJS v16+
  • Git

Quick Start

Shell
bash

git clone git@github.com:drew-y/wispy.git

cd wispy

git checkout quickstart

npm i

Manual Setup

I've included manual setup instructions as an alternative to the quick start, in case you want to know exactly how the project was set up or just like doing things from scratch. If you've already done the quick start, skip to the next section.

1. Open a terminal window and make a new directory:

Shell
bash

mkdir wispy

cd wispy


2. Initialize package.json:

Shell
bash

npm init -y # Be sure to have NodeJS 16+ installed


3. Install the project dependencies:

Shell
npm i @types/node binaryen typescript


4. Add these two fields to the package.json

JSON
"type": "module", // Binaryen uses the new module format so we must follow suit

"bin": {

  "wispy": "dist/index.mjs" // This will allow us to easily run the compiler from our terminal

},


5. Create a tsconfig file:

Shell
npx tsc init .


6. Set the following fields in `tsconfig.json`:

JSON
"module": "ES2022",

"rootDir": "./src",

"moduleResolution": "node",

"outDir": "./dist"


Lexing

Lexing is the process of digesting each individual character of our program into a set of tokens. A token is a group of characters that take on a special meaning when put together. Take the following snippet of wispy:

Common Lisp
(add 1 2)


There are five unique tokens in that snippet `(`, `add`, `1`, `2` and `)`. The lexer's job is simply to identify and list those tokens in order. Lexing is typically the first step in turning human-readable code into something closer to what a computer can understand.

Defining Our Tokens

We'll start by defining our tokens in a new file:

Shell
bash

# mts extension is important, it tells typescript to create a corresponding mjs file, so Node knows to use modules 



mkdirp -p src/types/token.mts


First up is the `IntToken`. This token represents whole numbers like `1045`

TypeScript
ts

// src/types/token.mts

export type IntToken = { type: "int"; value: number };


Next up is the `FloatToken`. This token represents numbers that may have a decimal, like `1.8`:

TypeScript
ts

// src/types/token.mts

export type FloatToken = { type: "float"; value: number };



/** Previously defined tokens omitted for brevity */


Now, let's define some identifier tokens. In wispy, an identifier can represent either the name of a function or the name of a function parameter. We have two types of identifier tokens, a standard `IdentifierToken` and a `TypedIdentifierToken`.

An `IdentifierToken` is used in the body of a function to refer to the function's parameters or to call another function.

A `TypedIdentifierToken` is used when defining a function or a parameter. Typed identifiers take the form `identifier:type`. For example, `val:i32` defines a parameter that is a 32-bit integer. When defining a function, the type represents the function's return type. For example, `fib:i32` is a function that returns a 32-bit integer.

Here are the definitions:

TypeScript
ts

// src/types/token.mts

export type IdentifierToken = { type: "identifier"; value: string };

export type TypedIdentifierToken = { type: "typed-identifier"; value: string };



/** Previously defined tokens omitted for brevity */


Up next is `BracketToken`. Wispy uses [S-expression](https://en.wikipedia.org/wiki/S-expression) syntax, like lisp. So brackets are very important. To keep things simple we allow two kinds of brackets `()` and `[]`. To keep things even more simple the compiler will treat `()` and `[]` as interchangeable. In actual use, we will only use `[]` to define parameters.

TypeScript
// src/types/token.mts

export type BracketToken = { type: "bracket"; value: Bracket };

export type Bracket = "(" | ")" | "[" | "]";



/** Previously defined tokens omitted for brevity */

 

Finally, we define the top-level `Token` type:

TypeScript
// src/types/token.mts

export type Token = BracketToken | IntToken | FloatToken | IdentifierToken | TypedIdentifierToken;

/** Previously defined tokens omitted for brevity */


`Token` is a discriminated union. Discriminated Unions are an incredibly powerful programming language construct. They represent a value that can be one of many types. In our case, a `Token` can be any one of the more specific token types we defined earlier, such as `IntToken` or `FloatToken`. You'll notice that each of these tokens have a unique `type` field, such as `type: "int"` in the case of `IntToken`. This is the discriminator. Down the line you can pass a `Token` to a function and that function can use the `type` field to figure out which specific token it's working with.

At this point `src/types/token.mts` is finished and should look like a file. 

To make our new types easily accessible, export them from a new `index.mts` file:

TypeScript
// src/types/index.mts

export * from "./token.mjs";


The Lex Function

Now that we have our tokens defined we can write the actual `lex` function. The lex function will take a `string` (i.e. a .wispy file) and output an array of tokens (`Token[]`):

Make a new lex file:

Shell
bash

mkdirp -p src/lexer.mts


Define the lex function:

TypeScript
// src/lexer.mts

import { Bracket, Token } from "./types/index.mjs";



export const lex = (input: string): Token[] => {

  const chars = input

    // Remove any leading or trailing whitespace for simplicity

    .trim()

    // Break up the file into single characters

    .split("");



  // This array stores our tokens

  const tokens: Token[] = [];



  // The loop continues as long as we have characters to consume

  while (chars.length) {

    // Here, a word is an unidentified token. It is usually any single group of non-whitespace

    // characters such as 123 or 123.4 or im_a_function

    const word = consumeNextWord(chars); // We'll define this function later



    // We ran out of tokens. Break out of the loop.

    if (word === undefined) break;



    const token = identifyToken(word); // We'll define this function later



    // Add the token to our store

    tokens.push(token);

  }



  // Return the tokens

  return tokens;

};


Next we define the `consumeNextWord` function:

TypeScript
// src/lexer.mts



/** previous function(s) omitted for brevity */



const consumeNextWord = (chars: string[]): string | undefined => {

  const token: string[] = [];



  while (chars.length) {

    // Save a preview of the current character without modifying the array

    const char = chars[0];



    // No more characters to read

    if (char === undefined) break;



    // Whitespace characters terminate the token (we'll define the isWhitespace function later)

    if (isWhitespace(char) && token.length) {

      chars.shift(); // Remove the whitespace so it doesn't get included in the next token

      break;

    }



    // Discard leading whitespace characters

    if (isWhitespace(char)) {

      chars.shift();

      continue;

    }



    // Terminator tokens signify the end of the current token (if any). (we'll define the isTerminatorToken function later)

    if (isTerminatorToken(char) && token.length) break;



    // Add the character to the token and discard it from the input

    token.push(char);

    chars.shift();



    // If the only token we've received so far is a single character token, that's our whole token.

    if (isTerminatorToken(char)) break;

  }



  // If we have characters for our token, join them into a single word. Otherwise, return undefined to signal to the lexer

  // that we are finished processing tokens.

  return token.length ? token.join("") : undefined;

};


Now we'll define our `identifyToken` function. As the name suggests, this function takes a word and figures out what token that word represents.

TypeScript
// src/lexer.mts



/** previous function(s) omitted for brevity */



const identifyToken = (word: string): Token => {

  // Don't worry we'll get to all the `is` helper functions in a bit

  if (isInt(word)) return { type: "int", value: parseInt(word) };

  if (isFloat(word)) return { type: "float", value: parseFloat(word) };

  if (isIdentifier(word)) return { type: "identifier", value: word };

  if (isBracket(word)) return { type: "bracket", value: word };

  if (isTypedIdentifier(word)) return { type: "typed-identifier", value: word };



  throw new Error(`Unknown token: ${word}`);

};


Finally, we define our helper functions. These functions all take a string and return `true` if the string passes their test, `false` otherwise. Most are written using regex. If you're unfamiliar with regex, I highly recommend regexone as a resource to learn more. In a nutshell, regex is an expression syntax that's used to extract meaningful information from text. In our case, we'll use it to match words against tokens. 

TypeScript
const isInt = (word: string) => /^[0-9]+$/.test(word);



const isFloat = (word: string) => /^[0-9]+\.[0-9]+$/.test(word);



const isIdentifier = (word: string) => /^[a-zA-Z_][a-zA-Z0-9_\-]*$/.test(word);



const isTypedIdentifier = (word: string) =>

  /^[a-zA-Z_][a-zA-Z0-9_\-]*:[a-zA-Z_][a-zA-Z0-9_\-]*$/.test(word);



const isBracket = (word: string): word is Bracket => /[\(\)\[\]]/.test(word);



/** Brackets are the only terminator tokens for now */

const isTerminatorToken = (word: string): word is Bracket => isBracket(word);



// Not sure why I didn't use regex here ¯\_(ツ)_/¯

const isWhitespace = (char: string) => char === " " || char === "\n" || char === "\t";  


At this point, `src/lexer.mts` is finished and should look something like this file.  

Running the Lexer 

It's time to actually run the lexer. Start by making a new file `src/index.mts`:

TypeScript
#!/usr/bin/env node



// src/index.mts



import { readFileSync } from "fs";

const file = process.argv[2];

const input = readFileSync(file, "utf8");

const tokens = lex(input);

console.log(JSON.stringify(tokens, undefined, 2));

 

Next, create an `example.wispy` file in the project root to compile.

Common Lisp
(fn fib:i32 [val:i32]

  (if (lt_i32 val 2)

    val

    (add_i32 (fib (sub_i32 val 1)) (fib (sub_i32 val 2)))))



(fn main:i32 [] (fib 15))

  

Now build the lexer 

Shell
bash

npx tsc

npm link # This will make wispy available to run as its own command

 

Finally, run the lexer:

Shell
bash

wispy example.wispy



# Note, if npm link failed you can call our compiler directly with this as an alternative:

node dist/index.mjs example.wispy


If everything goes well, `wispy` should output something like this:

JSON
[

  {

    "type": "bracket",

    "value": "("

  },

  {

    "type": "identifier",

    "value": "fn"

  },

  {

    "type": "typed-identifier",

    "value": "fib:i32"

  },

  {

    "type": "bracket",

    "value": "["

  },

  {

    "type": "typed-identifier",

    "value": "val:i32"

  },

  // Omitting the rest for brevity

]


With that, we have a working lexer. We can break our code down into tokens. This is a good place to break for now. In the next article, we’ll move onto parsing these tokens into a logical tree that will ultimately be converted to wasm. 

WebAssembly Build (game engine) Data Types

Published at DZone with permission of Drew Youngwerth. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top Five Tools for AI-based Test Automation
  • Why Does DevOps Recommend Shift-Left Testing Principles?
  • Debugging Threads and Asynchronous Code
  • A Real-Time Supply Chain Control Tower Powered by Kafka

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: