Build your own programming language

Breakdown

Creating a simple interpreter that can parse a custom command and execute it in C++ is an exciting project for anyone interested in programming languages and compilers. In this post, we’ll walk through the process of building a minimal interpreter that recognizes a single command: tell("Your Message"), which outputs the given message using std::cout.

1. Setting Up the Lexer

The first step in building our interpreter is to create a lexer. The lexer takes the input string and breaks it down into meaningful tokens. Here’s a simple implementation:

1.1 Defining the Token Enumeration

First, we create the Token enum class. This enumeration represents different types of tokens that our lexer will recognize. We include tokens for our command, strings, the end of input, invalid tokens, and parentheses.

enum class Token {
	Tell,      // The token for the command "tell"
	String,    // The token for a string literal
	End,       // Token indicating the end of the input
	Invalid,   // Token for invalid input
	OpenParen, // Token for an opening parenthesis
	CloseParen // Token for a closing parenthesis
};

1.2 Creating the Lexeme Structure

Then we create a Lexeme struct which takes our Token type and the

struct Lexeme {
	Token type;          // The type of the token
	std::string value;   // The value associated with the token
};

1.3 Implementing the Lexer Class

Now, we implement the Lexer class, which will use the input string to generate tokens. The constructor initializes the lexer with the source string and sets the starting index.

class Lexer {
public:
	Lexer(const std::string& src) : src(src), index(0) {}

1.4 Token Generation Logic

Within the Lexer class, we define the nextToken method. This method scans the input string and generates the next token based on the current index.

Lexeme nextToken() {
	while (index < src.length() && std::isspace(src[index])) {
	    index++; // Skip whitespace
	}
	
	if (index >= src.length()) {
	    return {Token::End, ""}; // Return End token if we reach the end of the input
	}
	
	// Check for the "tell" command
	if (src.substr(index, 3) == "tell") {
	    index += 3;
	    return {Token::Tell, "tell"};
	}
	
	// Check for string literals
	if (src[index] == '"') {
	    size_t start = index++;
	    while (index < src.length() && src[index] != '"') {
	        index++; // Continue until the closing quote
	    }
	    if (index < src.length()) {
	        index++; // Skip closing quote
	        return {Token::String, src.substr(start, index - start)};
	    }
	}
	
	// Check for parentheses
	if (src[index] == '(') {
	    index++;
	    return {Token::OpenParen, "("};
	}
	
	if (src[index] == ')') {
	    index++;
	    return {Token::CloseParen, ")"};
	}
	
	return {Token::Invalid, ""}; // Return Invalid token for unrecognized input
}

1.5 Private Members of the Lexer

Finally, we define the private members of the Lexer class, which include the source string and the current index position.

private:
	std::string src;  // The source input string
	size_t index;     // Current index in the input string
};

2. Implementing the Parser

The next step in our interpreter is to create a parser. The parser takes the tokens generated by the lexer and interprets them to perform actions based on the input. Here’s how we can implement it:

2.1 Defining the Parser Class

We start by defining the Parser class. This class will take a Lexer instance and manage the parsing process.

class Parser {
public:
	Parser(Lexer& lexer) : lexer(lexer), currentToken(lexer.nextToken()) {}

2.2 Parsing Logic

In the parse method, we define the logic for interpreting the tokens. We check if the current token is the Tell command and process it accordingly.

void parse() {
	if (currentToken.type == Token::Tell) {
	    currentToken = lexer.nextToken(); // Get the next token
	
	    if (currentToken.type == Token::OpenParen) {
	        currentToken = lexer.nextToken(); // Get the next token
	
	        if (currentToken.type == Token::String) {
	            std::string code = generateCode(currentToken.value);
	            compileAndRun(code); // Compile and run the generated code
	            return;
	        }
	    }
	    std::cerr << "Syntax error!" << std::endl; // Handle syntax errors
	}
}

2.3 Private Members of the Parser

The Parser class also contains private members. These include a reference to the Lexer and the current token being processed.

private:
	Lexer& lexer;        // Reference to the lexer
	Lexeme currentToken; // Current token being processed

2.4 Code Generation Method

We define the generateCode method, which takes a string message and creates C++ code that outputs that message. It also escapes any double quotes in the string to ensure valid syntax.

std::string generateCode(const std::string& message) {
	// Escape double quotes in the message
	std::string escapedMessage = message;
	size_t pos = 0;
	while ((pos = escapedMessage.find("\"", pos)) != std::string::npos) {
	    escapedMessage.insert(pos, "\""); // Insert escape character before double quotes
	    pos += 2; // Move past the newly inserted character
	}
	return "#include <iostream>\n"
	       "int main() {\n"
	       "    std::cout << \"" + escapedMessage + "\" << std::endl;\n"
	       "    return 0;\n"
	       "}\n"; // Return the generated C++ code
}

2.5 Compilation and Execution Method

Finally, we implement the compileAndRun method, which compiles the generated C++ code and executes it. It also cleans up the temporary files created during the process.

void compileAndRun(const std::string& code) {
	std::ofstream outFile("temp.cpp");
	outFile << code; // Write the generated code to a temporary file
	outFile.close();
	system("g++ temp.cpp -o temp && ./temp"); // Compile and run the code
	system("rm temp.cpp temp"); // Clean up temporary files
}

3. Putting It All Together

Finally, in our main function, we set up the lexer and parser, and pass a command to be interpreted.

int main() {
	std::string input = "yap(\"Hello World\")";
	Lexer lexer(input);
	Parser parser(lexer);
	parser.parse(); // Parse and execute the input
	return 0;
}

Conclusion

With this setup, we can parse and execute a simple command that prints a message to the console. This is a foundational step towards creating more complex interpreters and programming languages. By exploring the concepts of lexers and parsers, we gain valuable insights into how programming languages are designed and implemented. Happy coding!