Introduction
I know this doesn’t come up nearly as often as other topics but writing an interpreter (a.k.a. a compiler) can be a very useful concept depending on what exactly you need to do with the input to your program. For example, what if we had a fairly strict data store but we for some reason wanted to access it using something like SQL? We’d have to parse the SQL statements and then find a way to mogrify it until we had something that would allow us to get the data we wanted. There is an easier way involving a little parsing theory. For the purposes of this discussion I’m assuming you are at least somewhat familiar with Context Free Grammars.
Creating the Grammar
The first step to creating a parser (especially when using a parser generator) is to find or craft a definition for the grammar. The grammar we’ll use as an example is the expression grammar from tiny basic. This simple grammar is safe for LR or LL parsing (which is important if you look at a common definition of a language like SQL).
Tiny Basic Expression Grammar
expression ::= (+|-|ε) term ((+|-) term)* term ::= factor ((*|/) factor)* factor ::= number | (expression)
Enter pyparsing
To create a very simple and extensible LL parser I’ve recently stumbled upon pyparsing. This simple four production grammar expands to the following pyparsing implementation:
[sourcecode language="python" wraplines="false"]
expr = Forward()
factor = ( Word(nums) | Group(Suppress(‘(‘) + expr + Suppress(‘)’)) )
term = Group(factor + ZeroOrMore((Literal(‘*’)|Literal(‘/’)) + factor))
expr << Group(Optional(Literal(‘-’)|Literal(‘+’)) + term + ZeroOrMore((Literal(‘-’)|Literal(‘+’)) + term))
[/sourcecode]
This allows us to turn sentences such as ‘5+5*6/3-(47+56)*34‘ into an easy to work with list such as: ‘[[['5'], ‘+’, ['5', '*', '6', '/', '3'], ‘-’, [[[['47'], ‘+’, ['56']]], ‘*’, ’34′]]]‘. There are probably improvements that could be made to this parser so that it auto-collapses expressions and other fun handlers but for the purposes of a simple grammar this will suffice.
Calling the grammar after defining it is a very simple process: `expr.parseString(’5+5*6/3-(47+56)*34′)`.
Testing Parsers an Easier Way
Sure unit testing should be done (and parsers lend themselves to unit tests very well) but there’s something satisfying about seeing your sentences get parsed out in real time. The obvious answer is, “create a mini-shell like environment.” Python also makes this process extremely simple and only requires a few lines of code to get a basically functional shell for your parser (complete with history):
[sourcecode language="python" wraplines="false"]
import rlcompleter
import readline
import so
if not os.access(".history", os.F_OK): open(".history", "w").close()
readline.read_history_file(".history")
buffer = ""
while True:
try: line = raw_input(pycolorize.light_blue("BASIC$ "))
except EOFError:
readline.write_history_file(".history")
print
break
if line.lower() == "exit" or line.lower() == "quit":
readline.write_history_file(".history")
break
buffer += line
result = ACTION_ON_BUFFER
buffer = ""
[/sourcecode]
Putting It Together
The complete script for reference purposes:
[sourcecode language="python" wraplines="false"]
import rlcompleter
import readline
import os
from pyparsing import *
import pprint
if not os.access(".history", os.F_OK): open(".history", "w").close()
readline.read_history_file(".history")
try:
import pycolorize
except:
sys.path.append(os.path.join(os.path.dirname(__file__), "vendor", "pycolorize"))
import pycolorize
class ExpressionParser(object):
def __init__(self):
self._expr = Forward()
factor = ( Word(nums) | Group(Suppress(‘(‘) + self._expr + Suppress(‘)’)) )
term = Group(factor + ZeroOrMore((Literal(‘*’)|Literal(‘/’)) + factor))
self._expr << Group(Optional(Literal(‘-’)|Literal(‘+’)) + term + ZeroOrMore((Literal(‘-’)|Literal(‘+’)) + term))
def _calculate(self, l):
while any([ isinstance(x, list) for x in l]):
for n,i in enumerate(l):
if isinstance(i, list): l[n] = self._calculate(i)
return str(eval(" ".join(l)))
def __call__(self, string):
return self._calculate(self._expr.parseString(string).asList())
buffer = ""
print pycolorize.green("Enter your SQL commands to tokenize:")
print pycolorize.green("Enter a blank line to exit.")
while True:
try: line = raw_input(pycolorize.light_blue("BASIC$ "))
except EOFError:
readline.write_history_file(".history")
print
break
if line.lower() == "exit" or line.lower() == "quit":
readline.write_history_file(".history")
break
buffer += line
result = None
try: result = ExpressionParser()(buffer)
except ParseBaseException, e:
buffer = ""
pycolorize.error(e.line)
pycolorize.error(" "*(e.col – 1) + "^")
pycolorize.error(str(e))
continue
pycolorize.status("Result: %s", result)
buffer = ""
[/sourcecode]
Conclusion
Writing LL parsers is a breeze with pyparsing but it must be kept in mind that any grammar that has any left recursion will cause errors that may take some time to find or remove. Other parser generators (for C and C++) include bison and lemon. These parser generators are LR parser generators.
By coupling the parser with a small CLI quick checks on new features to the grammar (and by extension the parser) can be a breeze. Putting this all together with unit tests and proper grammar analysis can lead to a well written and extensible language to be used for whatever purpose you may have in mind.