The purpose of every compiler is to read the input file in one programming language and convert it to one or more output files. Output file can be a different programming language, object code or executable code. The process of compilation must first examine the input file. This means reading all characters, identifying keywords, expressions, statements and storing all the data into symbol tables for future use. Symbol table is one of the most important data structures in any compiler.
Symbol table stores identifiers and its attributes. Every time the compiler finds a new identifier in the source code it needs to check if this identifier is already in the table, and if not it needs to store it there. This means a lot of searches and comparisons with every symbol table item. Search is always a very time consuming operation. Our goal is to have a fast compiler. Therefore we should find a way to make the searches across the symbol table as fast as possible.
One of the simple yet effective approaches is to use some hash function and to create a hash table. For every identifier in the table we apply the hash function and calculate some number. The hash function is an arbitrary function that for each identifier returns some number. It can be a simple sum of the ASCII codes of the identifier or some more complex one. Then we use, for example, the last 4 bits of this hash value to determine where to search for our identifier. 4 bits of hash value mean that we have 16 different linked lists of identifiers. We search only the list that belongs to the calculated hash value. This means that we only have to search a small list of identifiers which have the same hash value. Using more bits of hash value and consequently having more linked lists means faster search but we need some more space for bigger hash table.
If we don't find our identifier in this list then we can be sure that the identifier is not in the table because all other identifiers have different hash values so they must be different. In such case we simply add the new identifier at the end of the list of identifiers which belongs to the calculated hash value. Using hash functions and hash tables is a very effective way to speed up searches in symbol tables. Hash functions and hash tables are used in almost all compilers because their implementation is pretty simple and the gain in search speed is huge.
There are many excellent books on compiler design. However, the best book on compiler design is the compiler itself. Take a look at Turbo Pascal compiler source code - a Turbo Pascal compiler written in Turbo Pascal. This source code shows all the beauty of the Pascal programming language and reveals all the tricks needed to build a fast and compact compiler for any language, not just Pascal.
No comments:
Post a Comment