The Tokenizer

The job of the tokenizer (tokens.e) is to break the code of an Euphoria file into small pieces to make it easier for the program to process. This version also adds the ability to easily modify the way it processes the data.

The main function is tokenize. Give it the a file number or a sequence containing the data, and it will return a sequence in the format {{token,line,col},{token,line,col}...}.

To tokenize a file or a sequence of data, use:

tokens = tokenize(file)

Name Description
tokens A list of tokens in the form {{token,line,col},{token,line,col}...}
file A file number or a file name

To tokenize a string, use:

tokens = tokenize_string(data)

Name Description
tokens A list of tokens in the form {{token,line,col},{token,line,col}...}
data The string to tokenize

Extending the Tokenizer

Procedure Description
addWhitespaceDelimiter(delim) Add a whitespace delimiter. For example:
addWhitespaceDelimiter(" ")
addWhitespaceDelimiter("\t")
addNewLineDelimiter(delim) Add a new-line delimiter. For example:
addNewLineDelimiter("\n")
addNewLineDelimiter("\r")
addNewLineDelimiter("\r\n")
addIncludedDelimiter(delim) Add an included delimiter, like an operator. For example:
addIncludedDelimiter("+")
addIncludedDelimiter("+=")
addIncludedDelimiter("(")
addIncludedDelimiter("}")
addStringDelimiter(delim) Adds a string delimiter. For example:
addStringDelimiter("'")
addStringDelimiter("\"")
addLineComment(delim) Adds a single-line comment delimiter. For example:
addLineComment("--")
addBlockComment(start,end) Adds block comment syntax, starting with start and ending with end. For example:
addBlockComment("/*","*/")
addNonDelimiter(nondelim) When nondelim is encountered, it is added to the current token. One thing to note about the way the tokenizer works is that it sorts the delimiters list starting with the longest delimiter and ending with the shortest. This way, it doesn't call the procedure for + instead of += when it encounters +=. By using this procedure with "", if no delimiters are encountered, then it calls routine.It also allows for always adding "a" to the current token, though I don't know of any reason why you would want to.
addNonDelimiter("")
addSpecialDelimiter(delim,routine) When delim is encountered, call routine.
global procedure processLineComment(integer whichOne)
    if length(token[1]) then
        tokens = append(tokens,token)
    end if
    token = {"",curline,curcol}
    while 1 do
        if cchar > length(file_data) then
            exit
        end if
        if isNewLine() then
            exit
        end if
        token[1] = token[1] & file_data[cchar]
        cchar = cchar + 1
        curcol = curcol + 1
    end while
    tokens = append(tokens,token)
    token = {"",curline,curcol}
end procedure
addSpecialDelimiter("--","processLineComment")
-- or
-- addSpecialDelimiter("--",routine_id("processLineComment"))
addExtendedDelimiter(delim,routine,extra) When delim is encountered, call routine. Store extra in DELIMITERS[whichOne][3] as in:
global procedure processBlockComment(integer whichOne)
    if length(token[1]) then
	tokens = append(tokens,token)
    end if
    token = {DELIMITERS[whichOne][1],curline,curcol}
    curcol = curcol + length(token[1])
    cchar = cchar + length(token[1])
    c = DELIMITERS[whichOne][3]
    while 1 do
	if cchar > length(file_data) then
	    exit
	end if
	if cmp() then
	    token[1] = token[1] & DELIMITERS[whichOne][3]
	    curcol = curcol + length(DELIMITERS[whichOne][3])
	    cchar = cchar + length(DELIMITERS[whichOne][3])
	    exit
	end if
	token[1] = token[1] & file_data[cchar]
	if isNewLine() then
	    cchar = cchar + length(DELIMITERS[isNewLine()][1])
	    curcol = 0
	    curline = curline + 1
	else
	    cchar = cchar + 1
	    curcol = curcol + 1
	end if
    end while
    tokens = append(tokens,token)
    token = {"",curline,curcol}
end procedure
addExtendedDelimiter("/*","processBlockComment","*/")

Prev: Extending the pre-processor

Next: tree.e

Table of Contents