POSIX Regular Expression Matching and Lexing

Chengsong Tan

Informatics

Student thesis: Doctoral Thesis › Doctor of Philosophy

Abstract

POSIX is the most widely used disambiguation strategy for regular expression matching. There are some difficulties associated with the POSIX strategy and according to tests conducted by Kulkewitz, many regular expression matchers implementing this strategy produce incorrect results. This thesis is concerned with an POSIX regular expression matching algorithm introduced by Sulzmann and Lu. This algorithm uses bitcoded regular expressions and is based on the idea of Brzozowski derivatives. The algorithm generates POSIX values which encode the information of how a regular expression matches a string - that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens.

While a formalised correctness proof for Sulzmann and Lu’s algorithm already exists, this proof does not include any of the crucial simplification rules. These simplification rules are however necessary in order to have an acceptable runtime for this algorithm. Our version of the simplification rules includes a number of fixes and improvements: one problem we fix has to do with their use of the nub function that does not remove non-trivial duplicates. We improve the simplification rules by formulating them as simple recursive function and also by simplifying more instances of regular expressions. As a result we can establish a bound on the size of derivatives. Our proofs are formalised in Isabelle/HOL.

Date of Award	1 Jan 2024
Original language	English
Awarding Institution	King's College London
Supervisor	Christian Urban (Supervisor)

Cite this

Documents

2024_Tan_Chengsong_1854034_ethesis
File: application/pdf, 676 KB
Type: Thesis