Regexp Lessons

Regexp Lessons
作者分享了开发正则表达式引擎Quamina的经验与教训，包括样本驱动开发、任务分解、Thompson构造法的应用以及解决状态爆炸问题的方法。文章还回顾了项目的十二年历程，并展望未来功能的实现。 2026-1-1 20:0:0 Author: www.tbray.org(查看原文) 阅读量:0 收藏

I’m just landing a chonky PR in Quamina whose effect is to enable the + and * regexp features. As in my last chapter, this is a disorderly war story not an essay, and probably not of general interest. But as I said then, the people who care about coercing finite automata into doing useful things at scale are My People (there are dozens of us).

2014-2026 · As I write this, I’m sitting in the same couch in my Mom’s living room in Saskatchewan where, on my first Christmas after joining AWS, I got the first-ever iteration of this software to work. That embryo’s mature form is available to the world as aws/event-ruler. (Thanks, AWS!) Quamina is its direct descendent, which means this story is entering its twelfth year.

Mom is now 95 and good company despite her failing memory. I’ve also arrived a qualitatively later stage of life than in 2014, but would like to report back from this poorly-lit and often painful landscape: Executable abstractions are still fun, even when you’re old.

Anyhow, the reason I’m writing all this stuff isn’t to expound on the nature of finite automata or regular expressions, it’s to pass on lessons from implementing them.

Lesson: Seek out samples · In a previous episode I used the phrase Sample-driven development to describe my luck in digging up 992 regexp test cases, which reduced task task from intimidating to approachable. I’ve never previously had the luxury of wading into a big software task armed with loads of test cases written by other people, and I can’t recommend it enough. Obviously you’re not always going to dig this sort of stuff up, but give it a sincere effort.

Lesson: Break up deliverables · I decomposed regular expressions into ten unique features, created an enumerated type to identify them, and implemented them by ones and twos. Several of the feature releases used techniques that turned out to be inefficient or just wrong when it came to subsequent features. But they worked, they were useful, and my errors nearly all taught me useful lessons.

Having said that, here’s some cool output that combines this lesson and the one above, from the unit test that runs those test cases. Each case has a regexp then one or more samples each of strings that should and shouldn’t match. I instrumented the test to report the usage of regexp features in the match and non-match cases.

Feature match test counts:
 32 '*' zero-or-more matcher
 27 () parenthetized group
 48 []-enclosed character-class matcher
 7 '.' single-character matcher
 29 |-separated logical alternatives
 16 '?' optional matcher
 29 '+' one-or-more matcher
 
Feature non-match test counts:
 45 '+' one-or-more matcher
 24 '*' zero-or-more matcher
 31 () parenthetized group
 49 []-enclosed character-class matcher
 6 '.' single-character matcher
 32 |-separated logical alternatives
 21 '?' optional matcher

Of course, since most of the tests combine multiple features, the numbers for all the features get bigger each time I implement a new one. Very confidence-building.

Lesson: Thompson’s construction · This is the classic nineteen-sixties regular-expression implementation by Ken Thompson, described in Wikipedia here and (for me at least) more usefully, in a storytelling style by Russ Cox, here.

On several occasions I rushed ahead and implemented a feature without checking those sources, because how hard could it be? In nearly every case, I had problems with that first cut and then after I went and consulted the oracle, I could see where I’d gone wrong and how to fix it.

So big thanks, Ken and Russ.

Lesson: List crushing · In some particularly nasty regular expressions that combine [] and ? and + and *, you can get multiple states connected in complicated ways with epsilon links.

In Thompson’s Construction, traversing an NFA transitions not just from one state to another but from a current set of states to a next set, repeat until you match or fail. You also compute epsilon closures as you go along; I’m skipping over details here. The problem was that traversing these pathologically complex regexps with a sufficiently long string was leading to an exponential explosion in the current-states and next-states sizes — not a figure of speech, I mean O(2^N). And despite the usage of the word “set” above, these weren’t, they contained endless duplicates.

The best solution would be to study the traversal algorithm and improve it so it didn’t emit fountains of dupes. That would be hard. The next-best would to turn these things into actual de-duped sets, but that would require a hash table right in the middle of the traversal hot spot, and be expensive.

So what I did was to detect whenever the next-steps list got to be longer than N, sorted it, and crushed out all the dupes. As I write, N is now 500 based on running benchmarks and finding a value that makes number go down.

The reason this is good engineering is that the condition where you have to crush the list almost never happens, so in the vast majority of cases the the only cost is an if len(nextSteps)>N comparison. Some may find this approach impure, and they have a point. I would at some point like to go back and find a better upstream approach. But for now it’s still really fast in practice, so I sleep soundly.

Lesson: Hell’s benchmark · I’ve written before about my struggles with the benchmark where I merge 12,959 wildcard patterns together. Back before I was doing epsilon processing correctly, I had a kludgey implementation that could match patterns in the merged FA at typical Quamina speeds, hundreds of thousands per second. With correct epsilon general-purpose epsilon handling, I have so far not been smart enough to find a way to preserve that performance. With the full 13K patterns, Quamina matches at less than 2K/second, and with a mere thousand, at 16K/second. I spent literally days trying to get better results, but decided that it was more valuable to Quamina to handle a large subset of regular expressions correctly than to run idiotically large merges at full speed.

I’m pretty sure that given enough time and consideration, I’ll be able to make it better. Or maybe someone else who’s smarter than me can manage it.

Question: What next? · The still-unimplemented regexp features are:

{lo,hi} : occurrence-count matcher
~p{} : Unicode property matcher
~P{} : Unicode property-complement matcher
[^] : complementary character-class matcher

Now I’m wondering what to do next. [^] is pretty easy I think and is useful, also I get to invert a state-transition table. The {lo,hi} idiom shouldn’t be hard but I’ve been using regexps for longer than most of you have been alive and have never felt the need for it, thus I don’t feel much urgency. The Unicode properties I think have a good fun factor just because processing the Unicode character database tables is cool. And, I’ve used them.

Lesson: Coding while aging · I tell people I keep working on code for the sake of preserving my mental fitness. But mostly I do it for fun. Same for writing about it. So, thanks for reading.

文章来源: https://www.tbray.org/ongoing/When/202x/2026/01/01/Quamina-2026
如有侵权请联系:admin#unsafe.sh