Field-Value Automata
2022-5-20 03:0:0 Author: www.tbray.org(查看原文) 阅读量:23 收藏

When I introduced Quamina, I described the core trick: You prepare an arbitrary JSON blob for automaton-based matching by “Flattening” it into a list of name/value pairs, then sorting them in name order. Today, a closer look at how you work with the flattened data.

This is one of a series of essays on programming topics motivated by my work on eventing services at AWS and, since then, on Quamina, a content-filtering library implemented in Go (GitHub).

By example · Consider the following Pattern:

{"x": ["a"], "y": [1, 2]}

It should match any JSON blob containing a top-level x field whose value is the string "a" and a top-level y field whose value is 1 or 2.

Since we sort the fields of incoming events in order of name, we know how to build the automaton. There are two things that aren’t obvious:

  1. Matching each field is a two-step process: First, see if the field name matches, then the value.

  2. This pattern doesn’t care about any fields except for the top-level x and y. So the automaton has to bypass all the others.

Given that, here’s the little automaton that could:

simple finite automaton

Notes · There are two kinds of things in the picture: field matchers and value matchers, labeled in the diagram with names beginning fm and vm. In Go code, they’re implemented by types named fieldMatcher and valueMatcher.

The matchers transition straightforwardly, looking for an x field with value "a" and then a y field with value 1 or 2. But, each of these matchers have a * representing “anything else” that loops back to the field-matcher state. This is how fields that don’t appear in the pattern are ignored. The automaton loops in fm0 until it sees an x field, then moves along to fm1 if the field’s value is "a", otherwise looping back to fm0. If it does get to fm1, it’ll loop there forever until it sees a y field with value 1 or 2, transitioning to fm2 or fm3 respectively and bypassing any other fields we don’t care about.

The big red !T on fm2 and fm3 says reaching either state means you’ve matched the pattern T that was added in the code sample above.

Problem · I’d used finite automata more than once before this, but on every other occasion I was parsing a programming language or Internet data format, which have none of this “skip any fields you’re not looking for” crap. Which meant that the state machine I drew on the whiteboard in my AWS office looked like the one just above, but without those loopbacks labeled *. Yeah, even though they’re implicit in the design. What can I say, I was in a hurry.

So when I sat down to write the code to traverse the automaton, I just couldn’t figure out how to make it work. I felt, as I often feel, that I’d gotten in over my head.

Since I wasn’t smart enough to write down the correct automaton, I had to figure out how to code around the one I had. Sitting on the living-room couch in my Mom’s over-full house at Christmas 2014, I came up with this. Go check it out if you like reading code, but let me try to explain it…

In English · You have an automaton with a start state that looks like the picture above, only without the * back-links. You don’t know which fields in the event (if any) are going to match the pattern. You do know that any match has to begin with the start state.

So, you make a little type called a Proposal, which says, state S might match field F. And you have a pool of Proposals to work on. To start with, that pool contains one proposal for each field, suggesting that the start state might match it.

Then you turn a loop loose that runs as long as there are any proposals in the pool. It reads a proposal and tries matching its state to its field. If it works, which means the field name/value combo transitions to another state, you toss proposals for that state and all the following fields back into the pool. Let’s work a quick example

Suppose you have 3 fields (F0, F1, F2), so you load the pool with proposals for F0/Start-State, F1/Start-State, and F2/Start-State. Let’s say that neither F1 nor F2 matches the Start-State, but F0 does, transitioning to State-X. So you toss proposals for F1/State-X and F2/State-X into the pool. F1 doesn’t match State-X but F2 does, transitioning to State-Y. State-Y has an annotation that you’ve matched a pattern, so you have something to return to your caller. The pool is now empty and there are no more fields after F2 to build new proposals, you’re done.

The fact that the fields are sorted by name really matters here; as you work your way through the automaton, you never have to worry about transferring back to a previous state.

It’s all in less than fifty lines of code (once again, starting here), which I’m not going to try to squeeze into this skinny blog column. If you’re looking at that code, please ignore (for the moment) the static about “exists:false matches” and “Array conflict”; but both are maybe interesting enough to get a write-up later in this series.

I don’t know if this approach to traversing automata (a) has been investigated, (b) has a name, or (c) is any good. I do know that it worked really well in practice, handling many millions of events per second in multiple AWS services.

I’ve done a bit of pen-and-paper analysis and don’t think the amount of work is meaningfully different from a conventional traversal. But it did occur to me that in principle this approach could be made multithreaded; you could process multiple proposals in parallel on different cores. But anyhow, the profiler says this part of traversing the automaton is hardly visible as part of the total compute. So I left it this way in Quamina for sentimental reasons.

Tables? · In the first cut, the field matcher was just map[string]*valueMatcher and the value-matcher was map[string]*fieldMatcher. It worked OK and the fieldMatcher is still like that but, for reasons I’ll write about later, that’s a bad choice for the valueMatcher.

Smarter than me · At some level, Quamina is. One symptom is places like this in the code, distinguished by extended verbose comments. These are where I got stuck and bashed my head across the wall until I got something that worked, and knew I’d have no chance of understanding it later (nor would any subsequent visitor) unless I could squeeze out a coherent English explanation. As Prof. Feynman said, if you can’t explain something in simple language, you don’t really understand it. The observation that computer programmers can build executable abstractions that work but they then have trouble understanding is not new and not surprising. Lots of our code is smarter than we are.

The code is also smart because at AWS I had extremely talented collaborators who added things that I didn’t think were possible until they worked. The proportion of the useful ideas in here that are actually mine is probably less than 50% now.

Finally, we had the insane luxury of running this in production against millions-per-second event flows and watching what broke. And of hearing from other teams using it about what they had managed to break. I guarantee: Nobody is smart enough to predict the behavior of software under this kind of stress without experiencing it.

News · As of late May, Quamina has picked up a couple of collaborators with way more GitHub expertise than me, and its repo is growing all sorts of bells and whistles, mostly on the CI/CD front. Which means that I hope to do a release next month and see if anyone actually wants to use this. Also, more stuff to write about!



文章来源: https://www.tbray.org/ongoing/When/202x/2022/05/19/Quamina-Matchers
如有侵权请联系:admin#unsafe.sh