I think it's a great idea... if you already know regex. Effectively it's just a different syntax for the same construct after all, it doesn't simplify anything, it just makes it more readable. Oh and it makes escaping a non-issue, which already almost sells me on the idea completely, since it seems that 50% of the time I spend writing regex is figuring out what needs escaping and how.
Writing regexes is not much of an issue usually (although the many dialects in common use are always a source of frustration) but reading them is always a pain, for me at least. For quick and dirty shell scripts or vim editing it's great, for stuff that's supposed to be long lived and actively maintained in a codebase I think this verbal approach is a great idea, at least in theory.
Regarding the optimization of the intermediate result it should only be a problem if you actually need to output these regexes for other uses or if you need to compile many of them at runtime with performance constraints. If your regexes are pre-compiled then the resulting DFA should look the same as far as I can tell.
If somebody makes a Rust crate with a similar concept I'll be sure to try it out next time I have to write regexes in a codebase.
> I think it's a great idea... if you already know regex
It's actually a bad idea in this case because regex is mostly the same in every modern language, so if you know it, you know it everywhere. What you don't know is this.
I agree with the common complaint that regex is effectively write-only, but this is only half due to its terse syntax. A pattern can be pretty complex on its own, and complex things are hard to understand. Imagine what code matching behavior of a complex regex would look like.
> It's actually a bad idea in this case because regex is mostly the same in every modern language, so if you know it, you know it everywhere. What you don't know is this.
I disagree, at least in my experience there are significant differences between multiple regex engines I'm used to use regularly. In no particular order: are parens and other operators treated literally by default or do they need to be escaped? Are character class like '[:alpha:]' understood, or do I need to write them explicitly? Similarly, do I have access to \w \W \s and friends? Can I use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) or do I have to use = (vim)? Or maybe just {0,1}? But then should I escape the braces? Do I have recursion? Do I have named captures?
Those are not theoretical concerns, that's stuff I routinely end up getting wrong because I forget that this one feature that works in pcre does not work in vim or works differently in sed etc...
> are parens and other operators treated literally by default or do they need to be escaped?
> Can I use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) or do I have to use = (vim)? Or maybe just {0,1}? But then should I escape the braces?
I think that's just older tools like vi and sed. Perl, Python, Java, and Javascript use a similar modern version where + and ? work, and parentheses and braces don't need to be escaped.
> if you know it, you know it everywhere. What you don't know is this.
Right, one language might have anythingBut(" ").endofline() and the next language might have a different . operator like anythingBut(" ")->endofline() or it might even require nesting calls. None of these things are a significant hurdle and if we standardize the names (endofline, anythingBut, ...) then you can make the same argument. It's a chicken and egg argument: just use regex because that works everywhere -> it's not universally implemented -> it won't work everywhere.
And aside from that, I have a similar experience to the sibling comment: when using some command line tool that I forgot (is it sed? Vim?) the default is that \( is a capture group whereas in normal regex ( is a capture group. Grep offers you three regex variants to choose from. I have to look up regex syntax or do trial and error every time I don't use a language that I use daily. And I don't know all of regex to begin with, I just know everything I ever needed but people posted examples here with (?:x) which I don't know. I once read it and remembered it for a few days I think... so anyway, consistent and descriptive method names seems a lot easier especially when you consider autocompleting IDEs.
Writing regexes is not much of an issue usually (although the many dialects in common use are always a source of frustration) but reading them is always a pain, for me at least. For quick and dirty shell scripts or vim editing it's great, for stuff that's supposed to be long lived and actively maintained in a codebase I think this verbal approach is a great idea, at least in theory.
Regarding the optimization of the intermediate result it should only be a problem if you actually need to output these regexes for other uses or if you need to compile many of them at runtime with performance constraints. If your regexes are pre-compiled then the resulting DFA should look the same as far as I can tell.
If somebody makes a Rust crate with a similar concept I'll be sure to try it out next time I have to write regexes in a codebase.