Rendered at 05:01:33 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
rtpg 23 hours ago [-]
Emacs in particular I suffer so much from basically guessing what needs to be escaped or not. I know `rx` exists[0] as an alternative but it's not really fun to use.
Even beyond the regex syntax itself, you often also start running into encoding problems when trying to actually use them. Typing the regex in a shell? Make sure to esacpe stuff properly. Regex in Python? Make sure it's a raw string. Etc etc etc
It's a modern miracle we're at least within rhyming distance of how to write regexes in most tools.
Grasping at straws, it's kinda convenient that ( and ) match literally if the text being searched is Elisp code!
brookst 15 hours ago [-]
Even more fun writing python that generates shell scripts that contain regex, and other nested-different-escaping scenarios.
afiori 8 hours ago [-]
Regexes should have been a structured language not an hodgepodge of DSLs
JdeBP 1 days ago [-]
The author is circling around, but not quite reaching, a statement that POSIX Basic Regular Expressions work everywhere, with the caveat that that not everyone has caught up with version 8 of the Single Unix Specification, which has slightly changed BREs.
oleganza 13 hours ago [-]
I don't think your comment is fair to the author. If there was not such caveat, then there would not be a need to write that article.
JdeBP 13 hours ago [-]
The author doesn't actually cover the caveat (3 newly valid backslash escapes in BREs) at all, so that's not the case.
13 hours ago [-]
agnishom 1 days ago [-]
A while ago, we wrote a paper about finding regexes which match the same way in both the greedy semantics and the leftmost maximal semantics.
I've always been a stickler for being specific about which regex language your thing accepts, and whether it is to match any substring, or a prefix, or a suffix, or the whole thing, or a line, or a substring of a line, or whatever.
Here are some of the [more popular][1] ones, and then there are PCRE and Python.
It took me a while to learn that some of the older ones you see in e.g. grep are [specified by POSIX][2].
Go stdlib regexp package does not support back references, as it uses the RE2 engine. You can use them in replace but not matching.
masklinn 24 hours ago [-]
Regexp does not use re2, it is a separate implementation of the same concepts.
codetiger 24 hours ago [-]
I built my Rust library for JSONLogic and use bindings for other languages after similar frustrations with Rule engines, template engines and IFTTT engines. https://github.com/GoPlasmatic/datalogic-rs
myroon5 1 days ago [-]
JSON schema's docs also have a recommended regular expression subset:
It drives me nuts when a developer documents something or other as being a "regex" but doesn't mention which dialect of regulation expression he's talking about. This habit is particularly common in the Rust, JavaScript, and Python communities, which seem to forget that their language's regular expression language isn't universal.
zahlman 1 days ago [-]
Why? Of course it means the dialect that is most directly supported by that language (by builtins or the standard library). And why should they have to consider other dialects? They aren't reading regexes from user input (or they'd be a lot more concerned about sanitization, catastrophic backtracking etc.), and their fellow developers all grok the conventions.
bartread 1 days ago [-]
I’d imagine precisely because they might be collecting regexes from user input such as parameter values or search terms, and the user may not know or care which technology your tool or service is built with. However, they will need to know which regex dialect(s) you support.
And I’d further bet that people who are casual about specifying that are relatively strongly correlated with people who are casual about santization, catastrophic backtracking, etc. (At least based on code I’ve seen over the decades.)
quotemstr 24 hours ago [-]
Because I don't know what language your program is even written in! Why should I know or care that you chose, e.g. TypeScript, when I'm trying to use or configure your program and don't know how to spell this or that regex concept?
zahlman 13 hours ago [-]
> It drives me nuts when a developer documents something or other as being a "regex"
> I don't know what language your program is even written in!
I legitimately don't understand how you're in this situation. If the documentation is telling you that something is a regex, and it's not a user-supplied regex, then that's something intended for fellow developers. If configuration expects a regex for some reason, that's a signal that you're expected to be a programmer to use the software; and you're presumably interested in it because you use the same language, or are at least familiar enough with the open source ecosystem to look these things up. If the software were meant to be used by people who can't do these things, it would be designed without those rough edges, but more importantly the documentation would be getting written by a non-developer.
quotemstr 10 hours ago [-]
> If configuration expects a regex for some reason, that's a signal that you're expected to be a programmer to use the software
1) What?
Only programmers are expected to use grep? What? That's absolute nonsense. Even programmers aren't programmers during every waking hour. My being a programmer in general doesn't make me a developer of your project, and I shouldn't have to become one by git cloning it to figure out how to write a config file.
Google Sheets and Excel have a REGEXMATCH. Do I have to be a programmer to use a spreadsheet? And even if so, do I need to guess the implementation language? No, because Google and Microsoft document their regular expression dialects (RE2 and PCRE, respectively), so you don't have to guess.
> If the software were meant to be used by people who can't [develop]...the documentation would be getting written by a non-developer
2) What?
No, that's also nonsense. Developers write programs for non-developers ALL THE TIME without some kind of technical writer intermediary. If the developer is any good, he'll realize that "regex" in documentation is ambiguous and write down the specific language he means.
22 hours ago [-]
xigoi 22 hours ago [-]
Same applies to “Markdown”.
pmarreck 24 hours ago [-]
I've become a fan of whatever PCRE2 understands
gilrain 15 hours ago [-]
We must find a way to return to SNOBOL/PITBOL. It’s so elegant and effective in Ada (where it’s in the standard library).
> In the 1980s and 1990s, its use faded as newer languages such as AWK and Perl made string manipulation by means of regular expressions fashionable. SNOBOL4 patterns include a way to express BNF grammars, which are equivalent to context-free grammars and more powerful than regular expressions. The "regular expressions" in current versions of AWK and Perl are in fact extensions of regular expressions in the traditional sense, but regular expressions, unlike SNOBOL4 patterns, are not recursive, which gives a distinct computational advantage to SNOBOL4 patterns.
rramadass 12 hours ago [-]
Quite Interesting. Have you worked with SNOBOL a lot? Care to share your experiences?
This para caught my eye;
A SNOBOL pattern can be very simple or extremely complex. A simple pattern is just a text string (e.g. "ABCD"), but a complex pattern may be a large structure describing, for example, the complete grammar of a computer language. It is possible to implement a language interpreter in SNOBOL almost directly from a Backus–Naur form expression of it, with few changes. Creating a macro assembler and an interpreter for a completely theoretical piece of hardware could take as little as a few hundred lines, with a new instruction being added with a single line.
Also this;
SNOBOL4 pattern-matching uses a backtracking algorithm similar to that used in the logic programming language Prolog, which provides pattern-like constructs via DCGs. This algorithm makes it easier to use SNOBOL as a logic programming language than is the case for most languages.
Seems like there are some hidden superpowers waiting to be unlocked ;-)
chasil 13 hours ago [-]
Microsoft FINDSTR.EXE supports a subset of these regular expressions.
It does not support the + repetition operator.
1 days ago [-]
galaxyLogic 22 hours ago [-]
2 RegExp problems:
1. You can not compose a bigger regexp out of smaller ones
2. A regexp can not "call" other regexps
wwind123 22 hours ago [-]
To do regex matching efficiently, you need to compile the pattern before using it. That'd exclude dynamically "calling" other regex patterns. But bigger regex pattern strings can be composed from smaller regex pattern strings. You'd just need to do the composition before the compilation.
galaxyLogic 9 hours ago [-]
I'm just thinking in JavaScript I can do this:
let s = "abc" + "def";
Why can't I do:
let regExp = /abc/ + /def/;
If JavaScript (or some other) interpreter can turn
/abc/ into a RegExp, why can't it do the same for
/abc/ + /def/
?
ystlum 21 hours ago [-]
Also define blocks if all someone wants is to break the pattern up to make it more readable.
woadwarrior01 20 hours ago [-]
Swift has a RegexBuilder[1][2] interface, in addition to the usual string-ey interface that allows composition.
To make it lazy.
It's the match anything in-between so you can put stuff before and after its my most used regex.
"my name is (.*?)$" => my name is k0in
Or values "last (.*?), was great" => last sunday, was great
LoganDark 1 days ago [-]
> the special characters . * ^ $
These already do not work in many tools which require those special characters to be escaped to have any meaning. An easy example is GNU grep, sed, etc. which use BRE ("Basic Regular Expressions") by default. The article mentions GNU coreutils but does not explain that `-E` is required to fix that behavior.
jonstewart 1 days ago [-]
Then there’s not just the issue of whether the engine supports a particular syntactical feature but the issue of matching semantics. Perl/PCRE’s semantics are far different from POSIX’s and some implementations different semantics altogether (and quite reasonably).
monkamonme 1 days ago [-]
[flagged]
semanticc 12 hours ago [-]
> So for my definition of “everywhere,” with the caveats mentioned above, the following features work everywhere. YMMV.
Even beyond the regex syntax itself, you often also start running into encoding problems when trying to actually use them. Typing the regex in a shell? Make sure to esacpe stuff properly. Regex in Python? Make sure it's a raw string. Etc etc etc
It's a modern miracle we're at least within rhyming distance of how to write regexes in most tools.
[0]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx...
https://par.nsf.gov/servlets/purl/10534654
Here are some of the [more popular][1] ones, and then there are PCRE and Python.
It took me a while to learn that some of the older ones you see in e.g. grep are [specified by POSIX][2].
[1]: https://cppreference.com/cpp/regex#Regular_expression_gramma...
[2]: https://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd...
I find it a good reading.
https://datatracker.ietf.org/doc/html/rfc9485
Amusing pair of statements.
https://json-schema.org/understanding-json-schema/reference/...
And I’d further bet that people who are casual about specifying that are relatively strongly correlated with people who are casual about santization, catastrophic backtracking, etc. (At least based on code I’ve seen over the decades.)
> I don't know what language your program is even written in!
I legitimately don't understand how you're in this situation. If the documentation is telling you that something is a regex, and it's not a user-supplied regex, then that's something intended for fellow developers. If configuration expects a regex for some reason, that's a signal that you're expected to be a programmer to use the software; and you're presumably interested in it because you use the same language, or are at least familiar enough with the open source ecosystem to look these things up. If the software were meant to be used by people who can't do these things, it would be designed without those rough edges, but more importantly the documentation would be getting written by a non-developer.
1) What?
Only programmers are expected to use grep? What? That's absolute nonsense. Even programmers aren't programmers during every waking hour. My being a programmer in general doesn't make me a developer of your project, and I shouldn't have to become one by git cloning it to figure out how to write a config file.
Google Sheets and Excel have a REGEXMATCH. Do I have to be a programmer to use a spreadsheet? And even if so, do I need to guess the implementation language? No, because Google and Microsoft document their regular expression dialects (RE2 and PCRE, respectively), so you don't have to guess.
> If the software were meant to be used by people who can't [develop]...the documentation would be getting written by a non-developer
2) What?
No, that's also nonsense. Developers write programs for non-developers ALL THE TIME without some kind of technical writer intermediary. If the developer is any good, he'll realize that "regex" in documentation is ambiguous and write down the specific language he means.
https://en.wikipedia.org/wiki/SNOBOL
> In the 1980s and 1990s, its use faded as newer languages such as AWK and Perl made string manipulation by means of regular expressions fashionable. SNOBOL4 patterns include a way to express BNF grammars, which are equivalent to context-free grammars and more powerful than regular expressions. The "regular expressions" in current versions of AWK and Perl are in fact extensions of regular expressions in the traditional sense, but regular expressions, unlike SNOBOL4 patterns, are not recursive, which gives a distinct computational advantage to SNOBOL4 patterns.
This para caught my eye;
A SNOBOL pattern can be very simple or extremely complex. A simple pattern is just a text string (e.g. "ABCD"), but a complex pattern may be a large structure describing, for example, the complete grammar of a computer language. It is possible to implement a language interpreter in SNOBOL almost directly from a Backus–Naur form expression of it, with few changes. Creating a macro assembler and an interpreter for a completely theoretical piece of hardware could take as little as a few hundred lines, with a new instruction being added with a single line.
Also this;
SNOBOL4 pattern-matching uses a backtracking algorithm similar to that used in the logic programming language Prolog, which provides pattern-like constructs via DCGs. This algorithm makes it easier to use SNOBOL as a logic programming language than is the case for most languages.
Seems like there are some hidden superpowers waiting to be unlocked ;-)
It does not support the + repetition operator.
1. You can not compose a bigger regexp out of smaller ones
2. A regexp can not "call" other regexps
/abc/ + /def/
?
[1]: https://github.com/swiftlang/swift-evolution/blob/main/propo...
[2]: https://developer.apple.com/documentation/regexbuilder
"my name is (.*?)$" => my name is k0in
Or values "last (.*?), was great" => last sunday, was great
These already do not work in many tools which require those special characters to be escaped to have any meaning. An easy example is GNU grep, sed, etc. which use BRE ("Basic Regular Expressions") by default. The article mentions GNU coreutils but does not explain that `-E` is required to fix that behavior.
- \w, \W, \s, \S - need to use POSIX classes instead: [[:alnum:]], [^[:alnum:]], [[:space:]], [^[:space:]]
- \b - need to use use [[:<:]] (word start) and [[:>:]] (word end) instead
- \B - (not a word start/end) no alternatives