Biggest Regex In The Word
There are two common pattern matching problems that appear simple on the surface, but are very complex if you think about them. These are matching emails and URI’s in free form text. Everyone wrote a URL or an email validation script at one point or another. And I’m willing to bet that 90% of these validation scripts out there are just plain wrong.
The URI matching problem was definitively solved sometime in 1999 by a perl script that generated the ultimate regex to catch all legal URI’s as specified in RFC’s. What is the end result?
Here it is in all of it’s unholy, unreadable glory: the 7.4Kb regex of doom. It appears that you need exactly 7579 characters to pattern match every possible legal url out there. Or possibly even more because this one doesn’t actually account for https:// addresses. And you thought this was an easy issue that could be solved by a one liner. Shame on you!
In all fairness, how often do we really need the regex of doom though? In most cases (not all mind you) something as simple as “give me all strings that start with http:// and are delimited by spaces on both sides” will work almost as well, and probably much faster.
Let’s face it. Who wants to have something like that sitting in their codebase? You can’t read it, you can’t verify that it works via code inspection, and generating the regex from scratch using the perl script included on the linked page, is probably the only way you can maintain it. Trying to modify it by hand is just asking for a one way trip to Painsville, NJ (that’s the fabled fictional town that invated brain pain if you didn’t know).
via GDR
Related Posts:

