Read the Beforeitsnews.com story here. Advertise at Before It's News here.

By Lawyerist (Reporter)
Contributor profile | More stories

Story Views
Now:
Last hour:
Last 24 hours:
Total:

Pattern Recognition: Regular Expressions and You

Thursday, January 12, 2017 5:34

% of readers think this story is Fact. Add your two cents.

Regular expressions (regex for short) won’t replace associates or paralegals, but they can take a lot off their plate. They sit somewhere between Control-F and the computer on Star Trek.¹ They’re a bit finicky about syntax, but once you learn the “magic words,” you can ask them to read a document and return lists of almost any textual pattern imaginable.

What does that even mean? Well, it means you can ask for a list of all nine-digit numbers found in the document, specifically those with non-numeric separators after the third and fifth digits. You know, anything that looks like this:

Source: Social Security Card from the Social Security Administration.

Spammers use regular expressions to search the web for “words” comprised of text followed by an @ symbol and ending in a domain name. You know, anything that looks like this:

[email protected]

If crosswords are more your speed, and you train regex on a dictionary, you do can ask for things like “a list of 7 letter words starting with S and ending in a double F.”

And the coup de grâce: in addition to finding patterns, you can find and replace patterns. So every instance of text including a set of numbers followed by a space, the text “U.S.C.”, another space, a § mark, another space, a set of numbers, and optionally, a year inside a parenthetical, can be replaced with a link (e.g. 17 U.S.C. § 107). That’s right, you can put your Bluebook skills to use automatically creating links to statutes.²

Regular expressions take the edge off many monotonous tasks related to pattern recognition. That being said, I’m a big believer in learning by doing. So let’s have some fun finding and redacting phone numbers, citing the US Code, and creating a crossword puzzle or two. Don’t worry, you won’t have to install any software. At most, I’ll ask you to pull up a website, and maybe if you’re feeling feisty, you’ll open a Word file to play along.

Index

Will I Really Use This? A Personnel Story

Probably. I used to work in a big bureaucratic organization with several hundred staff members. HR would put out a staff directory every quarter, but they insisted on providing it as a PDF. It had a row for every employee and “columns” indicating their names, job titles, phone numbers, and email addresses. Oh, how I wished that directory were a spreadsheet. Then I could filter rows to get mailing lists for department heads, secretaries, and the like (surprisingly, no one kept such lists). But alas, the directory was a PDF. I could, however, copy the text out of the PDF. The result wasn’t pretty, but with a little regex magic I was able to create a text file with a line for each staff member and commas between their info (e.g., David Colarusso, Cog In the Machine, 555-555-5555, [email protected]). I then loaded this into Excel, and presto! I had my own sortable staff directory, and never again did I have to build mailing lists by hand.

If you put in the time to learn regex, you will use regex. Trust me.

Assemble Your Tools

You’ll find support for regular expressions in many text editors and a lot of programming languages. Heck, even Word has a limited implementation. Here we’ll be discussing two flavors of regex: Perl-like³ and the Word implementation.

If you want to play along at home, open Regular Expressions 101 (regex101) and work through the examples below. Here we’ll explore Perl-like expressions. Depending on your screen size, you’ll see either one, two, or three columns. We care about the one with REGULAR EXPRESSION at the top. By default, this column is subdivided into two rows. The first is where you place your pattern to be matched (REGULAR EXPRESSION). The second is where you place the text over which the regex will search (TEST STRING).

At the far right of the first row you’ll see a set of regex options (flags). If you click on the flag icon, you’ll see a list of possible flags. By default the g (global) flag is present. Its presence means the regex will find all matches, not just the first one. Another commonly used flag is the i (case insensitive) flag. It’s presence means the regex doesn’t need the case (capitalization) of text to match for your pattern to match.

Finding Phone (& Social Security) Numbers

Cut and paste the following text into the TEST STRING field.⁴

01110100 555.867.5309 01101000 1.4142135623 01101001 987-01-6661 01110011 202.555.9355 00100000 01101001 3.1415926535897932384626433832795 01110011 00100000 666-12-4895 01100001 202-555-9355 00100000 01101000 (555) 867-5309 01101001 01100100 2.718281828459 01100100 01100101 01101110 00100000 01101101 555-867-5309 01100101 01110011 01110011 555/867-5309 01100001 01100111 01100101

This string of text is our haystack, and Social Security and phone numbers are our needles. If we know exactly what phone number we’re looking for (e.g., 555-867-5309) it’s just like control-F. All we do is place the number in the REGULAR EXPRESSION field.

When you place a character in your pattern and it matches itself, it is called a “literal character” because it literally matches itself. The digits 0-9 by themselves are literal characters and - by itself is a literal character. In addition to literal characters you can combine some characters into tokens (metacharacters) to represent a class of characters, and occasionally single characters take on such roles. For example, [0-9] matches any digit between 0 and 9. So [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] matches our phone number and any other phone number making use of hyphens as separators.

Now you may be thinking that [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] is a bit unwieldy, and I’d agree. You can modify a token by declaring how many times you would like it to reoccur. Our unwieldy friend can be rewritten as [0-9]{3}-[0-9]{3}-[0-9]{4} where {n} is a modifier and n is the number of times the preceding character/metacharacter should occur.

Excitingly, the above works in Word as well. However, you do have to make sure that wildcards are turned on.

But what about all those people who don’t use hyphens? How do I find phone numbers of the form 202.555.9355? Obviously, you could do a second search with period spacers, but maybe we could replace the hyphens with a wildcard. According to Word’s documentation, the question mark matches “Any single character, including space and punctuation characters.” So let’s see what happens if we try [0-9]{3}?[0-9]{3}?[0-9]{4}

Promising, but not quite right. What’s with 1415926535897892384626433832795 0111? It would seem that our wildcard matched numbers as well as hyphens and periods, leading to some issues. Again, we could do two searches, one with the hyphens as separators and one with periods, but here we start to meet the limits of what’s possible in Word. If you want to do more in Word, I recommend reading its regex documentation.

For now, we’re going to shift back to the more robust Perl-like syntax and regex101. It’s worth noting that the question mark has a different meaning in Perl-like regex (more on that in a bit). The generic single-character wildcard is a period. So [0-9]{3}.[0-9]{3}.[0-9]{4} is equivalent to the Word pattern above.

What I’d really like to do is to build a pattern that searched for hyphens or periods. We know that the period is not a literal character, that is, it doesn’t match one-to-one with other periods. When regex makes a single character into a special character you can escape it by placing a in front of the offending character, causing it to act like a literal character. Consequently, . will literally match a period.

So how do I match a period or a hyphen? We need a way of saying OR. As it happens, the | (pipe) does this. Of course, we need a way to say what parts of the pattern are included in the or comparison. For this job, we call in the parenthetical. For example, re(a|e)d matches read or reed. So let’s look for separators that are either hyphens or periods, that is, [0-9]{3}(-|.)[0-9]{3}(-|.)[0-9]{4}

Obviously, this means that the parentheses are not literal characters. So if we wanted to match a parenthetical we’d need to use to escape each parenthesis. For example, ([0-9]{3}() |-|.)[0-9]{3}(-|.)[0-9]{4} matches phone numbers where parentheses and a space are used to set off the area code.

Now we’re cookin’. So what does a question mark do in Perl-like regex? I’m glad you asked. It is a modifier like {3}, except it matches when the preceding character/metacharacter occurs zero or one time. For example, watch the first parenthesis given (?[0-9]{3}() |-|.)[0-9]{3}(-|.)[0-9]{4}

There are more of these modifiers. +, for example, finds a match when the preceding character/metacharacter appears somewhere between one and an infinite number of times, and * matches between zero and infinity.

One of the reasons I like regex101 is that it actually tells you all of this when you mouse over a modifier, literal character, or the like.

If your screen is large enough, you can also find this info in the right column. And if you have no idea where to start, there’s a quick reference with a list of tokens.

Let’s be honest, once you work through this post, you’re going to forget what tokens stands for what. Additionally, there’s no way I can show them all to you without losing your interest. You need a reference and a cheat sheet. The Quick Reference at regex101 is such a resource, and all you have to remember is the dead simple URL regex101.com. Easy peasey.

Now here’s a handy metacharacter: d is actually equivalent to [0-9]. So we can re-write our phone number regex as (?d{3}() |-|.)d{3}(-|.)d{4}

Parentheticals are actually a little more special than I let on earlier. They define something called a group, and we’ll talk about them more below, but for the moment, I want you to know that the same modifiers that we used on characters and metacharacters work on groups. So if you want to find phone numbers with no area code, you can place the area code in a group and throw a ? after it (meaning occurrences = 0 or 1). For example, ((?d{3}() |-|.))?d{3}(-|.)d{4}

So is our phone number regex complete? I don’t know. That depends on what you think counts as a valid phone number. Should we look for all dividers between numbers? If so, our wildcard example seems more correct than we first thought. Maybe we should add spaces to our list of dividers? Turns out there’s a token for that: s. Maybe a phone number is just any string of ten digits dividers. Or not. There are a lot of possibilities. I actually hid a phone number of the form 555/555-5555 in our test string. Did you notice it?

In addition to phone numbers, I also sprinkled in some Social Security numbers.⁵ Can you find them?

Redaction & Linking to the US Code

Click on SUBSTITUTION below the TEST STRING field. This should reveal two new fields. The first is for you to place a “replacement value” and the second is a display of your test string with matches replaced.

If you leave the replacement value blank, you’ll notice that all of your matches are gone, replaced with nothing. Add a placeholder, and it fills in the holes.

To make things interesting, replace your old test string with this:⁶

Respondent Acuff-Rose Music, Inc., filed suit against petitioners, the members of the rap music group 2 Live Crew and their record company, claiming that 2 Live Crew’s song, “Pretty Woman,” infringed AcuffRose’s copyright in Roy Orbison’s rock ballad, “Oh, Pretty Woman.” The District Court granted summary judgment for 2 Live Crew, holding that its song was a parody that made fair use of the original song. See Copyright Act of 1976, 17 U.S.C. § 107. The Court of Appeals reversed and remanded, holding that the commercial nature of the parody rendered it presumptively unfair under the first of four factors relevant under § 107; that, by taking the “heart” of the original and making it the “heart” of a new work, 2 Live Crew had, qualitatively, taken too much under the third § 107 factor; and that market harm for purposes of the fourth § 107 factor had been established by a presumption attaching to commercial uses.

Using what we learned, let’s write a regex to find citations to the United States Code. This means, we’re looking for a title number followed by U.S.C., the § mark, the section number, and an optional parenthetical indicating the year of the code we’re citing. d+ will pick up a string of digits between 1 and infinity. Something like (d+) U.S.C. § (d+)( (d{4}))? should do the trick.

Citation found! So what’s the deal with those groups? Well, you can use the content of your groups to construct replacement text. Each parenthetical is numbered, and you can use something called a “backreference” to get at its content. This is why I placed the title and section numbers in groups. Group one is referenced by placing 1 in the replacement field, group two by 2 and so on. Also worth noting: group zero matches the entire regex. So placing

Source: https://lawyerist.com/136050/regular-expressions/

Before It’s News® is a community of individuals who report on what’s going on around them, from all around the world.

Anyone can join.
Anyone can contribute.
Anyone can become informed about their world.

"United We Stand" Click Here To Create Your Personal Citizen Journalist Account Today, Be Sure To Invite Your Friends.

Please Help Support BeforeitsNews by trying our Natural Health Products below!

Order by Phone at 888-809-8385 or online at https://mitocopper.com M - F 9am to 5pm EST

Order by Phone at 866-388-7003 or online at https://www.herbanomic.com M - F 9am to 5pm EST

Order by Phone at 866-388-7003 or online at https://www.herbanomics.com M - F 9am to 5pm EST

Humic & Fulvic Trace Minerals Complex - Nature's most important supplement! Vivid Dreams again!

HNEX HydroNano EXtracellular Water - Improve immune system health and reduce inflammation.

Ultimate Clinical Potency Curcumin - Natural pain relief, reduce inflammation and so much more.

MitoCopper - Bioavailable Copper destroys pathogens and gives you more energy. (See Blood Video)

Oxy Powder - Natural Colon Cleanser! Cleans out toxic buildup with oxygen!

Nascent Iodine - Promotes detoxification, mental focus and thyroid health.

Smart Meter Cover - Reduces Smart Meter radiation by 96%! (See Video).

Comments

Online:
Visits:	1,602,598,817
Stories:	8,147,458