Using R for Regular Expressions *RegEx

Natural Langugae Processing

Regular Expression

Demonstration

Author

Simisani Ndaba

Published

April 9, 2022

Meetup Description

R-Ladies Gaborone joined forces with R - Ladies Cologne on Twitter | R-Ladies Cologne on foostodon to co-host an event on Using R for Regular Expressions [*RegEx] Saturday, April 09, 2022, at 6 PM CET/CAT.
Guest speaker, Pavitra Chakravarty guided us through learning R using the stringr and stringi packages - essential and useful skills for programming 🚀

About Speaker

Pavitra Chakravarty is a Data engineer with a PhD in Cancer Nanotechnology is a member of R -Ladies Dallas.

Contact Speaker

Follow Pavitra on X

Regular Expressions [*RegEx]

Note

Get more from the slides, code and notes the Regular Expressions GitHub Repo

What are regular expressions?

Regular expression is a pattern that describes a specific set of strings with a common structure
Heavily used for string matching / replacing in all programming languages
Heart and soul for string operations

Regular expression syntax

6 basic canonical characteristics of regular expressions

basic pattern matching: Using functions from stringr package with exact sequence of characters
- str_detect(), str_subset(), str_view(), str_view_all()
anchors: Indicate start and stop of sentence
- ^: indicating start of sentence, $: indicating end of sentence
escape characters: special characters cannot be directly coded in string
- \: if you want to find strings with single quote ', “escape” single quote by preceding it with \

character classes: specify entire classes of characters, such as numbers, letters, etc using either [: and :] around predefined name or \ and a special character
- [:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9]
- \D: non-digits, equivalent to [^0-9]
- [:lower:]: lower-case letters, equivalent to [a-z]
- [:upper:]: upper-case letters, equivalent to [A-Z]
- [:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z]
- [:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9]
- \w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_]
- \W: not word, equivalent to [^A-z0-9_]
- [:blank:]: blank characters, i.e. space and tab
- [:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space
- \s: space,
- \S: not space
quantifiers: Quantifiers specify how many repetitions of the pattern
- *: matches at least 0 times
- +: matches at least 1 times
- ?: matches at most 1 times
- {n}: matches exactly n times
- {n,}: matches at least n times
- {n,m}: matches between n and m times
character clusters: Use of paranthesis to keep pattern together
- (): use with pattern-matching characters to create groups

Dataset being used today

library(tidyverse)

enron <- read_csv(“https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/data/enron/enron.csv”) %>% drop_na()

glimpse(enron)

## Rows: 214,195

## Columns: 3

## $ mail_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 1…

## $ person <chr> “allen-p”, “allen-p”, “allen-p”, “allen-p”, “allen-p”, “allen…

## $ email <chr> “Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>”, …

head(enron, n=50)

## # A tibble: 50 x 3

## mail_num person email

## <dbl> <chr> <chr>

## 1 1 allen-p Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>

## 2 1 allen-p Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)

## 3 1 allen-p From: phillip.allen@enron.com

## 4 1 allen-p To: tim.belden@enron.com

## 5 1 allen-p Subject:

## 6 1 allen-p Mime-Version: 1.0

## 7 1 allen-p Content-Type: text/plain; charset=us-ascii

## 8 1 allen-p Content-Transfer-Encoding: 7bit

## 9 1 allen-p X-From: Phillip K Allen

## 10 1 allen-p X-To: Tim Belden <Tim Belden/Enron@EnronXGate>

## # … with 40 more rows

Canonical principle #1: Basic pattern-matching

enron %>% filter(str_detect(enron$person, "Allen"))

## # A tibble: 0 x 3

## # … with 3 variables: mail_num <dbl>, person <chr>, email <chr>

str_subset(enron$email, "tracy.ngo")

## [1] "To: tracy.ngo@enron.com"

## [2] “To: tracy.ngo@enron.com”

## [3] “To: tim.belden@enron.com, steve.c.hall@enron.com, tracy.ngo@enron.com,”

str_view_all(enron$email, "tracy.ngo")

Canonical principle #2: Anchors

^: matches the start of the string.
$: matches the end of the string.
\b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.
\B: matches the empty string provided it is not at an edge of a word.

enron %>% filter(str_detect(enron$email, "@ECT")) %>% select

## # A tibble: 6,524 x 0

enron %>% filter(str_detect(enron$email, "weekend$"))

## # A tibble: 45 x 3

## mail_num person email

## <dbl> <chr> <chr>

## 1 69 allen-p morning I sent you the roll did you get it? Did you need m…

## 2 94 carson-m Subject: This weekend

## 3 94 carson-m Subject: This weekend

## 4 95 carson-m Subject: Re: This weekend

## 5 69 davis-d Subject: Manual JE info for cutover weekend

## 6 69 davis-d Subject: Manual JE info for cutover weekend

## 7 69 davis-d Subject: Manual JE info for cutover weekend

## 8 1 dean-c Subject: RE: This weekend

## 9 1 dean-c Subject: RE: This weekend

## 10 1 dean-c Subject: RE: This weekend

## # … with 35 more rows

Canonical principle #3: Escape characters

x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")

str_view(x, “(\\d\\d\\d)\\d\\d\\d-\\d\\d\\d\\d”)

123-456-7890
(123)456-7890
(123) 456-7890
1235-2351

str_view("so it goes $^$ here", "\\$\\^\\$")

so it goes $^$ here

Canonical principle #4: Character Classes

str_view(stringr::words, "^[yx]", match=TRUE)

year
yes
yesterday
yet
you
young

str_view(stringr::words, "[^e]ed$", match = TRUE)

bed
hundred
red

str_view(c("red", "reed"), "[^e]ed$", match = FALSE)

reed

str_view(stringr::words, "^(thr)*", match = TRUE)

a
able
about
absolute
accept
account
achieve
across
act
active
actual
add
address
admit
advertise
affect
afford
after
afternoon
again
against
age
agent
ago

x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")

str_view(x, “\$[0-9][0-9][0-9]\$[ ]*[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”)

123-456-7890
(123)456-7890
(123) 456-7890
1235-2351

x <- c("123456-7890", "(123) 456-7890", "(123)456-7890", "1235-2351")

str_view(x, “\$[0-9][0-9][0-9]\$[ ]?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”)

123456-7890
(123) 456-7890
(123)456-7890
1235-2351

x <- c("4444-22-22", "test", "333-4444-22")

str_view(x, “\\d{4}-\\d{2}-\\d{2}”)

4444-22-22
test
333-4444-22

Canonical principle #6: Character Clusters

enron %>% filter(str_detect(email, "@.*\\.(edu|net)")) %>% select(email)

## # A tibble: 1,646 x 1

## email

## <chr>

## 1 “<retwell@mail.sanmarcos.net>”

## 2 “cc: \”Larry Lewter\” <retwell@mail.sanmarcos.net>, \“Claudia L. Crocker\””

## 3 “\”Bob McKinney\” <capstone@texas.net> on 11/27/2000 09:46:13 AM”

## 4 “To: \”Capstone\” <capstone@texas.net>”

## 5 “Brian_Hoskins@enron.net”

## 6 “Brian_Hoskins@enron.net”

## 7 “Brian_Hoskins@enron.net”

## 8 “Brian_Hoskins@enron.net”

## 9 “To: adam.r.bayer@vanderbilt.edu”

## 10 “X-To: \”Adam Bayer\” <adam.r.bayer@vanderbilt.edu> @ ENRON”

## # … with 1,636 more rows

enron %>% filter(str_detect(email, "@.*(ns)\\.(net)")) %>% select(email)

## # A tibble: 6 x 1

## email

## <chr>

## 1 “\”Karen Edson\” <kedson@ns.net> on 07/08/2000 03:06:40 PM”

## 2 “cc: \”Julee Malinowski-Ball (E-mail)\” <jmball@ns.net>, \“Ray McNally (E-mai…

## 3 “kedson@ns.net”

## 4 “<fotinb@bc-mail.com>; \”Bill Hannah\” <hannahs@wans.net>; \“Bill Harvey\””

## 5 “\”Harvey Wax\” <HLWAX@aol.com>; \“J. D Zikuda\” <jdzikuda@netins.net>; \“Jam…

## 6 “<rndyhbnr@midplains.net>; \”Ray Clary\” <rclrec@mindspring.com>; \“Rich Hari…

Material has been borrowed heavily from

the STAT 545 course. This course was started by Jenny Bryan: https://stat545.stat.ubc.ca/notes/notes-b05/
More STAT 545 resources: https://stat545.com/character-vectors.html, https://youtu.be/I0dJ1zpxAtU
R for Data Science chapter on Strings: https://r4ds.had.co.nz/strings.html
Solution set for R4DS on Strings: https://brshallo.github.io/r4ds_solutions/14-strings.html#matching-patterns-w-regex
FUN TIME: Regex Puzzle Builder: https://regexcrossword.com/puzzlebuilder
In a Processing Textual Data with Python from CODATA Connect Series on Research Skills Enhancement, Raphael Cobe gives a talk on how to use Regular Expressions to find patterns or remove useless data.

YouTube Link

The watch the full recording of the meetup session and subscribe to the R-Ladies Gaborone channel and get notifications to new videos uploaded.