Using R for Regular Expressions *RegEx
Meetup Description
R-Ladies Gaborone joined forces with R - Ladies Cologne on Twitter | R-Ladies Cologne on foostodon to co-host an event on Using R for Regular Expressions [*RegEx] Saturday, April 09, 2022, at 6 PM CET/CAT.
Guest speaker, Pavitra Chakravarty guided us through learning R using the stringr and stringi packages - essential and useful skills for programming 🚀
About Speaker
Pavitra Chakravarty is a Data engineer with a PhD in Cancer Nanotechnology is a member of R -Ladies Dallas.
Contact Speaker
Regular Expressions [*RegEx]
Get more from the slides, code and notes the Regular Expressions GitHub Repo
What are regular expressions?
Regular expression is a pattern that describes a specific set of strings with a common structure
Heavily used for string matching / replacing in all programming languages
Heart and soul for string operations
Regular expression syntax
6 basic canonical characteristics of regular expressions
basic pattern matching: Using functions from stringr package with exact sequence of characters
str_detect()
,str_subset()
,str_view()
,str_view_all()
anchors: Indicate start and stop of sentence
^: indicating start of sentence
,$: indicating end of sentence
escape characters: special characters cannot be directly coded in string
\
: if you want to find strings with single quote'
, “escape” single quote by preceding it with\
character classes: specify entire classes of characters, such as numbers, letters, etc using either
[:
and:]
around predefined name or\
and a special character[:digit:]
or\d
: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to[0-9]
\D
: non-digits, equivalent to[^0-9]
[:lower:]
: lower-case letters, equivalent to[a-z]
[:upper:]
: upper-case letters, equivalent to[A-Z]
[:alpha:]
: alphabetic characters, equivalent to[[:lower:][:upper:]]
or[A-z]
[:alnum:]
: alphanumeric characters, equivalent to[[:alpha:][:digit:]]
or[A-z0-9]
\w
: word characters, equivalent to[[:alnum:]_]
or[A-z0-9_]
\W
: not word, equivalent to[^A-z0-9_]
[:blank:]
: blank characters, i.e. space and tab
[:space:]
: space characters: tab, newline, vertical tab, form feed, carriage return, space\s
: space,\S
: not space
quantifiers: Quantifiers specify how many repetitions of the pattern
*
: matches at least 0 times+
: matches at least 1 times?
: matches at most 1 times{n}
: matches exactly n times{n,}
: matches at least n times{n,m}
: matches between n and m times
character clusters: Use of paranthesis to keep pattern together
()
: use with pattern-matching characters to create groups
Dataset being used today
library(tidyverse)
enron <- read_csv(“https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/data/enron/enron.csv”) %>% drop_na()
glimpse(enron)
## Rows: 214,195
## Columns: 3
## $ mail_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 1…
## $ person <chr> “allen-p”, “allen-p”, “allen-p”, “allen-p”, “allen-p”, “allen…
## $ email <chr> “Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>”, …
head(enron, n=50)
## # A tibble: 50 x 3
## mail_num person email
## <dbl> <chr> <chr>
## 1 1 allen-p Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
## 2 1 allen-p Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
## 3 1 allen-p From: phillip.allen@enron.com
## 4 1 allen-p To: tim.belden@enron.com
## 5 1 allen-p Subject:
## 6 1 allen-p Mime-Version: 1.0
## 7 1 allen-p Content-Type: text/plain; charset=us-ascii
## 8 1 allen-p Content-Transfer-Encoding: 7bit
## 9 1 allen-p X-From: Phillip K Allen
## 10 1 allen-p X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
## # … with 40 more rows
Canonical principle #1: Basic pattern-matching
enron %>% filter(str_detect(enron$person, "Allen"))
## # A tibble: 0 x 3
## # … with 3 variables: mail_num <dbl>, person <chr>, email <chr>
str_subset(enron$email, "tracy.ngo")
## [1] "To: tracy.ngo@enron.com"
## [2] “To: tracy.ngo@enron.com”
## [3] “To: tim.belden@enron.com, steve.c.hall@enron.com, tracy.ngo@enron.com,”
str_view_all(enron$email, "tracy.ngo")
Canonical principle #2: Anchors
^
: matches the start of the string.$
: matches the end of the string.\b
: matches the empty string at either edge of a word. Don’t confuse it with^ $
which marks the edge of a string.\B
: matches the empty string provided it is not at an edge of a word.
enron %>% filter(str_detect(enron$email, "@ECT")) %>% select
## # A tibble: 6,524 x 0
enron %>% filter(str_detect(enron$email, "weekend$"))
## # A tibble: 45 x 3
## mail_num person email
## <dbl> <chr> <chr>
## 1 69 allen-p morning I sent you the roll did you get it? Did you need m…
## 2 94 carson-m Subject: This weekend
## 3 94 carson-m Subject: This weekend
## 4 95 carson-m Subject: Re: This weekend
## 5 69 davis-d Subject: Manual JE info for cutover weekend
## 6 69 davis-d Subject: Manual JE info for cutover weekend
## 7 69 davis-d Subject: Manual JE info for cutover weekend
## 8 1 dean-c Subject: RE: This weekend
## 9 1 dean-c Subject: RE: This weekend
## 10 1 dean-c Subject: RE: This weekend
## # … with 35 more rows
Canonical principle #3: Escape characters
x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")
str_view(x, “(\\d\\d\\d)\\d\\d\\d-\\d\\d\\d\\d”)
123-456-7890
(123)456-7890
(123) 456-7890
1235-2351
str_view("so it goes $^$ here", "\\$\\^\\$")
- so it goes $^$ here
Canonical principle #4: Character Classes
str_view(stringr::words, "^[yx]", match=TRUE)
year
yes
yesterday
yet
you
young
str_view(stringr::words, "[^e]ed$", match = TRUE)
bed
hundred
red
str_view(c("red", "reed"), "[^e]ed$", match = FALSE)
- reed
str_view(stringr::words, "^(thr)*", match = TRUE)
a
able
about
absolute
accept
account
achieve
across
act
active
actual
add
address
admit
advertise
affect
afford
after
afternoon
again
against
age
agent
ago
x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")
str_view(x, “\\([0-9][0-9][0-9]\\)[ ]*[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”)
123-456-7890
(123)456-7890
(123) 456-7890
1235-2351
x <- c("123456-7890", "(123) 456-7890", "(123)456-7890", "1235-2351")
str_view(x, “\\([0-9][0-9][0-9]\\)[ ]?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”)
123456-7890
(123) 456-7890
(123)456-7890
1235-2351
x <- c("4444-22-22", "test", "333-4444-22")
str_view(x, “\\d{4}-\\d{2}-\\d{2}”)
4444-22-22
test
333-4444-22
Canonical principle #6: Character Clusters
enron %>% filter(str_detect(email, "@.*\\.(edu|net)")) %>% select(email)
## # A tibble: 1,646 x 1
## <chr>
## 1 “<retwell@mail.sanmarcos.net>”
## 2 “cc: \”Larry Lewter\” <retwell@mail.sanmarcos.net>, \“Claudia L. Crocker\””
## 3 “\”Bob McKinney\” <capstone@texas.net> on 11/27/2000 09:46:13 AM”
## 4 “To: \”Capstone\” <capstone@texas.net>”
## 5 “Brian_Hoskins@enron.net”
## 6 “Brian_Hoskins@enron.net”
## 7 “Brian_Hoskins@enron.net”
## 8 “Brian_Hoskins@enron.net”
## 9 “To: adam.r.bayer@vanderbilt.edu”
## 10 “X-To: \”Adam Bayer\” <adam.r.bayer@vanderbilt.edu> @ ENRON”
## # … with 1,636 more rows
enron %>% filter(str_detect(email, "@.*(ns)\\.(net)")) %>% select(email)
## # A tibble: 6 x 1
## <chr>
## 1 “\”Karen Edson\” <kedson@ns.net> on 07/08/2000 03:06:40 PM”
## 2 “cc: \”Julee Malinowski-Ball (E-mail)\” <jmball@ns.net>, \“Ray McNally (E-mai…
## 3 “kedson@ns.net”
## 4 “<fotinb@bc-mail.com>; \”Bill Hannah\” <hannahs@wans.net>; \“Bill Harvey\””
## 5 “\”Harvey Wax\” <HLWAX@aol.com>; \“J. D Zikuda\” <jdzikuda@netins.net>; \“Jam…
## 6 “<rndyhbnr@midplains.net>; \”Ray Clary\” <rclrec@mindspring.com>; \“Rich Hari…
Material has been borrowed heavily from
the STAT 545 course. This course was started by Jenny Bryan: https://stat545.stat.ubc.ca/notes/notes-b05/
More STAT 545 resources: https://stat545.com/character-vectors.html, https://youtu.be/I0dJ1zpxAtU
R for Data Science chapter on Strings: https://r4ds.had.co.nz/strings.html
Solution set for R4DS on Strings: https://brshallo.github.io/r4ds_solutions/14-strings.html#matching-patterns-w-regex
FUN TIME: Regex Puzzle Builder: https://regexcrossword.com/puzzlebuilder
In a Processing Textual Data with Python from CODATA Connect Series on Research Skills Enhancement, Raphael Cobe gives a talk on how to use Regular Expressions to find patterns or remove useless data.
YouTube Link
The watch the full recording of the meetup session and subscribe to the R-Ladies Gaborone channel and get notifications to new videos uploaded.