Using R for Regular Expressions *RegEx

R
Natural Langugae Processing
Regular Expression
Demonstration
Author

Simisani Ndaba

Published

April 9, 2022

Meetup Description

R-Ladies Gaborone joined forces with R - Ladies Cologne on Twitter | R-Ladies Cologne on foostodon to co-host an event on Using R for Regular Expressions [*RegEx] Saturday, April 09, 2022, at 6 PM CET/CAT.
Guest speaker, Pavitra Chakravarty guided us through learning R using the stringr and stringi packages - essential and useful skills for programming 🚀

About Speaker


Pavitra Chakravarty is a Data engineer with a PhD in Cancer Nanotechnology is a member of R -Ladies Dallas.

Contact Speaker


Follow Pavitra on X

Regular Expressions [*RegEx]

Note

Get more from the slides, code and notes the Regular Expressions GitHub Repo

What are regular expressions?

  • Regular expression is a pattern that describes a specific set of strings with a common structure

  • Heavily used for string matching / replacing in all programming languages

  • Heart and soul for string operations

Regular expression syntax

6 basic canonical characteristics of regular expressions

  • basic pattern matching: Using functions from stringr package with exact sequence of characters

    • str_detect(), str_subset(), str_view(), str_view_all()
  • anchors: Indicate start and stop of sentence

    • ^: indicating start of sentence, $: indicating end of sentence
  • escape characters: special characters cannot be directly coded in string

    • \: if you want to find strings with single quote ', “escape” single quote by preceding it with \
  • character classes: specify entire classes of characters, such as numbers, letters, etc using either [: and :] around predefined name or \ and a special character

    • [:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9]

    • \D: non-digits, equivalent to [^0-9]

    • [:lower:]: lower-case letters, equivalent to [a-z]

    • [:upper:]: upper-case letters, equivalent to [A-Z]

    • [:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z]

    • [:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9]

    • \w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_]

    • \W: not word, equivalent to [^A-z0-9_]

    • [:blank:]: blank characters, i.e. space and tab

    • [:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space

    • \s: space,

    • \S: not space

  • quantifiers: Quantifiers specify how many repetitions of the pattern

    • *: matches at least 0 times

    • +: matches at least 1 times

    • ?: matches at most 1 times

    • {n}: matches exactly n times

    • {n,}: matches at least n times

    • {n,m}: matches between n and m times

  • character clusters: Use of paranthesis to keep pattern together

    • (): use with pattern-matching characters to create groups

Dataset being used today

library(tidyverse)

enron <- read_csv(“https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/data/enron/enron.csv”) %>% drop_na()

glimpse(enron)

## Rows: 214,195

## Columns: 3

## $ mail_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 1…

## $ person <chr> “allen-p”, “allen-p”, “allen-p”, “allen-p”, “allen-p”, “allen…

## $ email <chr> “Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>”, …

head(enron, n=50)
## # A tibble: 50 x 3

## mail_num person email

## <dbl> <chr> <chr>

## 1 1 allen-p Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>

## 2 1 allen-p Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)

## 3 1 allen-p From: phillip.allen@enron.com

## 4 1 allen-p To: tim.belden@enron.com

## 5 1 allen-p Subject:

## 6 1 allen-p Mime-Version: 1.0

## 7 1 allen-p Content-Type: text/plain; charset=us-ascii

## 8 1 allen-p Content-Transfer-Encoding: 7bit

## 9 1 allen-p X-From: Phillip K Allen

## 10 1 allen-p X-To: Tim Belden <Tim Belden/Enron@EnronXGate>

## # … with 40 more rows

Canonical principle #1: Basic pattern-matching

enron %>% filter(str_detect(enron$person, "Allen"))
## # A tibble: 0 x 3

## # … with 3 variables: mail_num <dbl>, person <chr>, email <chr>

str_subset(enron$email, "tracy.ngo")
## [1] "To: tracy.ngo@enron.com"                                               

## [2] “To: tracy.ngo@enron.com”

## [3] “To: tim.belden@enron.com, steve.c.hall@enron.com, tracy.ngo@enron.com,”

str_view_all(enron$email, "tracy.ngo")

Canonical principle #2: Anchors

  • ^: matches the start of the string.

  • $: matches the end of the string.

  • \b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.

  • \B: matches the empty string provided it is not at an edge of a word.

enron %>% filter(str_detect(enron$email, "@ECT")) %>% select
## # A tibble: 6,524 x 0
enron %>% filter(str_detect(enron$email, "weekend$"))
## # A tibble: 45 x 3

## mail_num person email

## <dbl> <chr> <chr>

## 1 69 allen-p morning I sent you the roll did you get it? Did you need m…

## 2 94 carson-m Subject: This weekend

## 3 94 carson-m Subject: This weekend

## 4 95 carson-m Subject: Re: This weekend

## 5 69 davis-d Subject: Manual JE info for cutover weekend

## 6 69 davis-d Subject: Manual JE info for cutover weekend

## 7 69 davis-d Subject: Manual JE info for cutover weekend

## 8 1 dean-c Subject: RE: This weekend

## 9 1 dean-c Subject: RE: This weekend

## 10 1 dean-c Subject: RE: This weekend

## # … with 35 more rows

Canonical principle #3: Escape characters

x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")

str_view(x, “(\\d\\d\\d)\\d\\d\\d-\\d\\d\\d\\d”)

  • 123-456-7890

  • (123)456-7890

  • (123) 456-7890

  • 1235-2351

str_view("so it goes $^$ here", "\\$\\^\\$")
  • so it goes $^$ here

Canonical principle #4: Character Classes

str_view(stringr::words, "^[yx]", match=TRUE)
  • year

  • yes

  • yesterday

  • yet

  • you

  • young

str_view(stringr::words, "[^e]ed$", match = TRUE)
  • bed

  • hundred

  • red

str_view(c("red", "reed"), "[^e]ed$", match = FALSE)
  • reed
str_view(stringr::words, "^(thr)*", match = TRUE)
  • a

  • able

  • about

  • absolute

  • accept

  • account

  • achieve

  • across

  • act

  • active

  • actual

  • add

  • address

  • admit

  • advertise

  • affect

  • afford

  • after

  • afternoon

  • again

  • against

  • age

  • agent

  • ago

x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")

str_view(x, “\\([0-9][0-9][0-9]\\)[ ]*[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”)

  • 123-456-7890

  • (123)456-7890

  • (123) 456-7890

  • 1235-2351

x <- c("123456-7890", "(123) 456-7890", "(123)456-7890", "1235-2351")

str_view(x, “\\([0-9][0-9][0-9]\\)[ ]?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”)

  • 123456-7890

  • (123) 456-7890

  • (123)456-7890

  • 1235-2351

x <- c("4444-22-22", "test", "333-4444-22")

str_view(x, “\\d{4}-\\d{2}-\\d{2}”)

  • 4444-22-22

  • test

  • 333-4444-22

Canonical principle #6: Character Clusters

enron %>% filter(str_detect(email, "@.*\\.(edu|net)")) %>% select(email)
## # A tibble: 1,646 x 1

## email

## <chr>

## 1 “<retwell@mail.sanmarcos.net>”

## 2 “cc: \”Larry Lewter\” <retwell@mail.sanmarcos.net>, \“Claudia L. Crocker\””

## 3 “\”Bob McKinney\” <capstone@texas.net> on 11/27/2000 09:46:13 AM”

## 4 “To: \”Capstone\” <capstone@texas.net>”

## 5 “Brian_Hoskins@enron.net”

## 6 “Brian_Hoskins@enron.net”

## 7 “Brian_Hoskins@enron.net”

## 8 “Brian_Hoskins@enron.net”

## 9 “To: adam.r.bayer@vanderbilt.edu”

## 10 “X-To: \”Adam Bayer\” <adam.r.bayer@vanderbilt.edu> @ ENRON”

## # … with 1,636 more rows

enron %>% filter(str_detect(email, "@.*(ns)\\.(net)")) %>% select(email)
## # A tibble: 6 x 1

## email

## <chr>

## 1 “\”Karen Edson\” <kedson@ns.net> on 07/08/2000 03:06:40 PM”

## 2 “cc: \”Julee Malinowski-Ball (E-mail)\” <jmball@ns.net>, \“Ray McNally (E-mai…

## 3 “kedson@ns.net”

## 4 “<fotinb@bc-mail.com>; \”Bill Hannah\” <hannahs@wans.net>; \“Bill Harvey\””

## 5 “\”Harvey Wax\” <HLWAX@aol.com>; \“J. D Zikuda\” <jdzikuda@netins.net>; \“Jam…

## 6 “<rndyhbnr@midplains.net>; \”Ray Clary\” <rclrec@mindspring.com>; \“Rich Hari…

Material has been borrowed heavily from