Regular Expressions and Text Processing
Text processing is the foundation of NLP. Before we can analyze language, we need to handle raw text: finding patterns, breaking text into words, and normalizing variations.
The Big Picture
The Problem: Raw text is messy. We need systematic ways to:
- Find patterns in text
- Break text into meaningful units (tokens)
- Handle variations (color vs. colour, ran vs. running)
The Tools:
- Regular expressions: Powerful pattern matching
- Tokenization: Splitting text into words/subwords
- Normalization: Standardizing text formats
Regular Expressions
What Are Regular Expressions?
A regex is a language for specifying text patterns. Think of it as "find" on steroids.
Example: Find all email addresses in a document.
- Without regex: Write complex code with many if statements
- With regex:
/[\w.]+@[\w.]+\.\w+/
Basic Patterns
| Pattern | Meaning | Example Match |
|---|---|---|
/word/ |
Exact match | "word" |
/[wW]ord/ |
Either character | "word" or "Word" |
/[0-9]/ |
Any digit | "0", "5", "9" |
/[a-z]/ |
Any lowercase | "a", "m", "z" |
/[A-Z]/ |
Any uppercase | "A", "M", "Z" |
Negation with Caret
[^...] means "NOT these characters":
| Pattern | Meaning |
|---|---|
/[^A-Z]/ |
Not an uppercase letter |
/[^0-9]/ |
Not a digit |
/[^.]/ |
Not a period |
Note: Caret only means negation when it's the FIRST character inside brackets!
Quantifiers (How Many?)
| Symbol | Meaning | Example |
|---|---|---|
? |
Zero or one | /colou?r/ matches "color" and "colour" |
* |
Zero or more | /a*/ matches "", "a", "aaa" |
+ |
One or more | /[0-9]+/ matches "1", "42", "12345" |
{n} |
Exactly n | /a{3}/ matches "aaa" |
{n,m} |
Between n and m | /a{2,4}/ matches "aa", "aaa", "aaaa" |
{n,} |
At least n | /a{2,}/ matches "aa", "aaa", ... |
Wildcards and Anchors
Wildcard: . matches any single character
/beg.n/matches "begin", "began", "begun"
Anchors (position markers):
| Symbol | Meaning | Example |
|---|---|---|
^ |
Start of line | /^The/ matches lines starting with "The" |
$ |
End of line | /\.$/ matches lines ending with period |
\b |
Word boundary | /\bthe\b/ matches "the" but not "there" |
Grouping and Alternatives
Disjunction (|): Match either pattern
/cat|dog/matches "cat" or "dog"
Parentheses: Group patterns
/gupp(y|ies)/matches "guppy" or "guppies"
Character Classes (Shortcuts)
| Shortcut | Meaning | Equivalent |
|---|---|---|
\d |
Digit | [0-9] |
\D |
Non-digit | [^0-9] |
\w |
Word character | [a-zA-Z0-9_] |
\W |
Non-word | [^a-zA-Z0-9_] |
\s |
Whitespace | [ \t\n\r] |
\S |
Non-whitespace | [^ \t\n\r] |
Putting It Together
Find standalone "the" or "The":
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
Breaking it down:
(^|[^a-zA-Z]): Start of line OR non-letter before[tT]he: "the" or "The"([^a-zA-Z]|$): Non-letter after OR end of line
Words and Tokens
What Is a Word?
Seems simple, but it's surprisingly tricky!
Challenges:
- Contractions: Is "don't" one word or two?
- Hyphenation: Is "ice-cream" one word or two?
- Languages without spaces: Chinese, Japanese
- Multi-word expressions: "New York", "kick the bucket"
Key Terminology
| Term | Definition | Example |
|---|---|---|
| Token | An instance of a word/symbol | "the cat sat" has 3 tokens |
| Type | A unique word in vocabulary | "the cat sat on the mat" has 5 types |
| Lemma | Base/dictionary form | "runs", "ran", "running" → "run" |
| Utterance | Spoken equivalent of sentence | What you actually say |
Heap's Law
Vocabulary size grows with corpus size, but sublinearly: $$V = K \cdot N^\beta$$
Where:
- V = vocabulary size (types)
- N = corpus size (tokens)
- β ≈ 0.5–0.7, K ≈ 10–100
Implication: You'll always encounter new words!
Text Normalization
Three main steps:
- Tokenization: Break into tokens
- Normalization: Standardize format
- Segmentation: Find sentence boundaries
Tokenization Approaches
Rule-Based (Penn Treebank):
- Standard for English NLP
- Specific rules for punctuation, contractions
Regex-Based (NLTK):
- Flexible, customizable
- Good for specific domains
Subword (BPE):
- Data-driven
- Handles unknown words gracefully
Byte Pair Encoding (BPE)
The Problem: What about words we've never seen?
- Traditional tokenizers fail on "unigoogleable"
- OOV (out-of-vocabulary) tokens hurt performance
The Solution: Learn tokens from data!
BPE Algorithm:
- Start: Vocabulary = individual characters
- Count: Find most frequent adjacent pair
- Merge: Add merged pair to vocabulary
- Repeat: Until target vocabulary size reached
Example:
Corpus: "low lower lowest"
Initial vocab: {l, o, w, e, r, s, t, _}
Step 1: Most frequent pair = "lo" → add "lo"
Step 2: Most frequent pair = "low" → add "low"
...
At test time: Apply merges in the same order they were learned.
Benefits:
- No unknown words (can always fall back to characters)
- Common words stay whole
- Rare words split into meaningful pieces
Word Normalization
Case Folding
Convert to lowercase:
- "Apple" → "apple"
- "HELLO" → "hello"
Trade-off: Loses information!
- "US" (country) vs. "us" (pronoun)
- "Apple" (company) vs. "apple" (fruit)
Lemmatization
Reduce to base form:
- "running", "ran", "runs" → "run"
- "better", "best" → "good"
Requires: Understanding of morphology and often part-of-speech.
Stemming
Cruder approach — just chop suffixes:
- "running" → "run"
- "happily" → "happili" (imperfect!)
Porter Stemmer: Most common algorithm for English.
Edit Distance
The Problem
How similar are two strings?
Applications:
- Spell checking
- DNA sequence alignment
- Plagiarism detection
Levenshtein Distance
Minimum number of single-character edits:
- Insert: cat → cats
- Delete: cats → cat
- Substitute: cat → cot
Dynamic Programming Solution
Build a matrix D where D[i,j] = distance between first i chars of string1 and first j chars of string2.
Initialization:
- D[i,0] = i (delete all characters)
- D[0,j] = j (insert all characters)
Recurrence:
D[i,j] = min(
D[i-1,j] + 1, # deletion
D[i,j-1] + 1, # insertion
D[i-1,j-1] + cost # substitution (cost=0 if match, 1 otherwise)
)
Example: Distance between "kitten" and "sitting"
- Answer: 3 (k→s, e→i, +g)
Summary
| Concept | Purpose | Key Tool |
|---|---|---|
| Regex | Find patterns | Pattern language |
| Tokenization | Split text | BPE, rules |
| Normalization | Standardize | Lemmatization, case folding |
| Edit Distance | Measure similarity | Dynamic programming |
Practical Tips
- Always tokenize first before any NLP task
- BPE is the modern standard for neural models
- Be careful with normalization — you might lose important information
- Regex takes practice — use online testers to experiment!