Regex
Regex
- Language for specifying text search strings
- Algebraic notation for characterizing a set of strings
- Basic regular expression
- Match the "word" /word/
- Match the "word" or "Word" /[wW]ord/
- Match single digit /[0-9]/
- Ranges
- Capital Letters /[A-Z]/
- Lower Case Letters /[a-z]/
- Single Digit /[0-9]/
- Caret
- Exclusions
- Not an upper case letter /[^A-Z]/
- Not a period /[^.]/
- If caret is not the first character, it's treated as any other character
- Question Mark
- Preceding character or nothing
- "word" or "words" /words?/
- "colour" or "color" /colou?r/
- Kleene *
- Zero or more occurrences
- Zero or more "a" /a*/
- Zero or more "a"s or "b"s /[ab]*/
- Kleene +
- One or more occurrences
- One or more digits /[0-9]+/
- Wildcard
- Match any single expression
- Any character between "beg" and "n" /beg.n/
- Anchors
- Start of the line ^
- Lines starting with "the" /^The/
- End of the line $
- Lines ending with period /.$/
- Word boundary \b /\bthe\b/
- Grouping
- Disjunction "|"
- Match either cat or dog /cat|dog/
- Parentheses ()
- Match "guppy" or "guppies" /gupp(y|ies)/
- Example
- /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
- At the start or a non-alphabetic character
- At the end or non-alphabetic character
- Look for "the" or "The"
- Operators
- Any digit \d
- Any non-digit \D
- Whitespace \s
- Non-whitespace \S
- Any alphanumeric \w
- Non Alpha-numeric \W
- Range
- Zero or more *
- One or more +
- Exactly zero or one ?
- N Occurrences {n}
- N-to-M Occurrences {n,m}
- At least N Occurrences {n,}
- Up to M Occurrences {,m}
Words
- Utterance is the spoken correlate of a sentence
- Disfluency
- Fragments: broken off words
- Fillers or Filled Pauses "um"
- Lemma: Lexical form of the same word (cats vs cat)
- Types (V): Number of distinct words
- Tokens (N): Number of running words
- Heap's Law: $V = K N^\beta$
Text Normalization
- Involves three steps
- Tokenizing Words
- Normalizing word formats
- Segmenting sentences
- Tokenization
- Breaking up an utterance into tokens
- Penn Treebank Tokenization (standard in many NLP applications)
- NLTK Regex Tokenization (flexible rule-based approach)
- Byte Pair Encoding
- Empirically determine the tokens using data
- Useful in dealing with unseen words (OOV problem)
- Use subword tokens which are arbitrary substrings
- Token Learner: Creates vocabulary out of corpus
- Token Segmentor: Applies token learner on raw test data
- BPE Token Learner
- Starts with individual characters as vocab
- Merges the most frequently occurring pairs and adds them to vocab
- Repeats the count and merge process to create longer substrings
- Continues until target vocab size is reached
- No merging across word boundaries
- BPE Token Parser
- Run the token learner on test data
- Apply merges in the same order in which tokens were created
- First split into individual characters
- Merge the characters based on BPE vocab
Word Normalization
- Putting words and tokens in a standard format
- Case Folding: Convert everything to lowercase
- Simple but loses information (US vs. us, Apple vs. apple)
- Lemmatization: Reduce words to roots
- Maps variants to the same base form (am, are, is → be)
- Stemming (removes suffixes like -ing, -ed, etc.)
- Porter Stemming (algorithm for English stemming)
- Typically requires POS information for accuracy
Edit Distance
- Similarity between two strings
- Minimum number of editing operations needed to transform one string into another
- Insertion, Deletion and Substitution
- Levenshtein Distance: All three operations have the same cost (typically 1)
- Dynamic Programming solution:
- Create a matrix D[i,j] representing distance between first i chars of string1 and first j chars of string2
- Initialize: D[i,0] = i, D[0,j] = j
- Fill matrix: D[i,j] = min(D[i-1,j]+1, D[i,j-1]+1, D[i-1,j-1]+cost)
- where cost = 0 if string1[i-1] = string2[j-1], otherwise 1
- Final distance is in D[m,n]
- Viterbi Algorithm is a related DP algorithm used for finding most likely sequence of hidden states