pygarble

Detect gibberish, garbled text, and nonsense with high precision.

A zero-dependency Python library for identifying random character sequences, keyboard mashing, encoding errors, and other forms of text corruption. Uses statistical analysis, phonotactic rules, and pattern matching to distinguish meaningful text from gibberish.

Installation

pip install pygarble

Quick Start

from pygarble import GarbleDetector, EnsembleDetector, Strategy

# Recommended: Use the default ensemble (99.5% precision)
detector = EnsembleDetector()
detector.predict("Hello world")      # False - valid text
detector.predict("asdfghjkl")        # True - keyboard mashing
detector.predict("qxzjkwp")          # True - impossible letter combinations

# Get probability scores (0.0 = valid, 1.0 = gibberish)
detector.predict_proba("Hello world")  # ~0.1
detector.predict_proba("xkqzjwp")      # ~0.9

# Batch processing
texts = ["Hello world", "asdfghjkl", "Normal sentence here"]
results = detector.predict(texts)      # [False, True, False]

Performance

Tested on 1,644 samples (dictionary words, sentences, random strings, keyboard mashing):

Detector Precision Recall F1 Score
EnsembleDetector() 99.5% 78.5% 87.8%
MARKOV_CHAIN 98.8% 86.4% 92.2%
BIGRAM_PROBABILITY 100% 33.6% 50.3%

The default ensemble prioritizes precision (minimizing false positives) over recall.

Detection Strategies

Strategy Description Precision
MARKOV_CHAIN Character transition probabilities trained on English 98.8%
NGRAM_FREQUENCY Common English trigram analysis 96.3%
WORD_LOOKUP Dictionary of 50K English words 92.7%
BIGRAM_PROBABILITY Impossible letter pairs (qx, jj, zx) 100%
LETTER_POSITION Letters in impossible positions 99.0%

All Available Strategies

High Precision (v0.5.0)

Core Strategies

Specialized Detectors

Legacy Strategies

Using Individual Strategies

from pygarble import GarbleDetector, Strategy

# Markov chain - best overall performance
detector = GarbleDetector(Strategy.MARKOV_CHAIN)
detector.predict("the quick brown fox")  # False
detector.predict("xkqzjwpmv")            # True

# High precision - zero false positives
detector = GarbleDetector(Strategy.BIGRAM_PROBABILITY)
detector.predict("hello world")          # False
detector.predict("qxjjxz")               # True (impossible: qx, jj, xz)

# Encoding corruption detection
detector = GarbleDetector(Strategy.MOJIBAKE)
detector.predict("Café")                 # False - valid UTF-8
detector.predict("Café")                # True - mojibake

# Homoglyph attack detection
detector = GarbleDetector(Strategy.UNICODE_SCRIPT)
detector.predict("paypal")               # False - all Latin
detector.predict("pаypal")               # True - Cyrillic 'а'

Ensemble Detector

Combine multiple strategies for better accuracy:

from pygarble import EnsembleDetector, Strategy

# Default ensemble (recommended)
# Uses: MARKOV_CHAIN, WORD_LOOKUP, NGRAM_FREQUENCY, BIGRAM_PROBABILITY, LETTER_POSITION
# Voting: majority
detector = EnsembleDetector()

# Custom strategies
detector = EnsembleDetector(
    strategies=[
        Strategy.MARKOV_CHAIN,
        Strategy.BIGRAM_PROBABILITY,
        Strategy.KEYBOARD_PATTERN,
    ]
)

# Different voting modes
detector = EnsembleDetector(voting="any")       # High recall - flag if ANY strategy detects
detector = EnsembleDetector(voting="all")       # High precision - flag only if ALL agree
detector = EnsembleDetector(voting="majority")  # Balanced (default)
detector = EnsembleDetector(voting="average")   # Average probabilities

# Weighted voting
detector = EnsembleDetector(
    strategies=[Strategy.MARKOV_CHAIN, Strategy.WORD_LOOKUP],
    voting="weighted",
    weights=[0.7, 0.3]
)

API Reference

GarbleDetector

GarbleDetector(
    strategy: Strategy,
    threshold: float = 0.5,    # Probability threshold for predict()
    **kwargs                   # Strategy-specific parameters
)

# Methods
detector.predict(text)         # Returns bool or List[bool]
detector.predict_proba(text)   # Returns float or List[float] (0.0-1.0)

EnsembleDetector

EnsembleDetector(
    strategies: List[Strategy] = None,  # Default: high-precision mix
    threshold: float = 0.5,
    voting: str = "majority",           # "majority", "any", "all", "average", "weighted"
    weights: List[float] = None,        # Required if voting="weighted"
)

# Methods (same as GarbleDetector)
detector.predict(text)
detector.predict_proba(text)

Common Use Cases

Filter User Input

detector = EnsembleDetector()

def validate_input(text):
    if detector.predict(text):
        return "Please enter valid text"
    return None

Clean Data Pipeline

detector = GarbleDetector(Strategy.MARKOV_CHAIN)

clean_data = [text for text in raw_data if not detector.predict(text)]

Detect Encoding Issues

detector = GarbleDetector(Strategy.MOJIBAKE)

for text in documents:
    if detector.predict(text):
        print(f"Encoding issue detected: {text[:50]}...")

Detect Phishing/Homoglyphs

detector = GarbleDetector(Strategy.UNICODE_SCRIPT)

if detector.predict(domain_name):
    print("Warning: Possible homoglyph attack")

Requirements

Development

git clone https://github.com/brightertiger/pygarble.git
cd pygarble
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Changelog

0.5.0

0.4.0

0.3.0