Detect gibberish, garbled text, and nonsense with high precision.
A zero-dependency Python library for identifying random character sequences, keyboard mashing, encoding errors, and other forms of text corruption. Uses statistical analysis, phonotactic rules, and pattern matching to distinguish meaningful text from gibberish.
pip install pygarble
from pygarble import GarbleDetector, EnsembleDetector, Strategy
# Recommended: Use the default ensemble (99.5% precision)
detector = EnsembleDetector()
detector.predict("Hello world") # False - valid text
detector.predict("asdfghjkl") # True - keyboard mashing
detector.predict("qxzjkwp") # True - impossible letter combinations
# Get probability scores (0.0 = valid, 1.0 = gibberish)
detector.predict_proba("Hello world") # ~0.1
detector.predict_proba("xkqzjwp") # ~0.9
# Batch processing
texts = ["Hello world", "asdfghjkl", "Normal sentence here"]
results = detector.predict(texts) # [False, True, False]
Tested on 1,644 samples (dictionary words, sentences, random strings, keyboard mashing):
| Detector | Precision | Recall | F1 Score |
|---|---|---|---|
| EnsembleDetector() | 99.5% | 78.5% | 87.8% |
| MARKOV_CHAIN | 98.8% | 86.4% | 92.2% |
| BIGRAM_PROBABILITY | 100% | 33.6% | 50.3% |
The default ensemble prioritizes precision (minimizing false positives) over recall.
| Strategy | Description | Precision |
|---|---|---|
MARKOV_CHAIN |
Character transition probabilities trained on English | 98.8% |
NGRAM_FREQUENCY |
Common English trigram analysis | 96.3% |
WORD_LOOKUP |
Dictionary of 50K English words | 92.7% |
BIGRAM_PROBABILITY |
Impossible letter pairs (qx, jj, zx) | 100% |
LETTER_POSITION |
Letters in impossible positions | 99.0% |
High Precision (v0.5.0)
BIGRAM_PROBABILITY - Impossible letter pairsLETTER_POSITION - Invalid letter positionsCONSONANT_SEQUENCE - Too many consecutive consonantsVOWEL_PATTERN - Invalid vowel sequencesLETTER_FREQUENCY - Abnormal letter distributionRARE_TRIGRAM - Impossible trigramsCore Strategies
MARKOV_CHAIN - Character-level Markov chain (best overall)NGRAM_FREQUENCY - Trigram frequency analysisWORD_LOOKUP - English dictionary lookupPRONOUNCEABILITY - English phonotactic rulesKEYBOARD_PATTERN - Keyboard row sequencesENTROPY_BASED - Shannon entropy analysisVOWEL_RATIO - Vowel to consonant ratioSpecialized Detectors
MOJIBAKE - Encoding corruption (UTF-8 as Latin-1)UNICODE_SCRIPT - Homoglyph/script mixing attacksHEX_STRING - Hash strings and UUIDsSYMBOL_RATIO - Excessive symbols/numbersREPETITION - Repeated patterns (ababab)COMPRESSION_RATIO - Compression-based detectionLegacy Strategies
CHARACTER_FREQUENCY, WORD_LENGTH, PATTERN_MATCHING, STATISTICAL_ANALYSISENGLISH_WORD_VALIDATION (requires pip install pygarble[spellchecker])from pygarble import GarbleDetector, Strategy
# Markov chain - best overall performance
detector = GarbleDetector(Strategy.MARKOV_CHAIN)
detector.predict("the quick brown fox") # False
detector.predict("xkqzjwpmv") # True
# High precision - zero false positives
detector = GarbleDetector(Strategy.BIGRAM_PROBABILITY)
detector.predict("hello world") # False
detector.predict("qxjjxz") # True (impossible: qx, jj, xz)
# Encoding corruption detection
detector = GarbleDetector(Strategy.MOJIBAKE)
detector.predict("Café") # False - valid UTF-8
detector.predict("Café") # True - mojibake
# Homoglyph attack detection
detector = GarbleDetector(Strategy.UNICODE_SCRIPT)
detector.predict("paypal") # False - all Latin
detector.predict("pаypal") # True - Cyrillic 'а'
Combine multiple strategies for better accuracy:
from pygarble import EnsembleDetector, Strategy
# Default ensemble (recommended)
# Uses: MARKOV_CHAIN, WORD_LOOKUP, NGRAM_FREQUENCY, BIGRAM_PROBABILITY, LETTER_POSITION
# Voting: majority
detector = EnsembleDetector()
# Custom strategies
detector = EnsembleDetector(
strategies=[
Strategy.MARKOV_CHAIN,
Strategy.BIGRAM_PROBABILITY,
Strategy.KEYBOARD_PATTERN,
]
)
# Different voting modes
detector = EnsembleDetector(voting="any") # High recall - flag if ANY strategy detects
detector = EnsembleDetector(voting="all") # High precision - flag only if ALL agree
detector = EnsembleDetector(voting="majority") # Balanced (default)
detector = EnsembleDetector(voting="average") # Average probabilities
# Weighted voting
detector = EnsembleDetector(
strategies=[Strategy.MARKOV_CHAIN, Strategy.WORD_LOOKUP],
voting="weighted",
weights=[0.7, 0.3]
)
GarbleDetector(
strategy: Strategy,
threshold: float = 0.5, # Probability threshold for predict()
**kwargs # Strategy-specific parameters
)
# Methods
detector.predict(text) # Returns bool or List[bool]
detector.predict_proba(text) # Returns float or List[float] (0.0-1.0)
EnsembleDetector(
strategies: List[Strategy] = None, # Default: high-precision mix
threshold: float = 0.5,
voting: str = "majority", # "majority", "any", "all", "average", "weighted"
weights: List[float] = None, # Required if voting="weighted"
)
# Methods (same as GarbleDetector)
detector.predict(text)
detector.predict_proba(text)
detector = EnsembleDetector()
def validate_input(text):
if detector.predict(text):
return "Please enter valid text"
return None
detector = GarbleDetector(Strategy.MARKOV_CHAIN)
clean_data = [text for text in raw_data if not detector.predict(text)]
detector = GarbleDetector(Strategy.MOJIBAKE)
for text in documents:
if detector.predict(text):
print(f"Encoding issue detected: {text[:50]}...")
detector = GarbleDetector(Strategy.UNICODE_SCRIPT)
if detector.predict(domain_name):
print("Warning: Possible homoglyph attack")
pyspellchecker for ENGLISH_WORD_VALIDATION strategygit clone https://github.com/brightertiger/pygarble.git
cd pygarble
pip install -e ".[dev]"
pytest tests/ -v
MIT License