Detect gibberish, garbled text, and corrupted content with high accuracy using advanced machine learning techniques.
pygarble is a powerful Python library designed to identify nonsensical, garbled, or corrupted text content that often appears in data processing pipelines, user inputs, or automated systems. Whether you’re dealing with random character sequences, encoding errors, keyboard mashing, or corrupted data streams, pygarble provides multiple detection strategies to filter out unwanted content and maintain data quality. The library uses statistical analysis, entropy calculations, pattern matching, and language detection to distinguish between meaningful text and gibberish with configurable sensitivity levels.
predict()
and predict_proba()
methodsYou can install pygarble using pip:
pip install pygarble
from pygarble import GarbleDetector, Strategy
# Create a detector with character frequency strategy
detector = GarbleDetector(Strategy.CHARACTER_FREQUENCY, threshold=0.5)
# Detect garbled text
texts = ["normal text", "aaaaaaa", "asdfghjkl"]
results = detector.predict(texts)
print(results) # [False, True, True]
# Get probability scores
probabilities = detector.predict_proba(texts)
print(probabilities) # [0.2, 1.0, 0.8]
Each strategy implements a different approach to detect garbled text. All strategies return probability scores between 0.0 and 1.0, where higher scores indicate more likely garbled text.
Note: For sentence and paragraph analysis, Character Frequency, Statistical Analysis, and Entropy Based strategies exclude whitespace characters (spaces, tabs, newlines) from their calculations since whitespace is structural, not content. This ensures consistent analysis regardless of formatting differences.
CHARACTER_FREQUENCY
)Implementation Logic: Analyzes the frequency distribution of alphabetic characters in the text (excluding whitespace). Garbled text often has unusual character frequency patterns (e.g., too many repeated characters or skewed distributions).
Algorithm:
Parameters:
frequency_threshold
(float, default: 0.1): Maximum allowed character frequency ratiodetector = GarbleDetector(
Strategy.CHARACTER_FREQUENCY,
threshold=0.3,
frequency_threshold=0.1 # Character frequency threshold
)
# Examples
detector.predict("aaaaaaa") # True - high frequency of 'a'
detector.predict("normal text") # False - balanced character distribution
detector.predict("aaa aaa aaa") # True - same as above (whitespace ignored)
WORD_LENGTH
)Implementation Logic: Analyzes the average word length in the text. Garbled text often contains unusually long words or lacks proper word boundaries.
Algorithm:
Parameters:
max_word_length
(int, default: 20): Maximum allowed average word lengthdetector = GarbleDetector(
Strategy.WORD_LENGTH,
threshold=0.5,
max_word_length=20 # Maximum average word length
)
# Examples
detector.predict("supercalifragilisticexpialidocious") # True - very long word
detector.predict("short words") # False - normal word lengths
PATTERN_MATCHING
)Implementation Logic: Uses configurable regex patterns to detect suspicious text patterns. Highly customizable with named patterns and override capabilities.
Algorithm:
Default Patterns:
special_chars
: Sequences of 3+ special characters [^a-zA-Z0-9\s]{3,}
repeated_chars
: Characters repeated 5+ times (.)\1{4,}
uppercase_sequence
: 5+ consecutive uppercase letters [A-Z]{5,}
long_numbers
: 8+ consecutive digits [0-9]{8,}
Parameters:
patterns
(dict, optional): Custom patterns dictionary {name: regex_pattern}
override_defaults
(bool, default: False): Whether to skip default patterns# Default patterns only
detector = GarbleDetector(Strategy.PATTERN_MATCHING)
# Custom patterns added to defaults
custom_patterns = {
'email_pattern': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'phone_pattern': r'\d{3}-\d{3}-\d{4}'
}
detector = GarbleDetector(Strategy.PATTERN_MATCHING, patterns=custom_patterns)
# Override defaults completely
detector = GarbleDetector(Strategy.PATTERN_MATCHING, patterns=custom_patterns, override_defaults=True)
# Examples
detector.predict("AAAAA") # True - uppercase sequence
detector.predict("@@@@@") # True - special characters
detector.predict("test@example.com") # True - email pattern (if configured)
detector.predict("normal text") # False - no patterns match
STATISTICAL_ANALYSIS
)Implementation Logic: Analyzes the ratio of alphabetic characters to content characters (excluding whitespace). Garbled text often has low alphabetic character ratios.
Algorithm:
Parameters:
alpha_threshold
(float, default: 0.5): Minimum required alphabetic character ratiodetector = GarbleDetector(
Strategy.STATISTICAL_ANALYSIS,
threshold=0.5,
alpha_threshold=0.5 # Minimum alphabetic character ratio
)
# Examples
detector.predict("123456789") # True - low alphabetic ratio
detector.predict("normal text") # False - high alphabetic ratio
detector.predict("123 456 789") # True - same as above (whitespace ignored)
ENTROPY_BASED
)Implementation Logic: Uses Shannon entropy to measure alphabetic character diversity (excluding whitespace). Garbled text often has low entropy due to repetitive patterns.
Algorithm:
Parameters:
entropy_threshold
(float, default: 3.0): Minimum required entropy valuedetector = GarbleDetector(
Strategy.ENTROPY_BASED,
threshold=0.5,
entropy_threshold=3.0 # Minimum entropy threshold
)
# Examples
detector.predict("aaaaaaa") # True - low entropy (repetitive)
detector.predict("normal text") # False - high entropy (diverse)
detector.predict("aaa aaa aaa") # True - same as above (whitespace ignored)
LANGUAGE_DETECTION
)Implementation Logic: Uses FastText language identification to detect if text is in the expected language. Garbled text often fails language detection.
Algorithm:
Parameters:
target_language
(str, default: ‘en’): Expected language codemodel_path
(str, optional): Path to custom FastText modeldetector = GarbleDetector(
Strategy.LANGUAGE_DETECTION,
threshold=0.5,
target_language='en', # Target language code
model_path='/path/to/model.bin' # Optional custom model path
)
# Examples
detector.predict("Hello world") # False - detected as English
detector.predict("asdfghjkl") # True - not detected as English
detector.predict("Bonjour le monde") # True - detected as French, not English
ENGLISH_WORD_VALIDATION
)Implementation Logic: Tokenizes text and validates words against an English dictionary using pyspellchecker. Garbled text often contains many invalid English words.
Algorithm:
Parameters:
valid_word_threshold
(float, default: 0.7): Minimum required ratio of valid English wordsdetector = GarbleDetector(
Strategy.ENGLISH_WORD_VALIDATION,
threshold=0.5,
valid_word_threshold=0.7 # Minimum ratio of valid English words
)
# Examples
detector.predict("hello world this is normal text") # False - all words are valid
detector.predict("asdfghjkl qwertyuiop zxcvbnm") # True - no valid words
detector.predict("hello asdfgh world qwerty") # False - 50% valid words (above threshold)
The Pattern Matching strategy offers extensive configurability for custom use cases:
# 1. Default patterns only
detector = GarbleDetector(Strategy.PATTERN_MATCHING)
# 2. Add custom patterns to defaults
custom_patterns = {
'email_pattern': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'phone_pattern': r'\d{3}-\d{3}-\d{4}',
'url_pattern': r'https?://[^\s]+',
'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}'
}
detector = GarbleDetector(Strategy.PATTERN_MATCHING, patterns=custom_patterns)
# 3. Override specific default patterns
override_patterns = {
'special_chars': r'[^a-zA-Z0-9\s]{5,}', # More restrictive (5+ chars)
'email_pattern': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
}
detector = GarbleDetector(Strategy.PATTERN_MATCHING, patterns=override_patterns)
# 4. Use only custom patterns (no defaults)
detector = GarbleDetector(Strategy.PATTERN_MATCHING, patterns=custom_patterns, override_defaults=True)
# 5. Disable all patterns (returns 0.0 probability for everything)
detector = GarbleDetector(Strategy.PATTERN_MATCHING, patterns={}, override_defaults=True)
# Adjust sensitivity
detector = GarbleDetector(Strategy.CHARACTER_FREQUENCY, threshold=0.3)
# Lower threshold = more sensitive (detects more as garbled)
# Higher threshold = less sensitive (detects less as garbled)
detector = GarbleDetector(Strategy.PATTERN_MATCHING)
# Process multiple texts at once
texts = ["normal text", "AAAAA", "123456789", "mixed text"]
predictions = detector.predict(texts)
probabilities = detector.predict_proba(texts)
for text, pred, prob in zip(texts, predictions, probabilities):
status = "GARBLED" if pred else "NORMAL"
print(f"{text:15} -> {status:7} (confidence: {prob:.3f})")
For very large datasets, use multithreading:
from pygarble import GarbleDetector, Strategy
import time
# Large dataset
texts = ["normal text", "AAAAA", "123456789", "mixed text"] * 2000
# Single-threaded processing
detector_single = GarbleDetector(Strategy.LANGUAGE_DETECTION)
start_time = time.time()
predictions_single = detector_single.predict(texts)
single_time = time.time() - start_time
# Multithreaded processing
detector_multi = GarbleDetector(Strategy.LANGUAGE_DETECTION, threads=4)
start_time = time.time()
predictions_multi = detector_multi.predict(texts)
multi_time = time.time() - start_time
print(f"Single-threaded: {single_time:.3f}s")
print(f"Multithreaded: {multi_time:.3f}s")
print(f"Speedup: {single_time/multi_time:.2f}x")
print(f"Results match: {predictions_single == predictions_multi}")
For large datasets, you can enable multithreaded processing:
# Enable multithreading for large datasets
detector = GarbleDetector(Strategy.LANGUAGE_DETECTION, threads=4)
# Process large batch with multiple threads
large_texts = ["text"] * 1000 # 1000 texts
predictions = detector.predict(large_texts)
Note: Multithreading is most beneficial for:
For small datasets or CPU-bound strategies, single-threaded processing may be faster due to threading overhead.
# For environments without internet access
detector = GarbleDetector(
Strategy.LANGUAGE_DETECTION,
model_path='/offline/path/lid.176.bin',
target_language='en',
threshold=0.5
)
The package automatically handles common edge cases:
# Extremely long strings without any whitespace (like base64 data)
detector = GarbleDetector(Strategy.CHARACTER_FREQUENCY, max_string_length=1000)
# These will be detected as garbled (no whitespace)
long_base64 = "SGVsbG9Xb3JsZEhlbGxvV29ybGRIZWxsb1dvcmxk" * 50
long_url = "https://example.com/" + "a" * 1000
long_no_whitespace = "a" * 1001
# These will NOT be detected as garbled (have whitespace)
long_paragraph = "This is a normal sentence. " * 50
long_with_tabs = "word\t" * 250
long_with_newlines = "word\n" * 250
long_mixed_whitespace = "word \t\n " * 200
print(detector.predict(long_base64)) # True
print(detector.predict(long_url)) # True
print(detector.predict(long_no_whitespace)) # True
print(detector.predict(long_paragraph)) # False
print(detector.predict(long_with_tabs)) # False
print(detector.predict(long_with_newlines)) # False
print(detector.predict(long_mixed_whitespace)) # False
GarbleDetector(strategy: Strategy, threshold: float = 0.5, threads: int = None, **kwargs)
Parameters:
strategy
: Detection strategy to usethreshold
: Probability threshold for binary predictions (default: 0.5)threads
: Number of threads for batch processing (default: None, single-threaded)**kwargs
: Strategy-specific parametersMethods:
predict(X)
: Returns binary predictions (True/False)predict_proba(X)
: Returns probability scores (0.0-1.0)class Strategy(Enum):
CHARACTER_FREQUENCY = "character_frequency"
WORD_LENGTH = "word_length"
PATTERN_MATCHING = "pattern_matching"
STATISTICAL_ANALYSIS = "statistical_analysis"
ENTROPY_BASED = "entropy_based"
LANGUAGE_DETECTION = "language_detection"
ENGLISH_WORD_VALIDATION = "english_word_validation"
from pygarble import GarbleDetector, Strategy
# Filter out garbled text from a dataset
detector = GarbleDetector(Strategy.ENTROPY_BASED, threshold=0.7)
texts = [
"This is normal English text",
"asdfghjkl",
"Hello world",
"AAAAA",
"Mixed content with numbers 123"
]
clean_texts = []
for text in texts:
if not detector.predict(text):
clean_texts.append(text)
print("Clean texts:", clean_texts)
from pygarble import GarbleDetector, Strategy
strategies = [
Strategy.CHARACTER_FREQUENCY,
Strategy.PATTERN_MATCHING,
Strategy.ENTROPY_BASED,
Strategy.ENGLISH_WORD_VALIDATION
]
text = "suspicious text here"
votes = 0
for strategy in strategies:
detector = GarbleDetector(strategy, threshold=0.5)
if detector.predict(text):
votes += 1
# Text is considered garbled if majority of strategies agree
is_garbled = votes > len(strategies) // 2
print(f"Text is garbled: {is_garbled}")
from pygarble import GarbleDetector, Strategy
# Validate that text is in English
detector = GarbleDetector(
Strategy.LANGUAGE_DETECTION,
target_language='en',
threshold=0.3
)
user_inputs = [
"Hello, how are you?",
"Bonjour, comment allez-vous?",
"asdfghjkl",
"123456789"
]
for text in user_inputs:
proba = detector.predict_proba(text)
is_english = not detector.predict(text)
print(f"{text:25} -> English: {is_english:5} (confidence: {1-proba:.3f})")
from pygarble import GarbleDetector, Strategy
# Create a detector for validating user input data
validation_patterns = {
'email_pattern': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'phone_pattern': r'\d{3}[-.]?\d{3}[-.]?\d{4}',
'ssn_pattern': r'\d{3}-\d{2}-\d{4}',
'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}',
'suspicious_chars': r'[^a-zA-Z0-9\s@.-]{3,}' # Override default special_chars
}
detector = GarbleDetector(
Strategy.PATTERN_MATCHING,
patterns=validation_patterns,
threshold=0.1 # Low threshold to catch any suspicious patterns
)
# Test various inputs
test_inputs = [
"john.doe@example.com", # Valid email
"555-123-4567", # Valid phone
"123-45-6789", # Valid SSN
"4532 1234 5678 9012", # Valid credit card
"normal text", # Normal text
"asdfghjkl", # Random characters
"user@domain.com $$$", # Email with suspicious chars
]
for text in test_inputs:
proba = detector.predict_proba(text)
is_suspicious = detector.predict(text)
status = "SUSPICIOUS" if is_suspicious else "VALID"
print(f"{text:25} -> {status:10} (confidence: {proba:.3f})")
from pygarble import GarbleDetector, Strategy
# Create patterns for detecting code-like content
code_patterns = {
'function_call': r'\w+\s*\([^)]*\)',
'variable_assignment': r'\w+\s*=\s*\w+',
'json_pattern': r'\{[^}]*"[^"]*"[^}]*\}',
'sql_pattern': r'(SELECT|INSERT|UPDATE|DELETE)\s+.*FROM',
'html_pattern': r'<[^>]+>',
'long_identifiers': r'\w{20,}', # Very long variable names
}
# Use only custom patterns (no defaults)
detector = GarbleDetector(
Strategy.PATTERN_MATCHING,
patterns=code_patterns,
override_defaults=True,
threshold=0.2
)
# Test mixed content
mixed_content = [
"This is normal text",
"def calculate_total(items):",
"SELECT * FROM users WHERE id = 1",
"<div class='container'>Hello</div>",
"very_long_variable_name_that_exceeds_normal_length",
"user_data = {'name': 'John', 'age': 30}",
]
for text in mixed_content:
proba = detector.predict_proba(text)
is_code_like = detector.predict(text)
status = "CODE-LIKE" if is_code_like else "NORMAL"
print(f"{text:50} -> {status:10} (confidence: {proba:.3f})")
To set up a development environment:
git clone https://github.com/yourusername/py-garble.git
cd py-garble
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pytest tests/
flake8 pygarble/
black pygarble/
mypy pygarble/
./scripts/test_ci.sh
This project uses GitHub Actions for CI/CD. The workflow runs:
For complete setup instructions, branch protection rules, and publishing guides, see .github/instructions.txt
.
The package uses a modular strategy pattern:
pygarble/
├── __init__.py
├── core.py # Main GarbleDetector class
└── strategies/
├── __init__.py
├── base.py # Abstract base strategy
├── character_frequency.py
├── word_length.py
├── pattern_matching.py
├── statistical_analysis.py
├── entropy_based.py
├── language_detection.py
└── english_word_validation.py
Each strategy implements:
_predict_impl(text)
: Returns boolean prediction_predict_proba_impl(text)
: Returns probability scoregit checkout -b feature/amazing-feature
)git commit -m 'Add some amazing feature'
)git push origin feature/amazing-feature
)To add a new detection strategy:
pygarble/strategies/
BaseStrategy
_predict_impl()
and _predict_proba_impl()
This project is licensed under the MIT License - see the LICENSE file for details.
override_defaults
parameter to skip default patternsthreads
parameter for multithreaded batch processing