tomatocream/tooling

Fork 0

T

dingfeng.wong 08e3f4d272 a

2025-07-23 12:45:38 +08:00

dev_scripts

add ocr

2025-07-22 01:17:49 +08:00

src/tooling

2025-07-23 12:45:38 +08:00

.gitignore

2025-07-23 12:36:34 +08:00

.python-version

add ocr

2025-07-22 01:17:49 +08:00

.webui_secret_key

add ocr

2025-07-22 01:17:49 +08:00

2025-07-22 23:28:49 +08:00

pyproject.toml

2025-07-23 12:45:38 +08:00

README.md

2025-07-22 22:21:06 +08:00

realtimesst.log

2025-07-23 12:45:38 +08:00

requirements.txt

stt

2025-07-22 22:02:48 +08:00

STT_STATUSBAR_USAGE.md

test

2025-07-22 22:11:37 +08:00

test_realtime_stt.py

2025-07-23 12:36:34 +08:00

uv.lock

2025-07-23 12:45:38 +08:00

README.md

Tooling

A collection of useful command-line tools.

OCR Screenshot Tool

A cross-platform CLI tool that takes screenshots, performs OCR using DocTR (state-of-the-art deep learning OCR), and copies the result to clipboard. Features intelligent text formatting preservation and optional image annotation.

Features

🌍 Cross-platform - Works on Windows, macOS, and Linux
⚡ Multiple screenshot methods - Choose the fastest for your system
🔍 Advanced OCR - Uses DocTR with PARSeq recognition model
📝 Smart formatting - Preserves text layout and indentation
🎨 Image annotation - Visualize detected text regions
📋 Clipboard integration - Automatic text copying

Installation

Basic installation:

pip install .

With cross-platform screenshot support:

# For fastest screenshots (recommended)
pip install ".[screenshot-fast]"

# For full automation features (region selection)
pip install ".[screenshot-full]"

# For maximum compatibility (all backends)
pip install ".[screenshot-all]"

Install specific screenshot libraries:

pip install mss          # Fastest (~30x faster than others)
pip install pyautogui    # Interactive region selection
pip install pyscreenshot # Multiple backends

Usage

Basic Commands

Take a screenshot and perform OCR:

ocr-screenshot

With verbose output and annotation:

ocr-screenshot --verbose --annotate --save-image

Screenshot Methods

Choose your preferred screenshot method:

# Auto-detect best method (default)
ocr-screenshot --screenshot-method auto

# Use MSS (fastest)
ocr-screenshot --screenshot-method mss

# Use PyAutoGUI (supports region selection)
ocr-screenshot --screenshot-method pyautogui

# Use Pillow ImageGrab (built-in)
ocr-screenshot --screenshot-method pillow

# Interactive region selection
ocr-screenshot --screenshot-method interactive

# macOS native (region selection with drag)
ocr-screenshot --screenshot-method macos

Advanced Features

Save screenshot with annotation showing detected text:

ocr-screenshot --save-image --annotate --show-words --show-text

Capture specific monitor (MSS method):

ocr-screenshot --screenshot-method mss --monitor-number 2

Full annotation with all detection levels:

ocr-screenshot --annotate --show-words --show-lines --show-blocks --show-text --save-image

Screenshot Method Comparison

Method	Speed	Region Selection	Cross-Platform	Notes
mss	⚡⚡⚡ Fastest	❌ (crop after)	✅	~30x faster, recommended
pyautogui	⚡ Slow	✅ Interactive	✅	Best for region selection
pillow	⚡ Slow	✅ Coordinates	✅	Built into Pillow
pyscreenshot	⚡ Variable	✅ Coordinates	✅	Multiple backends
macos	⚡⚡ Fast	✅ Native UI	🍎 macOS only	Native drag selection

How it works

Screenshot: Multiple cross-platform methods available
- Auto: Tries best method for your platform
- MSS: Fastest full-screen capture
- Interactive: Guided region selection
- macOS: Native drag-to-select interface
OCR: Advanced DocTR processing
- Uses state-of-the-art PARSeq recognition model
- Preserves text layout and indentation
- Handles multiple languages
Annotation (optional): Visual feedback
- Word-level bounding boxes (red)
- Line-level groupings (green)
- Block-level sections (blue)
- Text overlay showing detected content
Output: Formatted text copied to clipboard

Command Line Options

ocr-screenshot [OPTIONS]

Options:
  --lang TEXT                     Language code for OCR (default: eng)
  --save-image                    Save the screenshot image
  --output-dir PATH               Directory to save images (default: ~/Desktop)
  --verbose                       Show detailed output
  --annotate                      Create annotated image with detection boxes
  --show-words                    Show word-level boxes (default: True)
  --show-lines                    Show line-level boxes
  --show-blocks                   Show block-level boxes  
  --show-text                     Overlay detected text on image
  --screenshot-method TEXT        Method: auto, mss, pyautogui, pillow, pyscreenshot, macos, interactive
  --monitor-number INTEGER        Monitor to capture (MSS method only, 0=all)
  --help                          Show this message and exit

Examples

Quick OCR with fastest method:

ocr-screenshot --screenshot-method mss

Debug OCR accuracy with annotations:

ocr-screenshot --annotate --show-words --show-text --save-image --verbose

Interactive region selection:

ocr-screenshot --screenshot-method interactive --save-image

Multi-monitor setup (capture monitor 2):

ocr-screenshot --screenshot-method mss --monitor-number 2

Speech-to-Text (STT) Tool

A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.

Features

🎙️ Real-time transcription - Live speech-to-text conversion
🎯 Wake word activation - Multiple wake words including "jarvis"
⚡ GPU acceleration - CUDA support for faster processing
🔄 Live display - Real-time transcription preview
💾 File output - Save transcriptions to text files
🎛️ Multiple models - Choose from tiny to large Whisper models
🌍 Multi-language - Support for multiple languages
🧪 Test mode - Test functionality without wake words

Installation

The STT dependencies are included in the base installation:

pip install .

For optimal performance with GPU acceleration:

# For CUDA 11.8
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.X
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

Usage

Basic Commands

Start STT with jarvis wake word:

tooling stt listen

Test STT without wake words:

tooling stt test

Show system information:

tooling stt info

Wake Word Options

Use different wake words:

# Use alexa wake word
tooling stt listen --wake-word alexa

# Use hey google wake word  
tooling stt listen --wake-word "hey google"

# Use computer wake word
tooling stt listen --wake-word computer

Model Selection

Choose different Whisper models for speed vs accuracy:

# Fastest (tiny model)
tooling stt listen --model tiny

# Balanced (base model, default)
tooling stt listen --model base

# Best accuracy (large model)
tooling stt listen --model large-v2

Advanced Features

Save transcriptions to file:

tooling stt listen --save-to-file transcripts.txt

Disable real-time display for better performance:

tooling stt listen --no-realtime

Set custom sensitivity and language:

tooling stt listen --sensitivity 0.8 --language en --verbose

Force CPU usage:

tooling stt listen --device cpu

Available Wake Words

The following wake words are supported:

jarvis (default)
alexa
americano
blueberry
bumblebee
computer
grapefruits
grasshopper
hey google
hey siri
ok google
picovoice
porcupine
terminator

Wake Word Engines

Two wake word engines are supported:

openwakeword (default) - Open source, free to use, good accuracy
pvporcupine - Picovoice's Porcupine engine, highly optimized

Choose the engine based on your requirements:

# Use OpenWakeWord (default)
tooling stt listen --wakeword-engine openwakeword

# Use Porcupine for better performance
tooling stt listen --wakeword-engine pvporcupine

Available Models

Model	Speed	Accuracy	Memory	Use Case
tiny	⚡⚡⚡	⭐⭐	39MB	Testing, low-power devices
base	⚡⚡	⭐⭐⭐	74MB	Balanced (default)
small	⚡	⭐⭐⭐⭐	244MB	Better accuracy
medium	🐌	⭐⭐⭐⭐⭐	769MB	High accuracy
large-v2	🐌🐌	⭐⭐⭐⭐⭐	1550MB	Best accuracy

Command Line Options

tooling stt listen [OPTIONS]

Options:
  --wake-word TEXT        Wake word to activate recording [default: jarvis]
  --model TEXT           Whisper model (tiny, base, small, medium, large-v2) [default: base]
  --language TEXT        Language code for transcription (empty for auto-detection)
  --realtime/--no-realtime    Enable real-time transcription display [default: realtime]
  --save-to-file PATH    Save transcriptions to a file
  --sensitivity FLOAT    Wake word sensitivity (0.0 to 1.0) [default: 0.6]
  --device TEXT          Device to use (auto, cuda, cpu) [default: auto]
  --wakeword-engine TEXT Wake word engine (openwakeword, pvporcupine) [default: openwakeword]
  --verbose              Show verbose output and configuration
  --help                 Show this message and exit

Examples

Basic usage with jarvis:

tooling stt listen

Fast transcription with tiny model:

tooling stt listen --model tiny --wake-word computer

High accuracy with file output:

tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose

Quick test without wake words:

tooling stt test --duration 5 --model tiny

Custom language and sensitivity:

tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"

Use different wake word engine:

tooling stt listen --wakeword-engine pvporcupine --wake-word alexa

How it Works

Initialization: Loads the selected Whisper model and sets up audio processing
Wake Word Detection: Listens for the specified wake word using Porcupine or OpenWakeWord
Voice Activity Detection: Uses WebRTC VAD and Silero VAD for accurate speech detection
Real-time Transcription: Processes audio chunks in real-time (optional)
Final Transcription: Generates high-quality final transcription when speech ends
Output: Displays results and optionally saves to file

Performance Tips

GPU: Use CUDA for 3-5x faster transcription
Model: Use tiny or base for real-time applications
Sensitivity: Adjust wake word sensitivity based on environment noise
Device: Set --device cpu if experiencing GPU memory issues
Real-time: Disable --no-realtime for better final transcription performance

Troubleshooting

No microphone detected:

# Check audio devices
tooling stt info

CUDA not available:

# Install CUDA-enabled PyTorch
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

Wake word not detected:

# Increase sensitivity
tooling stt listen --sensitivity 0.8 --verbose

Poor transcription quality:

# Use larger model
tooling stt listen --model large-v2

Development Guide

How to Add New Packages

To add a new production dependency (e.g., 'requests'):

uv add requests

To add a new development dependency (e.g., 'ipdb'):

uv add --dev ipdb

After adding dependencies, always re-generate requirements.txt:

uv pip compile pyproject.toml -o requirements.txt

How to Build Packages

To build your project's distributable packages (.whl, .tar.gz):

python -m build

Or using the virtual environment directly:

./venv/bin/python -m build

Offline Build

To build offline packages for deployment:

./dev_scripts/build_offline.sh

This will create offline_packages/ with all dependencies and install.sh