tomatocream/tooling

Fork 0

Files

T

dingfeng.wong ae7b7d0869 a

2025-07-22 22:21:06 +08:00

12 KiB

Raw Permalink Blame History

Tooling

A collection of useful command-line tools.

OCR Screenshot Tool

A cross-platform CLI tool that takes screenshots, performs OCR using DocTR (state-of-the-art deep learning OCR), and copies the result to clipboard. Features intelligent text formatting preservation and optional image annotation.

Features

🌍 Cross-platform - Works on Windows, macOS, and Linux
⚡ Multiple screenshot methods - Choose the fastest for your system
🔍 Advanced OCR - Uses DocTR with PARSeq recognition model
📝 Smart formatting - Preserves text layout and indentation
🎨 Image annotation - Visualize detected text regions
📋 Clipboard integration - Automatic text copying

Installation

Basic installation:

pip install .

With cross-platform screenshot support:

# For fastest screenshots (recommended)
pip install ".[screenshot-fast]"

# For full automation features (region selection)
pip install ".[screenshot-full]"

# For maximum compatibility (all backends)
pip install ".[screenshot-all]"

Install specific screenshot libraries:

pip install mss          # Fastest (~30x faster than others)
pip install pyautogui    # Interactive region selection
pip install pyscreenshot # Multiple backends

Usage

Basic Commands

Take a screenshot and perform OCR:

ocr-screenshot

With verbose output and annotation:

ocr-screenshot --verbose --annotate --save-image

Screenshot Methods

Choose your preferred screenshot method:

# Auto-detect best method (default)
ocr-screenshot --screenshot-method auto

# Use MSS (fastest)
ocr-screenshot --screenshot-method mss

# Use PyAutoGUI (supports region selection)
ocr-screenshot --screenshot-method pyautogui

# Use Pillow ImageGrab (built-in)
ocr-screenshot --screenshot-method pillow

# Interactive region selection
ocr-screenshot --screenshot-method interactive

# macOS native (region selection with drag)
ocr-screenshot --screenshot-method macos

Advanced Features

Save screenshot with annotation showing detected text:

ocr-screenshot --save-image --annotate --show-words --show-text

Capture specific monitor (MSS method):

ocr-screenshot --screenshot-method mss --monitor-number 2

Full annotation with all detection levels:

ocr-screenshot --annotate --show-words --show-lines --show-blocks --show-text --save-image

Screenshot Method Comparison

Method	Speed	Region Selection	Cross-Platform	Notes
mss	⚡⚡⚡ Fastest	❌ (crop after)	✅	~30x faster, recommended
pyautogui	⚡ Slow	✅ Interactive	✅	Best for region selection
pillow	⚡ Slow	✅ Coordinates	✅	Built into Pillow
pyscreenshot	⚡ Variable	✅ Coordinates	✅	Multiple backends
macos	⚡⚡ Fast	✅ Native UI	🍎 macOS only	Native drag selection

How it works

Screenshot: Multiple cross-platform methods available
- Auto: Tries best method for your platform
- MSS: Fastest full-screen capture
- Interactive: Guided region selection
- macOS: Native drag-to-select interface
OCR: Advanced DocTR processing
- Uses state-of-the-art PARSeq recognition model
- Preserves text layout and indentation
- Handles multiple languages
Annotation (optional): Visual feedback
- Word-level bounding boxes (red)
- Line-level groupings (green)
- Block-level sections (blue)
- Text overlay showing detected content
Output: Formatted text copied to clipboard

Command Line Options

ocr-screenshot [OPTIONS]

Options:
  --lang TEXT                     Language code for OCR (default: eng)
  --save-image                    Save the screenshot image
  --output-dir PATH               Directory to save images (default: ~/Desktop)
  --verbose                       Show detailed output
  --annotate                      Create annotated image with detection boxes
  --show-words                    Show word-level boxes (default: True)
  --show-lines                    Show line-level boxes
  --show-blocks                   Show block-level boxes  
  --show-text                     Overlay detected text on image
  --screenshot-method TEXT        Method: auto, mss, pyautogui, pillow, pyscreenshot, macos, interactive
  --monitor-number INTEGER        Monitor to capture (MSS method only, 0=all)
  --help                          Show this message and exit

Examples

Quick OCR with fastest method:

ocr-screenshot --screenshot-method mss

Debug OCR accuracy with annotations:

ocr-screenshot --annotate --show-words --show-text --save-image --verbose

Interactive region selection:

ocr-screenshot --screenshot-method interactive --save-image

Multi-monitor setup (capture monitor 2):

ocr-screenshot --screenshot-method mss --monitor-number 2

Speech-to-Text (STT) Tool

A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.

Features

🎙️ Real-time transcription - Live speech-to-text conversion
🎯 Wake word activation - Multiple wake words including "jarvis"
⚡ GPU acceleration - CUDA support for faster processing
🔄 Live display - Real-time transcription preview
💾 File output - Save transcriptions to text files
🎛️ Multiple models - Choose from tiny to large Whisper models
🌍 Multi-language - Support for multiple languages
🧪 Test mode - Test functionality without wake words

Installation

The STT dependencies are included in the base installation:

pip install .

For optimal performance with GPU acceleration:

# For CUDA 11.8
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.X
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

Usage

Basic Commands

Start STT with jarvis wake word:

tooling stt listen

Test STT without wake words:

tooling stt test

Show system information:

tooling stt info

Wake Word Options

Use different wake words:

# Use alexa wake word
tooling stt listen --wake-word alexa

# Use hey google wake word  
tooling stt listen --wake-word "hey google"

# Use computer wake word
tooling stt listen --wake-word computer

Model Selection

Choose different Whisper models for speed vs accuracy:

# Fastest (tiny model)
tooling stt listen --model tiny

# Balanced (base model, default)
tooling stt listen --model base

# Best accuracy (large model)
tooling stt listen --model large-v2

Advanced Features

Save transcriptions to file:

tooling stt listen --save-to-file transcripts.txt

Disable real-time display for better performance:

tooling stt listen --no-realtime

Set custom sensitivity and language:

tooling stt listen --sensitivity 0.8 --language en --verbose

Force CPU usage:

tooling stt listen --device cpu

Available Wake Words

The following wake words are supported:

jarvis (default)
alexa
americano
blueberry
bumblebee
computer
grapefruits
grasshopper
hey google
hey siri
ok google
picovoice
porcupine
terminator

Wake Word Engines

Two wake word engines are supported:

openwakeword (default) - Open source, free to use, good accuracy
pvporcupine - Picovoice's Porcupine engine, highly optimized

Choose the engine based on your requirements:

# Use OpenWakeWord (default)
tooling stt listen --wakeword-engine openwakeword

# Use Porcupine for better performance
tooling stt listen --wakeword-engine pvporcupine

Available Models

Model	Speed	Accuracy	Memory	Use Case
tiny	⚡⚡⚡	⭐⭐	39MB	Testing, low-power devices
base	⚡⚡	⭐⭐⭐	74MB	Balanced (default)
small	⚡	⭐⭐⭐⭐	244MB	Better accuracy
medium	🐌	⭐⭐⭐⭐⭐	769MB	High accuracy
large-v2	🐌🐌	⭐⭐⭐⭐⭐	1550MB	Best accuracy

Command Line Options

tooling stt listen [OPTIONS]

Options:
  --wake-word TEXT        Wake word to activate recording [default: jarvis]
  --model TEXT           Whisper model (tiny, base, small, medium, large-v2) [default: base]
  --language TEXT        Language code for transcription (empty for auto-detection)
  --realtime/--no-realtime    Enable real-time transcription display [default: realtime]
  --save-to-file PATH    Save transcriptions to a file
  --sensitivity FLOAT    Wake word sensitivity (0.0 to 1.0) [default: 0.6]
  --device TEXT          Device to use (auto, cuda, cpu) [default: auto]
  --wakeword-engine TEXT Wake word engine (openwakeword, pvporcupine) [default: openwakeword]
  --verbose              Show verbose output and configuration
  --help                 Show this message and exit

Examples

Basic usage with jarvis:

tooling stt listen

Fast transcription with tiny model:

tooling stt listen --model tiny --wake-word computer

High accuracy with file output:

tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose

Quick test without wake words:

tooling stt test --duration 5 --model tiny

Custom language and sensitivity:

tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"

Use different wake word engine:

tooling stt listen --wakeword-engine pvporcupine --wake-word alexa

How it Works

Initialization: Loads the selected Whisper model and sets up audio processing
Wake Word Detection: Listens for the specified wake word using Porcupine or OpenWakeWord
Voice Activity Detection: Uses WebRTC VAD and Silero VAD for accurate speech detection
Real-time Transcription: Processes audio chunks in real-time (optional)
Final Transcription: Generates high-quality final transcription when speech ends
Output: Displays results and optionally saves to file

Performance Tips

GPU: Use CUDA for 3-5x faster transcription
Model: Use tiny or base for real-time applications
Sensitivity: Adjust wake word sensitivity based on environment noise
Device: Set --device cpu if experiencing GPU memory issues
Real-time: Disable --no-realtime for better final transcription performance

Troubleshooting

No microphone detected:

# Check audio devices
tooling stt info

CUDA not available:

# Install CUDA-enabled PyTorch
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

Wake word not detected:

# Increase sensitivity
tooling stt listen --sensitivity 0.8 --verbose

Poor transcription quality:

# Use larger model
tooling stt listen --model large-v2

Development Guide

How to Add New Packages

To add a new production dependency (e.g., 'requests'):

uv add requests

To add a new development dependency (e.g., 'ipdb'):

uv add --dev ipdb

After adding dependencies, always re-generate requirements.txt:

uv pip compile pyproject.toml -o requirements.txt

How to Build Packages

To build your project's distributable packages (.whl, .tar.gz):

python -m build

Or using the virtual environment directly:

./venv/bin/python -m build

Offline Build

To build offline packages for deployment:

./dev_scripts/build_offline.sh

This will create offline_packages/ with all dependencies and install.sh

12 KiB Raw Permalink Blame History

Tooling

OCR Screenshot Tool

Features

Installation

Basic installation:

With cross-platform screenshot support:

Install specific screenshot libraries:

Usage

Basic Commands

Screenshot Methods

Advanced Features

Screenshot Method Comparison

How it works

Command Line Options

Examples

Speech-to-Text (STT) Tool

Features

Installation

Usage

Basic Commands

Wake Word Options

Model Selection

Advanced Features

Available Wake Words

Wake Word Engines

Available Models

Command Line Options

Examples

How it Works

Performance Tips

Troubleshooting

Development Guide

How to Add New Packages

How to Build Packages

Offline Build

12 KiB

Raw Permalink Blame History