449 lines
12 KiB
Markdown
449 lines
12 KiB
Markdown
|
|
# Tooling
|
|
|
|
A collection of useful command-line tools.
|
|
|
|
## OCR Screenshot Tool
|
|
|
|
A cross-platform CLI tool that takes screenshots, performs OCR using DocTR (state-of-the-art deep learning OCR), and copies the result to clipboard. Features intelligent text formatting preservation and optional image annotation.
|
|
|
|
### Features
|
|
|
|
- 🌍 **Cross-platform** - Works on Windows, macOS, and Linux
|
|
- ⚡ **Multiple screenshot methods** - Choose the fastest for your system
|
|
- 🔍 **Advanced OCR** - Uses DocTR with PARSeq recognition model
|
|
- 📝 **Smart formatting** - Preserves text layout and indentation
|
|
- 🎨 **Image annotation** - Visualize detected text regions
|
|
- 📋 **Clipboard integration** - Automatic text copying
|
|
|
|
### Installation
|
|
|
|
#### Basic installation:
|
|
```bash
|
|
pip install .
|
|
```
|
|
|
|
#### With cross-platform screenshot support:
|
|
```bash
|
|
# For fastest screenshots (recommended)
|
|
pip install ".[screenshot-fast]"
|
|
|
|
# For full automation features (region selection)
|
|
pip install ".[screenshot-full]"
|
|
|
|
# For maximum compatibility (all backends)
|
|
pip install ".[screenshot-all]"
|
|
```
|
|
|
|
#### Install specific screenshot libraries:
|
|
```bash
|
|
pip install mss # Fastest (~30x faster than others)
|
|
pip install pyautogui # Interactive region selection
|
|
pip install pyscreenshot # Multiple backends
|
|
```
|
|
|
|
### Usage
|
|
|
|
#### Basic Commands
|
|
|
|
Take a screenshot and perform OCR:
|
|
```bash
|
|
ocr-screenshot
|
|
```
|
|
|
|
With verbose output and annotation:
|
|
```bash
|
|
ocr-screenshot --verbose --annotate --save-image
|
|
```
|
|
|
|
#### Screenshot Methods
|
|
|
|
Choose your preferred screenshot method:
|
|
|
|
```bash
|
|
# Auto-detect best method (default)
|
|
ocr-screenshot --screenshot-method auto
|
|
|
|
# Use MSS (fastest)
|
|
ocr-screenshot --screenshot-method mss
|
|
|
|
# Use PyAutoGUI (supports region selection)
|
|
ocr-screenshot --screenshot-method pyautogui
|
|
|
|
# Use Pillow ImageGrab (built-in)
|
|
ocr-screenshot --screenshot-method pillow
|
|
|
|
# Interactive region selection
|
|
ocr-screenshot --screenshot-method interactive
|
|
|
|
# macOS native (region selection with drag)
|
|
ocr-screenshot --screenshot-method macos
|
|
```
|
|
|
|
#### Advanced Features
|
|
|
|
Save screenshot with annotation showing detected text:
|
|
```bash
|
|
ocr-screenshot --save-image --annotate --show-words --show-text
|
|
```
|
|
|
|
Capture specific monitor (MSS method):
|
|
```bash
|
|
ocr-screenshot --screenshot-method mss --monitor-number 2
|
|
```
|
|
|
|
Full annotation with all detection levels:
|
|
```bash
|
|
ocr-screenshot --annotate --show-words --show-lines --show-blocks --show-text --save-image
|
|
```
|
|
|
|
### Screenshot Method Comparison
|
|
|
|
| Method | Speed | Region Selection | Cross-Platform | Notes |
|
|
|--------|-------|------------------|----------------|-------|
|
|
| **mss** | ⚡⚡⚡ Fastest | ❌ (crop after) | ✅ | ~30x faster, recommended |
|
|
| **pyautogui** | ⚡ Slow | ✅ Interactive | ✅ | Best for region selection |
|
|
| **pillow** | ⚡ Slow | ✅ Coordinates | ✅ | Built into Pillow |
|
|
| **pyscreenshot** | ⚡ Variable | ✅ Coordinates | ✅ | Multiple backends |
|
|
| **macos** | ⚡⚡ Fast | ✅ Native UI | 🍎 macOS only | Native drag selection |
|
|
|
|
### How it works
|
|
|
|
1. **Screenshot**: Multiple cross-platform methods available
|
|
- **Auto**: Tries best method for your platform
|
|
- **MSS**: Fastest full-screen capture
|
|
- **Interactive**: Guided region selection
|
|
- **macOS**: Native drag-to-select interface
|
|
|
|
2. **OCR**: Advanced DocTR processing
|
|
- Uses state-of-the-art PARSeq recognition model
|
|
- Preserves text layout and indentation
|
|
- Handles multiple languages
|
|
|
|
3. **Annotation** (optional): Visual feedback
|
|
- Word-level bounding boxes (red)
|
|
- Line-level groupings (green)
|
|
- Block-level sections (blue)
|
|
- Text overlay showing detected content
|
|
|
|
4. **Output**: Formatted text copied to clipboard
|
|
|
|
### Command Line Options
|
|
|
|
```bash
|
|
ocr-screenshot [OPTIONS]
|
|
|
|
Options:
|
|
--lang TEXT Language code for OCR (default: eng)
|
|
--save-image Save the screenshot image
|
|
--output-dir PATH Directory to save images (default: ~/Desktop)
|
|
--verbose Show detailed output
|
|
--annotate Create annotated image with detection boxes
|
|
--show-words Show word-level boxes (default: True)
|
|
--show-lines Show line-level boxes
|
|
--show-blocks Show block-level boxes
|
|
--show-text Overlay detected text on image
|
|
--screenshot-method TEXT Method: auto, mss, pyautogui, pillow, pyscreenshot, macos, interactive
|
|
--monitor-number INTEGER Monitor to capture (MSS method only, 0=all)
|
|
--help Show this message and exit
|
|
```
|
|
|
|
### Examples
|
|
|
|
**Quick OCR with fastest method:**
|
|
```bash
|
|
ocr-screenshot --screenshot-method mss
|
|
```
|
|
|
|
**Debug OCR accuracy with annotations:**
|
|
```bash
|
|
ocr-screenshot --annotate --show-words --show-text --save-image --verbose
|
|
```
|
|
|
|
**Interactive region selection:**
|
|
```bash
|
|
ocr-screenshot --screenshot-method interactive --save-image
|
|
```
|
|
|
|
**Multi-monitor setup (capture monitor 2):**
|
|
```bash
|
|
ocr-screenshot --screenshot-method mss --monitor-number 2
|
|
```
|
|
|
|
## Speech-to-Text (STT) Tool
|
|
|
|
A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.
|
|
|
|
### Features
|
|
|
|
- 🎙️ **Real-time transcription** - Live speech-to-text conversion
|
|
- 🎯 **Wake word activation** - Multiple wake words including "jarvis"
|
|
- ⚡ **GPU acceleration** - CUDA support for faster processing
|
|
- 🔄 **Live display** - Real-time transcription preview
|
|
- 💾 **File output** - Save transcriptions to text files
|
|
- 🎛️ **Multiple models** - Choose from tiny to large Whisper models
|
|
- 🌍 **Multi-language** - Support for multiple languages
|
|
- 🧪 **Test mode** - Test functionality without wake words
|
|
|
|
### Installation
|
|
|
|
The STT dependencies are included in the base installation:
|
|
```bash
|
|
pip install .
|
|
```
|
|
|
|
For optimal performance with GPU acceleration:
|
|
```bash
|
|
# For CUDA 11.8
|
|
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
|
|
|
|
# For CUDA 12.X
|
|
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
|
```
|
|
|
|
### Usage
|
|
|
|
#### Basic Commands
|
|
|
|
Start STT with jarvis wake word:
|
|
```bash
|
|
tooling stt listen
|
|
```
|
|
|
|
Test STT without wake words:
|
|
```bash
|
|
tooling stt test
|
|
```
|
|
|
|
Show system information:
|
|
```bash
|
|
tooling stt info
|
|
```
|
|
|
|
#### Wake Word Options
|
|
|
|
Use different wake words:
|
|
```bash
|
|
# Use alexa wake word
|
|
tooling stt listen --wake-word alexa
|
|
|
|
# Use hey google wake word
|
|
tooling stt listen --wake-word "hey google"
|
|
|
|
# Use computer wake word
|
|
tooling stt listen --wake-word computer
|
|
```
|
|
|
|
#### Model Selection
|
|
|
|
Choose different Whisper models for speed vs accuracy:
|
|
```bash
|
|
# Fastest (tiny model)
|
|
tooling stt listen --model tiny
|
|
|
|
# Balanced (base model, default)
|
|
tooling stt listen --model base
|
|
|
|
# Best accuracy (large model)
|
|
tooling stt listen --model large-v2
|
|
```
|
|
|
|
#### Advanced Features
|
|
|
|
Save transcriptions to file:
|
|
```bash
|
|
tooling stt listen --save-to-file transcripts.txt
|
|
```
|
|
|
|
Disable real-time display for better performance:
|
|
```bash
|
|
tooling stt listen --no-realtime
|
|
```
|
|
|
|
Set custom sensitivity and language:
|
|
```bash
|
|
tooling stt listen --sensitivity 0.8 --language en --verbose
|
|
```
|
|
|
|
Force CPU usage:
|
|
```bash
|
|
tooling stt listen --device cpu
|
|
```
|
|
|
|
### Available Wake Words
|
|
|
|
The following wake words are supported:
|
|
- **jarvis** (default)
|
|
- alexa
|
|
- americano
|
|
- blueberry
|
|
- bumblebee
|
|
- computer
|
|
- grapefruits
|
|
- grasshopper
|
|
- hey google
|
|
- hey siri
|
|
- ok google
|
|
- picovoice
|
|
- porcupine
|
|
- terminator
|
|
|
|
### Wake Word Engines
|
|
|
|
Two wake word engines are supported:
|
|
|
|
- **openwakeword** (default) - Open source, free to use, good accuracy
|
|
- **pvporcupine** - Picovoice's Porcupine engine, highly optimized
|
|
|
|
Choose the engine based on your requirements:
|
|
```bash
|
|
# Use OpenWakeWord (default)
|
|
tooling stt listen --wakeword-engine openwakeword
|
|
|
|
# Use Porcupine for better performance
|
|
tooling stt listen --wakeword-engine pvporcupine
|
|
```
|
|
|
|
### Available Models
|
|
|
|
| Model | Speed | Accuracy | Memory | Use Case |
|
|
|-------|-------|----------|--------|----------|
|
|
| **tiny** | ⚡⚡⚡ | ⭐⭐ | 39MB | Testing, low-power devices |
|
|
| **base** | ⚡⚡ | ⭐⭐⭐ | 74MB | Balanced (default) |
|
|
| **small** | ⚡ | ⭐⭐⭐⭐ | 244MB | Better accuracy |
|
|
| **medium** | 🐌 | ⭐⭐⭐⭐⭐ | 769MB | High accuracy |
|
|
| **large-v2** | 🐌🐌 | ⭐⭐⭐⭐⭐ | 1550MB | Best accuracy |
|
|
|
|
### Command Line Options
|
|
|
|
```bash
|
|
tooling stt listen [OPTIONS]
|
|
|
|
Options:
|
|
--wake-word TEXT Wake word to activate recording [default: jarvis]
|
|
--model TEXT Whisper model (tiny, base, small, medium, large-v2) [default: base]
|
|
--language TEXT Language code for transcription (empty for auto-detection)
|
|
--realtime/--no-realtime Enable real-time transcription display [default: realtime]
|
|
--save-to-file PATH Save transcriptions to a file
|
|
--sensitivity FLOAT Wake word sensitivity (0.0 to 1.0) [default: 0.6]
|
|
--device TEXT Device to use (auto, cuda, cpu) [default: auto]
|
|
--wakeword-engine TEXT Wake word engine (openwakeword, pvporcupine) [default: openwakeword]
|
|
--verbose Show verbose output and configuration
|
|
--help Show this message and exit
|
|
```
|
|
|
|
### Examples
|
|
|
|
**Basic usage with jarvis:**
|
|
```bash
|
|
tooling stt listen
|
|
```
|
|
|
|
**Fast transcription with tiny model:**
|
|
```bash
|
|
tooling stt listen --model tiny --wake-word computer
|
|
```
|
|
|
|
**High accuracy with file output:**
|
|
```bash
|
|
tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose
|
|
```
|
|
|
|
**Quick test without wake words:**
|
|
```bash
|
|
tooling stt test --duration 5 --model tiny
|
|
```
|
|
|
|
**Custom language and sensitivity:**
|
|
```bash
|
|
tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"
|
|
```
|
|
|
|
**Use different wake word engine:**
|
|
```bash
|
|
tooling stt listen --wakeword-engine pvporcupine --wake-word alexa
|
|
```
|
|
|
|
### How it Works
|
|
|
|
1. **Initialization**: Loads the selected Whisper model and sets up audio processing
|
|
2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord
|
|
3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection
|
|
4. **Real-time Transcription**: Processes audio chunks in real-time (optional)
|
|
5. **Final Transcription**: Generates high-quality final transcription when speech ends
|
|
6. **Output**: Displays results and optionally saves to file
|
|
|
|
### Performance Tips
|
|
|
|
- **GPU**: Use CUDA for 3-5x faster transcription
|
|
- **Model**: Use `tiny` or `base` for real-time applications
|
|
- **Sensitivity**: Adjust wake word sensitivity based on environment noise
|
|
- **Device**: Set `--device cpu` if experiencing GPU memory issues
|
|
- **Real-time**: Disable `--no-realtime` for better final transcription performance
|
|
|
|
### Troubleshooting
|
|
|
|
**No microphone detected:**
|
|
```bash
|
|
# Check audio devices
|
|
tooling stt info
|
|
```
|
|
|
|
**CUDA not available:**
|
|
```bash
|
|
# Install CUDA-enabled PyTorch
|
|
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
|
```
|
|
|
|
**Wake word not detected:**
|
|
```bash
|
|
# Increase sensitivity
|
|
tooling stt listen --sensitivity 0.8 --verbose
|
|
```
|
|
|
|
**Poor transcription quality:**
|
|
```bash
|
|
# Use larger model
|
|
tooling stt listen --model large-v2
|
|
```
|
|
|
|
## Development Guide
|
|
|
|
### How to Add New Packages
|
|
|
|
To add a new production dependency (e.g., 'requests'):
|
|
```bash
|
|
uv add requests
|
|
```
|
|
|
|
To add a new development dependency (e.g., 'ipdb'):
|
|
```bash
|
|
uv add --dev ipdb
|
|
```
|
|
|
|
After adding dependencies, always re-generate requirements.txt:
|
|
```bash
|
|
uv pip compile pyproject.toml -o requirements.txt
|
|
```
|
|
|
|
### How to Build Packages
|
|
|
|
To build your project's distributable packages (.whl, .tar.gz):
|
|
```bash
|
|
python -m build
|
|
```
|
|
|
|
Or using the virtual environment directly:
|
|
```bash
|
|
./venv/bin/python -m build
|
|
```
|
|
|
|
### Offline Build
|
|
|
|
To build offline packages for deployment:
|
|
```bash
|
|
./dev_scripts/build_offline.sh
|
|
```
|
|
|
|
This will create offline_packages/ with all dependencies and install.sh
|