tooling/README.md


# Tooling

A collection of useful command-line tools.

## OCR Screenshot Tool

A cross-platform CLI tool that takes screenshots, performs OCR using DocTR (state-of-the-art deep learning OCR), and copies the result to clipboard. Features intelligent text formatting preservation and optional image annotation.

### Features

- 🌍 **Cross-platform** - Works on Windows, macOS, and Linux
- ⚡ **Multiple screenshot methods** - Choose the fastest for your system
- 🔍 **Advanced OCR** - Uses DocTR with PARSeq recognition model
- 📝 **Smart formatting** - Preserves text layout and indentation
- 🎨 **Image annotation** - Visualize detected text regions
- 📋 **Clipboard integration** - Automatic text copying

### Installation

#### Basic installation:
```bash
pip install .
```

#### With cross-platform screenshot support:
```bash
# For fastest screenshots (recommended)
pip install ".[screenshot-fast]"

# For full automation features (region selection)
pip install ".[screenshot-full]"

# For maximum compatibility (all backends)
pip install ".[screenshot-all]"
```

#### Install specific screenshot libraries:
```bash
pip install mss          # Fastest (~30x faster than others)
pip install pyautogui    # Interactive region selection
pip install pyscreenshot # Multiple backends
```

### Usage

#### Basic Commands

Take a screenshot and perform OCR:
```bash
ocr-screenshot
```

With verbose output and annotation:
```bash
ocr-screenshot --verbose --annotate --save-image
```

#### Screenshot Methods

Choose your preferred screenshot method:

```bash
# Auto-detect best method (default)
ocr-screenshot --screenshot-method auto

# Use MSS (fastest)
ocr-screenshot --screenshot-method mss

# Use PyAutoGUI (supports region selection)
ocr-screenshot --screenshot-method pyautogui

# Use Pillow ImageGrab (built-in)
ocr-screenshot --screenshot-method pillow

# Interactive region selection
ocr-screenshot --screenshot-method interactive

# macOS native (region selection with drag)
ocr-screenshot --screenshot-method macos
```

#### Advanced Features

Save screenshot with annotation showing detected text:
```bash
ocr-screenshot --save-image --annotate --show-words --show-text
```

Capture specific monitor (MSS method):
```bash
ocr-screenshot --screenshot-method mss --monitor-number 2
```

Full annotation with all detection levels:
```bash
ocr-screenshot --annotate --show-words --show-lines --show-blocks --show-text --save-image
```

### Screenshot Method Comparison

| Method | Speed | Region Selection | Cross-Platform | Notes |
|--------|-------|------------------|----------------|-------|
| **mss** | ⚡⚡⚡ Fastest | ❌ (crop after) | ✅ | ~30x faster, recommended |
| **pyautogui** | ⚡ Slow | ✅ Interactive | ✅ | Best for region selection |
| **pillow** | ⚡ Slow | ✅ Coordinates | ✅ | Built into Pillow |
| **pyscreenshot** | ⚡ Variable | ✅ Coordinates | ✅ | Multiple backends |
| **macos** | ⚡⚡ Fast | ✅ Native UI | 🍎 macOS only | Native drag selection |

### How it works

1. **Screenshot**: Multiple cross-platform methods available
   - **Auto**: Tries best method for your platform
   - **MSS**: Fastest full-screen capture
   - **Interactive**: Guided region selection
   - **macOS**: Native drag-to-select interface

2. **OCR**: Advanced DocTR processing
   - Uses state-of-the-art PARSeq recognition model
   - Preserves text layout and indentation
   - Handles multiple languages

3. **Annotation** (optional): Visual feedback
   - Word-level bounding boxes (red)
   - Line-level groupings (green)
   - Block-level sections (blue)
   - Text overlay showing detected content

4. **Output**: Formatted text copied to clipboard

### Command Line Options

```bash
ocr-screenshot [OPTIONS]

Options:
  --lang TEXT                     Language code for OCR (default: eng)
  --save-image                    Save the screenshot image
  --output-dir PATH               Directory to save images (default: ~/Desktop)
  --verbose                       Show detailed output
  --annotate                      Create annotated image with detection boxes
  --show-words                    Show word-level boxes (default: True)
  --show-lines                    Show line-level boxes
  --show-blocks                   Show block-level boxes
  --show-text                     Overlay detected text on image
  --screenshot-method TEXT        Method: auto, mss, pyautogui, pillow, pyscreenshot, macos, interactive
  --monitor-number INTEGER        Monitor to capture (MSS method only, 0=all)
  --help                          Show this message and exit
```

### Examples

**Quick OCR with fastest method:**
```bash
ocr-screenshot --screenshot-method mss
```

**Debug OCR accuracy with annotations:**
```bash
ocr-screenshot --annotate --show-words --show-text --save-image --verbose
```

**Interactive region selection:**
```bash
ocr-screenshot --screenshot-method interactive --save-image
```

**Multi-monitor setup (capture monitor 2):**
```bash
ocr-screenshot --screenshot-method mss --monitor-number 2
```

## Speech-to-Text (STT) Tool

A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.

### Features

- 🎙️ **Real-time transcription** - Live speech-to-text conversion
- 🎯 **Wake word activation** - Multiple wake words including "jarvis"
- ⚡ **GPU acceleration** - CUDA support for faster processing
- 🔄 **Live display** - Real-time transcription preview
- 💾 **File output** - Save transcriptions to text files
- 🎛️ **Multiple models** - Choose from tiny to large Whisper models
- 🌍 **Multi-language** - Support for multiple languages
- 🧪 **Test mode** - Test functionality without wake words

### Installation

The STT dependencies are included in the base installation:
```bash
pip install .
```

For optimal performance with GPU acceleration:
```bash
# For CUDA 11.8
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.X
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
```

### Usage

#### Basic Commands

Start STT with jarvis wake word:
```bash
tooling stt listen
```

Test STT without wake words:
```bash
tooling stt test
```

Show system information:
```bash
tooling stt info
```

#### Wake Word Options

Use different wake words:
```bash
# Use alexa wake word
tooling stt listen --wake-word alexa

# Use hey google wake word
tooling stt listen --wake-word "hey google"

# Use computer wake word
tooling stt listen --wake-word computer
```

#### Model Selection

Choose different Whisper models for speed vs accuracy:
```bash
# Fastest (tiny model)
tooling stt listen --model tiny

# Balanced (base model, default)
tooling stt listen --model base

# Best accuracy (large model)
tooling stt listen --model large-v2
```

#### Advanced Features

Save transcriptions to file:
```bash
tooling stt listen --save-to-file transcripts.txt
```

Disable real-time display for better performance:
```bash
tooling stt listen --no-realtime
```

Set custom sensitivity and language:
```bash
tooling stt listen --sensitivity 0.8 --language en --verbose
```

Force CPU usage:
```bash
tooling stt listen --device cpu
```

### Available Wake Words

The following wake words are supported:
- **jarvis** (default)
- alexa
- americano
- blueberry
- bumblebee
- computer
- grapefruits
- grasshopper
- hey google
- hey siri
- ok google
- picovoice
- porcupine
- terminator

### Wake Word Engines

Two wake word engines are supported:

- **openwakeword** (default) - Open source, free to use, good accuracy
- **pvporcupine** - Picovoice's Porcupine engine, highly optimized

Choose the engine based on your requirements:
```bash
# Use OpenWakeWord (default)
tooling stt listen --wakeword-engine openwakeword

# Use Porcupine for better performance
tooling stt listen --wakeword-engine pvporcupine
```

### Available Models

| Model | Speed | Accuracy | Memory | Use Case |
|-------|-------|----------|--------|----------|
| **tiny** | ⚡⚡⚡ | ⭐⭐ | 39MB | Testing, low-power devices |
| **base** | ⚡⚡ | ⭐⭐⭐ | 74MB | Balanced (default) |
| **small** | ⚡ | ⭐⭐⭐⭐ | 244MB | Better accuracy |
| **medium** | 🐌 | ⭐⭐⭐⭐⭐ | 769MB | High accuracy |
| **large-v2** | 🐌🐌 | ⭐⭐⭐⭐⭐ | 1550MB | Best accuracy |

### Command Line Options

```bash
tooling stt listen [OPTIONS]

Options:
  --wake-word TEXT        Wake word to activate recording [default: jarvis]
  --model TEXT           Whisper model (tiny, base, small, medium, large-v2) [default: base]
  --language TEXT        Language code for transcription (empty for auto-detection)
  --realtime/--no-realtime    Enable real-time transcription display [default: realtime]
  --save-to-file PATH    Save transcriptions to a file
  --sensitivity FLOAT    Wake word sensitivity (0.0 to 1.0) [default: 0.6]
  --device TEXT          Device to use (auto, cuda, cpu) [default: auto]
  --wakeword-engine TEXT Wake word engine (openwakeword, pvporcupine) [default: openwakeword]
  --verbose              Show verbose output and configuration
  --help                 Show this message and exit
```

### Examples

**Basic usage with jarvis:**
```bash
tooling stt listen
```

**Fast transcription with tiny model:**
```bash
tooling stt listen --model tiny --wake-word computer
```

**High accuracy with file output:**
```bash
tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose
```

**Quick test without wake words:**
```bash
tooling stt test --duration 5 --model tiny
```

**Custom language and sensitivity:**
```bash
tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"
```

**Use different wake word engine:**
```bash
tooling stt listen --wakeword-engine pvporcupine --wake-word alexa
```

### How it Works

1. **Initialization**: Loads the selected Whisper model and sets up audio processing
2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord
3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection
4. **Real-time Transcription**: Processes audio chunks in real-time (optional)
5. **Final Transcription**: Generates high-quality final transcription when speech ends
6. **Output**: Displays results and optionally saves to file

### Performance Tips

- **GPU**: Use CUDA for 3-5x faster transcription
- **Model**: Use `tiny` or `base` for real-time applications
- **Sensitivity**: Adjust wake word sensitivity based on environment noise
- **Device**: Set `--device cpu` if experiencing GPU memory issues
- **Real-time**: Disable `--no-realtime` for better final transcription performance

### Troubleshooting

**No microphone detected:**
```bash
# Check audio devices
tooling stt info
```

**CUDA not available:**
```bash
# Install CUDA-enabled PyTorch
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
```

**Wake word not detected:**
```bash
# Increase sensitivity
tooling stt listen --sensitivity 0.8 --verbose
```

**Poor transcription quality:**
```bash
# Use larger model
tooling stt listen --model large-v2
```

## Development Guide

### How to Add New Packages

To add a new production dependency (e.g., 'requests'):
```bash
uv add requests
```

To add a new development dependency (e.g., 'ipdb'):
```bash
uv add --dev ipdb
```

After adding dependencies, always re-generate requirements.txt:
```bash
uv pip compile pyproject.toml -o requirements.txt
```

### How to Build Packages

To build your project's distributable packages (.whl, .tar.gz):
```bash
python -m build
```

Or using the virtual environment directly:
```bash
./venv/bin/python -m build
```

### Offline Build

To build offline packages for deployment:
```bash
./dev_scripts/build_offline.sh
```

This will create offline_packages/ with all dependencies and install.sh