# Tooling A collection of useful command-line tools. ## OCR Screenshot Tool A cross-platform CLI tool that takes screenshots, performs OCR using DocTR (state-of-the-art deep learning OCR), and copies the result to clipboard. Features intelligent text formatting preservation and optional image annotation. ### Features - ๐ŸŒ **Cross-platform** - Works on Windows, macOS, and Linux - โšก **Multiple screenshot methods** - Choose the fastest for your system - ๐Ÿ” **Advanced OCR** - Uses DocTR with PARSeq recognition model - ๐Ÿ“ **Smart formatting** - Preserves text layout and indentation - ๐ŸŽจ **Image annotation** - Visualize detected text regions - ๐Ÿ“‹ **Clipboard integration** - Automatic text copying ### Installation #### Basic installation: ```bash pip install . ``` #### With cross-platform screenshot support: ```bash # For fastest screenshots (recommended) pip install ".[screenshot-fast]" # For full automation features (region selection) pip install ".[screenshot-full]" # For maximum compatibility (all backends) pip install ".[screenshot-all]" ``` #### Install specific screenshot libraries: ```bash pip install mss # Fastest (~30x faster than others) pip install pyautogui # Interactive region selection pip install pyscreenshot # Multiple backends ``` ### Usage #### Basic Commands Take a screenshot and perform OCR: ```bash ocr-screenshot ``` With verbose output and annotation: ```bash ocr-screenshot --verbose --annotate --save-image ``` #### Screenshot Methods Choose your preferred screenshot method: ```bash # Auto-detect best method (default) ocr-screenshot --screenshot-method auto # Use MSS (fastest) ocr-screenshot --screenshot-method mss # Use PyAutoGUI (supports region selection) ocr-screenshot --screenshot-method pyautogui # Use Pillow ImageGrab (built-in) ocr-screenshot --screenshot-method pillow # Interactive region selection ocr-screenshot --screenshot-method interactive # macOS native (region selection with drag) ocr-screenshot --screenshot-method macos ``` #### Advanced Features Save screenshot with annotation showing detected text: ```bash ocr-screenshot --save-image --annotate --show-words --show-text ``` Capture specific monitor (MSS method): ```bash ocr-screenshot --screenshot-method mss --monitor-number 2 ``` Full annotation with all detection levels: ```bash ocr-screenshot --annotate --show-words --show-lines --show-blocks --show-text --save-image ``` ### Screenshot Method Comparison | Method | Speed | Region Selection | Cross-Platform | Notes | |--------|-------|------------------|----------------|-------| | **mss** | โšกโšกโšก Fastest | โŒ (crop after) | โœ… | ~30x faster, recommended | | **pyautogui** | โšก Slow | โœ… Interactive | โœ… | Best for region selection | | **pillow** | โšก Slow | โœ… Coordinates | โœ… | Built into Pillow | | **pyscreenshot** | โšก Variable | โœ… Coordinates | โœ… | Multiple backends | | **macos** | โšกโšก Fast | โœ… Native UI | ๐ŸŽ macOS only | Native drag selection | ### How it works 1. **Screenshot**: Multiple cross-platform methods available - **Auto**: Tries best method for your platform - **MSS**: Fastest full-screen capture - **Interactive**: Guided region selection - **macOS**: Native drag-to-select interface 2. **OCR**: Advanced DocTR processing - Uses state-of-the-art PARSeq recognition model - Preserves text layout and indentation - Handles multiple languages 3. **Annotation** (optional): Visual feedback - Word-level bounding boxes (red) - Line-level groupings (green) - Block-level sections (blue) - Text overlay showing detected content 4. **Output**: Formatted text copied to clipboard ### Command Line Options ```bash ocr-screenshot [OPTIONS] Options: --lang TEXT Language code for OCR (default: eng) --save-image Save the screenshot image --output-dir PATH Directory to save images (default: ~/Desktop) --verbose Show detailed output --annotate Create annotated image with detection boxes --show-words Show word-level boxes (default: True) --show-lines Show line-level boxes --show-blocks Show block-level boxes --show-text Overlay detected text on image --screenshot-method TEXT Method: auto, mss, pyautogui, pillow, pyscreenshot, macos, interactive --monitor-number INTEGER Monitor to capture (MSS method only, 0=all) --help Show this message and exit ``` ### Examples **Quick OCR with fastest method:** ```bash ocr-screenshot --screenshot-method mss ``` **Debug OCR accuracy with annotations:** ```bash ocr-screenshot --annotate --show-words --show-text --save-image --verbose ``` **Interactive region selection:** ```bash ocr-screenshot --screenshot-method interactive --save-image ``` **Multi-monitor setup (capture monitor 2):** ```bash ocr-screenshot --screenshot-method mss --monitor-number 2 ``` ## Speech-to-Text (STT) Tool A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options. ### Features - ๐ŸŽ™๏ธ **Real-time transcription** - Live speech-to-text conversion - ๐ŸŽฏ **Wake word activation** - Multiple wake words including "jarvis" - โšก **GPU acceleration** - CUDA support for faster processing - ๐Ÿ”„ **Live display** - Real-time transcription preview - ๐Ÿ’พ **File output** - Save transcriptions to text files - ๐ŸŽ›๏ธ **Multiple models** - Choose from tiny to large Whisper models - ๐ŸŒ **Multi-language** - Support for multiple languages - ๐Ÿงช **Test mode** - Test functionality without wake words ### Installation The STT dependencies are included in the base installation: ```bash pip install . ``` For optimal performance with GPU acceleration: ```bash # For CUDA 11.8 pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118 # For CUDA 12.X pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 ``` ### Usage #### Basic Commands Start STT with jarvis wake word: ```bash tooling stt listen ``` Test STT without wake words: ```bash tooling stt test ``` Show system information: ```bash tooling stt info ``` #### Wake Word Options Use different wake words: ```bash # Use alexa wake word tooling stt listen --wake-word alexa # Use hey google wake word tooling stt listen --wake-word "hey google" # Use computer wake word tooling stt listen --wake-word computer ``` #### Model Selection Choose different Whisper models for speed vs accuracy: ```bash # Fastest (tiny model) tooling stt listen --model tiny # Balanced (base model, default) tooling stt listen --model base # Best accuracy (large model) tooling stt listen --model large-v2 ``` #### Advanced Features Save transcriptions to file: ```bash tooling stt listen --save-to-file transcripts.txt ``` Disable real-time display for better performance: ```bash tooling stt listen --no-realtime ``` Set custom sensitivity and language: ```bash tooling stt listen --sensitivity 0.8 --language en --verbose ``` Force CPU usage: ```bash tooling stt listen --device cpu ``` ### Available Wake Words The following wake words are supported: - **jarvis** (default) - alexa - americano - blueberry - bumblebee - computer - grapefruits - grasshopper - hey google - hey siri - ok google - picovoice - porcupine - terminator ### Wake Word Engines Two wake word engines are supported: - **openwakeword** (default) - Open source, free to use, good accuracy - **pvporcupine** - Picovoice's Porcupine engine, highly optimized Choose the engine based on your requirements: ```bash # Use OpenWakeWord (default) tooling stt listen --wakeword-engine openwakeword # Use Porcupine for better performance tooling stt listen --wakeword-engine pvporcupine ``` ### Available Models | Model | Speed | Accuracy | Memory | Use Case | |-------|-------|----------|--------|----------| | **tiny** | โšกโšกโšก | โญโญ | 39MB | Testing, low-power devices | | **base** | โšกโšก | โญโญโญ | 74MB | Balanced (default) | | **small** | โšก | โญโญโญโญ | 244MB | Better accuracy | | **medium** | ๐ŸŒ | โญโญโญโญโญ | 769MB | High accuracy | | **large-v2** | ๐ŸŒ๐ŸŒ | โญโญโญโญโญ | 1550MB | Best accuracy | ### Command Line Options ```bash tooling stt listen [OPTIONS] Options: --wake-word TEXT Wake word to activate recording [default: jarvis] --model TEXT Whisper model (tiny, base, small, medium, large-v2) [default: base] --language TEXT Language code for transcription (empty for auto-detection) --realtime/--no-realtime Enable real-time transcription display [default: realtime] --save-to-file PATH Save transcriptions to a file --sensitivity FLOAT Wake word sensitivity (0.0 to 1.0) [default: 0.6] --device TEXT Device to use (auto, cuda, cpu) [default: auto] --wakeword-engine TEXT Wake word engine (openwakeword, pvporcupine) [default: openwakeword] --verbose Show verbose output and configuration --help Show this message and exit ``` ### Examples **Basic usage with jarvis:** ```bash tooling stt listen ``` **Fast transcription with tiny model:** ```bash tooling stt listen --model tiny --wake-word computer ``` **High accuracy with file output:** ```bash tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose ``` **Quick test without wake words:** ```bash tooling stt test --duration 5 --model tiny ``` **Custom language and sensitivity:** ```bash tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google" ``` **Use different wake word engine:** ```bash tooling stt listen --wakeword-engine pvporcupine --wake-word alexa ``` ### How it Works 1. **Initialization**: Loads the selected Whisper model and sets up audio processing 2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord 3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection 4. **Real-time Transcription**: Processes audio chunks in real-time (optional) 5. **Final Transcription**: Generates high-quality final transcription when speech ends 6. **Output**: Displays results and optionally saves to file ### Performance Tips - **GPU**: Use CUDA for 3-5x faster transcription - **Model**: Use `tiny` or `base` for real-time applications - **Sensitivity**: Adjust wake word sensitivity based on environment noise - **Device**: Set `--device cpu` if experiencing GPU memory issues - **Real-time**: Disable `--no-realtime` for better final transcription performance ### Troubleshooting **No microphone detected:** ```bash # Check audio devices tooling stt info ``` **CUDA not available:** ```bash # Install CUDA-enabled PyTorch pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 ``` **Wake word not detected:** ```bash # Increase sensitivity tooling stt listen --sensitivity 0.8 --verbose ``` **Poor transcription quality:** ```bash # Use larger model tooling stt listen --model large-v2 ``` ## Development Guide ### How to Add New Packages To add a new production dependency (e.g., 'requests'): ```bash uv add requests ``` To add a new development dependency (e.g., 'ipdb'): ```bash uv add --dev ipdb ``` After adding dependencies, always re-generate requirements.txt: ```bash uv pip compile pyproject.toml -o requirements.txt ``` ### How to Build Packages To build your project's distributable packages (.whl, .tar.gz): ```bash python -m build ``` Or using the virtual environment directly: ```bash ./venv/bin/python -m build ``` ### Offline Build To build offline packages for deployment: ```bash ./dev_scripts/build_offline.sh ``` This will create offline_packages/ with all dependencies and install.sh