stt

2025-07-22 22:02:48 +08:00
parent 0726aa60ed
commit dcb3f9d368
5 changed files with 775 additions and 9 deletions
@@ -170,6 +170,221 @@ ocr-screenshot --screenshot-method interactive --save-image
 ocr-screenshot --screenshot-method mss --monitor-number 2
 ```

+## Speech-to-Text (STT) Tool
+
+A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.
+
+### Features
+
+- 🎙️ **Real-time transcription** - Live speech-to-text conversion
+- 🎯 **Wake word activation** - Multiple wake words including "jarvis"
+- ⚡ **GPU acceleration** - CUDA support for faster processing
+- 🔄 **Live display** - Real-time transcription preview
+- 💾 **File output** - Save transcriptions to text files
+- 🎛️ **Multiple models** - Choose from tiny to large Whisper models
+- 🌍 **Multi-language** - Support for multiple languages
+- 🧪 **Test mode** - Test functionality without wake words
+
+### Installation
+
+The STT dependencies are included in the base installation:
+```bash
+pip install .
+```
+
+For optimal performance with GPU acceleration:
+```bash
+# For CUDA 11.8
+pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
+
+# For CUDA 12.X
+pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
+```
+
+### Usage
+
+#### Basic Commands
+
+Start STT with jarvis wake word:
+```bash
+tooling stt listen
+```
+
+Test STT without wake words:
+```bash
+tooling stt test
+```
+
+Show system information:
+```bash
+tooling stt info
+```
+
+#### Wake Word Options
+
+Use different wake words:
+```bash
+# Use alexa wake word
+tooling stt listen --wake-word alexa
+
+# Use hey google wake word  
+tooling stt listen --wake-word "hey google"
+
+# Use computer wake word
+tooling stt listen --wake-word computer
+```
+
+#### Model Selection
+
+Choose different Whisper models for speed vs accuracy:
+```bash
+# Fastest (tiny model)
+tooling stt listen --model tiny
+
+# Balanced (base model, default)
+tooling stt listen --model base
+
+# Best accuracy (large model)
+tooling stt listen --model large-v2
+```
+
+#### Advanced Features
+
+Save transcriptions to file:
+```bash
+tooling stt listen --save-to-file transcripts.txt
+```
+
+Disable real-time display for better performance:
+```bash
+tooling stt listen --no-realtime
+```
+
+Set custom sensitivity and language:
+```bash
+tooling stt listen --sensitivity 0.8 --language en --verbose
+```
+
+Force CPU usage:
+```bash
+tooling stt listen --device cpu
+```
+
+### Available Wake Words
+
+The following wake words are supported:
+- **jarvis** (default)
+- alexa
+- americano  
+- blueberry
+- bumblebee
+- computer
+- grapefruits
+- grasshopper
+- hey google
+- hey siri
+- ok google
+- picovoice
+- porcupine
+- terminator
+
+### Available Models
+
+| Model | Speed | Accuracy | Memory | Use Case |
+|-------|-------|----------|--------|----------|
+| **tiny** | ⚡⚡⚡ | ⭐⭐ | 39MB | Testing, low-power devices |
+| **base** | ⚡⚡ | ⭐⭐⭐ | 74MB | Balanced (default) |
+| **small** | ⚡ | ⭐⭐⭐⭐ | 244MB | Better accuracy |
+| **medium** | 🐌 | ⭐⭐⭐⭐⭐ | 769MB | High accuracy |
+| **large-v2** | 🐌🐌 | ⭐⭐⭐⭐⭐ | 1550MB | Best accuracy |
+
+### Command Line Options
+
+```bash
+tooling stt listen [OPTIONS]
+
+Options:
+  --wake-word TEXT        Wake word to activate recording [default: jarvis]
+  --model TEXT           Whisper model (tiny, base, small, medium, large-v2) [default: base]
+  --language TEXT        Language code for transcription (empty for auto-detection)
+  --realtime/--no-realtime    Enable real-time transcription display [default: realtime]
+  --save-to-file PATH    Save transcriptions to a file
+  --sensitivity FLOAT    Wake word sensitivity (0.0 to 1.0) [default: 0.6]
+  --device TEXT          Device to use (auto, cuda, cpu) [default: auto]
+  --verbose              Show verbose output and configuration
+  --help                 Show this message and exit
+```
+
+### Examples
+
+**Basic usage with jarvis:**
+```bash
+tooling stt listen
+```
+
+**Fast transcription with tiny model:**
+```bash
+tooling stt listen --model tiny --wake-word computer
+```
+
+**High accuracy with file output:**
+```bash
+tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose
+```
+
+**Quick test without wake words:**
+```bash
+tooling stt test --duration 5 --model tiny
+```
+
+**Custom language and sensitivity:**
+```bash
+tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"
+```
+
+### How it Works
+
+1. **Initialization**: Loads the selected Whisper model and sets up audio processing
+2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord
+3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection  
+4. **Real-time Transcription**: Processes audio chunks in real-time (optional)
+5. **Final Transcription**: Generates high-quality final transcription when speech ends
+6. **Output**: Displays results and optionally saves to file
+
+### Performance Tips
+
+- **GPU**: Use CUDA for 3-5x faster transcription
+- **Model**: Use `tiny` or `base` for real-time applications
+- **Sensitivity**: Adjust wake word sensitivity based on environment noise
+- **Device**: Set `--device cpu` if experiencing GPU memory issues
+- **Real-time**: Disable `--no-realtime` for better final transcription performance
+
+### Troubleshooting
+
+**No microphone detected:**
+```bash
+# Check audio devices
+tooling stt info
+```
+
+**CUDA not available:**
+```bash
+# Install CUDA-enabled PyTorch
+pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
+```
+
+**Wake word not detected:**
+```bash
+# Increase sensitivity
+tooling stt listen --sensitivity 0.8 --verbose
+```
+
+**Poor transcription quality:**
+```bash
+# Use larger model
+tooling stt listen --model large-v2
+```
+
 ## Development Guide

 ### How to Add New Packages