This commit is contained in:
dingfeng.wong
2025-07-22 22:02:48 +08:00
parent 0726aa60ed
commit dcb3f9d368
5 changed files with 775 additions and 9 deletions
+215
View File
@@ -170,6 +170,221 @@ ocr-screenshot --screenshot-method interactive --save-image
ocr-screenshot --screenshot-method mss --monitor-number 2
```
## Speech-to-Text (STT) Tool
A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.
### Features
- 🎙️ **Real-time transcription** - Live speech-to-text conversion
- 🎯 **Wake word activation** - Multiple wake words including "jarvis"
-**GPU acceleration** - CUDA support for faster processing
- 🔄 **Live display** - Real-time transcription preview
- 💾 **File output** - Save transcriptions to text files
- 🎛️ **Multiple models** - Choose from tiny to large Whisper models
- 🌍 **Multi-language** - Support for multiple languages
- 🧪 **Test mode** - Test functionality without wake words
### Installation
The STT dependencies are included in the base installation:
```bash
pip install .
```
For optimal performance with GPU acceleration:
```bash
# For CUDA 11.8
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.X
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
```
### Usage
#### Basic Commands
Start STT with jarvis wake word:
```bash
tooling stt listen
```
Test STT without wake words:
```bash
tooling stt test
```
Show system information:
```bash
tooling stt info
```
#### Wake Word Options
Use different wake words:
```bash
# Use alexa wake word
tooling stt listen --wake-word alexa
# Use hey google wake word
tooling stt listen --wake-word "hey google"
# Use computer wake word
tooling stt listen --wake-word computer
```
#### Model Selection
Choose different Whisper models for speed vs accuracy:
```bash
# Fastest (tiny model)
tooling stt listen --model tiny
# Balanced (base model, default)
tooling stt listen --model base
# Best accuracy (large model)
tooling stt listen --model large-v2
```
#### Advanced Features
Save transcriptions to file:
```bash
tooling stt listen --save-to-file transcripts.txt
```
Disable real-time display for better performance:
```bash
tooling stt listen --no-realtime
```
Set custom sensitivity and language:
```bash
tooling stt listen --sensitivity 0.8 --language en --verbose
```
Force CPU usage:
```bash
tooling stt listen --device cpu
```
### Available Wake Words
The following wake words are supported:
- **jarvis** (default)
- alexa
- americano
- blueberry
- bumblebee
- computer
- grapefruits
- grasshopper
- hey google
- hey siri
- ok google
- picovoice
- porcupine
- terminator
### Available Models
| Model | Speed | Accuracy | Memory | Use Case |
|-------|-------|----------|--------|----------|
| **tiny** | ⚡⚡⚡ | ⭐⭐ | 39MB | Testing, low-power devices |
| **base** | ⚡⚡ | ⭐⭐⭐ | 74MB | Balanced (default) |
| **small** | ⚡ | ⭐⭐⭐⭐ | 244MB | Better accuracy |
| **medium** | 🐌 | ⭐⭐⭐⭐⭐ | 769MB | High accuracy |
| **large-v2** | 🐌🐌 | ⭐⭐⭐⭐⭐ | 1550MB | Best accuracy |
### Command Line Options
```bash
tooling stt listen [OPTIONS]
Options:
--wake-word TEXT Wake word to activate recording [default: jarvis]
--model TEXT Whisper model (tiny, base, small, medium, large-v2) [default: base]
--language TEXT Language code for transcription (empty for auto-detection)
--realtime/--no-realtime Enable real-time transcription display [default: realtime]
--save-to-file PATH Save transcriptions to a file
--sensitivity FLOAT Wake word sensitivity (0.0 to 1.0) [default: 0.6]
--device TEXT Device to use (auto, cuda, cpu) [default: auto]
--verbose Show verbose output and configuration
--help Show this message and exit
```
### Examples
**Basic usage with jarvis:**
```bash
tooling stt listen
```
**Fast transcription with tiny model:**
```bash
tooling stt listen --model tiny --wake-word computer
```
**High accuracy with file output:**
```bash
tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose
```
**Quick test without wake words:**
```bash
tooling stt test --duration 5 --model tiny
```
**Custom language and sensitivity:**
```bash
tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"
```
### How it Works
1. **Initialization**: Loads the selected Whisper model and sets up audio processing
2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord
3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection
4. **Real-time Transcription**: Processes audio chunks in real-time (optional)
5. **Final Transcription**: Generates high-quality final transcription when speech ends
6. **Output**: Displays results and optionally saves to file
### Performance Tips
- **GPU**: Use CUDA for 3-5x faster transcription
- **Model**: Use `tiny` or `base` for real-time applications
- **Sensitivity**: Adjust wake word sensitivity based on environment noise
- **Device**: Set `--device cpu` if experiencing GPU memory issues
- **Real-time**: Disable `--no-realtime` for better final transcription performance
### Troubleshooting
**No microphone detected:**
```bash
# Check audio devices
tooling stt info
```
**CUDA not available:**
```bash
# Install CUDA-enabled PyTorch
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
```
**Wake word not detected:**
```bash
# Increase sensitivity
tooling stt listen --sensitivity 0.8 --verbose
```
**Poor transcription quality:**
```bash
# Use larger model
tooling stt listen --model large-v2
```
## Development Guide
### How to Add New Packages