stt
This commit is contained in:
@@ -170,6 +170,221 @@ ocr-screenshot --screenshot-method interactive --save-image
|
||||
ocr-screenshot --screenshot-method mss --monitor-number 2
|
||||
```
|
||||
|
||||
## Speech-to-Text (STT) Tool
|
||||
|
||||
A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.
|
||||
|
||||
### Features
|
||||
|
||||
- 🎙️ **Real-time transcription** - Live speech-to-text conversion
|
||||
- 🎯 **Wake word activation** - Multiple wake words including "jarvis"
|
||||
- ⚡ **GPU acceleration** - CUDA support for faster processing
|
||||
- 🔄 **Live display** - Real-time transcription preview
|
||||
- 💾 **File output** - Save transcriptions to text files
|
||||
- 🎛️ **Multiple models** - Choose from tiny to large Whisper models
|
||||
- 🌍 **Multi-language** - Support for multiple languages
|
||||
- 🧪 **Test mode** - Test functionality without wake words
|
||||
|
||||
### Installation
|
||||
|
||||
The STT dependencies are included in the base installation:
|
||||
```bash
|
||||
pip install .
|
||||
```
|
||||
|
||||
For optimal performance with GPU acceleration:
|
||||
```bash
|
||||
# For CUDA 11.8
|
||||
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
|
||||
|
||||
# For CUDA 12.X
|
||||
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
#### Basic Commands
|
||||
|
||||
Start STT with jarvis wake word:
|
||||
```bash
|
||||
tooling stt listen
|
||||
```
|
||||
|
||||
Test STT without wake words:
|
||||
```bash
|
||||
tooling stt test
|
||||
```
|
||||
|
||||
Show system information:
|
||||
```bash
|
||||
tooling stt info
|
||||
```
|
||||
|
||||
#### Wake Word Options
|
||||
|
||||
Use different wake words:
|
||||
```bash
|
||||
# Use alexa wake word
|
||||
tooling stt listen --wake-word alexa
|
||||
|
||||
# Use hey google wake word
|
||||
tooling stt listen --wake-word "hey google"
|
||||
|
||||
# Use computer wake word
|
||||
tooling stt listen --wake-word computer
|
||||
```
|
||||
|
||||
#### Model Selection
|
||||
|
||||
Choose different Whisper models for speed vs accuracy:
|
||||
```bash
|
||||
# Fastest (tiny model)
|
||||
tooling stt listen --model tiny
|
||||
|
||||
# Balanced (base model, default)
|
||||
tooling stt listen --model base
|
||||
|
||||
# Best accuracy (large model)
|
||||
tooling stt listen --model large-v2
|
||||
```
|
||||
|
||||
#### Advanced Features
|
||||
|
||||
Save transcriptions to file:
|
||||
```bash
|
||||
tooling stt listen --save-to-file transcripts.txt
|
||||
```
|
||||
|
||||
Disable real-time display for better performance:
|
||||
```bash
|
||||
tooling stt listen --no-realtime
|
||||
```
|
||||
|
||||
Set custom sensitivity and language:
|
||||
```bash
|
||||
tooling stt listen --sensitivity 0.8 --language en --verbose
|
||||
```
|
||||
|
||||
Force CPU usage:
|
||||
```bash
|
||||
tooling stt listen --device cpu
|
||||
```
|
||||
|
||||
### Available Wake Words
|
||||
|
||||
The following wake words are supported:
|
||||
- **jarvis** (default)
|
||||
- alexa
|
||||
- americano
|
||||
- blueberry
|
||||
- bumblebee
|
||||
- computer
|
||||
- grapefruits
|
||||
- grasshopper
|
||||
- hey google
|
||||
- hey siri
|
||||
- ok google
|
||||
- picovoice
|
||||
- porcupine
|
||||
- terminator
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Speed | Accuracy | Memory | Use Case |
|
||||
|-------|-------|----------|--------|----------|
|
||||
| **tiny** | ⚡⚡⚡ | ⭐⭐ | 39MB | Testing, low-power devices |
|
||||
| **base** | ⚡⚡ | ⭐⭐⭐ | 74MB | Balanced (default) |
|
||||
| **small** | ⚡ | ⭐⭐⭐⭐ | 244MB | Better accuracy |
|
||||
| **medium** | 🐌 | ⭐⭐⭐⭐⭐ | 769MB | High accuracy |
|
||||
| **large-v2** | 🐌🐌 | ⭐⭐⭐⭐⭐ | 1550MB | Best accuracy |
|
||||
|
||||
### Command Line Options
|
||||
|
||||
```bash
|
||||
tooling stt listen [OPTIONS]
|
||||
|
||||
Options:
|
||||
--wake-word TEXT Wake word to activate recording [default: jarvis]
|
||||
--model TEXT Whisper model (tiny, base, small, medium, large-v2) [default: base]
|
||||
--language TEXT Language code for transcription (empty for auto-detection)
|
||||
--realtime/--no-realtime Enable real-time transcription display [default: realtime]
|
||||
--save-to-file PATH Save transcriptions to a file
|
||||
--sensitivity FLOAT Wake word sensitivity (0.0 to 1.0) [default: 0.6]
|
||||
--device TEXT Device to use (auto, cuda, cpu) [default: auto]
|
||||
--verbose Show verbose output and configuration
|
||||
--help Show this message and exit
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
**Basic usage with jarvis:**
|
||||
```bash
|
||||
tooling stt listen
|
||||
```
|
||||
|
||||
**Fast transcription with tiny model:**
|
||||
```bash
|
||||
tooling stt listen --model tiny --wake-word computer
|
||||
```
|
||||
|
||||
**High accuracy with file output:**
|
||||
```bash
|
||||
tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose
|
||||
```
|
||||
|
||||
**Quick test without wake words:**
|
||||
```bash
|
||||
tooling stt test --duration 5 --model tiny
|
||||
```
|
||||
|
||||
**Custom language and sensitivity:**
|
||||
```bash
|
||||
tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"
|
||||
```
|
||||
|
||||
### How it Works
|
||||
|
||||
1. **Initialization**: Loads the selected Whisper model and sets up audio processing
|
||||
2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord
|
||||
3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection
|
||||
4. **Real-time Transcription**: Processes audio chunks in real-time (optional)
|
||||
5. **Final Transcription**: Generates high-quality final transcription when speech ends
|
||||
6. **Output**: Displays results and optionally saves to file
|
||||
|
||||
### Performance Tips
|
||||
|
||||
- **GPU**: Use CUDA for 3-5x faster transcription
|
||||
- **Model**: Use `tiny` or `base` for real-time applications
|
||||
- **Sensitivity**: Adjust wake word sensitivity based on environment noise
|
||||
- **Device**: Set `--device cpu` if experiencing GPU memory issues
|
||||
- **Real-time**: Disable `--no-realtime` for better final transcription performance
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**No microphone detected:**
|
||||
```bash
|
||||
# Check audio devices
|
||||
tooling stt info
|
||||
```
|
||||
|
||||
**CUDA not available:**
|
||||
```bash
|
||||
# Install CUDA-enabled PyTorch
|
||||
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
||||
```
|
||||
|
||||
**Wake word not detected:**
|
||||
```bash
|
||||
# Increase sensitivity
|
||||
tooling stt listen --sensitivity 0.8 --verbose
|
||||
```
|
||||
|
||||
**Poor transcription quality:**
|
||||
```bash
|
||||
# Use larger model
|
||||
tooling stt listen --model large-v2
|
||||
```
|
||||
|
||||
## Development Guide
|
||||
|
||||
### How to Add New Packages
|
||||
|
||||
Reference in New Issue
Block a user