stt
This commit is contained in:
@@ -170,6 +170,221 @@ ocr-screenshot --screenshot-method interactive --save-image
|
|||||||
ocr-screenshot --screenshot-method mss --monitor-number 2
|
ocr-screenshot --screenshot-method mss --monitor-number 2
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Speech-to-Text (STT) Tool
|
||||||
|
|
||||||
|
A real-time speech-to-text tool using RealtimeSTT with wake word activation. Features the "jarvis" wake word by default and supports live transcription with various output options.
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
- 🎙️ **Real-time transcription** - Live speech-to-text conversion
|
||||||
|
- 🎯 **Wake word activation** - Multiple wake words including "jarvis"
|
||||||
|
- ⚡ **GPU acceleration** - CUDA support for faster processing
|
||||||
|
- 🔄 **Live display** - Real-time transcription preview
|
||||||
|
- 💾 **File output** - Save transcriptions to text files
|
||||||
|
- 🎛️ **Multiple models** - Choose from tiny to large Whisper models
|
||||||
|
- 🌍 **Multi-language** - Support for multiple languages
|
||||||
|
- 🧪 **Test mode** - Test functionality without wake words
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
|
||||||
|
The STT dependencies are included in the base installation:
|
||||||
|
```bash
|
||||||
|
pip install .
|
||||||
|
```
|
||||||
|
|
||||||
|
For optimal performance with GPU acceleration:
|
||||||
|
```bash
|
||||||
|
# For CUDA 11.8
|
||||||
|
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
|
||||||
|
|
||||||
|
# For CUDA 12.X
|
||||||
|
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
||||||
|
```
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
|
||||||
|
#### Basic Commands
|
||||||
|
|
||||||
|
Start STT with jarvis wake word:
|
||||||
|
```bash
|
||||||
|
tooling stt listen
|
||||||
|
```
|
||||||
|
|
||||||
|
Test STT without wake words:
|
||||||
|
```bash
|
||||||
|
tooling stt test
|
||||||
|
```
|
||||||
|
|
||||||
|
Show system information:
|
||||||
|
```bash
|
||||||
|
tooling stt info
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Wake Word Options
|
||||||
|
|
||||||
|
Use different wake words:
|
||||||
|
```bash
|
||||||
|
# Use alexa wake word
|
||||||
|
tooling stt listen --wake-word alexa
|
||||||
|
|
||||||
|
# Use hey google wake word
|
||||||
|
tooling stt listen --wake-word "hey google"
|
||||||
|
|
||||||
|
# Use computer wake word
|
||||||
|
tooling stt listen --wake-word computer
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Model Selection
|
||||||
|
|
||||||
|
Choose different Whisper models for speed vs accuracy:
|
||||||
|
```bash
|
||||||
|
# Fastest (tiny model)
|
||||||
|
tooling stt listen --model tiny
|
||||||
|
|
||||||
|
# Balanced (base model, default)
|
||||||
|
tooling stt listen --model base
|
||||||
|
|
||||||
|
# Best accuracy (large model)
|
||||||
|
tooling stt listen --model large-v2
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Advanced Features
|
||||||
|
|
||||||
|
Save transcriptions to file:
|
||||||
|
```bash
|
||||||
|
tooling stt listen --save-to-file transcripts.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Disable real-time display for better performance:
|
||||||
|
```bash
|
||||||
|
tooling stt listen --no-realtime
|
||||||
|
```
|
||||||
|
|
||||||
|
Set custom sensitivity and language:
|
||||||
|
```bash
|
||||||
|
tooling stt listen --sensitivity 0.8 --language en --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
Force CPU usage:
|
||||||
|
```bash
|
||||||
|
tooling stt listen --device cpu
|
||||||
|
```
|
||||||
|
|
||||||
|
### Available Wake Words
|
||||||
|
|
||||||
|
The following wake words are supported:
|
||||||
|
- **jarvis** (default)
|
||||||
|
- alexa
|
||||||
|
- americano
|
||||||
|
- blueberry
|
||||||
|
- bumblebee
|
||||||
|
- computer
|
||||||
|
- grapefruits
|
||||||
|
- grasshopper
|
||||||
|
- hey google
|
||||||
|
- hey siri
|
||||||
|
- ok google
|
||||||
|
- picovoice
|
||||||
|
- porcupine
|
||||||
|
- terminator
|
||||||
|
|
||||||
|
### Available Models
|
||||||
|
|
||||||
|
| Model | Speed | Accuracy | Memory | Use Case |
|
||||||
|
|-------|-------|----------|--------|----------|
|
||||||
|
| **tiny** | ⚡⚡⚡ | ⭐⭐ | 39MB | Testing, low-power devices |
|
||||||
|
| **base** | ⚡⚡ | ⭐⭐⭐ | 74MB | Balanced (default) |
|
||||||
|
| **small** | ⚡ | ⭐⭐⭐⭐ | 244MB | Better accuracy |
|
||||||
|
| **medium** | 🐌 | ⭐⭐⭐⭐⭐ | 769MB | High accuracy |
|
||||||
|
| **large-v2** | 🐌🐌 | ⭐⭐⭐⭐⭐ | 1550MB | Best accuracy |
|
||||||
|
|
||||||
|
### Command Line Options
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tooling stt listen [OPTIONS]
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--wake-word TEXT Wake word to activate recording [default: jarvis]
|
||||||
|
--model TEXT Whisper model (tiny, base, small, medium, large-v2) [default: base]
|
||||||
|
--language TEXT Language code for transcription (empty for auto-detection)
|
||||||
|
--realtime/--no-realtime Enable real-time transcription display [default: realtime]
|
||||||
|
--save-to-file PATH Save transcriptions to a file
|
||||||
|
--sensitivity FLOAT Wake word sensitivity (0.0 to 1.0) [default: 0.6]
|
||||||
|
--device TEXT Device to use (auto, cuda, cpu) [default: auto]
|
||||||
|
--verbose Show verbose output and configuration
|
||||||
|
--help Show this message and exit
|
||||||
|
```
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
**Basic usage with jarvis:**
|
||||||
|
```bash
|
||||||
|
tooling stt listen
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fast transcription with tiny model:**
|
||||||
|
```bash
|
||||||
|
tooling stt listen --model tiny --wake-word computer
|
||||||
|
```
|
||||||
|
|
||||||
|
**High accuracy with file output:**
|
||||||
|
```bash
|
||||||
|
tooling stt listen --model large-v2 --save-to-file meeting_notes.txt --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
**Quick test without wake words:**
|
||||||
|
```bash
|
||||||
|
tooling stt test --duration 5 --model tiny
|
||||||
|
```
|
||||||
|
|
||||||
|
**Custom language and sensitivity:**
|
||||||
|
```bash
|
||||||
|
tooling stt listen --language es --sensitivity 0.8 --wake-word "hey google"
|
||||||
|
```
|
||||||
|
|
||||||
|
### How it Works
|
||||||
|
|
||||||
|
1. **Initialization**: Loads the selected Whisper model and sets up audio processing
|
||||||
|
2. **Wake Word Detection**: Listens for the specified wake word using Porcupine or OpenWakeWord
|
||||||
|
3. **Voice Activity Detection**: Uses WebRTC VAD and Silero VAD for accurate speech detection
|
||||||
|
4. **Real-time Transcription**: Processes audio chunks in real-time (optional)
|
||||||
|
5. **Final Transcription**: Generates high-quality final transcription when speech ends
|
||||||
|
6. **Output**: Displays results and optionally saves to file
|
||||||
|
|
||||||
|
### Performance Tips
|
||||||
|
|
||||||
|
- **GPU**: Use CUDA for 3-5x faster transcription
|
||||||
|
- **Model**: Use `tiny` or `base` for real-time applications
|
||||||
|
- **Sensitivity**: Adjust wake word sensitivity based on environment noise
|
||||||
|
- **Device**: Set `--device cpu` if experiencing GPU memory issues
|
||||||
|
- **Real-time**: Disable `--no-realtime` for better final transcription performance
|
||||||
|
|
||||||
|
### Troubleshooting
|
||||||
|
|
||||||
|
**No microphone detected:**
|
||||||
|
```bash
|
||||||
|
# Check audio devices
|
||||||
|
tooling stt info
|
||||||
|
```
|
||||||
|
|
||||||
|
**CUDA not available:**
|
||||||
|
```bash
|
||||||
|
# Install CUDA-enabled PyTorch
|
||||||
|
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
||||||
|
```
|
||||||
|
|
||||||
|
**Wake word not detected:**
|
||||||
|
```bash
|
||||||
|
# Increase sensitivity
|
||||||
|
tooling stt listen --sensitivity 0.8 --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
**Poor transcription quality:**
|
||||||
|
```bash
|
||||||
|
# Use larger model
|
||||||
|
tooling stt listen --model large-v2
|
||||||
|
```
|
||||||
|
|
||||||
## Development Guide
|
## Development Guide
|
||||||
|
|
||||||
### How to Add New Packages
|
### How to Add New Packages
|
||||||
|
|||||||
@@ -30,6 +30,7 @@ screenshot-all = [
|
|||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
ocr-screenshot = "tooling.cli:cli_main"
|
ocr-screenshot = "tooling.cli:cli_main"
|
||||||
|
tooling = "tooling.cli:cli_main"
|
||||||
|
|
||||||
[build-system]
|
[build-system]
|
||||||
requires = ["hatchling"]
|
requires = ["hatchling"]
|
||||||
|
|||||||
+105
-9
@@ -2,34 +2,63 @@
|
|||||||
# uv pip compile pyproject.toml -o requirements.txt
|
# uv pip compile pyproject.toml -o requirements.txt
|
||||||
anyascii==0.3.3
|
anyascii==0.3.3
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
av==15.0.0
|
||||||
|
# via faster-whisper
|
||||||
certifi==2025.7.14
|
certifi==2025.7.14
|
||||||
# via requests
|
# via requests
|
||||||
|
cffi==1.17.1
|
||||||
|
# via soundfile
|
||||||
charset-normalizer==3.4.2
|
charset-normalizer==3.4.2
|
||||||
# via requests
|
# via requests
|
||||||
click==8.2.1
|
click==8.2.1
|
||||||
# via typer
|
# via typer
|
||||||
|
colorama==0.4.6
|
||||||
|
# via
|
||||||
|
# halo
|
||||||
|
# log-symbols
|
||||||
|
coloredlogs==15.0.1
|
||||||
|
# via onnxruntime
|
||||||
|
ctranslate2==4.6.0
|
||||||
|
# via faster-whisper
|
||||||
defusedxml==0.7.1
|
defusedxml==0.7.1
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
enum34==1.1.10
|
||||||
|
# via pvporcupine
|
||||||
|
faster-whisper==1.1.1
|
||||||
|
# via realtimestt
|
||||||
filelock==3.18.0
|
filelock==3.18.0
|
||||||
# via
|
# via
|
||||||
# huggingface-hub
|
# huggingface-hub
|
||||||
# torch
|
# torch
|
||||||
|
flatbuffers==25.2.10
|
||||||
|
# via onnxruntime
|
||||||
fsspec==2025.7.0
|
fsspec==2025.7.0
|
||||||
# via
|
# via
|
||||||
# huggingface-hub
|
# huggingface-hub
|
||||||
# torch
|
# torch
|
||||||
h5py==3.14.0
|
h5py==3.14.0
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
halo==0.0.31
|
||||||
|
# via realtimestt
|
||||||
hf-xet==1.1.5
|
hf-xet==1.1.5
|
||||||
# via huggingface-hub
|
# via huggingface-hub
|
||||||
huggingface-hub==0.33.4
|
huggingface-hub==0.33.4
|
||||||
# via python-doctr
|
# via
|
||||||
|
# faster-whisper
|
||||||
|
# python-doctr
|
||||||
|
# tokenizers
|
||||||
|
humanfriendly==10.0
|
||||||
|
# via coloredlogs
|
||||||
idna==3.10
|
idna==3.10
|
||||||
# via requests
|
# via requests
|
||||||
jinja2==3.1.6
|
jinja2==3.1.6
|
||||||
# via torch
|
# via torch
|
||||||
|
joblib==1.5.1
|
||||||
|
# via scikit-learn
|
||||||
langdetect==1.0.9
|
langdetect==1.0.9
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
log-symbols==0.0.14
|
||||||
|
# via halo
|
||||||
markdown-it-py==3.0.0
|
markdown-it-py==3.0.0
|
||||||
# via rich
|
# via rich
|
||||||
markupsafe==3.0.2
|
markupsafe==3.0.2
|
||||||
@@ -42,30 +71,55 @@ networkx==3.5
|
|||||||
# via torch
|
# via torch
|
||||||
numpy==2.3.1
|
numpy==2.3.1
|
||||||
# via
|
# via
|
||||||
|
# ctranslate2
|
||||||
# h5py
|
# h5py
|
||||||
# onnx
|
# onnx
|
||||||
|
# onnxruntime
|
||||||
# opencv-python
|
# opencv-python
|
||||||
|
# pvporcupine
|
||||||
# python-doctr
|
# python-doctr
|
||||||
|
# scikit-learn
|
||||||
# scipy
|
# scipy
|
||||||
# shapely
|
# shapely
|
||||||
|
# soundfile
|
||||||
# torchvision
|
# torchvision
|
||||||
onnx==1.18.0
|
onnx==1.18.0
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
onnxruntime==1.22.1
|
||||||
|
# via
|
||||||
|
# faster-whisper
|
||||||
|
# openwakeword
|
||||||
opencv-python==4.11.0.86
|
opencv-python==4.11.0.86
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
openwakeword==0.6.0
|
||||||
|
# via realtimestt
|
||||||
packaging==25.0
|
packaging==25.0
|
||||||
# via huggingface-hub
|
# via
|
||||||
|
# huggingface-hub
|
||||||
|
# onnxruntime
|
||||||
pillow==11.3.0
|
pillow==11.3.0
|
||||||
# via
|
# via
|
||||||
# tooling (pyproject.toml)
|
# tooling (pyproject.toml)
|
||||||
# python-doctr
|
# python-doctr
|
||||||
# torchvision
|
# torchvision
|
||||||
protobuf==6.31.1
|
protobuf==6.31.1
|
||||||
# via onnx
|
# via
|
||||||
|
# onnx
|
||||||
|
# onnxruntime
|
||||||
|
pvporcupine==1.9.5
|
||||||
|
# via realtimestt
|
||||||
|
pyaudio==0.2.14
|
||||||
|
# via realtimestt
|
||||||
pyclipper==1.3.0.post6
|
pyclipper==1.3.0.post6
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
pycparser==2.22
|
||||||
|
# via cffi
|
||||||
pygments==2.19.2
|
pygments==2.19.2
|
||||||
# via rich
|
# via rich
|
||||||
|
pyobjc-core==11.1
|
||||||
|
# via pyobjc-framework-cocoa
|
||||||
|
pyobjc-framework-cocoa==11.1
|
||||||
|
# via rumps
|
||||||
pypdfium2==4.30.0
|
pypdfium2==4.30.0
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
pyperclip==1.9.0
|
pyperclip==1.9.0
|
||||||
@@ -73,34 +127,70 @@ pyperclip==1.9.0
|
|||||||
python-doctr==1.0.0
|
python-doctr==1.0.0
|
||||||
# via tooling (pyproject.toml)
|
# via tooling (pyproject.toml)
|
||||||
pyyaml==6.0.2
|
pyyaml==6.0.2
|
||||||
# via huggingface-hub
|
# via
|
||||||
|
# ctranslate2
|
||||||
|
# huggingface-hub
|
||||||
rapidfuzz==3.13.0
|
rapidfuzz==3.13.0
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
realtimestt==0.3.104
|
||||||
|
# via tooling (pyproject.toml)
|
||||||
requests==2.32.4
|
requests==2.32.4
|
||||||
# via huggingface-hub
|
# via
|
||||||
|
# huggingface-hub
|
||||||
|
# openwakeword
|
||||||
rich==14.0.0
|
rich==14.0.0
|
||||||
# via
|
# via
|
||||||
# tooling (pyproject.toml)
|
# tooling (pyproject.toml)
|
||||||
# typer
|
# typer
|
||||||
scipy==1.16.0
|
rumps==0.4.0
|
||||||
# via python-doctr
|
# via tooling (pyproject.toml)
|
||||||
|
scikit-learn==1.7.1
|
||||||
|
# via openwakeword
|
||||||
|
scipy==1.15.2
|
||||||
|
# via
|
||||||
|
# openwakeword
|
||||||
|
# python-doctr
|
||||||
|
# realtimestt
|
||||||
|
# scikit-learn
|
||||||
|
setuptools==80.9.0
|
||||||
|
# via ctranslate2
|
||||||
shapely==2.1.1
|
shapely==2.1.1
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
shellingham==1.5.4
|
shellingham==1.5.4
|
||||||
# via typer
|
# via typer
|
||||||
six==1.17.0
|
six==1.17.0
|
||||||
# via langdetect
|
# via
|
||||||
|
# halo
|
||||||
|
# langdetect
|
||||||
|
soundfile==0.13.1
|
||||||
|
# via realtimestt
|
||||||
|
spinners==0.0.24
|
||||||
|
# via halo
|
||||||
sympy==1.14.0
|
sympy==1.14.0
|
||||||
# via torch
|
# via
|
||||||
|
# onnxruntime
|
||||||
|
# torch
|
||||||
|
termcolor==3.1.0
|
||||||
|
# via halo
|
||||||
|
threadpoolctl==3.6.0
|
||||||
|
# via scikit-learn
|
||||||
|
tokenizers==0.21.2
|
||||||
|
# via faster-whisper
|
||||||
torch==2.7.1
|
torch==2.7.1
|
||||||
# via
|
# via
|
||||||
# python-doctr
|
# python-doctr
|
||||||
|
# realtimestt
|
||||||
|
# torchaudio
|
||||||
# torchvision
|
# torchvision
|
||||||
|
torchaudio==2.7.1
|
||||||
|
# via realtimestt
|
||||||
torchvision==0.22.1
|
torchvision==0.22.1
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
tqdm==4.67.1
|
tqdm==4.67.1
|
||||||
# via
|
# via
|
||||||
|
# faster-whisper
|
||||||
# huggingface-hub
|
# huggingface-hub
|
||||||
|
# openwakeword
|
||||||
# python-doctr
|
# python-doctr
|
||||||
typer==0.16.0
|
typer==0.16.0
|
||||||
# via tooling (pyproject.toml)
|
# via tooling (pyproject.toml)
|
||||||
@@ -114,3 +204,9 @@ urllib3==2.5.0
|
|||||||
# via requests
|
# via requests
|
||||||
validators==0.35.0
|
validators==0.35.0
|
||||||
# via python-doctr
|
# via python-doctr
|
||||||
|
webrtcvad-wheels==2.0.14
|
||||||
|
# via realtimestt
|
||||||
|
websocket-client==1.8.0
|
||||||
|
# via realtimestt
|
||||||
|
websockets==15.0.1
|
||||||
|
# via realtimestt
|
||||||
|
|||||||
@@ -9,6 +9,7 @@ import typer
|
|||||||
from rich.console import Console
|
from rich.console import Console
|
||||||
|
|
||||||
from .ocr_cli import ocr_app
|
from .ocr_cli import ocr_app
|
||||||
|
from .stt_cli import stt_app
|
||||||
|
|
||||||
# Create main app
|
# Create main app
|
||||||
app = typer.Typer(
|
app = typer.Typer(
|
||||||
@@ -22,6 +23,9 @@ console = Console()
|
|||||||
# Add OCR subcommand
|
# Add OCR subcommand
|
||||||
app.add_typer(ocr_app, name="ocr", help="OCR screenshot tools")
|
app.add_typer(ocr_app, name="ocr", help="OCR screenshot tools")
|
||||||
|
|
||||||
|
# Add STT subcommand
|
||||||
|
app.add_typer(stt_app, name="stt", help="Speech-to-text tools with wake word activation")
|
||||||
|
|
||||||
@app.command()
|
@app.command()
|
||||||
def version():
|
def version():
|
||||||
"""Show version information."""
|
"""Show version information."""
|
||||||
|
|||||||
@@ -0,0 +1,450 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Speech-to-Text CLI Tool
|
||||||
|
|
||||||
|
A command-line tool that provides real-time speech-to-text transcription
|
||||||
|
using RealtimeSTT with wake word activation and various output options.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import datetime
|
||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional, Callable
|
||||||
|
import threading
|
||||||
|
import time
|
||||||
|
|
||||||
|
import typer
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.panel import Panel
|
||||||
|
from rich.progress import Progress, SpinnerColumn, TextColumn
|
||||||
|
from rich.live import Live
|
||||||
|
from rich.text import Text
|
||||||
|
from rich.table import Table
|
||||||
|
|
||||||
|
# Create STT app that can be imported as a subcommand
|
||||||
|
stt_app = typer.Typer(
|
||||||
|
name="stt",
|
||||||
|
help="Real-time speech-to-text with wake word activation",
|
||||||
|
rich_markup_mode="rich"
|
||||||
|
)
|
||||||
|
|
||||||
|
console = Console()
|
||||||
|
|
||||||
|
# Global variables for managing the recorder
|
||||||
|
_recorder = None
|
||||||
|
_recording_active = False
|
||||||
|
_transcription_buffer = []
|
||||||
|
|
||||||
|
|
||||||
|
class TranscriptionDisplay:
|
||||||
|
"""Handle live display of transcriptions."""
|
||||||
|
|
||||||
|
def __init__(self, show_realtime: bool = True):
|
||||||
|
self.show_realtime = show_realtime
|
||||||
|
self.realtime_text = ""
|
||||||
|
self.final_text = ""
|
||||||
|
self.status = "Initializing..."
|
||||||
|
|
||||||
|
def create_display(self) -> Table:
|
||||||
|
"""Create the display table."""
|
||||||
|
table = Table.grid(padding=1)
|
||||||
|
table.add_column(style="cyan", no_wrap=False)
|
||||||
|
|
||||||
|
# Status
|
||||||
|
table.add_row(f"[bold blue]Status:[/bold blue] {self.status}")
|
||||||
|
table.add_row("")
|
||||||
|
|
||||||
|
# Realtime transcription
|
||||||
|
if self.show_realtime and self.realtime_text:
|
||||||
|
table.add_row("[bold yellow]🎙️ Live transcription:[/bold yellow]")
|
||||||
|
table.add_row(f"[dim]{self.realtime_text}[/dim]")
|
||||||
|
table.add_row("")
|
||||||
|
|
||||||
|
# Final transcription
|
||||||
|
if self.final_text:
|
||||||
|
table.add_row("[bold green]✅ Final transcription:[/bold green]")
|
||||||
|
table.add_row(self.final_text)
|
||||||
|
table.add_row("")
|
||||||
|
|
||||||
|
return table
|
||||||
|
|
||||||
|
def update_status(self, status: str):
|
||||||
|
"""Update the status."""
|
||||||
|
self.status = status
|
||||||
|
|
||||||
|
def update_realtime(self, text: str):
|
||||||
|
"""Update realtime transcription."""
|
||||||
|
self.realtime_text = text
|
||||||
|
|
||||||
|
def add_final(self, text: str):
|
||||||
|
"""Add final transcription."""
|
||||||
|
if text.strip():
|
||||||
|
timestamp = datetime.datetime.now().strftime("%H:%M:%S")
|
||||||
|
self.final_text += f"[{timestamp}] {text}\n"
|
||||||
|
|
||||||
|
|
||||||
|
@stt_app.command("listen")
|
||||||
|
def listen_cmd(
|
||||||
|
wake_word: str = typer.Option(
|
||||||
|
default="jarvis",
|
||||||
|
help="Wake word to activate recording (jarvis, alexa, hey google, etc.)"
|
||||||
|
),
|
||||||
|
model: str = typer.Option(
|
||||||
|
default="base",
|
||||||
|
help="Whisper model to use (tiny, base, small, medium, large-v2)"
|
||||||
|
),
|
||||||
|
language: str = typer.Option(
|
||||||
|
default="",
|
||||||
|
help="Language code for transcription (empty for auto-detection)"
|
||||||
|
),
|
||||||
|
realtime: bool = typer.Option(
|
||||||
|
default=True,
|
||||||
|
help="Enable real-time transcription display"
|
||||||
|
),
|
||||||
|
save_to_file: Optional[Path] = typer.Option(
|
||||||
|
default=None,
|
||||||
|
help="Save transcriptions to a file"
|
||||||
|
),
|
||||||
|
sensitivity: float = typer.Option(
|
||||||
|
default=0.6,
|
||||||
|
help="Wake word sensitivity (0.0 to 1.0)"
|
||||||
|
),
|
||||||
|
device: str = typer.Option(
|
||||||
|
default="auto",
|
||||||
|
help="Device to use (auto, cuda, cpu)"
|
||||||
|
),
|
||||||
|
verbose: bool = typer.Option(
|
||||||
|
default=False,
|
||||||
|
help="Show verbose output and configuration"
|
||||||
|
)
|
||||||
|
):
|
||||||
|
"""Start real-time speech-to-text with wake word activation."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
from RealtimeSTT import AudioToTextRecorder
|
||||||
|
except ImportError:
|
||||||
|
console.print("[bold red]❌ RealtimeSTT not installed.[/bold red]")
|
||||||
|
console.print("Install with: [bold]pip install RealtimeSTT[/bold]")
|
||||||
|
raise typer.Exit(1)
|
||||||
|
|
||||||
|
# Validate wake word
|
||||||
|
valid_wake_words = [
|
||||||
|
"alexa", "americano", "blueberry", "bumblebee", "computer",
|
||||||
|
"grapefruits", "grasshopper", "hey google", "hey siri", "jarvis",
|
||||||
|
"ok google", "picovoice", "porcupine", "terminator"
|
||||||
|
]
|
||||||
|
|
||||||
|
if wake_word.lower() not in valid_wake_words:
|
||||||
|
console.print(f"[bold red]❌ Invalid wake word: {wake_word}[/bold red]")
|
||||||
|
console.print(f"Valid options: {', '.join(valid_wake_words)}")
|
||||||
|
raise typer.Exit(1)
|
||||||
|
|
||||||
|
# Determine device
|
||||||
|
if device == "auto":
|
||||||
|
try:
|
||||||
|
import torch
|
||||||
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
except ImportError:
|
||||||
|
device = "cpu"
|
||||||
|
|
||||||
|
# Create transcription display
|
||||||
|
display = TranscriptionDisplay(show_realtime=realtime)
|
||||||
|
|
||||||
|
# File output setup
|
||||||
|
output_file = None
|
||||||
|
if save_to_file:
|
||||||
|
save_to_file.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
output_file = open(save_to_file, 'a', encoding='utf-8')
|
||||||
|
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||||
|
output_file.write(f"\n=== STT Session Started: {timestamp} ===\n")
|
||||||
|
output_file.flush()
|
||||||
|
|
||||||
|
# Show configuration if verbose
|
||||||
|
if verbose:
|
||||||
|
config_table = Table(title="STT Configuration")
|
||||||
|
config_table.add_column("Setting", style="cyan")
|
||||||
|
config_table.add_column("Value", style="green")
|
||||||
|
|
||||||
|
config_table.add_row("Wake Word", wake_word)
|
||||||
|
config_table.add_row("Model", model)
|
||||||
|
config_table.add_row("Language", language if language else "Auto-detect")
|
||||||
|
config_table.add_row("Device", device)
|
||||||
|
config_table.add_row("Realtime Display", str(realtime))
|
||||||
|
config_table.add_row("Sensitivity", str(sensitivity))
|
||||||
|
if save_to_file:
|
||||||
|
config_table.add_row("Output File", str(save_to_file))
|
||||||
|
|
||||||
|
console.print(config_table)
|
||||||
|
console.print()
|
||||||
|
|
||||||
|
# Callback functions
|
||||||
|
def on_realtime_transcription(text: str):
|
||||||
|
"""Handle real-time transcription updates."""
|
||||||
|
if realtime:
|
||||||
|
display.update_realtime(text)
|
||||||
|
|
||||||
|
def on_transcription_complete(text: str):
|
||||||
|
"""Handle completed transcriptions."""
|
||||||
|
if text.strip():
|
||||||
|
display.add_final(text)
|
||||||
|
|
||||||
|
# Save to file if specified
|
||||||
|
if output_file:
|
||||||
|
timestamp = datetime.datetime.now().strftime("%H:%M:%S")
|
||||||
|
output_file.write(f"[{timestamp}] {text}\n")
|
||||||
|
output_file.flush()
|
||||||
|
|
||||||
|
def on_recording_start():
|
||||||
|
"""Called when recording starts."""
|
||||||
|
display.update_status("🎙️ Recording... (speak now)")
|
||||||
|
|
||||||
|
def on_recording_stop():
|
||||||
|
"""Called when recording stops."""
|
||||||
|
display.update_status("⏸️ Processing transcription...")
|
||||||
|
|
||||||
|
def on_wakeword_detected():
|
||||||
|
"""Called when wake word is detected."""
|
||||||
|
display.update_status(f"🎯 Wake word '{wake_word}' detected!")
|
||||||
|
|
||||||
|
def on_wakeword_timeout():
|
||||||
|
"""Called when wake word times out."""
|
||||||
|
display.update_status(f"⏰ Waiting for wake word '{wake_word}'...")
|
||||||
|
|
||||||
|
def on_wakeword_detection_start():
|
||||||
|
"""Called when starting to listen for wake words."""
|
||||||
|
display.update_status(f"👂 Listening for wake word '{wake_word}'...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
display.update_status("🔧 Initializing STT engine...")
|
||||||
|
|
||||||
|
# Configure recorder parameters
|
||||||
|
recorder_config = {
|
||||||
|
"model": model,
|
||||||
|
"wake_words": wake_word,
|
||||||
|
"wake_words_sensitivity": sensitivity,
|
||||||
|
"device": device,
|
||||||
|
"on_recording_start": on_recording_start,
|
||||||
|
"on_recording_stop": on_recording_stop,
|
||||||
|
"on_wakeword_detected": on_wakeword_detected,
|
||||||
|
"on_wakeword_timeout": on_wakeword_timeout,
|
||||||
|
"on_wakeword_detection_start": on_wakeword_detection_start,
|
||||||
|
}
|
||||||
|
|
||||||
|
if language:
|
||||||
|
recorder_config["language"] = language
|
||||||
|
|
||||||
|
if realtime:
|
||||||
|
recorder_config.update({
|
||||||
|
"enable_realtime_transcription": True,
|
||||||
|
"on_realtime_transcription_update": on_realtime_transcription,
|
||||||
|
})
|
||||||
|
|
||||||
|
# Initialize recorder
|
||||||
|
recorder = AudioToTextRecorder(**recorder_config)
|
||||||
|
|
||||||
|
# Show initial instructions
|
||||||
|
console.print(Panel(
|
||||||
|
f"[bold]Speech-to-Text Ready![/bold]\n\n"
|
||||||
|
f"• Say '[bold cyan]{wake_word}[/bold cyan]' to activate recording\n"
|
||||||
|
f"• Speak clearly after activation\n"
|
||||||
|
f"• Press [bold red]Ctrl+C[/bold red] to stop\n"
|
||||||
|
f"• Model: [bold]{model}[/bold] | Device: [bold]{device}[/bold]",
|
||||||
|
title="🎤 STT Instructions",
|
||||||
|
border_style="green"
|
||||||
|
))
|
||||||
|
|
||||||
|
# Start live display
|
||||||
|
with Live(display.create_display(), refresh_per_second=10, console=console) as live:
|
||||||
|
try:
|
||||||
|
while True:
|
||||||
|
# Get transcription (this will wait for wake word and then record)
|
||||||
|
text = recorder.text()
|
||||||
|
if text:
|
||||||
|
on_transcription_complete(text)
|
||||||
|
live.update(display.create_display())
|
||||||
|
|
||||||
|
# Small delay to prevent high CPU usage
|
||||||
|
time.sleep(0.1)
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
display.update_status("🛑 Stopping STT...")
|
||||||
|
live.update(display.create_display())
|
||||||
|
raise
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
console.print("\n[bold yellow]⚠️ STT stopped by user.[/bold yellow]")
|
||||||
|
except Exception as e:
|
||||||
|
console.print(f"\n[bold red]❌ STT error: {e}[/bold red]")
|
||||||
|
if verbose:
|
||||||
|
import traceback
|
||||||
|
console.print(f"[dim]{traceback.format_exc()}[/dim]")
|
||||||
|
raise typer.Exit(1)
|
||||||
|
finally:
|
||||||
|
# Cleanup
|
||||||
|
if 'recorder' in locals():
|
||||||
|
try:
|
||||||
|
recorder.shutdown()
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
if output_file:
|
||||||
|
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||||
|
output_file.write(f"=== STT Session Ended: {timestamp} ===\n\n")
|
||||||
|
output_file.close()
|
||||||
|
console.print(f"\n[green]💾 Transcriptions saved to: {save_to_file}[/green]")
|
||||||
|
|
||||||
|
|
||||||
|
@stt_app.command("test")
|
||||||
|
def test_cmd(
|
||||||
|
duration: int = typer.Option(
|
||||||
|
default=10,
|
||||||
|
help="Test duration in seconds"
|
||||||
|
),
|
||||||
|
model: str = typer.Option(
|
||||||
|
default="tiny",
|
||||||
|
help="Whisper model to use for testing"
|
||||||
|
)
|
||||||
|
):
|
||||||
|
"""Test STT functionality without wake words."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
from RealtimeSTT import AudioToTextRecorder
|
||||||
|
except ImportError:
|
||||||
|
console.print("[bold red]❌ RealtimeSTT not installed.[/bold red]")
|
||||||
|
console.print("Install with: [bold]pip install RealtimeSTT[/bold]")
|
||||||
|
raise typer.Exit(1)
|
||||||
|
|
||||||
|
console.print(Panel(
|
||||||
|
f"[bold]STT Test Mode[/bold]\n\n"
|
||||||
|
f"• Duration: [bold]{duration}[/bold] seconds\n"
|
||||||
|
f"• Model: [bold]{model}[/bold]\n"
|
||||||
|
f"• No wake word required\n"
|
||||||
|
f"• Start speaking when you see 'Recording...'",
|
||||||
|
title="🧪 Test Configuration",
|
||||||
|
border_style="blue"
|
||||||
|
))
|
||||||
|
|
||||||
|
try:
|
||||||
|
with Progress(
|
||||||
|
SpinnerColumn(),
|
||||||
|
TextColumn("[progress.description]{task.description}"),
|
||||||
|
console=console,
|
||||||
|
) as progress:
|
||||||
|
|
||||||
|
init_task = progress.add_task("[cyan]Initializing STT engine...", total=None)
|
||||||
|
|
||||||
|
recorder = AudioToTextRecorder(
|
||||||
|
model=model,
|
||||||
|
wake_words="", # No wake words for test
|
||||||
|
)
|
||||||
|
|
||||||
|
progress.update(init_task, description="[green]✓ STT engine ready")
|
||||||
|
progress.stop()
|
||||||
|
|
||||||
|
console.print(f"\n[bold green]🎙️ Recording for {duration} seconds...[/bold green]")
|
||||||
|
console.print("[yellow]Start speaking now![/yellow]")
|
||||||
|
|
||||||
|
# Manual recording for test
|
||||||
|
recorder.start()
|
||||||
|
|
||||||
|
# Show countdown
|
||||||
|
for remaining in range(duration, 0, -1):
|
||||||
|
console.print(f"\r[bold cyan]⏰ {remaining} seconds remaining...[/bold cyan]", end="")
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
console.print(f"\r[bold blue]⏸️ Processing transcription...[/bold blue]")
|
||||||
|
|
||||||
|
recorder.stop()
|
||||||
|
text = recorder.text()
|
||||||
|
|
||||||
|
if text:
|
||||||
|
console.print("\n[bold green]✅ Test completed successfully![/bold green]")
|
||||||
|
console.print(Panel(
|
||||||
|
text,
|
||||||
|
title="📝 Transcribed Text",
|
||||||
|
border_style="green"
|
||||||
|
))
|
||||||
|
else:
|
||||||
|
console.print("\n[bold yellow]⚠️ No speech detected during test.[/bold yellow]")
|
||||||
|
console.print("[dim]Try speaking louder or check your microphone.[/dim]")
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
console.print("\n[bold yellow]⚠️ Test cancelled by user.[/bold yellow]")
|
||||||
|
except Exception as e:
|
||||||
|
console.print(f"\n[bold red]❌ Test failed: {e}[/bold red]")
|
||||||
|
raise typer.Exit(1)
|
||||||
|
finally:
|
||||||
|
if 'recorder' in locals():
|
||||||
|
try:
|
||||||
|
recorder.shutdown()
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@stt_app.command("info")
|
||||||
|
def info_cmd():
|
||||||
|
"""Show STT system information and available options."""
|
||||||
|
|
||||||
|
console.print(Panel(
|
||||||
|
"[bold blue]STT System Information[/bold blue]",
|
||||||
|
border_style="blue"
|
||||||
|
))
|
||||||
|
|
||||||
|
# Check RealtimeSTT installation
|
||||||
|
try:
|
||||||
|
from RealtimeSTT import AudioToTextRecorder
|
||||||
|
console.print("[green]✅ RealtimeSTT installed[/green]")
|
||||||
|
|
||||||
|
# Check CUDA availability
|
||||||
|
try:
|
||||||
|
import torch
|
||||||
|
cuda_available = torch.cuda.is_available()
|
||||||
|
if cuda_available:
|
||||||
|
console.print(f"[green]✅ CUDA available (GPU: {torch.cuda.get_device_name()})[/green]")
|
||||||
|
else:
|
||||||
|
console.print("[yellow]⚠️ CUDA not available (CPU only)[/yellow]")
|
||||||
|
except ImportError:
|
||||||
|
console.print("[yellow]⚠️ PyTorch not available[/yellow]")
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
console.print("[red]❌ RealtimeSTT not installed[/red]")
|
||||||
|
console.print("Install with: [bold]pip install RealtimeSTT[/bold]")
|
||||||
|
|
||||||
|
# Available wake words
|
||||||
|
wake_words = [
|
||||||
|
"alexa", "americano", "blueberry", "bumblebee", "computer",
|
||||||
|
"grapefruits", "grasshopper", "hey google", "hey siri", "jarvis",
|
||||||
|
"ok google", "picovoice", "porcupine", "terminator"
|
||||||
|
]
|
||||||
|
|
||||||
|
console.print(f"\n[bold cyan]Available Wake Words:[/bold cyan]")
|
||||||
|
console.print(", ".join(wake_words))
|
||||||
|
|
||||||
|
# Available models
|
||||||
|
models = ["tiny", "tiny.en", "base", "base.en", "small", "small.en", "medium", "medium.en", "large-v1", "large-v2"]
|
||||||
|
console.print(f"\n[bold cyan]Available Models:[/bold cyan]")
|
||||||
|
console.print(", ".join(models))
|
||||||
|
|
||||||
|
# Usage examples
|
||||||
|
console.print(f"\n[bold cyan]Usage Examples:[/bold cyan]")
|
||||||
|
examples = [
|
||||||
|
"tooling stt listen # Use jarvis wake word with base model",
|
||||||
|
"tooling stt listen --wake-word alexa # Use alexa wake word",
|
||||||
|
"tooling stt listen --model tiny # Use faster tiny model",
|
||||||
|
"tooling stt test --duration 5 # Test for 5 seconds",
|
||||||
|
"tooling stt listen --save-to-file transcripts.txt # Save to file"
|
||||||
|
]
|
||||||
|
|
||||||
|
for example in examples:
|
||||||
|
console.print(f" [dim]${example}[/dim]")
|
||||||
|
|
||||||
|
|
||||||
|
# For backward compatibility when run directly
|
||||||
|
def cli_main():
|
||||||
|
"""Entry point for the STT CLI script when run directly."""
|
||||||
|
stt_app()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
stt_app()
|
||||||
Reference in New Issue
Block a user