The Trials of Tiny Transcription: Optimizing Speech-to-Text on the Raspberry Pi Zero 2W"

matt · March 26, 2025, 7:51pm

“The Trials of Tiny Transcription: Optimizing Speech-to-Text on the Raspberry Pi Zero 2W”

Speech-to-text (STT) technology has been rapidly advancing, enabling devices like the Raspberry Pi Zero 2W to run sophisticated voice recognition systems. However, optimizing these systems for efficient transcription, especially on resource-constrained hardware, presents significant challenges. This report explores the current limitations of running VOSK STT on a Raspberry Pi Zero 2W, analyzes the problematic output, and provides suggestions for improving the Python code and alternative STT engines.

Introduction to STT and Raspberry Pi

Speech-to-text technology converts spoken words into text. The Raspberry Pi Zero 2W is a low-cost, small form factor single-board computer ideal for prototyping and proof-of-concept (POC) projects. It’s commonly used in voice assistants and other applications where privacy and local processing are preferred.

VOSK STT

VOSK is an open-source STT engine known for its efficiency and support for many languages. It works well on lower-end hardware, making it suitable for devices like the Raspberry Pi Zero 2W. However, VOSK’s performance can be impacted by the buffer size and audio stream handling, as seen in the provided Python code.

Problematic Output Analysis

The provided output shows repetition and stuttering at the end of sentences, indicating issues with buffer management:

Text: oh reallylly
Text: karma we might need to kept those had a certain pointint
Text: yeaheah
Text: just might need to do thathat

This suggests that the current buffer size (CHUNK = 1024) might be too small, leading to overlapping audio frames and causing the repetition.

Current Python Code

#!/usr/bin/env python3

import pyaudio
from vosk import Model, KaldiRecognizer
import json

# Path to the Vosk model directory
model_path = "./model"  # Replace with your model path

# Audio settings
FORMAT = pyaudio.paInt16  # 16-bit audio
CHANNELS = 1  # Mono
RATE = 16000  # Sample rate (match your model)
CHUNK = 1024  # Buffer size

# Initialize PyAudio
audio = pyaudio.PyAudio()

# Start microphone stream
stream = audio.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                    input=True, frames_per_buffer=CHUNK)

# Initialize the recognizer
rec = KaldiRecognizer(model, RATE)

print("Listening... (Ctrl+C to stop)")

# Process the audio stream
try:
    while True:
        data = stream.read(CHUNK, exception_on_overflow=False)
        if rec.AcceptWaveform(data):
            result = json.loads(rec.Result())
            print("Text:", result.get("text", ""))
        else:
            partial_result = json.loads(rec.PartialResult())
            print("Partial:", partial_result.get("partial", ""), end='\r')
except KeyboardInterrupt:
    # Clean up
    stream.stop_stream()
    stream.close()
    audio.terminate()
    final_result = json.loads(rec.FinalResult())
    print("\nFinal Text:", final_result.get("text", ""))

Suggestions for Improvement

Buffer Size Adjustment: Increase the buffer size to allow for more audio data to be processed at once, potentially reducing overlaps.
```
CHUNK = 2048  # Increase buffer size
```
Pause Detection: Implement a system to detect pauses in speech. This

yakyak:{“make”: “perplexity”, “model”: “sonar”}