Local chat-bot pipeline (STT → LLM → TTS) on your Jetson Orin Nano Developer Kit

To set up a fully local chat-bot pipeline (STT → LLM → TTS) on your Jetson Orin Nano Developer Kit using faster-whisper for speech-to-text (STT), an LLM of your choice (e.g., a lightweight quantized model like Gemma-2B or Phi-3 Mini via Ollama for efficiency on the hardware), and Piper for text-to-speech (TTS), all communicating via the Wyoming protocol without Home Assistant, you can leverage Python-based installations and a custom orchestration script. This approach is Python-friendly, as the core components are installable via pip in virtual environments, and the integration can be handled with a simple Python script.

The Jetson Orin Nano (ARM64 architecture with NVIDIA GPU) supports CUDA acceleration, which is key for performance. We’ll use CUDA-enabled versions where possible. The Wyoming protocol allows these services to run as independent servers (e.g., on different TCP ports), with a “satellite” handling audio input/output and a custom Python “master” script tying the pipeline together by processing events (e.g., receiving a transcript from STT, querying the LLM, and sending the response to TTS).

Prerequisites

  • Jetson Setup: Ensure your Jetson is running JetPack 5.x or 6.x (includes CUDA 11.x or 12.x). Install system dependencies:
    sudo apt update
    sudo apt install python3-venv python3-pip git curl alsa-utils libasound2-dev portaudio19-dev
    
  • CUDA: Verify with nvcc --version. If not installed, follow NVIDIA’s JetPack guide.
  • Audio Devices: Test mic/speaker with arecord and aplay (e.g., arecord -r 16000 -c 1 -f S16_LE -t wav test.wav for recording).
  • Hardware Notes: With 8GB RAM and the integrated GPU, aim for small LLMs (2-4B parameters, quantized to 4-bit) to avoid OOM errors. Larger models may require swapping or reduced batch sizes.

Step 1: Install and Run Wyoming Services

These run as background servers. Create separate virtual environments for each to avoid conflicts.

1.1 Wyoming-Faster-Whisper (STT Server)

Uses CUDA for acceleration on Jetson.

git clone https://github.com/rhasspy/wyoming-faster-whisper.git
cd wyoming-faster-whisper
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt -r requirements_cu12.txt  # For CUDA 12; use cu11.txt for CUDA 11
pip install .  # Install the package
deactivate

Run the server (adjust model/language; tiny-int8 is lightweight and fast):

cd wyoming-faster-whisper
.venv/bin/python3 -m wyoming_faster_whisper --model tiny-int8 --language en --beam-size 5 --uri tcp://0.0.0.0:10300 --data-dir ./data --download-dir ./data --device cuda
  • For even faster performance on Jetson, consider the TensorRT-optimized version (Docker-based, but integrable):
    docker run --gpus all -p 10300:10300 -e MODEL=base -e LANGUAGE=en -e COMPUTE_TYPE=float16 -e DEVICE=cuda captnspdr/wyoming-whisper-trt:latest-igpu
    
  • Test: Use a Wyoming client tool or script to send audio chunks.

1.2 Wyoming-Piper (TTS Server)

git clone https://github.com/rhasspy/wyoming-piper.git
cd wyoming-piper
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install .
deactivate
# Download Piper binary (ARM64-compatible)
curl -L -s "https://github.com/rhasspy/piper/releases/download/v1.2.0/piper_arm64.tar.gz" | tar -zxvf - -C .

Run the server (use a voice like en_US-amy-medium.onnx; download from Piper’s voices repo if needed):

cd wyoming-piper
.venv/bin/python3 -m wyoming_piper --piper ./piper/piper --voice en_US-amy-medium --uri tcp://0.0.0.0:10200 --data-dir ./data --download-dir ./data
  • Piper supports CUDA; if issues, fall back to CPU.

1.3 (Optional) Wyoming-OpenWakeWord (Wake Word Detection)

For activating on a phrase like “Hey Assistant”.

git clone https://github.com/rhasspy/wyoming-openwakeword.git
cd wyoming-openwakeword
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install .
deactivate

Run:

cd wyoming-openwakeword
.venv/bin/python3 -m wyoming_openwakeword --preload-model ok_nabu --uri tcp://0.0.0.0:10400 --data-dir ./data --download-dir ./data
  • Custom models can be trained via OpenWakeWord tools.

Step 2: Install and Configure Wyoming-Satellite (Audio Handler)

This handles microphone input, VAD (voice activity detection), wake word, and streams to/from services.

git clone https://github.com/rhasspy/wyoming-satellite.git
cd wyoming-satellite
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install pysilero-vad==1.0.0 webrtc-noise-gain==1.2.3  # For VAD and noise reduction
pip install .
deactivate

Run the satellite (adjust mic/snd devices; find with arecord -L/aplay -L):

cd wyoming-satellite
.venv/bin/python3 -m wyoming_satellite \
  --name 'chat-bot-satellite' \
  --uri tcp://0.0.0.0:10700 \
  --mic-command 'arecord -D plughw:0,0 -r 16000 -c 1 -f S16_LE -t raw' \
  --snd-command 'aplay -D plughw:0,0 -r 22050 -c 1 -f S16_LE -t raw' \
  --wake-uri tcp://127.0.0.1:10400 \
  --wake-word-name ok_nabu \
  --stt-uri tcp://127.0.0.1:10300 \
  --tts-uri tcp://127.0.0.1:10200 \
  --vad \
  --mic-auto-gain 5 \
  --mic-noise-suppression 2 \
  --debug
  • This sets up local services. The satellite exposes itself on port 10700 for a “master” to connect and handle events (e.g., transcripts).

Step 3: Set Up the LLM

Use Ollama for easy Python integration (ARM64 builds available; supports CUDA on Jetson).

  • Download/install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  • Pull a lightweight model: ollama pull gemma2:2b (or phi3:mini for ~3.8B params; quantized for efficiency).
  • Run: ollama serve (background with nohup or systemd).
  • Test in Python:
    import requests
    response = requests.post('http://localhost:11434/api/generate', json={'model': 'gemma2:2b', 'prompt': 'Hello!'}).json()
    print(response['response'])
    

Alternatives: Use llama.cpp with Python bindings for more control (pip install llama-cpp-python --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels for CUDA).

Step 4: Python Orchestration Script (Custom Master)

This connects to the satellite, processes transcripts with the LLM, and sends responses to TTS. Based on wyoming-satellite/example_event_client.py (adapt it).

Create chat_bot_master.py:

import asyncio
import json
import requests
from wyoming.asr import Transcript
from wyoming.client import AsyncTcpClient
from wyoming.error import Error
from wyoming.event import Event
from wyoming.satellite import RunSatellite
from wyoming.tts import Synthesize

async def main():
    client = AsyncTcpClient('tcp://127.0.0.1:10700')  # Satellite URI
    await client.connect()
    await client.send(RunSatellite(name='chat-bot-satellite').event())  # Start satellite

    while True:
        event = await client.receive()
        if event is None:
            break

        if Transcript.is_type(event.type):
            transcript = Transcript.from_event(event)
            print(f"User: {transcript.text}")

            # Query LLM (adjust prompt for chat context)
            llm_response = requests.post(
                'http://localhost:11434/api/generate',
                json={'model': 'gemma2:2b', 'prompt': f"Respond briefly to: {transcript.text}"}
            ).json()['response']
            print(f"Bot: {llm_response}")

            # Send to TTS
            await client.send(Synthesize(text=llm_response).event())

        elif Error.is_type(event.type):
            error = Error.from_event(event)
            print(f"Error: {error.text}")

    await client.disconnect()

if __name__ == "__main__":
    asyncio.run(main())
  • Install deps: pip install wyoming
  • Run: python chat_bot_master.py
  • This forms the STT > LLM > TTS loop. Add context/memory by tracking conversation history in a list and injecting into the LLM prompt.

Running the Full Pipeline

  1. Start services (STT, TTS, wake word) in separate terminals/tmux.
  2. Start satellite.
  3. Run the master script.
  • Speak after wake word; it transcribes, queries LLM, synthesizes response.

Optimization and Troubleshooting

  • Performance: Monitor GPU with nvidia-smi. Use smaller models if latency >2-3s.
  • Debug: Add --debug to services; check logs.
  • Customization: Extend the master for multi-turn chat (e.g., history list) or add wake word skipping for always-on.
  • Alternatives if Needed: If Wyoming feels heavy, direct libs like faster-whisper (pip install faster-whisper), piper-tts, and ollama in one script, but this skips the protocol.
  • Resources: Rhasspy docs for Wyoming extensions; NVIDIA forums for Jetson tweaks.

This setup is modular, Python-centric, and avoids HA entirely while using Wyoming for inter-service communication. If you hit hardware-specific issues, provide logs for refinement.