Whisper — Converting Voice Commands to Text

What is Whisper?

OpenAI Whisper is an open-source speech recognition model. It converts spoken audio into text with high accuracy across many languages, including Urdu and other non-English languages.

Best part: Whisper runs completely locally — no API key, no internet, no cost.

Installation

pip install openai-whisper
# Also needs ffmpeg
sudo apt install ffmpeg  # Ubuntu

Basic Transcription

import whisper

# Load model (sizes: tiny, base, small, medium, large)
# "base" is a good balance of speed and accuracy
model = whisper.load_model("base")

# Transcribe an audio file
result = model.transcribe("command.wav")
print(result["text"])
# Output: "Move forward and pick up the red ball"

Real-Time Voice Command Node (ROS 2)

import whisper
import sounddevice as sd
import numpy as np
import rclpy
from rclpy.node import Node
from std_msgs.msg import String

SAMPLE_RATE = 16000
RECORD_SECONDS = 3  # Listen for 3 seconds

class VoiceCommandNode(Node):
    def __init__(self):
        super().__init__('voice_command_node')
        self.publisher = self.create_publisher(String, '/voice_command', 10)
        self.model = whisper.load_model("base")
        self.get_logger().info("Voice command node ready. Listening...")
        # Timer to capture audio every 3 seconds
        self.timer = self.create_timer(4.0, self.capture_and_transcribe)

    def capture_and_transcribe(self):
        self.get_logger().info("Listening...")
        audio = sd.rec(
            int(RECORD_SECONDS * SAMPLE_RATE),
            samplerate=SAMPLE_RATE,
            channels=1,
            dtype='float32'
        )
        sd.wait()

        # Whisper expects float32 mono audio
        audio_flat = audio.flatten()
        result = self.model.transcribe(audio_flat, fp16=False)
        text = result["text"].strip()

        if text:
            self.get_logger().info(f'Heard: "{text}"')
            msg = String()
            msg.data = text
            self.publisher.publish(msg)

def main():
    rclpy.init()
    node = VoiceCommandNode()
    rclpy.spin(node)
    rclpy.shutdown()

Multilingual Support

Whisper supports 99 languages including Urdu:

# Force Urdu transcription
result = model.transcribe("command.wav", language="ur")
print(result["text"])  # Output in Urdu script

Whisper Model Sizes

Model	Parameters	Speed	Accuracy	VRAM
`tiny`	39M	Very fast	Low	~1 GB
`base`	74M	Fast	Good	~1 GB
`small`	244M	Medium	Better	~2 GB
`medium`	769M	Slow	High	~5 GB
`large`	1550M	Very slow	Best	~10 GB

For robot applications on Jetson Orin, use base or small.

Next: LLM Cognitive Planning

What is Whisper?​

Installation​

Basic Transcription​

Real-Time Voice Command Node (ROS 2)​

Multilingual Support​

Whisper Model Sizes​