Skip to main content

Whisper — Converting Voice Commands to Text

What is Whisper?

OpenAI Whisper is an open-source speech recognition model. It converts spoken audio into text with high accuracy across many languages, including Urdu and other non-English languages.

Best part: Whisper runs completely locally — no API key, no internet, no cost.

Installation

pip install openai-whisper
# Also needs ffmpeg
sudo apt install ffmpeg # Ubuntu

Basic Transcription

import whisper

# Load model (sizes: tiny, base, small, medium, large)
# "base" is a good balance of speed and accuracy
model = whisper.load_model("base")

# Transcribe an audio file
result = model.transcribe("command.wav")
print(result["text"])
# Output: "Move forward and pick up the red ball"

Real-Time Voice Command Node (ROS 2)

import whisper
import sounddevice as sd
import numpy as np
import rclpy
from rclpy.node import Node
from std_msgs.msg import String

SAMPLE_RATE = 16000
RECORD_SECONDS = 3 # Listen for 3 seconds

class VoiceCommandNode(Node):
def __init__(self):
super().__init__('voice_command_node')
self.publisher = self.create_publisher(String, '/voice_command', 10)
self.model = whisper.load_model("base")
self.get_logger().info("Voice command node ready. Listening...")
# Timer to capture audio every 3 seconds
self.timer = self.create_timer(4.0, self.capture_and_transcribe)

def capture_and_transcribe(self):
self.get_logger().info("Listening...")
audio = sd.rec(
int(RECORD_SECONDS * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=1,
dtype='float32'
)
sd.wait()

# Whisper expects float32 mono audio
audio_flat = audio.flatten()
result = self.model.transcribe(audio_flat, fp16=False)
text = result["text"].strip()

if text:
self.get_logger().info(f'Heard: "{text}"')
msg = String()
msg.data = text
self.publisher.publish(msg)

def main():
rclpy.init()
node = VoiceCommandNode()
rclpy.spin(node)
rclpy.shutdown()

Multilingual Support

Whisper supports 99 languages including Urdu:

# Force Urdu transcription
result = model.transcribe("command.wav", language="ur")
print(result["text"]) # Output in Urdu script

Whisper Model Sizes

ModelParametersSpeedAccuracyVRAM
tiny39MVery fastLow~1 GB
base74MFastGood~1 GB
small244MMediumBetter~2 GB
medium769MSlowHigh~5 GB
large1550MVery slowBest~10 GB

For robot applications on Jetson Orin, use base or small.


Next: LLM Cognitive Planning