Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 14, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 14, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 14, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 14, 2025

      I test a lot of AI coding tools, and this stunning new OpenAI release just saved me days of work

      May 14, 2025

      How to use your Android phone as a webcam when your laptop’s default won’t cut it

      May 14, 2025

      The 5 most customizable Linux desktop environments – when you want it your way

      May 14, 2025

      Gen AI use at work saps our motivation even as it boosts productivity, new research shows

      May 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025
      Recent

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025

      Perficient’s “What If? So What?” Podcast Wins Gold at the 2025 Hermes Creative Awards

      May 14, 2025

      PIM for Azure Resources

      May 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025
      Recent

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025

      You can now share an app/browser window with Copilot Vision to help you with different tasks

      May 14, 2025

      Microsoft will gradually retire SharePoint Alerts over the next two years

      May 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»How to Do Hotword Detection with Streaming Speech-to-Text and Go

    How to Do Hotword Detection with Streaming Speech-to-Text and Go

    June 25, 2024

    If you’ve ever wanted to create a personal AI assistant like Siri or Alexa, the first step is to figure out how to trigger the AI using a specific word or phrase (also known as a hotword). All of the prevalent AI systems use a similar approach; for Alexa, the hotword is “Alexa,” and for Siri, the hotword is “Hey Siri.”

    In this tutorial, you’ll learn how to implement a hotword detection system using AssemblyAI’s Streaming Speech-to-Text API. In homage to Iron Man, the assistant in the example is called Jarvis, but you’re welcome to name it whatever you want. This tutorial uses the AssemblyAI Go SDK, but if Go isn’t your preferred language, you’re welcome to use any other supported programming languages.

    Before you start

    To complete this tutorial, you’ll need:

    Go installedPortAudio installed

    Set up your environment

    You’ll use the Go bindings of PortAudio to get raw audio data from your microphone and the AssemblyAI Go SDK to interface with AssemblyAI.

    Let’s start by creating a new Go project and setting up all of the required dependencies. To do so, navigate to your preferred directory and run the following commands:

    mkdir jarvis
    cd jarvis
    go mod init jarvis
    go get github.com/gordonklaus/portaudio
    go get github.com/AssemblyAI/assemblyai-go-sdk

    If the execution is successful, you should end up with the following directory structure:

    jarvis
    ├── go.mod
    └── go.sum

    Next up, you’ll need an AssemblyAI API key.

    Create an AssemblyAI account

    To get an API key, you first need to create an AssemblyAI account. Go to the sign-up page and fill out the form to get started.

    Once your account is created, you need to set up billing details. Streaming Speech-to-Text is among a few selected APIs that aren’t available on the free plan. You can set up billing by going to Billing and providing valid credit card details.

    Next, go to the dashboard and take note of your API key, as you’ll need it in the next step:

    Record the raw audio data

    With the dependencies set up and the API key in hand, you’re ready to start implementing the core logic of your personal AI assistant. The first step is to figure out how to get raw audio data from the microphone. As mentioned earlier, you’ll be using the Go bindings of the well-known PortAudio I/O library, which makes it easy to get raw data and manipulate low-level options like the sampling rate and the number of frames per buffer. This is important, as AssemblyAI is sensitive to these options and might generate an inaccurate transcript if they aren’t set correctly.

    Create a new recorder.go file in the jarvis directory, import the dependencies, and define a new recorder struct:

    package main

    import (
    “bytes”
    “encoding/binary”

    “github.com/gordonklaus/portaudio”
    )

    type recorder struct {
    stream *portaudio.Stream
    in []int16
    }

    The recorder struct will hold a reference to the input stream and read data from that stream via the in field of the struct.

    Use the following code to configure a newRecorder function to create and initialize a new recorder struct:

    func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
    in := make([]int16, framesPerBuffer)

    stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
    if err != nil {
    return nil, err
    }

    return &recorder{
    stream: stream,
    in: in,
    }, nil
    }

    This function takes in the required sample rate and frames per buffer to configure PortAudio, opens the default audio input device attached to your computer, and returns a pointer to a new recorder struct. You might want to use OpenStream instead of OpenDefaultStream if you have multiple mics connected to your computer and want to use a specific one.

    Next, define a few methods on the recorder struct pointer that you’ll use in the next step:

    func (r *recorder) Read() ([]byte, error) {
    if err := r.stream.Read(); err != nil {
    return nil, err
    }

    buf := new(bytes.Buffer)

    if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
    return nil, err
    }

    return buf.Bytes(), nil
    }

    func (r *recorder) Start() error {
    return r.stream.Start()
    }

    func (r *recorder) Stop() error {
    return r.stream.Stop()
    }

    func (r *recorder) Close() error {
    return r.stream.Close()
    }

    The Read method reads data from the input stream, writes it to a buffer, and then returns that buffer.

    The Start, Stop, and Close methods call similarly named methods on the stream and don’t do anything unique.

    Create a real-time transcriber

    AssemblyAI divides each session with the streaming API into multiple discrete events and requires an event handler for each of these events. These event handlers are defined as field functions on the RealTimeTranscriber struct type that is provided by AssemblyAI.

    Here are the different field functions that the RealTimeTranscriber accepts:

    transcriber := &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event aai.SessionBegins) {
    // …
    },
    OnSessionTerminated: func(event aai.SessionTerminated) {
    // …
    },
    OnPartialTranscript: func(event aai.PartialTranscript) {
    // …
    },
    OnFinalTranscript: func(event aai.FinalTranscript) {
    // …
    },
    OnSessionInformation: func(event aai.SessionInformation) {
    // …
    },
    OnError: func(err error) {
    // …
    },
    }

    You only need to provide implementations for the field functions you want to use. The others can be omitted if they aren’t needed.

    Create a new main.go file and add this transcriber struct:

    package main

    import (
    “fmt”

    “github.com/AssemblyAI/assemblyai-go-sdk”
    )

    var transcriber = &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event assemblyai.SessionBegins) {
    fmt.Println(“session begins”)
    },

    OnSessionTerminated: func(event assemblyai.SessionTerminated) {
    fmt.Println(“session terminated”)
    },

    OnPartialTranscript: func(event assemblyai.PartialTranscript) {
    fmt.Printf(“%sr”, event.Text)
    },

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
    fmt.Println(event.Text)
    },

    OnError: func(err error) {
    fmt.Println(err)
    },
    }

    This struct has all of the required field functions defined that you’ll be using in this tutorial.

    The events fire as follows:

    OnSessionBegins when the connection is establishedOnSessionTerminated when the connection is terminatedOnPartialTranscript event when AssemblyAI is transcribing a new sentenceOnFinalTranscript event when a sentence is completely transcribed

    OnPartialTranscript fires repeatedly until a sentence is complete, and each invocation will contain the complete sentence up to that point. Only after the sentence is completely transcribed will the FinalTranscript event fire with the complete transcribed sentence.

    The OnError function simply handles any errors that may occur during a session.

    The benefit of using a carriage return (r) in the OnPartialTranscript function is that it’ll overwrite the same line in the terminal whenever the OnPartialTranscript function is called. This way, it won’t clutter your screen by printing each partial output on a new line.

    To add hotword detection support, you need to define a hotword variable that’ll be populated via a command line argument and be compared against in the OnFinalTranscript function:

    package main

    import (
    “context”
    “fmt”
    “log”
    “os”
    “os/signal”
    “strings”
    “syscall”

    “github.com/AssemblyAI/assemblyai-go-sdk”
    “github.com/gordonklaus/portaudio”
    )

    var hotword string

    var transcriber = &assemblyai.RealTimeTranscriber{
    // truncated

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
    fmt.Println(event.Text)

    hotwordDetected := strings.Contains(
    strings.ToLower(event.Text),
    strings.ToLower(hotword),
    )

    if hotwordDetected {
    fmt.Println(“I am here!”)
    }
    },

    // truncated
    }

    So far, the code doesn’t contain the logic for populating the hotword variable. That’ll be done in the main function that you’ll write next.

    Stitch everything together

    With all the required pieces in place, the only thing left to do is define a main function, invoke the AssemblyAI API, and pass in the raw audio data. Let’s start the main function by setting up a logger:

    func main() {
    logger := log.New(os.Stderr, “”, log.Lshortfile)
    }

    This logger will output the logs to stderr.

    Note: The rest of the code in this section will also be written in the body of the main function.

    Next, you need to initialize PortAudio:

    // Use PortAudio to record the microphone
    portaudio.Initialize()
    defer portaudio.Terminate()

    This initializes some internal data structures and lets you open up the input stream later on, which is required before you can use any PortAudio API functions.

    Let’s populate the hotword variable next from a command line argument:

    hotword = os.Args[1]

    Optionally, you can also add a print statement after this to print the hotword to the screen:

    fmt.Println(hotword)

    Now, you need to set up a few variables for the AssemblyAI API key, input sample rate, and input frames per buffer:

    device, err := portaudio.DefaultInputDevice()
    if err != nil {
    logger.Fatal(err)
    }

    var (
    apiKey = os.Getenv(“ASSEMBLYAI_API_KEY”)

    // Number of samples per second
    sampleRate = device.DefaultSampleRate

    // Number of samples to send at once
    framesPerBuffer = int(0.2 * sampleRate) // 200 ms of audio
    )

    This code takes the API key from an environment variable and sets up the sampleRate and framesPerBuffer variables by letting PortAudio supply the configured sample rate of the default input device. This way, you don’t have to manually check what the sample rate of the input device is, and it’ll automatically be set correctly.

    It’s finally time to create an AssemblyAI API client, a new recorder, and send data from the recorder to the API. Use the following code:

    client := assemblyai.NewRealTimeClientWithOptions(
    assemblyai.WithRealTimeAPIKey(apiKey),
    assemblyai.WithRealTimeSampleRate(int(sampleRate)),
    assemblyai.WithRealTimeTranscriber(transcriber),
    )

    ctx := context.Background()

    if err := client.Connect(ctx); err != nil {
    logger.Fatal(err)
    }

    rec, err := newRecorder(int(sampleRate), framesPerBuffer)
    if err != nil {
    logger.Fatal(err)
    }

    if err := rec.Start(); err != nil {
    logger.Fatal(err)
    }

    This code passes the transcriber to AssemblyAI while creating a new real-time client. It also passes the sampleRate of the microphone to AssemblyAI, as the streaming speech-to-text would fail without it. Once the client is created, it opens up a WebSocket connection to AssemblyAI via the client.Connect call. Next, it creates a new recorder using the newRecorder function and starts recording via the rec.Start() method. If anything fails in either of these stages, the code prints the error and exits.

    Now you need to add an infinite for loop that gets the data from the microphone and sends it to AssemblyAI:

    for {
    b, err := rec.Read()
    if err != nil {
    logger.Fatal(err)
    }

    // Send partial audio samples
    if err := client.Send(ctx, b); err != nil {
    logger.Fatal(err)
    }
    }

    If you try running the code you’ve written so far, it should work. However, it’s lacking a function for proper resource cleanup. Let’s make sure the code catches any termination signals and cleans up the resources appropriately. There are two changes you need to make for this to work. First, you need to create a new channel that’ll be notified in case of a SIGINT or SIGTERM signal. Put this code at the top of your main function:

    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

    Next, you need to modify your for loop to add a select statement and make sure you clean up the resources in case you receive something in the sigs channel:

    for {
    select {
    case <-sigs:
    fmt.Println(“stopping recording…”)
    if err := rec.Stop(); err != nil {
    log.Fatal(err)
    }
    if err := client.Disconnect(ctx, true); err != nil {
    log.Fatal(err)
    }
    os.Exit(0)
    default:
    b, err := rec.Read()
    if err != nil {
    logger.Fatal(err)
    }

    // Send partial audio samples
    if err := client.Send(ctx, b); err != nil {
    logger.Fatal(err)
    }
    }
    }

    Due to the introduction of the select statement, the code will always check if there is anything in the sigs channel. If so, it’ll run the cleanup tasks. Otherwise, it’ll continue reading data from the microphone and passing it to AssemblyAI.

    As part of the cleanup, the code stops the recording using the Stop() method and disconnects the WebSocket connection to AssemblyAI via the client.Disconnect() method.

    Review the complete code

    By now, you should have a main.go file and a recorder.go file. The recorder.go file should resemble this:

    package main

    import (
    “bytes”
    “encoding/binary”

    “github.com/gordonklaus/portaudio”
    )

    type recorder struct {
    stream *portaudio.Stream
    in []int16
    }

    func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
    in := make([]int16, framesPerBuffer)

    stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
    if err != nil {
    return nil, err
    }

    return &recorder{
    stream: stream,
    in: in,
    }, nil
    }

    func (r *recorder) Read() ([]byte, error) {
    if err := r.stream.Read(); err != nil {
    return nil, err
    }

    buf := new(bytes.Buffer)

    if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
    return nil, err
    }

    return buf.Bytes(), nil
    }

    func (r *recorder) Start() error {
    return r.stream.Start()
    }

    func (r *recorder) Stop() error {
    return r.stream.Stop()
    }

    func (r *recorder) Close() error {
    return r.stream.Close()
    }

    And the main.go file should resemble this:

    package main

    import (
    “context”
    “fmt”
    “log”
    “os”
    “os/signal”
    “strings”
    “syscall”

    “github.com/AssemblyAI/assemblyai-go-sdk”
    “github.com/gordonklaus/portaudio”
    )

    var hotword string

    var transcriber = &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event assemblyai.SessionBegins) {
    fmt.Println(“session begins”)
    },

    OnSessionTerminated: func(event assemblyai.SessionTerminated) {
    fmt.Println(“session terminated”)
    },

    OnPartialTranscript: func(event assemblyai.PartialTranscript) {
    fmt.Printf(“%sr”, event.Text)
    },

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
    fmt.Println(event.Text)
    hotwordDetected := strings.Contains(
    strings.ToLower(event.Text),
    strings.ToLower(hotword),
    )
    if hotwordDetected {
    fmt.Println(“I am here!”)
    }
    },

    OnError: func(err error) {
    fmt.Println(err)
    },
    }

    func main() {
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

    logger := log.New(os.Stderr, “”, log.Lshortfile)

    // Use PortAudio to record the microphone
    portaudio.Initialize()
    defer portaudio.Terminate()

    hotword = os.Args[1]

    device, err := portaudio.DefaultInputDevice()
    if err != nil {
    logger.Fatal(err)
    }

    var (
    apiKey = os.Getenv(“ASSEMBLYAI_API_KEY”)

    // Number of samples per second
    sampleRate = device.DefaultSampleRate

    // Number of samples to send at once
    framesPerBuffer = int(0.2 * sampleRate) // 200 ms of audio
    )

    client := assemblyai.NewRealTimeClientWithOptions(
    assemblyai.WithRealTimeAPIKey(apiKey),
    assemblyai.WithRealTimeSampleRate(int(sampleRate)),
    assemblyai.WithRealTimeTranscriber(transcriber),
    )

    ctx := context.Background()

    if err := client.Connect(ctx); err != nil {
    logger.Fatal(err)
    }

    rec, err := newRecorder(int(sampleRate), framesPerBuffer)
    if err != nil {
    logger.Fatal(err)
    }

    if err := rec.Start(); err != nil {
    logger.Fatal(err)
    }

    for {
    select {
    case <-sigs:
    fmt.Println(“stopping recording…”)
    if err := rec.Stop(); err != nil {
    log.Fatal(err)
    }
    if err := client.Disconnect(ctx, true); err != nil {
    log.Fatal(err)
    }
    os.Exit(0)
    default:
    b, err := rec.Read()
    if err != nil {
    logger.Fatal(err)
    }

    // Send partial audio samples
    if err := client.Send(ctx, b); err != nil {
    logger.Fatal(err)
    }
    }
    }
    }

    Run the application

    Let’s see the results of your work in action by running the code. Open up the terminal, cd into the project directory, and set up your AssemblyAI API key as an environment variable:

    export ASSEMBLYAI_API_KEY=’***’
    Note: Replace *** in the command above with your AssemblyAI API key.

    Finally, run this command:

    go run . Jarvis

    This will set Jarvis as the hotword, and the code will output How may I be of service? whenever it sees Jarvis in the output from AssemblyAI.

    Conclusion

    In this tutorial, you learned how to create an application that detects a hotword using the AssemblyAI API. You saw how PortAudio makes it easy to get raw data from a microphone and how AssemblyAI allows you to make sense of that data and transcribe it.

    AssemblyAI’s Streaming Speech-to-Text API opens up a world of possibilities for developers seeking to enhance their applications with cutting-edge AI.

    Whether it’s transcribing phone calls for customer service analytics, generating subtitles for video content, or creating accessibility solutions for the hearing impaired, AssemblyAI’s powerful Speech AI models provide a versatile toolkit for developers to innovate and improve user experiences with the power of voice. With its high accuracy, ease of integration, and streaming capabilities, AssemblyAI empowers developers to unlock the full potential of voice data in their applications.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAI Accelerator Institute: Generative AI Summit Austin 2024
    Next Article LLMs are really bad at solving simple river crossing puzzles

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 15, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-30419 – NI Circuit Design Suite SymbolEditor Out-of-Bounds Read Vulnerability

    May 15, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    AI in Medical Imaging: Balancing Performance and Fairness Across Populations

    Development

    Clair Obscur: Expedition 33 release date — Launch times and when it comes out in your time zone

    News & Updates

    Uncovering How Vision Transformers Understand Object Relations: A Two-Stage Approach to Visual Reasoning

    Development

    Cyberattack Targets Rhode Island’s RIBridges System, Compromising Sensitive Resident Information

    Development
    GetResponse

    Highlights

    CLI Experiments : Dashboard (Part 1)

    July 2, 2024

    Let’s learn how to compose multiple animations running at different speeds by creating a 2001:…

    How to Type Object Properties as Strings in TypeScript

    December 24, 2024

    Rilasciato Rust 1.86: il linguaggio di programmazione introduce nuove funzionalità avanzate

    April 4, 2025

    TikTok’s parent company launches AI text-to-video generation tool

    August 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.