Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»How to build a LiveKit app with real-time Speech-to-Text

    How to build a LiveKit app with real-time Speech-to-Text

    December 20, 2024

    How to build a LiveKit app with real-time Speech-to-Text

    LiveKit is a powerful platform for building real-time audio and video applications. They build on top of WebRTC to abstract away the complicated details of building real-time applications, allowing developers to rapidly build and deploy applications for video conferencing, livestreaming, interactive virtual events, and more.

    Beyond the core real-time capabilities, LiveKit also provides a flexible agents system, which allows developers to incorporate programmatic agents into their applications for additional functionality. For example, you can incorporate AI agents to add Speech-to-Text or LLM capabilities to build multimodal real-time AI applications.

    In this guide, we’ll show you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI’s new Python LiveKit integration. This allows you to transcribe audio streams in real-time so that you can do backend processing, or so you can display the transcriptions in your application’s UI. Here’s what we’ll build today:



    0:00
    /0:32



    We’ll get started with an overview of LiveKit and its constructs, but if you’re already familiar with LiveKit you can jump straight to the code here. You can find the code for this tutorial in this respository. Let’s get started!

    LiveKit basics

    LiveKit is an extremely flexible platform. It is open-source, allowing you to self-host your own infrastructure, and they offer a wide range of SDKs to build real-time applications on top of clean interfaces in an idiomatic way.

    At the core of a LiveKit application is a LiveKit Server. Users connect to the server and can publish streams of data to it. These streams are commonly audio or video stream, but any arbitrary data stream can be used. Additionally, users can subscribe to streams published by other users.

    How to build a LiveKit app with real-time Speech-to-Text
    Users can publish their own audio/video feeds and subscribe to other participants’ feeds in order to build, for example, a video conferencing application.

    The LiveKit Server acts as a Selective Forwarding Unit, which is a fancy way of saying that it accepts all of these incoming streams and sends (forwards) them to the appropriate users (i.e. selectively). In this way, the LiveKit Server is a central orchestrator which prevents the need for peer-to-peer connections between all users, which would drastically drive up bandwidth and compute requirements for applications with many users. Additionally, the LiveKit server can send lower bitrate/resolution videos for e.g. thumbnail views, further lowering bandwidth requirements. This approach is what allows LiveKit applications to seamlessly scale to large numbers of users.

    How to build a LiveKit app with real-time Speech-to-Text
    The user (red) sends his audio and video streams to the LiveKit server, which forwards them to the other participants (left). This is in contrast to a peer-to-peer system (right) where the user would have to send his streams to every other participant.

    Additionally, since LiveKit is unopinionated by providing a simple mechanism to exchange real-time data (the publication/subscription of streams), LiveKit is flexible enough to build a litany of real-time applications.

    LiveKit constructs

    With this general context in mind, we can now cast this information in terms of LiveKit verbiage. LiveKit has three fundamental constructs – participants, tracks, and rooms.

    Participants are members of a LiveKit application, which means they are participating in a real-time session. Participants can be end-users connecting to your application, processes that are ingesting media into or exporting it from your application, or AI agents that can process media in real-time.

    These participants publish tracks, which are the streams of information mentioned above. These streams will generally be audio and video for end-users, but could also be, for example, streams of text as we will see in this tutorial, where our AI Agent that performs Streaming Speech-to-Text will publish a stream of the transcripts.

    The participants are members of rooms, which are logical groupings of participants that can publish and subscribe to each other’s tracks (subject of course to your applications permissions and logic). Participants in the same room receive notifications when other participants make changes to their tracks, like adding, removing, or modifying them.

    For additional information, including the fields/attributes of the relevant objects, check out LiveKit’s Docs. Now that we have the overarching basics of LiveKit down, let’s see what it actually takes to build a LiveKit application.

    Getting started with LiveKit

    In order to build a LiveKit application with real-time Speech-to-Text, you’ll need three essential components:

    1. A LiveKit Server, to which the frontend will connect.
    2. A frontend application, which end-users will interact with
    3. An AI Agent that will transcribe the audio streams in real-time

    Let’s start by setting up the LiveKit Server.

    Step 1 – Set up a LiveKit server

    LiveKit is open-source, which means you can self-host your own LiveKit server. This is a great option if you want to have full control over your infrastructure, or if you want to customize the LiveKit server to your specific needs. In this tutorial, we’ll use the LiveKit Cloud service, which is a hosted version of LiveKit that is managed by the LiveKit team. This will make it easy for us to get up and running quickly and is free for small applications.

    Go to livekit.io and sign up for a LiveKit account. You will be met with a page that prompts you to create your first app. Name your app streaming-stt (streaming Speech-to-Text), and click “Continue”. After answering a few questions about your use-case, you will be taken to the dashboard for your new app:

    Your dashboard shows information about your LiveKit project, which is essentially a management layer for your LiveKit server. You can find usage information, active sessions, as well as what we’re interested in – the server URL and the API keys. Go to Settings > Keys and you will see the default API key that was created when you initialized your project:

    How to build a LiveKit app with real-time Speech-to-Text

    In a terminal, create a project directory and navigate into it:

    mkdir livekit-stt
    cd livekit-stt
    

    Inside your project directory, create a .env file to store the credentials for your application and add the following:

    LIVEKIT_URL=
    LIVEKIT_API_KEY=
    LIVEKIT_API_SECRET=
    

    Back on the Keys page in your LiveKit dashboard, click the default API key for your app. This will display a popup modal where you can copy over each of the values and paste them into your .env file (you will have to click to reveal your secret key):

    How to build a LiveKit app with real-time Speech-to-Text

    Your .env file should now look something like this:

    LIVEKIT_URL=wss://streaming-stt-SOME_NUMBER.livekit.cloud
    LIVEKIT_API_KEY=SHORT_ALPHANUMERIC_STRING
    LIVEKIT_API_SECRET=REALLY_LONG_ALPHANUMERIC_STRING
    

    Note: Your .env file contains your credentials – make sure to keep it secure and never commit it to source control.

    Step 2 – Set up the LiveKit Agents Playground

    Now that our server is set up, we can move on to building the frontend application. LiveKit has a range of SDKs that make it easy to build in any environment. In our case, we’ll use the LiveKit Agents Playground, which is a web application that allows you to test out the LiveKit agents system. Using this playground will allow us to quickly test out the Speech-to-Text agent that we’ll build in the next section. The Agents Playground is open-source, so feel free to read through the code for inspiration when you’re building your own project.

    Additionally, we don’t even have to set up the Agents Playground ourselves – LiveKit has a hosted version that we can use. Go to agents-playground.livekit.io and you will be either automatically signed in, or met with a prompt to connect to LiveKit cloud:

    How to build a LiveKit app with real-time Speech-to-Text

    Sign in if prompted, and select the streaming-stt project to connect to it:

    How to build a LiveKit app with real-time Speech-to-Text

    You will be taken to the Agents Playground which is connected to a LiveKit server for your streaming-stt project. On the right, you will see the ID of the room you are connected to, as well as your own participant ID.

    How to build a LiveKit app with real-time Speech-to-Text

    You can disconnect for now by clicking the button in the top right – it’s time to build our Speech-to-Text agent!

    Step 3 – Build a real-time Speech-to-Text agent

    Before we start writing code, we need to get an AssemblyAI API key – you can get one here. Currently, the free offering includes over 400 hours of asynchronous Speech-to-Text, as well as access to Audio Intelligence models; but it does not include Streaming Speech-to-Text. You will need to add a payment method to your account to access the Streaming Speech-to-Text API. Once you have done so, you can find your API key on the front page of your dashboard:

    How to build a LiveKit app with real-time Speech-to-Text

    Copy it, and paste it into your .env file:

    ASSEMBLYAI_API_KEY=YOUR-KEY-HERE
    

    Now we’re ready to start coding. Back in your project directory, create a virtual environment:

    # Mac/Linux
    python3 -m venv venv
    . venv/bin/activate
    
    # Windows
    python -m venv venv
    .venvScriptsactivate.bat
    

    Next, install the required package:

    pip install livekit-agents livekit-plugins-assemblyai python-dotenv
    

    This command installs the LiveKit Python SDK, the AssemblyAI plugin for LiveKit, and the python-dotenv package which you’ll use to load your environment variables from your .env file.

    Now it’s time to build the agent, which will be based on an example from LiveKits examples repository. Create a new Python file in your project directory called stt_agent.py and add the following:

    import asyncio
    import logging
    
    from dotenv import load_dotenv
    from livekit import rtc
    from livekit.agents import (
        AutoSubscribe,
        JobContext,
        WorkerOptions,
        cli,
        stt,
        transcription,
    )
    from livekit.plugins import assemblyai
    
    load_dotenv()
    
    logger = logging.getLogger("transcriber")
    

    We start without imports, load our environment variables, and then instantiate a logger for our agent. Now we can move on to the writing the main agent code.

    Define the entrypoint function

    We start by defining an entrypoint function which executes when the agent is connected to the room.

    async def entrypoint(ctx: JobContext):
        logger.info(f"starting transcriber (speech to text) example, room: {ctx.room.name}")
        stt_impl = assemblyai.STT()
    
    

    The entrypoint function is an asynchronous function that accepts a JobContext. We then log a message and instantiate an assemblyai.STT() object. This object is responsible for handling the Speech-to-Text and satisfies the LiveKit Agents stt.STT interface.

    Next, still within the entrypoint function, we define an inner function that tells the agent what to do when it subscribes to a new track:

        @ctx.room.on("track_subscribed")
        def on_track_subscribed(
            track: rtc.Track,
            publication: rtc.TrackPublication,
            participant: rtc.RemoteParticipant,
        ):
            if track.kind == rtc.TrackKind.KIND_AUDIO:
                asyncio.create_task(transcribe_track(participant, track))
    

    The decorator indicates to what event this function should be bound, in this case track subscription. The function creates a new asynchronous task that transcribes audio tracks using the transcribe_track function we’ll add next.

    Add the following inner function to your entrypoint function:

        async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
            audio_stream = rtc.AudioStream(track)
            stt_stream = stt_impl.stream()
            stt_forwarder = transcription.STTSegmentsForwarder(
                room=ctx.room, participant=participant, track=track
            )
    
            # Run tasks for audio input and transcription output in parallel
            await asyncio.gather(
                _handle_audio_input(audio_stream, stt_stream),
                _handle_transcription_output(stt_stream, stt_forwarder),
            )
    

    This function first creates an AudioStream object from the track, and then creates an AssemblyAI SpeechStream object using the .stream() method of our assemblyai.STT() object. The SpeechStream object represents the bilateral communication stream between your LiveKit agent and AssemblyAI – audio segments are forwarded to AssemblyAI, and transcripts are received. Next, the function creates a STTSegmentsForwarder object, which is responsible for forwarding the transcripts to the room so that they can be displayed on the frontend.

    To transcribe the track we need to do two things in parallel – receive the audio track from the LiveKit server and send it to AssemblyAI for transcription, and then receive the response transcript from AssemblyAI and forward it back to the LiveKit server. We do this using the asyncio.gather function, which runs these two tasks in parallel. We will define these tasks next.

    First, we define _handle_audio_input. Add the following inner function to entrypoint:

        async def _handle_audio_input(
            audio_stream: rtc.AudioStream, stt_stream: stt.SpeechStream
        ):
            """Pushes audio frames to the speech-to-text stream."""
            async for ev in audio_stream:
                stt_stream.push_frame(ev.frame)
    

    This function listens for audio frames from the AudioStream object and pushes them to the SpeechStream object. The AudioStream object is an asynchronous generator that yields audio frames from the subscribed track which we forward to AssemblyAI using the push_frame method of the stt_stream. Now add this inner function to entrypoint:

        async def _handle_transcription_output(
            stt_stream: stt.SpeechStream, stt_forwarder: transcription.STTSegmentsForwarder
        ):
            """Receives transcription events from the speech-to-text service."""
            async for ev in stt_stream:
                if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
                    print(" -> ", ev.alternatives[0].text)
    
                stt_forwarder.update(ev)
    

    This function does the converse of _handle_audio_input – it listens for Speech events from the SpeechStream object and forwards them to the STTSegmentsForwarder object, which in turn forwards them to the LiveKit server. When it receives a FINAL_TRANSCRIPT event, it prints the transcript to the console. You can also add additional logic to e.g. print out INTERIM_TRANSCRIPTs – you can learn about the difference between interim (or partial) transcripts and final transcripts in this section of our blog on transcribing Twilio calls in real-time. You can see all of the speech event types here.

    Finally, add the following line to the entrypoint function (at its root level) to connect to the LiveKit room and automatically subscribe to any published audio tracks:

        await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    

    To summarize:

    1. The entrypoint function is executed when the agent connects to the LiveKit room
    2. The agent automatically subscribes to every audio track published to the room
    3. For each of these tracks, the agent creates an asynchronous task which simultaneously:
      1. Pushes audio frames to the AssemblyAI Speech-to-Text stream
      2. Receives transcription events from the AssemblyAI Speech-to-Text stream, prints them to the agent server console if they are FINAL_TRANSCRIPTs, and forwards them to the LiveKit room so that they can be sent to participants, in our case to power the “Chat” feature on the frontend.

    So, your entrypoint function should now look like this:

    async def entrypoint(ctx: JobContext):
        logger.info(f"Starting transcriber (speech to text) example, room: {ctx.room.name}")
        stt_impl = assemblyai.STT()
    
        @ctx.room.on("track_subscribed")
        def on_track_subscribed(
            track: rtc.Track,
            publication: rtc.TrackPublication,
            participant: rtc.RemoteParticipant,
        ):
            if track.kind == rtc.TrackKind.KIND_AUDIO:
                asyncio.create_task(transcribe_track(participant, track))
    
        async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
            """
            Handles the parallel tasks of sending audio to the STT service and 
            forwarding transcriptions back to the app.
            """
            audio_stream = rtc.AudioStream(track)
            stt_forwarder = transcription.STTSegmentsForwarder(
                room=ctx.room, participant=participant, track=track
            )
    
            stt_stream = stt_impl.stream()
    
            # Run tasks for audio input and transcription output in parallel
            await asyncio.gather(
                _handle_audio_input(audio_stream, stt_stream),
                _handle_transcription_output(stt_stream, stt_forwarder),
            )
    
        async def _handle_audio_input(
            audio_stream: rtc.AudioStream, stt_stream: stt.SpeechStream
        ):
            """Pushes audio frames to the speech-to-text stream."""
            async for ev in audio_stream:
                stt_stream.push_frame(ev.frame)
    
        async def _handle_transcription_output(
            stt_stream: stt.SpeechStream, stt_forwarder: transcription.STTSegmentsForwarder
        ):
            """Receives transcription events from the speech-to-text service."""
            async for ev in stt_stream:
                if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
                    print(" -> ", ev.alternatives[0].text)
    
                stt_forwarder.update(ev)
    
        await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    

    Define the main loop

    Finally, we define the main loop of our agent, which is responsible for connecting to the LiveKit room and running the entrypoint function. Add the following code to your stt_agent.py file:

    if __name__ == "__main__":
        cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
    

    When the script is run, we use LiveKit’s cli.run_app method to run the agent, specifying the entrypoint function as the entrypoint for the agent.

    Run the application

    Go back to the Agents Playground in your browser, and click Connect. Remember, the Playground is connected to your LiveKit Project. Now, go into your terminal and start the agent with the below command, ensuring that the virtual environment you created earlier is active:

    python stt_agent.py dev
    

    The agent connects to your LiveKit project by using the credentials in your .env file. In the Playground, you will see the Agent connected status change from FALSE to TRUE after starting your agent.

    Begin speaking, and you will see your speech transcribed in real time. After you complete a sentence, it will be punctuated and formatted, and then a new line will be started for the next sentence in the chat box on the Playground.

    In your terminal where the agent is running, you will see only the final punctuated/formatted utterances printed, because this is the behavior we defined in our stt_agent.py file.

    You can see this process in action here:



    0:00
    /0:32



    That’s it! You’ve successfully built a real-time Speech-to-Text agent for your LiveKit application. You can now use this agent to transcribe audio streams in real-time, and display the transcripts in your application’s UI.

    Remember, you can self-host any part of this application, including the LiveKit server, the frontend application. Check out the LiveKit docs for more information on building LiveKit applications and working with AI agents.

    Final words

    In this tutorial, we showed you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI’s new Python LiveKit integration. We walked through the basics of LiveKit, how to set up a LiveKit server, how to build a real-time Speech-to-Text agent, and how to connect the agent to your LiveKit application.

    Check out AssemblyAI’s docs to learn more about other models we offer beyond Streaming Speech-to-Text. Otherwise, feel free to check out our YouTube channel or blog to learn more about building with AI and AI theory, like this video on building a Chatbot in Python with Claude 3.5 Sonnet:

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAnnouncing the AssemblyAI integration for LiveKit
    Next Article When MIT’s interdisciplinary NEET program is a perfect fit

    Related Posts

    Security

    ⚡ Weekly Recap: APT Intrusions, AI Malware, Zero-Click Exploits, Browser Hijacks and More

    June 2, 2025
    Security

    Exploitation Risk Grows for Critical Cisco Bug

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Rilasciato DeaDBeeF 1.10: Un veterano del software libero che continua a evolversi

    Rilasciato DeaDBeeF 1.10: Un veterano del software libero che continua a evolversi

    Linux

    Microsoft Patch Tuesday May 2025: 5 Zero Days, 8 High-Risk Vulnerabilities

    Development

    Becoming Ransomware Ready: Why Continuous Validation Is Your Best Defense

    Development

    Role of QA in the Ramification of ISO 20022 Transformation 

    Development

    Highlights

    Linux

    FOSS Weekly #25.09: Modern Terminals, RSS Matter, Linux Gaming Tested in 2025 and More

    February 27, 2025

    If you are starting to use and learn Linux, remember this: Linux is not magic…

    Beginner’s guide to GitHub: Creating a pull request

    August 12, 2024

    Gradformer: A Machine Learning Method that Integrates Graph Transformers (GTs) with the Intrinsic Inductive Bias by Applying an Exponential Decay Mask to the Attention Matrix

    April 30, 2024

    Meet Satori: A New AI Framework for Advancing LLM Reasoning through Deep Thinking without a Strong Teacher Model

    February 5, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.