Building Real-Time Audio-Video AI Interaction with Amazon Nova and TEN Framework

How to build a low-latency, multimodal real-time audio-video AI assistant using Amazon Nova, Amazon Transcribe, Amazon Polly, and the open-source TEN framework on AWS.

zhuermu · · 12 min
Amazon NovaTEN FrameworkReal-time AVAWSAI InteractionEKS

Background

As artificial intelligence advances rapidly, real-time audio-video interaction applications are becoming a major market focus. By combining Amazon Nova foundation models, Amazon Transcribe, Amazon Polly, and other AWS services, you can build a powerful real-time audio-video interaction system with multimodal understanding capabilities.

This post walks through a complete solution for building such a system using the open-source TEN (Transformative Extensions Network) framework to orchestrate all the components. We cover the logical architecture, physical deployment on AWS, the core services involved, key optimization strategies for latency and cost, and a step-by-step deployment guide using Amazon EKS.

Technology Selection

We designed a Real-Time Audio-Video Interaction Solution Based on Amazon Nova and TEN Framework. The core advantage of this approach is leveraging Amazon Nova’s multimodal capabilities to support real-time video understanding, while using the TEN framework’s plugin architecture to keep each component modular and replaceable.

The key technologies in play:

  • Amazon Nova — Foundation models with native multimodal support (text, image, video)
  • Amazon Transcribe — Real-time speech-to-text transcription
  • Amazon Polly — High-quality text-to-speech synthesis
  • TEN Framework — Open-source agent framework with a directed graph-based plugin orchestration engine
  • Agora RTC — Real-time audio-video communication network
  • Amazon EKS — Container orchestration for production deployment

Logical Architecture

The solution uses a modular logical architecture where the TEN framework orchestrates all functional modules through a plugin system. Here is how the components fit together.

Frontend User Interaction Module

  • User Terminal: Supports both web browsers and mobile applications
  • Web Server: Serves as the entry point for frontend requests, handling channel creation and authentication

TEN Agent (Core Orchestration)

The TEN Agent is the central module responsible for orchestrating and managing all plugins. It uses a Directed Cyclic Graph (DCG) to implement flexible data flow processing between modules. The key plugins include:

  • RTC Plugin: Handles sending and receiving real-time audio-video data through the Agora RTC network
  • Amazon Transcribe Plugin: Performs real-time speech recognition, converting the user’s spoken audio into text
  • Interrupt Plugin: Monitors voice interruption states so the agent can stop speaking when the user starts talking
  • Amazon Bedrock Plugin: Sends text and video frames to Amazon Nova models for multimodal reasoning and response generation
  • Amazon Polly Plugin: Converts the agent’s text responses back into natural-sounding speech

RTC Network

Real-time communication is handled through Agora’s SD-RTN (Software Defined Real-Time Network), which provides low-latency audio-video transmission globally.

Communication Channel Setup Flow

The process of establishing a real-time session follows these steps:

  1. The user client calls the HTTP endpoint /v1/api/generate to request a channel name and authentication token
  2. The Web Server processes the request and returns the channel name and token
  3. The user client uses the token to establish a communication channel with the RTC Network
  4. The user client calls /v1/api/start to request that the conversation session begin
  5. The Web Server obtains the channel information and passes it to the TEN Agent
  6. The TEN Agent joins the channel and establishes its own connection to the RTC Network

Once all connections are established, audio and video data flows bidirectionally between the user and the agent in real time.

Physical Architecture

The entire solution is deployed on AWS cloud services:

  • End users access the system through web browsers or mobile applications
  • Amazon CloudFront serves as the CDN, accelerating static asset delivery and reducing latency for the frontend
  • Application Load Balancer (ALB) routes incoming API requests to the backend services
  • Amazon EKS hosts all core services, including the Web Server and TEN Agent containers
  • Agora SD-RTN handles the actual low-latency audio-video transmission between users and the agent
  • Docker images are built and stored in Amazon ECR (Elastic Container Registry) for deployment to EKS

The backend services running on EKS connect to AWS AI services (Bedrock, Transcribe, Polly) via AWS SDK calls, keeping the architecture clean and the service boundaries well-defined.

Core AWS Services

Amazon Nova — Multimodal AI Engine

Amazon Nova is a family of foundation models that powers the intelligence behind this solution. The model selection depends on your latency, cost, and capability requirements:

ModelStrengthsBest For
Nova MicroLow cost, fast inferenceText-only tasks where speed matters most
Nova LiteLow cost, handles images and videoQuick processing of visual inputs
Nova ProBest balance of performance, speed, and costProduction workloads requiring multimodal reasoning
Nova PremierHighest capability for complex reasoningTasks requiring deep analysis and multi-step reasoning

For this real-time interaction solution, Nova Pro is the recommended default. It provides the best balance between response quality and inference latency, which is critical when users are waiting for a conversational response in real time. Nova Lite is a good alternative when cost is the primary concern and the visual understanding requirements are straightforward.

Amazon Transcribe — Speech Recognition

Amazon Transcribe provides real-time speech-to-text capabilities with support for over 100 languages. In this architecture, the Transcribe plugin receives audio frames from the RTC stream and converts them to text in real time, which is then passed to the Amazon Bedrock plugin for reasoning.

Key features used in this solution:

  • Streaming transcription for real-time, low-latency processing
  • Partial result detection to enable interrupt handling (more on this below)
  • Automatic language detection for multilingual scenarios

Amazon Polly — Text-to-Speech

Amazon Polly converts the agent’s text responses into natural-sounding speech. It supports over 40 languages and returns audio via streaming, completing speech generation in under 150 milliseconds. This low latency is essential for maintaining a natural conversational flow — users should not feel like they are waiting for the agent to “speak.”

The streaming capability means Polly begins returning audio before the full text has been processed, allowing the agent to start speaking almost immediately as the response is generated.

Key Technical Challenges and Optimization Strategies

Reducing Latency

Latency is the single most important factor in real-time conversational AI. Users expect responses within a few hundred milliseconds — anything longer breaks the illusion of a natural conversation. We address this at multiple levels:

  • Model selection: Amazon Nova Pro provides low-latency multimodal reasoning. For text-only scenarios, Nova Micro offers even faster inference times.
  • Asynchronous processing: The TEN framework supports fully asynchronous plugin execution, so multiple operations can proceed in parallel rather than sequentially.
  • Streaming throughout the pipeline: Amazon Transcribe streams partial transcription results as the user speaks. Amazon Polly streams audio as the text is generated. The Bedrock API supports streaming responses. This means every stage of the pipeline starts producing output before the previous stage has finished.

Cost Optimization

Real-time video understanding is inherently expensive because it requires sending image frames to the model continuously. Two key strategies help control costs:

Frame sampling by time interval: Rather than sending every video frame to the model, frames are sampled at configurable time intervals. This dramatically reduces the volume of data processed while still providing the model with sufficient visual context.

Image compression and history merging: Frames are compressed before being sent to the model, and historical frames are merged to reduce the total number of images in the context window.

Here is the core video frame processing logic from the TEN Agent’s Bedrock plugin:

async def _on_video(self, ten_env: AsyncTenEnv):
    while True:
        [image_data, image_width, image_height] = await self.image_queue.get()
        frame_buffer = rgb2base64jpeg(image_data, image_width, image_height)
        self.image_buffers.append(frame_buffer)
        while len(self.image_buffers) > MAX_IMAGE_COUNT:
            self.image_buffers.pop(0)
        while not self.image_queue.empty():
            await self.image_queue.get()
        await asyncio.sleep(VIDEO_FRAME_INTERVAL)

This function runs in a continuous loop, pulling video frames from a queue, converting them to Base64-encoded JPEG, and maintaining a sliding window of recent frames (MAX_IMAGE_COUNT). Older frames are discarded as new ones arrive. The VIDEO_FRAME_INTERVAL parameter controls how frequently frames are sampled — increasing this value reduces cost at the expense of temporal resolution. The inner while loop drains any accumulated frames from the queue, ensuring the agent always works with the most recent visual data rather than processing stale frames.

Interrupt and Completion Signal Detection

In natural conversation, people frequently interrupt each other. The agent needs to detect when the user starts speaking and immediately stop its own audio output. This is handled through the Amazon Transcribe plugin’s transcript event processing:

async def handle_transcript_event(self, transcript_event):
    results = transcript_event.transcript.results
    for result in results:
        if result.is_partial:
            is_final = False
        for alt in result.alternatives:
            text_result += alt.transcript
    create_and_send_data(ten=self.ten, text_result=text_result, is_final=is_final, stream_id=self.stream_id)

The key distinction here is between partial and final transcript results. Partial results indicate that the user is currently speaking — this triggers the Interrupt Plugin to signal the Polly plugin to stop playing audio. Final results indicate that the user has finished a complete utterance, which triggers the full processing pipeline: the text is sent to Nova for reasoning, and the response is converted back to speech.

This partial/final detection mechanism is what makes the conversation feel natural rather than turn-based. Without it, the agent would either talk over the user or wait awkwardly for long pauses before responding.

Modular Architecture with Hot-Swappable Plugins

The TEN framework’s plugin architecture means every component in the pipeline can be replaced independently. Want to swap Amazon Transcribe for a different speech recognition service? Replace the Transcribe plugin without touching any other code. Want to add a translation step between transcription and reasoning? Insert a new plugin into the directed graph.

This modularity is particularly valuable during development and experimentation. You can test different model configurations, swap between Nova Pro and Nova Lite for A/B testing, or add custom preprocessing plugins without redesigning the architecture.

Deployment Guide

Prerequisites

Before deploying, you need:

  1. An AWS account with appropriate IAM permissions for EKS, ECR, Bedrock, Transcribe, Polly, and CloudFront
  2. An Agora account with RTC service enabled (for the real-time audio-video network)
  3. AWS CLI and eksctl installed and configured on your local machine
  4. Docker installed for building container images
  5. kubectl installed for managing the EKS cluster

Building and Pushing Docker Images

First, authenticate Docker with your Amazon ECR registry and build the TEN Agent image:

# Authenticate Docker with ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS \
  --password-stdin <your_account_id>.dkr.ecr.us-east-1.amazonaws.com

# Clone the TEN Agent repository
git clone https://github.com/zhuermu/TEN-Agent.git
cd TEN-Agent

# Build the Docker image
docker build -t dev/ten_agent_build .

# Tag and push to ECR
docker tag dev/ten_agent_build:latest \
  <your_account_id>.dkr.ecr.us-east-1.amazonaws.com/dev/ten_agent_build:latest
docker push <your_account_id>.dkr.ecr.us-east-1.amazonaws.com/dev/ten_agent_build:latest

Creating the EKS Cluster

Create the EKS cluster using a configuration file:

# cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ten-framework-cluster
  region: us-east-1
  version: "1.31"

managedNodeGroups:
  - name: ten-workers
    instanceType: c5.2xlarge
    desiredCapacity: 2
    minSize: 1
    maxSize: 4
    volumeSize: 100
    iam:
      withAddonPolicies:
        ebs: true
        efs: true

Apply the cluster configuration:

eksctl create cluster -f cluster-config.yaml

Deploying the TEN Agent Services

Create the Kubernetes namespace and deploy the services:

# Create the namespace
kubectl create namespace ten-framework --save-config

Apply the deployment manifest:

# deployment.k8s.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ten-agent
  namespace: ten-framework
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ten-agent
  template:
    metadata:
      labels:
        app: ten-agent
    spec:
      containers:
        - name: ten-agent
          image: <your_account_id>.dkr.ecr.us-east-1.amazonaws.com/dev/ten_agent_build:latest
          ports:
            - containerPort: 8080
          env:
            - name: AWS_REGION
              value: "us-east-1"
            - name: BEDROCK_MODEL_ID
              value: "us.amazon.nova-pro-v1:0"
            - name: AGORA_APP_ID
              valueFrom:
                secretKeyRef:
                  name: ten-agent-secrets
                  key: agora-app-id
            - name: AGORA_APP_CERTIFICATE
              valueFrom:
                secretKeyRef:
                  name: ten-agent-secrets
                  key: agora-app-certificate
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
# Apply the deployment and service manifests
kubectl apply -n ten-framework -f deployment.k8s.yaml
kubectl apply -n ten-framework -f service.k8s.yaml

Verifying the Deployment

Once the pods are running, verify the deployment:

# Check pod status
kubectl get pods -n ten-framework

# Check service endpoints
kubectl get svc -n ten-framework

# View logs for troubleshooting
kubectl logs -n ten-framework -l app=ten-agent --tail=100

Wrapping Up

Building a real-time audio-video AI interaction system involves coordinating many moving parts: speech recognition, multimodal reasoning, text-to-speech, real-time communication, and interrupt handling. The combination of Amazon Nova’s multimodal capabilities with the TEN framework’s plugin orchestration makes this significantly more tractable than building everything from scratch.

The key takeaways from this architecture:

  • Amazon Nova Pro provides the best balance of quality and latency for real-time multimodal conversations
  • Frame sampling and compression are essential for controlling costs when processing continuous video streams
  • Streaming at every stage (Transcribe, Bedrock, Polly) is what makes the latency acceptable for real-time interaction
  • The TEN framework’s DCG-based plugin architecture keeps the system modular and each component independently replaceable
  • Amazon EKS provides a solid foundation for production deployment with auto-scaling capabilities

The full source code for the TEN Agent with Amazon Nova integration is available on GitHub. If you are building real-time conversational AI applications on AWS, this architecture gives you a production-ready starting point with well-defined extension points for customization.