Which Amazon Nova model is best for real-time audio-video AI interaction?

Nova Pro is the recommended default because it offers the best balance between response quality and inference latency, which matters when users are waiting for a conversational reply in real time. Nova Lite is a good alternative when cost is the primary concern and visual understanding needs are straightforward, while Nova Micro suits text-only tasks where speed matters most.

How do you reduce the cost of real-time video understanding with AI models?

Two strategies help: frame sampling by time interval, where video frames are sent to the model at configurable intervals instead of continuously, and image compression with history merging, where frames are compressed before sending and older frames are merged or discarded via a sliding window (MAX_IMAGE_COUNT). Increasing the sampling interval lowers cost at the expense of temporal resolution.

How does the TEN framework handle user interruptions during AI speech?

Amazon Transcribe emits partial and final transcript results. Partial results indicate the user is currently speaking, which triggers the Interrupt Plugin to signal the Polly plugin to stop playing audio. Final results indicate a complete utterance, which sends the text to Amazon Nova for reasoning and converts the response back to speech. This partial/final distinction makes conversation feel natural rather than turn-based.

阅读中文版 →

Building Real-Time AI Audio-Video with Amazon Nova

Build a low-latency multimodal AI assistant with Amazon Nova, Transcribe, Polly, and the open-source TEN framework.

zhuermu · June 15, 2025 · 12 min

Amazon NovaTEN FrameworkReal-time AVAWSAI InteractionEKS

Background

As artificial intelligence advances rapidly, real-time audio-video interaction applications are becoming a major market focus. By combining Amazon Nova foundation models, Amazon Transcribe, Amazon Polly, and other AWS services, you can build a powerful real-time audio-video interaction system with multimodal understanding capabilities.

This post walks through a complete solution for building such a system using the open-source TEN (Transformative Extensions Network) framework to orchestrate all the components. We cover the logical architecture, physical deployment on AWS, the core services involved, key optimization strategies for latency and cost, and a step-by-step deployment guide using Amazon EKS.

Technology Selection

We designed a Real-Time Audio-Video Interaction Solution Based on Amazon Nova and TEN Framework. The core advantage of this approach is leveraging Amazon Nova’s multimodal capabilities to support real-time video understanding, while using the TEN framework’s plugin architecture to keep each component modular and replaceable.

The key technologies in play:

Amazon Nova — Foundation models with native multimodal support (text, image, video)
Amazon Transcribe — Real-time speech-to-text transcription
Amazon Polly — High-quality text-to-speech synthesis
TEN Framework — Open-source agent framework with a directed graph-based plugin orchestration engine
Agora RTC — Real-time audio-video communication network
Amazon EKS — Container orchestration for production deployment

Logical Architecture

The solution uses a modular logical architecture where the TEN framework orchestrates all functional modules through a plugin system. Here is how the components fit together.

Frontend User Interaction Module

User Terminal: Supports both web browsers and mobile applications
Web Server: Serves as the entry point for frontend requests, handling channel creation and authentication

TEN Agent (Core Orchestration)

The TEN Agent is the central module responsible for orchestrating and managing all plugins. It uses a Directed Cyclic Graph (DCG) to implement flexible data flow processing between modules. The key plugins include:

RTC Plugin: Handles sending and receiving real-time audio-video data through the Agora RTC network
Amazon Transcribe Plugin: Performs real-time speech recognition, converting the user’s spoken audio into text
Interrupt Plugin: Monitors voice interruption states so the agent can stop speaking when the user starts talking
Amazon Bedrock Plugin: Sends text and video frames to Amazon Nova models for multimodal reasoning and response generation
Amazon Polly Plugin: Converts the agent’s text responses back into natural-sounding speech

RTC Network

Real-time communication is handled through Agora’s SD-RTN (Software Defined Real-Time Network), which provides low-latency audio-video transmission globally.

Communication Channel Setup Flow

The process of establishing a real-time session follows these steps:

The user client calls the HTTP endpoint /v1/api/generate to request a channel name and authentication token
The Web Server processes the request and returns the channel name and token
The user client uses the token to establish a communication channel with the RTC Network
The user client calls /v1/api/start to request that the conversation session begin
The Web Server obtains the channel information and passes it to the TEN Agent
The TEN Agent joins the channel and establishes its own connection to the RTC Network

Once all connections are established, audio and video data flows bidirectionally between the user and the agent in real time.

Physical Architecture

The entire solution is deployed on AWS cloud services:

End users access the system through web browsers or mobile applications
Amazon CloudFront serves as the CDN, accelerating static asset delivery and reducing latency for the frontend
Application Load Balancer (ALB) routes incoming API requests to the backend services
Amazon EKS hosts all core services, including the Web Server and TEN Agent containers
Agora SD-RTN handles the actual low-latency audio-video transmission between users and the agent
Docker images are built and stored in Amazon ECR (Elastic Container Registry) for deployment to EKS

The backend services running on EKS connect to AWS AI services (Bedrock, Transcribe, Polly) via AWS SDK calls, keeping the architecture clean and the service boundaries well-defined.

Core AWS Services

Amazon Nova — Multimodal AI Engine

Amazon Nova is a family of foundation models that powers the intelligence behind this solution. The model selection depends on your latency, cost, and capability requirements:

Model	Strengths	Best For
Nova Micro	Low cost, fast inference	Text-only tasks where speed matters most
Nova Lite	Low cost, handles images and video	Quick processing of visual inputs
Nova Pro	Best balance of performance, speed, and cost	Production workloads requiring multimodal reasoning
Nova Premier	Highest capability for complex reasoning	Tasks requiring deep analysis and multi-step reasoning

For this real-time interaction solution, Nova Pro is the recommended default. It provides the best balance between response quality and inference latency, which is critical when users are waiting for a conversational response in real time. Nova Lite is a good alternative when cost is the primary concern and the visual understanding requirements are straightforward.

Amazon Transcribe — Speech Recognition

Amazon Transcribe provides real-time speech-to-text capabilities with support for over 100 languages. In this architecture, the Transcribe plugin receives audio frames from the RTC stream and converts them to text in real time, which is then passed to the Amazon Bedrock plugin for reasoning.

Key features used in this solution:

Streaming transcription for real-time, low-latency processing
Partial result detection to enable interrupt handling (more on this below)
Automatic language detection for multilingual scenarios

Amazon Polly — Text-to-Speech

Amazon Polly converts the agent’s text responses into natural-sounding speech. It supports over 40 languages and returns audio via streaming, completing speech generation in under 150 milliseconds. This low latency is essential for maintaining a natural conversational flow — users should not feel like they are waiting for the agent to “speak.”

The streaming capability means Polly begins returning audio before the full text has been processed, allowing the agent to start speaking almost immediately as the response is generated.

Key Technical Challenges and Optimization Strategies

Reducing Latency

Latency is the single most important factor in real-time conversational AI. Users expect responses within a few hundred milliseconds — anything longer breaks the illusion of a natural conversation. We address this at multiple levels:

Model selection: Amazon Nova Pro provides low-latency multimodal reasoning. For text-only scenarios, Nova Micro offers even faster inference times.
Asynchronous processing: The TEN framework supports fully asynchronous plugin execution, so multiple operations can proceed in parallel rather than sequentially.
Streaming throughout the pipeline: Amazon Transcribe streams partial transcription results as the user speaks. Amazon Polly streams audio as the text is generated. The Bedrock API supports streaming responses. This means every stage of the pipeline starts producing output before the previous stage has finished.

Cost Optimization

Real-time video understanding is inherently expensive because it requires sending image frames to the model continuously. Two key strategies help control costs:

Frame sampling by time interval: Rather than sending every video frame to the model, frames are sampled at configurable time intervals. This dramatically reduces the volume of data processed while still providing the model with sufficient visual context.

Image compression and history merging: Frames are compressed before being sent to the model, and historical frames are merged to reduce the total number of images in the context window.

Here is the core video frame processing logic from the TEN Agent’s Bedrock plugin:

async def _on_video(self, ten_env: AsyncTenEnv):
    while True:
        [image_data, image_width, image_height] = await self.image_queue.get()
        frame_buffer = rgb2base64jpeg(image_data, image_width, image_height)
        self.image_buffers.append(frame_buffer)
        while len(self.image_buffers) > MAX_IMAGE_COUNT:
            self.image_buffers.pop(0)
        while not self.image_queue.empty():
            await self.image_queue.get()
        await asyncio.sleep(VIDEO_FRAME_INTERVAL)

This function runs in a continuous loop, pulling video frames from a queue, converting them to Base64-encoded JPEG, and maintaining a sliding window of recent frames (MAX_IMAGE_COUNT). Older frames are discarded as new ones arrive. The VIDEO_FRAME_INTERVAL parameter controls how frequently frames are sampled — increasing this value reduces cost at the expense of temporal resolution. The inner while loop drains any accumulated frames from the queue, ensuring the agent always works with the most recent visual data rather than processing stale frames.

Interrupt and Completion Signal Detection

In natural conversation, people frequently interrupt each other. The agent needs to detect when the user starts speaking and immediately stop its own audio output. This is handled through the Amazon Transcribe plugin’s transcript event processing:

async def handle_transcript_event(self, transcript_event):
    results = transcript_event.transcript.results
    for result in results:
        if result.is_partial:
            is_final = False
        for alt in result.alternatives:
            text_result += alt.transcript
    create_and_send_data(ten=self.ten, text_result=text_result, is_final=is_final, stream_id=self.stream_id)

The key distinction here is between partial and final transcript results. Partial results indicate that the user is currently speaking — this triggers the Interrupt Plugin to signal the Polly plugin to stop playing audio. Final results indicate that the user has finished a complete utterance, which triggers the full processing pipeline: the text is sent to Nova for reasoning, and the response is converted back to speech.

This partial/final detection mechanism is what makes the conversation feel natural rather than turn-based. Without it, the agent would either talk over the user or wait awkwardly for long pauses before responding.

Modular Architecture with Hot-Swappable Plugins

The TEN framework’s plugin architecture means every component in the pipeline can be replaced independently. Want to swap Amazon Transcribe for a different speech recognition service? Replace the Transcribe plugin without touching any other code. Want to add a translation step between transcription and reasoning? Insert a new plugin into the directed graph.

This modularity is particularly valuable during development and experimentation. You can test different model configurations, swap between Nova Pro and Nova Lite for A/B testing, or add custom preprocessing plugins without redesigning the architecture.

Deployment Guide

Prerequisites

Before deploying, you need:

An AWS account with appropriate IAM permissions for EKS, ECR, Bedrock, Transcribe, Polly, and CloudFront
An Agora account with RTC service enabled (for the real-time audio-video network)
AWS CLI and eksctl installed and configured on your local machine
Docker installed for building container images
kubectl installed for managing the EKS cluster

Building and Pushing Docker Images

First, authenticate Docker with your Amazon ECR registry and build the TEN Agent image:

# Authenticate Docker with ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS \
  --password-stdin <your_account_id>.dkr.ecr.us-east-1.amazonaws.com

# Clone the TEN Agent repository
git clone https://github.com/zhuermu/TEN-Agent.git
cd TEN-Agent

# Build the Docker image
docker build -t dev/ten_agent_build .

# Tag and push to ECR
docker tag dev/ten_agent_build:latest \
  <your_account_id>.dkr.ecr.us-east-1.amazonaws.com/dev/ten_agent_build:latest
docker push <your_account_id>.dkr.ecr.us-east-1.amazonaws.com/dev/ten_agent_build:latest

Creating the EKS Cluster

Create the EKS cluster using a configuration file:

# cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ten-framework-cluster
  region: us-east-1
  version: "1.31"

managedNodeGroups:
  - name: ten-workers
    instanceType: c5.2xlarge
    desiredCapacity: 2
    minSize: 1
    maxSize: 4
    volumeSize: 100
    iam:
      withAddonPolicies:
        ebs: true
        efs: true

Apply the cluster configuration:

eksctl create cluster -f cluster-config.yaml

Deploying the TEN Agent Services

Create the Kubernetes namespace and deploy the services:

# Create the namespace
kubectl create namespace ten-framework --save-config

Apply the deployment manifest:

# deployment.k8s.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ten-agent
  namespace: ten-framework
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ten-agent
  template:
    metadata:
      labels:
        app: ten-agent
    spec:
      containers:
        - name: ten-agent
          image: <your_account_id>.dkr.ecr.us-east-1.amazonaws.com/dev/ten_agent_build:latest
          ports:
            - containerPort: 8080
          env:
            - name: AWS_REGION
              value: "us-east-1"
            - name: BEDROCK_MODEL_ID
              value: "us.amazon.nova-pro-v1:0"
            - name: AGORA_APP_ID
              valueFrom:
                secretKeyRef:
                  name: ten-agent-secrets
                  key: agora-app-id
            - name: AGORA_APP_CERTIFICATE
              valueFrom:
                secretKeyRef:
                  name: ten-agent-secrets
                  key: agora-app-certificate
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"

# Apply the deployment and service manifests
kubectl apply -n ten-framework -f deployment.k8s.yaml
kubectl apply -n ten-framework -f service.k8s.yaml

Verifying the Deployment

Once the pods are running, verify the deployment:

# Check pod status
kubectl get pods -n ten-framework

# Check service endpoints
kubectl get svc -n ten-framework

# View logs for troubleshooting
kubectl logs -n ten-framework -l app=ten-agent --tail=100

Wrapping Up

Building a real-time audio-video AI interaction system involves coordinating many moving parts: speech recognition, multimodal reasoning, text-to-speech, real-time communication, and interrupt handling. The combination of Amazon Nova’s multimodal capabilities with the TEN framework’s plugin orchestration makes this significantly more tractable than building everything from scratch.

The key takeaways from this architecture:

Amazon Nova Pro provides the best balance of quality and latency for real-time multimodal conversations
Frame sampling and compression are essential for controlling costs when processing continuous video streams
Streaming at every stage (Transcribe, Bedrock, Polly) is what makes the latency acceptable for real-time interaction
The TEN framework’s DCG-based plugin architecture keeps the system modular and each component independently replaceable
Amazon EKS provides a solid foundation for production deployment with auto-scaling capabilities

The full source code for the TEN Agent with Amazon Nova integration is available on GitHub. If you are building real-time conversational AI applications on AWS, this architecture gives you a production-ready starting point with well-defined extension points for customization.

References

TEN Framework — GitHub
TEN documentation — TEN
Amazon Nova User Guide — AWS Documentation