Real-time AI avatars on a single GPU: When local beats the cloud API

Open-weights joint audio-video models hit a usable bar this spring, and a single, local GPU card now runs a real-time conversational avatar end to end. Here's where self-hosting with LTX 2.3 and Daydream Scope beats a third-party cloud API, and where it doesn't.

May 21, 2026 • 15 min read

Let's talk about running a conversational avatar locally: weights on your disk, a 24 GB card doing the work, and the honest tradeoffs of this approach versus using a cloud API.

The typical scenario is simple: You type a question, and several seconds later, a face on the screen answers you out loud. That same face was a still portrait a moment ago. The voice and the lips were generated in the same pass by a single 22-billion-parameter open-weights model. At the same - this all happened on your machine - nothing left the box.

That is what our new LTX 2.3 avatar tutorial walks through, end to end, on a single NVIDIA GPU with 24 GB of VRAM or more. The pipeline is a seven-node graph in Daydream Scope's Workflow Builder. A local language model (the workflow uses SmolLM2-360M-Instruct) writes the reply. Lightricks' LTX 2.3 generates 121 frames of video at 25 fps along with the 48 kHz audio of the spoken reply, in a single joint pass, in eight denoising steps of roughly 0.3 seconds each. The browser receives the result over WebRTC from your own process.

0:00

/0:36

Real-time AI video avatar running in Daydream Scope

This post is the companion piece to that tutorial. Think of it like this: the tutorial shows you how, and this artifact is about when self-hosted, open-weights, real-time avatars are the right call and when a cloud avatar API is still the right call.

GTM moment more than a tech demo

Three things changed in the last six months, and all three of them significantly affect your considerations and options.

First, open-weights joint audio-video crossed a usable quality bar. Lightricks released LTX-2 with open weights on January 6, 2026, and LTX 2.3 on March 5, 2026, both under a community license that is free for commercial use up to $10M in annual revenue.

LTX 2.3 is a 22-billion-parameter diffusion transformer that produces synchronized video and audio in one forward pass, not video plus a TTS dub. On the Artificial Analysis open-weight leaderboard at release, LTX 2.3 ranked first among open video models, with the Elo gap to closed leaders like Seedance 2.0 still real but narrowed. Lightricks also licensed all training data from Getty Images and Shutterstock, removing the ambiguity in training data that has shadowed most open video models.

Second, the consumer GPU floor moved. The NVIDIA RTX 5090 shipped on January 30, 2025 at a $1,999 MSRP with 32 GB of GDDR7. Street prices have been higher (TrackaLacker logged a low of $1,999.99 in August 2025, climbing to $3,049.99 by February 2026), but the point is the same: a single card under your desk now clears the 24 GB VRAM floor that the LTX 2.3 FP8 checkpoint requires. The same model also runs on any number of H100 cloud providers all over the place with different pricing points.

Third, the pipeline tooling matured. Daydream Scope ships a visual node graph, WebRTC streaming, plugin support, integrations with TouchDesigner, Resolume, Unity, and OBS, and the control protocols MIDI, OSC, DMX, NDI, Spout, and Syphon.

Scope 0.2.2 added multi-source, multi-sink graphs; the recent release that introduced the LTX-2 plugin also added ~18× faster model loading and faster prompt changes. There is now a published, forkable workflow at app.daydream.live/workflows/rafal/live-ltx-2-avatar. You don't have to start from a blank graph.

None of the above is the future - all of it is what was shipped between January and mid-May 2026.

Where cloud avatar APIs still own the category

If you are building a customer-facing conversational avatar and your priority is sub-second turn-taking with full-duplex emotional rendering, Tavus is the very strong API on the market right now.

Phoenix-4, which Tavus launched on February 18, 2026, is a Gaussian-diffusion rendering model that runs at 40 fps in 1080p, with a published sub-500ms end-to-end latency target on their Conversational Video Interface and millisecond-level rendering latency.

Their stack has named models for rendering (Phoenix-4), perception (Raven-1), and turn-taking (Sparrow-1). On November 12, 2025, Tavus announced a $40M Series B led by CRV (with Scale Venture Partners, Sequoia, Y Combinator, HubSpot Ventures, and Flex Capital), bringing total funding to roughly $63M. If you want full-duplex active listening that looks like a person nodding while you talk, Tavus is worth considering.

Beyond Presence is the speed-to-deployment option. Their Genesis 1.0 model claims sub-100ms audio-to-video latency and roughly 1.2-second end-to-end agent latency, with the platform integrating LiveKit and supporting bring-your-own LLM. The founders previously sold Presize to Meta in April 2022 in what TechCrunch described as a lower-nine-figure deal, so the technical bench is likely real. Public named-customer evidence is thinner than competitors'; aggregate counts ("2,400+ customers") do appear in directory listings, but individual case studies do not.

Then, there’s HeyGen, which owns async sales and marketing video. Their pay-as-you-go API is now priced at about $1.00 per minute for Avatar III at 1080p and up to $5.00 per minute for Avatar IV Digital Twin at 4K. For mass-produced multilingual outbound that doesn't need to converse back, that pricing is attractive.

Synthesia owns enterprise learning and development. The $200M Series E led by Google Ventures was officially announced on January 26, 2026 at a roughly $4B valuation (first reported by Forbes on October 29, 2025), and the platform now serves more than 90% of the Fortune 100. Named customers include UBS (which rolled out Synthesia avatars to 36 analysts starting January 2025, targeting roughly 5,000 videos per year), Merck KGaA, ServiceNow, Heineken, Zoom, and SAP. Synthesia 3.0, which launched in October 2025, adds interactive conversational Video Agents, though as of writing, those Agents are in limited enterprise beta with no publicly named deployment yet.

D-ID launched V4 Expressive Visual Agents on March 16, 2026 with a sub-0.5-second conversational turn target and a starting price of $5.90 per month. They acquired simpleshow in September 2025 (terms undisclosed) and inherited that customer base.

Named D-ID Agents deployments include PepsiCo's Gatorade Sports Science Institute hydration coach, Southern Illinois University School of Medicine's "Randy" virtual patient, and Rafael's Iron Dome product explainer.

If your buyer wants a managed service, a BAA on the vendor's paper, a roadmap call, and someone to blame when the avatar misbehaves on a Friday afternoon, then you should consider one of these.

Where local and open wins

However, there are at least 5 scenarios where running LTX 2.3 on your own GPU via Scope beats sending audio to any vendor.

Privacy and data sovereignty. The tutorial pipeline we mentioned keeps weights on your disk, inference on your GPU, and the WebRTC stream from a browser to your own process.

For PHI under HIPAA, the practical bar is a Business Associate Agreement plus controls demonstrating that PHI never leaves your environment. Inworld AI's HIPAA pattern guide for patient intake makes the architecture explicit: SOC 2 Type II, GDPR, on-premises deployment on customer-controlled H100 or A100 GPUs, and BAAs under enterprise contracts.

The same logic applies to MNPI under SEC rules, attorney-client privilege, and CUI in defense work. "Nothing leaves the box" in those cases is a deployment posture, and not a nice-sounding marketing line.

Unit economics at scale. Tavus's published overage on the Starter plan is $0.37 per conversation minute, with discounts as you move up. HeyGen Avatar IV runs $4 to $5 per minute via API. At 100,000 minutes per month at Tavus Starter overage rates, that is about $37,000 per month before any account-management or volume discount.

A single RTX 5090 at $1,999 MSRP, or even at $3,049 street price, plus a workstation around it, amortizes against that bill in under three months if you can keep the card busy.

At 1,000,000 minutes per month, the cloud bill at $0.37 is $370,000 per month, and the math is not subtle. However, here’s the honest qualifier: LTX 2.3 in the Scope avatar workflow produces 4.84 seconds of audio-video per generation cycle, so a single GPU is one conversational stream, not a hundred. You scale by adding GPUs/instances, and the breakeven math is per concurrent stream, not per minute. You can also likely benefit from Daydreams hosted inference even if you don’t have your own, local GPU, as it’s integrated in Scope and runs on Windows, Mac or Linux machines.

Model and pipeline control. The whole pipeline is composable which means that every node in the seven-node graph is swappable, and the graph itself is yours to extend.

For example, you can replace the reference portrait with the face you actually want. You can swap SmolLM2 for a 7B or 13B open-weights LLM if you can spare the VRAM, or point at a remote LLM endpoint, or run your own fine-tuned model with the LoRAs your business case needs.

You can even decouple voice from video if you prefer: generate the audio with a separate TTS model you've chosen for its voice quality, then lip-sync it with LTX-2, which gives you full control over the voice rather than taking whatever the joint pass produces. You can add Whisper to the input side and have a full voice-to-voice pipeline.

For more customizability, you can even swap LTX-2 itself for something else; lip-sync models work today, and the autoregressive audio-video models landing this year will slot in the same way.

Lastly, you can even consider dropping in a different talking-head LoRA and then piping the output into TouchDesigner. And although it’s a bit more work, you could drive prompts and parameters over OSC or MIDI from a hardware controller.

None of this is on a Tavus or HeyGen roadmap, because that is not their product. Their product is the API, and Daydream Scope's product is the graph and whatever you build on it.

Edge and latency floor. A kiosk in a retail showroom, a conference floor activation, a museum installation: most, if not all, of these run on hardware you control, often on a network you don't trust to reach a cloud API reliably. WebRTC from a local Scope process to a browser on the same LAN is a different latency regime than WebRTC from a browser through the public internet to a vendor's GPU in another region. Worth taking into account.

Brand IP that lives somewhere you control. A long-lived avatar with a face, a voice, a name, and a training set of LoRAs can be considered your brand asset. Hosting that asset on someone else's tenant is a strategic dependency.

Soul Machines, founded out of the University of Auckland in 2016 and once a category leader for "digital humans" with Mercedes-Benz and ANZ Bank as customers, entered receivership on February 5, 2026 owing at least NZ$19.6 million after losing those customers to in-house projects. That is the risk picture in one sentence. Open weights on your own infrastructure don't go into receivership.

Verticals with most potential for local real-time AI avatars

It would be dishonest to say that this local and open approach works equally well for every vertical. You should be guided by this principle and use it for the verticals where the competitive advantage is sharpest. We list these verticals together with realistic constraints and a starter project idea deployable to test the validity of use cases.

Here are several such verticals - spanning from regulated healthcare to financial services and to multilingual citizen services for (local) governments.

Regulated healthcare: intake triage, patient education, behavioral health support. The use case that already exists at production scale today is voice-only intake (Inworld's HIPAA pattern, Retell), and video avatars are an emerging layer on top.

woman in black crew neck shirt wearing blue earbuds

Southern Illinois University School of Medicine's "Randy" virtual patient with D-ID is one of the few publicly documented video-avatar deployments in clinical training. The Daydream Scope advantage is real for behavioral health intake and patient education, where you do not want PHI traversing a vendor's GPU, and where you want the avatar's face, voice, and script under your team's control.

Be mindful of a realistic constraint: clinical validation, consent for any cloned likeness, and a clear escalation path to a human are non-negotiable.

Starter project a builder could ship in one to two weeks: a kiosk avatar that explains a procedure in plain language and answers patient questions about it, running on a 5090 in the clinic's own server room, with a Whisper input node, a clinician-approved knowledge base, and an explicit "talk to a person" button.

As long as the avatar is the front door, and not the doctor, this makes sense.

Financial services: regulated communications and advisor avatars. UBS's Synthesia rollout to 36 analysts in January 2025 is the clearest public evidence that big-bank compliance and AI avatars can coexist, and notably, UBS chose a vendor with a deep enterprise security approach.

For real-time conversational advisor avatars rather than scripted research videos, the FINRA and SEC marketing-communications rules push hard toward on-premise processing of MNPI and full audit trails.

Scope's local-first approach, with weights and inference within the bank's perimeter, is one that compliance teams understand.

Again, some realistic constraints: voice cloning consent from any named advisor, model output logging, and a tight allowlist of topics.

Starter project idea: an internal-only research summary avatar for an analyst's morning notes, deployed on a single GPU, with the LLM swapped for the bank's own RAG endpoint, and a daily compliance review of generated outputs.

Corporate learning and development: localized internal training and role-play simulations. Synthesia is the category leader here for a reason, and the case studies are good.

Man presents charts to seated audience in a modern office.

Claire Boger, Senior Director of Customer Training and Enablement at Persado, reported a 95% reduction in production time from two weeks to four hours and more than 120 training videos populated into 40 courses in a year.

Druva built dozens of training videos in six months. AWS uses Synthesia to localize marketing content. If you want a scripted, branded, multilingual avatar video factory with templates and SSO, consider them.

The Daydream advantage in L&D shows up where the asset is interactive and proprietary: role-play simulations with a difficult customer, manager training where the avatar has to react to whatever the trainee says, or sales-objection drills where the persona has to be fine-tuned with confidential transcripts you cannot send to a vendor.

Starter project: a sales-objection role-play avatar with a swappable customer persona, deployed on a single 5090, integrated with the LMS over a basic webhook.

Live broadcast, entertainment, and VJ work. This is the vertical where Scope's design choices look least like a compromise and most like a deliberate feature.

a person working on a computer in a dark room

Real-time WebRTC output, Spout and Syphon and NDI integrations, OSC and MIDI control over node parameters, and audio-reactive workflows - you get this out of the box. VTubers and live streamers have been the most aggressive adopters of real-time AI video pipelines because the rendering latency is the show.

The LTX 2.3 benefit here is the joint audio-video pass: an avatar that can sing a generated line or react to a chat prompt with both face and voice in sync, in a single pass, without a separate TTS step that drifts.

Constraint: LTX 2.3 is not autoregressive, so each chunk is generated from the reference image. Between answers, the workflow plays a short idle-loop clip, and there is a small seam at the transition unless you solve for it.

For broadcast, that seam is a directing choice (cut on the seam, use it as a beat); for a customer-service avatar, it is a more visible artifact.

Starter project: a Twitch overlay where viewers chat-prompt a co-host avatar that answers in voice and on camera, with the entire pipeline running on the streamer's existing 4090 or 5090.

Public-sector kiosks and multilingual citizen service. The data-residency story is the whole story in this case.

A kiosk in a city hall serving residents in multiple languages, running on the agency's own hardware, with no resident data leaving the building, is a cleaner procurement than any cloud API.

Soul Machines pursued exactly this market with Nadia for Australia's NDIS, and the technical pieces are now reproducible in open source.

Constraint: accessibility, language coverage, and a fallback to a human staff member who is actually present.

Starter project: a single-language kiosk for the most common five citizen questions at one agency, on a 5090 in a locked closet, with NDI output to the kiosk display and a hardware "call a person" button wired through OSC.

What a real deployment looks like

The unromantic part - and some practical deployment considerations you should consider.

GPU procurement in 2026 is not a one-click decision. RTX 5090 supply has been spotty, with street prices ranging from $1,999 to over $3,000 through 2025 and 2026 per BestValueGPU and TrackaLacker price trackers.

On the other side, a used H100 with proper cooling and a 1,500W PSU is a real piece of infrastructure, not just a desktop anyone can run. Many teams should start by using rented hardware and network capacity, such as Daydream’s remote inference.

Another thing to consider is a latency budget: at 4.84 seconds of video per 3 to 4 second generation cycle, your pipeline is comfortably below playback time, but the moment you add Whisper for speech input or a larger LLM for the reply, you eat into that margin.

Budget the full turn (STT + LLM + LTX 2.3 + WebRTC) and target 7 seconds end-to-end, not the model's standalone numbers. P99 handling matters: a single slow generation cycle creates an audible gap on the playback side. Fallback to a pre-rendered idle clip is what the tutorial workflow already does between answers; the same trick covers misbehaving generations and significantly improves overall experience.

Talent likeness and voice consent: if the reference portrait is a real person, get the consent in writing, scoped to the use, with a retirement clause. The talking-head LoRA in the tutorial (elix3r/LTX-2.3-22b-AV-LoRA-talking-head) uses a fixed trigger token OHWXPERSON, which is a useful technical detail to know but does not substitute for a likeness release.

Synthesia's "three Cs" (consent, control, collaboration) framework is a defensible starting point. Don't make avatars of public figures or dead people just because you can.

Then, there’s content moderation: a local pipeline has no vendor doing safety classification for you. Wire a moderation pass on the LLM output before it reaches the LTX 2.3 node. Regulatory layer to design for now: the EU AI Act Article 50 transparency obligations apply from August 2, 2026. Deployers of AI systems generating deepfake image, audio, or video content must disclose that the content is artificially generated, and providers must mark outputs in a machine-readable format.

The Colorado Anti-Discrimination in AI Act has slipped to January 1, 2027 after SB 189 amendments, with consumer-notice and appeal obligations rather than the original duty-of-care regime. Build the disclosure label into the UI now; it costs nothing today and is mandatory later - but also be sure to follow the regulations space closely as it’s evolving rapidly.

Honest limitations

First and foremost, LTX 2.3 is not an autoregressive model. This means that each generation is conditioned on the reference image, and not on the previous frame. The workflow stitches answer clips together with a short idle loop in between, and a visible seam appears at the transition. For broadcast and entertainment, you can hide that seam with a cut or a beat. For a customer-service avatar staring you down, the seam is somewhat noticeable.

But this field is evolving, and the autoregressive successors are landing. OmniForcing (arXiv:2603.11647, March 2026) distills the LTX-2 bidirectional architecture into a streaming autoregressive generator at ~25 fps on a single GPU with ~0.7 second time-to-first-chunk, a roughly 35× speedup over the teacher.

Mutual Forcing (arXiv:2604.25819) trains a native, fast-causal audio-video model with dual-mode self-evolution. The OmniForcing team is committed to releasing code and weights within two weeks of the paper. When these models ship, and (community) Scope plugins land for them, the seam problem and most of the per-chunk batching go away.

Voice quality on small local LLMs varies. SmolLM2-360M-Instruct is small enough to fit comfortably alongside LTX 2.3 on a 24 GB card, and small enough that its prose is plain and occasionally off. Swap it for a 7B or 8B model if you have the VRAM headroom on a 32 GB card or H100. Prompt-change latency is noticeably higher on a 32 GB card than on an H100, even with the recent ~18× increase in model-loading speed in the LTX-2 plugin.

The LoRA's fixed trigger token (OHWXPERSON) means your prompts must include that token to trigger the talking-head behavior. It is a quirk, not a blocker, and it is the kind of quirk that is normal for open-weights LoRAs. Just a thing to be aware of.

Where to start

Three steps.

Read and run the LTX 2.3 avatar tutorial. If you have a 24 GB card, run it locally. If you don't, run it on Daydream’s hosted/remote inference infra.
Fork the published workflow in Scope. Don't rebuild the graph from scratch on day one.
Swap exactly one node for your use case. The fastest signal-to-learning swap is the reference portrait. The next is the LLM. The third could be replacing typed input node with a Whisper STT node.

When the autoregressive A/V models land later this year, the seam closes, and the same graph picks up the upgrade with a plugin update.

Until then, the demo in the tutorial is already a shippable product for the verticals named above. Open the tutorial, point it at your own GPU or Daydream’s remote inference, and see what falls out.