Just Dub It AI Tool: Complete Guide to AI Video Dubbing
JUST-DUB-IT is a research method and model release for AI video dubbing. Its main promise is simple: instead of translating audio first and fixing the face later, it adapts an audio-video foundation model so the new speech and the speaker’s visible mouth movement are generated together. That joint approach is why the project matters. Dubbing is not only about words; it is also about rhythm, facial motion, pauses, expression, and whether the viewer still believes the same person is speaking.
The cleaned-up way to describe it is this: JUST-DUB-IT is not just a regular voiceover tool, and it is not a full consumer localization platform by itself. It is a video-to-video dubbing approach built on LTX audio-video models. It can be used inside a local AI workflow to create target-language or re-voiced videos with better lip synchronization than many modular pipelines, but translation quality, prompt preparation, human review, and responsible use still remain outside the model.
Quick Facts
| Point | Clean explanation |
| Category | AI video dubbing, lip synchronization, and audio-video generation |
| Core idea | Generate target-language speech and matching facial motion together instead of using a long modular dubbing pipeline. |
| Base technology | LTX-2 / LTX-2.3, a diffusion-based audio-video foundation model from Lightricks. |
| Adaptation method | A lightweight IC-LoRA / LoRA adapter trained for lip dubbing. |
| Best fit | Research, local AI workflows, experimental dubbing, video localization testing, and creator tooling. |
| Not the same as | A complete consumer localization suite. Translation, script preparation, quality review, and consent handling still matter. |
Official Links and Resources
Use these links as the article’s resource base. The original JUST-DUB-IT repository is now archived, while the newer LTX-2 repository and the LTX-2.3 LipDub model card are the better starting points for current usage.
| Resource | Official link | Why it matters |
| Official project page | https://justdubit.github.io/ | Demos, paper links, method overview, and citation. The page lists JUST-DUB-IT as SIGGRAPH 2026 work. |
| Paper | https://arxiv.org/abs/2601.22143 | JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion. |
| Current LTX-2 codebase | https://github.com/Lightricks/LTX-2 | The active Lightricks monorepo that now contains the lip dubbing pipeline. |
| LipDub pipeline | https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/lipdub.py | The current implementation path for lip dubbing / re-voicing with IC-LoRA and audio reference conditioning. |
| Latest LipDub model | https://huggingface.co/Lightricks/LTX-2.3-22b-IC-LoRA-LipDub | LTX-2.3 22B IC-LoRA LipDub adapter released by Lightricks. |
| Base model | https://huggingface.co/Lightricks/LTX-2.3 | LTX-2.3 audio-video foundation model used as the base. |
| Original JUST-DUB-IT code archive | https://github.com/justdubit/just-dub-it | The original research repository. It is archived and points users to the newer Lightricks repository. |
| Original JUST-DUB-IT model | https://huggingface.co/justdubit/justdubit | Earlier LoRA adapter and usage notes connected with the paper. |
| Training dataset | https://huggingface.co/datasets/justdubit/audiovisual_translation_dub | Paired audiovisual translation dubbing dataset referenced by the model card. |
| ComfyUI support | https://github.com/Lightricks/ComfyUI-LTXVideo | Official LTX-Video support for ComfyUI, including workflows for LTX models. |
What Is JUST-DUB-IT?
JUST-DUB-IT, formally titled “Video Dubbing via Joint Audio-Visual Diffusion,” is a method for generating dubbed video where the output includes both new speech and matching facial motion. The project page summarizes the idea as a joint audio-visual model for video dubbing. The paper argues that modern audio-video foundation models are already good at understanding the connection between sound and visuals, so they can be adapted for dubbing instead of building a separate tool for every stage of the process.
That difference matters because traditional dubbing pipelines are fragile. A typical AI dubbing workflow might transcribe speech, translate text, synthesize a new voice, adjust timing, run a lip-sync model, and then repair the video. Each stage can introduce errors. If the translated sentence is longer than the original, the timing drifts. If the generated voice does not match the face, the clip feels detached. If the lip-sync stage only sees the final audio, it may produce mouth movement that looks technically aligned but emotionally flat.
JUST-DUB-IT tries to reduce that mismatch by adapting a foundation model that already treats audio and video as connected signals. The model conditions on the input audio-video clip and then generates the target speech and synchronized facial motion together. In practical terms, the output is a new video where the mouth movement is adjusted for the new spoken content instead of remaining locked to the original language.
How It Works
The current public release is best understood as an IC-LoRA / LoRA adapter for LTX-2.3 rather than a separate all-in-one application. LTX-2.3 is Lightricks’ diffusion-based audio-video foundation model. JUST-DUB-IT adapts that type of model for lip dubbing, so the system can use the source video as visual context while generating speech and face motion that match the target prompt.
The paper’s training idea is also important. Instead of relying only on naturally collected multilingual clips of the same person, the researchers use the generative model itself to build paired multilingual training examples. They create clips with language switches, then train the model to inpaint the face and audio so one part matches the language of the other. This gives the adapter examples where identity, pose, background, and timing can stay consistent while the spoken language changes.
For readers, the simple explanation is enough: the model learns how a speaker should look and sound when the spoken content changes. It is not merely pasting a new audio track onto the old video. It uses the original clip as context, then updates the speech region in a way that aims to preserve identity and improve synchronization.
Models, Code, and Components It Uses
The central model family is LTX-2, with the newer public LipDub adapter released as Lightricks/LTX-2.3-22b-IC-LoRA-LipDub on Hugging Face. That model card lists the base model as LTX-2.3, the training type as IC-LoRA, and the control type as video and audio. The active implementation is inside the Lightricks/LTX-2 repository, specifically the lipdub pipeline under packages/ltx-pipelines.
There are two model paths worth distinguishing. The original JUST-DUB-IT release used the justdubit/justdubit Hugging Face model and the justdubit/just-dub-it repository. That repository was archived on May 11, 2026 and points users toward the Lightricks LTX-2 repository for current LTX-2.3 support. The newer Lightricks model card is therefore the cleaner link for readers who want the latest LipDub adapter.
The model also depends on the broader LTX-2.3 ecosystem. Depending on the workflow, users may need the LTX-2.3 base checkpoint, distilled model or LoRA components, spatial upscalers, a Gemma text encoder, and the LTX pipelines package. For ComfyUI users, the official Lightricks ComfyUI-LTXVideo repository is the relevant integration point.
What Makes It Different
The main difference is joint generation. Many dubbing systems are modular: one model creates speech, another handles voice conversion, another edits lips, and another cleans the face. Modular systems are flexible, but they can struggle when the clip includes side angles, expressive acting, fast motion, occlusion, or non-standard footage. JUST-DUB-IT’s appeal is that it tries to keep the audio and visual generation inside a single connected process.
The second difference is identity preservation. The paper and model cards describe the goal as preserving speaker identity while improving lip synchronization. That should be worded carefully. It does not mean every output will perfectly clone a person’s voice or replace a dedicated professional dubbing pipeline. It means the model is designed to keep the speaker’s appearance and speaking presence more consistent than a basic translated voiceover.
The third difference is research transparency. Unlike many commercial dubbing tools, JUST-DUB-IT has a paper, project page, code links, model cards, and dataset references. That makes it useful for researchers, developers, and advanced creators who want to understand how the system works or build local workflows around it.
Best Use Cases
JUST-DUB-IT is most relevant for creators and teams exploring AI video localization. A YouTube creator could use this type of workflow to test whether a video feels more natural in another language. A course creator could experiment with multilingual lessons where the instructor’s face stays synchronized with the target speech. A product team could prototype localized founder videos, explainers, or social ads before investing in full manual dubbing.
It is also interesting for AI video researchers because it shows how audio-video foundation models can be adapted for downstream editing tasks. Dubbing is a hard test case: it needs timing, identity, expression, and language control at once. If a model can handle that well, the same foundation may be useful for other video editing tasks such as re-voicing, performance transfer, audio-driven video generation, or selective video retakes.
Practical Limits and Cautions
The biggest practical limit is that JUST-DUB-IT should not be described as a complete one-click translation product. In a real publishing workflow, someone still needs to prepare or verify the target-language script, check pronunciation, review meaning, inspect lip sync, and confirm that the final video is acceptable for the audience. AI dubbing can reduce production work, but it does not remove editorial responsibility.
Hardware is another limit. LTX-2.3 workflows are local and open-weight oriented, but they are not lightweight mobile tools. The latest LipDub adapter sits on top of a large 22B audio-video model, and the surrounding pipeline may require substantial GPU memory, model downloads, and setup time. This is better framed as a developer or power-user workflow than a casual browser tool.
Consent is also essential. A model that can make someone appear to speak new words in a different language should be used only with permission and clear disclosure where appropriate. The same technology that helps education and localization can also create misleading clips if used carelessly. Any article about JUST-DUB-IT should mention that responsible use is part of the tool’s real-world value.
How to Explain It to Readers
The cleanest positioning is: JUST-DUB-IT is an open research direction for realistic AI video dubbing. It uses the LTX audio-video model family and a lightweight adapter to generate target speech and matching facial motion together. Compared with traditional AI dubbing pipelines, it is designed to reduce lip-sync mismatch and preserve the speaker’s identity under more realistic video conditions.
Avoid calling it only a translator, a voice cloning app, or a simple lip-sync tool. Those descriptions are incomplete. The stronger description is joint audio-visual dubbing: the voice and visible speech performance are generated as one connected output.
FAQ
Yes, AI dubbing is legal when used with proper consent and in compliance with privacy laws.
The original repository and model were released publicly, and the current LTX-2.3 LipDub adapter is available on Hugging Face under the LTX community license. The original repository is archived, so readers should also check the active Lightricks/LTX-2 repository.
The research focuses on generating dubbed audio and synchronized facial motion. Translation or script preparation can be handled separately or through prompts, so it should not be sold as a full translation suite by itself.
The latest public LipDub release is trained on top of LTX-2.3-22B as an IC-LoRA adapter. Earlier JUST-DUB-IT resources also reference LTX-2 and the original justdubit model card.
It is best for AI researchers, developers, localization teams, and advanced creators who are comfortable with model downloads, GPU workflows, and manual quality control.
JUST-DUB-IT is important because it points toward a more natural future for AI dubbing. Instead of treating translation, speech, lip movement, and video repair as separate problems, it adapts a joint audio-video model so the visible performance and the target speech can be generated together. That makes it one of the more interesting releases for anyone watching video localization, creator tooling, and open-weight audio-video generation.
For an article, the strongest angle is not hype. The strongest angle is clarity: JUST-DUB-IT is a research-backed, model-driven approach to multilingual video dubbing built around LTX-2.3 and IC-LoRA adaptation. It is promising, technically impressive, and useful for experimentation, but it still needs careful setup, human review, and responsible use.
