MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

1Kling Team, Kuaishou Technology 2Zhejiang University 3Tsinghua University

* Equal contribution.

Left: Integrate with multimodal large language models such as GPT-4o to enable conversational digital human systems. Right: By using dual-stream audio as input, we realize multimodal digital-human dialogue video generation.

Abstract

Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with high latency, heavy computational cost, and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64× reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.

Autoregressive Generation Model

Interpolate start reference image.

Our streaming generation approach processes inputs in chunks.

To enable efficient streaming generation, we organize inputs and outputs into logical chunks, where each chunk contains a concatenated sequence of audio tokens, pose tokens, text tokens and frame tokens. This structured token organization facilitates both streaming control input and sequential output generation, allowing for real-time responsiveness while maintaining contextual coherence across chunks.

We design a specialized frame-level causal attention mask to optimize for both streaming generation and output quality. This mask permits each token to attend only to tokens from previous frames and to all tokens within its own frame. This hybrid approach—causal attention between frames and full attention within frames—balances temporal consistency with spatial coherence, critically important for high-quality visual outputs.

For efficient inference, we implement a lighweight diffusion head and utilize flow matching to achieve high sampling efficiency. At inference time, the diffusion head completes the generation process within 4 sampling iterations, enabling real-time performance.

Highly Compressed Video Tokenizer

Interpolate start reference image.

Illustration of streaming autoencoding.

For real-time AR video generation, our streaming autoencoder meets two key criteria: (1) maintain high reconstruction fidelity under strong spatial compression (64 in our implementation) to enable efficient processing by LLM, and (2) preserve and explicitly model the temporal dimension during encoding and decoding to support autoregressive generation with coherent, flicker-free outputs.

Training is performed in three stages: (i) pretraining of the Deep Compression Autoencoder (DC-AE), (ii) temporal module training, where causal 3D convolutions and RoPE-based attention layers are inserted after each spatial layer, and (iii) joint fine-tuning.

During inference, streaming encoding and decoding are performed frame by frame with cached temporal features (feature maps and key/value caches), leveraging a short history buffer to ensure real-time autoregressive generation with temporal consistency.

Dataset Construction and Processing

Interpolate start reference image.

Illustration of data collection and processing pipeline.

The training dataset combines single-person and two-person speech content from three sources: (1) Publicly available benchmarks (VoxCeleb1/2, TED-LRS); (2) curated online videos including podcasts, interviews, talk shows, and speeches; and (3) custom-recorded sessions featuring controlled two-person interactions.

Data processing consists of three stages: preprocessing, annotation and synthetic data construction, and post-processing.

Preprocessing: Temporal segmentation is performed using shot boundary detection and active speaker detection (ASD), followed by filtering of human subjects via face and body detection. Each segmented clip then undergoes rigorous evaluation for visual quality, audio quality, and lip synchronization.

Annotation and Data Construction: This stage includes quality assessment, caption, emotion label, and automatic speech recognition (ASR) transcription. A part of single-person data is adapted into conversational formats using semantic analysis and text-to-speech (TTS) synthesis.

Post-processing: Annotated data undergoes manual review combined with automatic sampling to ensure balanced, high-quality subsets.

The final dataset contains approximately 20,000 hours of pre-training video data and over 400 hours of duplex fine-tuning (SFT) data.

Long video results

With fine-tuning on a designated character representation, the system can support multilingual audio-driven generation of long-duration videos.

Duplex Data Fine-Tuning

The pretrained model is further adapted on 400 hours of full-duplex conversational data, allowing it to condition on dual-stream audio inputs and produce videos with seamless transitions between talking and listening modes.

Our system enables natural turn-taking dialogue between digital avatars with synchronized audio-visual responses. Each avatar maintains appropriate listen- ing expressions when the other is speaking, and becomes animated with synchronized lip move- ments and facial expressions when driven by their corresponding audio input. The audio waveforms (visualized in blue and green) clearly indicate the speaking turns. This demonstrates the model’s ability to generate contextually appropriate reactions and maintain speaker identity while handling the complex dynamics of conversational interaction.

Interpolate start reference image.

General Interactive Video Generation

Our model architecture can flexibly accomodate arbitrary modal conditions as inputs, making it seamlessly applicable to general interactive video generation tasks. By reformulating multimodal conditions into directional control signals and training on the Minecraft dataset, our approach effectively serves as a world model. Experimental results demonstrate that our world model achieves strong 3D consistency and exhibits notable memory capabilities.

Limitations

The current model exhibits limitations in generalization capability; when using arbitrary images as the initial frame, the generated videos suffer from issues in identity preservation, temporal consistency, and stability, making it infeasible to perform long-duration inference while maintaining high quality. Addressing these shortcomings may require larger-scale datasets and more powerful models in the future.