FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang^1,*, Yao Lai^1,*, Aoxue Li², Shifeng Zhang², Jiacheng Sun², Ning Kang², Chengyue Wu¹, Zhenguo Li², Ping Luo¹

¹The University of Hong Kong ²Noah's Ark Lab

^*Equal Contribution

🤗

Abstract

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize Fudoki from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that Fudoki achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models.

🖼️ Image Generation

A red colored car.

Rainbow coloured penguin.

Hyper-realistic photo of an abandoned industrial site during a storm.

A loft bedroom with a white bed next to the bedside table

Eiffel Tower, large aperture, blurred background

The sunset is at the end of the sky and the sea.

A horse running on the beach at sunrise

A rabbit wears a blue scarf.

Automobile design drawings, sketch

Discrete flow matching denoises step by step, transforming noise into vivid images.

💡 Image Understanding

Tokens can be iteratively revised—even those previously changed are updated again (red border), while the final answer is marked with a yellow dashed border, illustrating flexible step-by-step reasoning.

👁️‍🗨️ More Results

Qualitative Results of Visual Generation and Understanding Capabilities of FUDOKI. Fudoki is designed based on the framework of discrete flow matching for both visual and textual modalities, capable of performing understanding and generation simultaneously under one unified paradigm.

🔄 Text Generation Process Comparison

Generation process of different methods:

(a) AR-based Janus: Can only generate tokens sequentially; if an error is made in the initial step, subsequent outputs will consistently propagate this mistake.

(b) D-DiT (mask-based discrete diffusion, MDD): Cannot revise tokens once unmasked, making errors irreversible and leading to poor generalization.

(c) FUDOKI (discrete flow matching, DFM): Allows generated tokens to be revised in subsequent steps, enabling step-by-step reasoning and error correction for more accurate answers.

🎬 Video: Image Understanding Process

🎬 Video: Image Generation Process

BibTeX

@article{wang2025fudokidiscreteflowbasedunified,
    title={FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities}, 
    author={Jin Wang and Yao Lai and Aoxue Li and Shifeng Zhang and Jiacheng Sun and Ning Kang and Chengyue Wu and Zhenguo Li and Ping Luo},
    year={2025},
    eprint={2505.20147},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2505.20147}
}