Publications

Currently, my research focuses on multimodal AI systems, spanning multimodal understanding, generative modeling, and agentic systems. My work evolved from visual-temporal understanding (CVPR Oral'22, CVPR'23) to generative modeling (3DV'24, NeurIPS'25), efficient multimodal inference (ICLR'26), and agentic systems with reasoning and tool use (WACV'25, ICASSP'26), with a current focus on system-algorithm co-design for efficient multimodal modeling and agentic reasoning.

10+ publications • 8 top-tier venues (CVPR, NeurIPS, ICLR, IJCAI, AAAI, WACV, ICASSP, 3DV ) • 1 CVPR Oral presentation

2026

arXiv'2604 Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

Paper

TL;DR

Analyzes VLM end-to-end latency and reveals that output token length dominates inference cost. While large models with short outputs outperform small models with long generations, reasoning remains essential for complex tasks. We bridge this by proposing a multi-agent framework where a small model computes the reasoning tokens and transfers them to a large model, achieving large-model accuracy with minimal latency.

arXiv'2602 To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu, Xing Xie

Paper

TL;DR

Investigates the capabilities of Large Reasoning Models (like OpenAI o1/o3) on Theory of Mind tasks, revealing that continuous reasoning does not always guarantee better socially-aware outcomes.

ICASSP'26 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

Xiuwen Zheng, Sixun Dong, Bornali Phukon, Mark Hasegawa-Johnson, Chang D. Yoo

Paper / Code / Dataset

TL;DR

Proposes an LLM-agent-based post-ASR correction framework for dysarthric speech recognition that focuses on semantic accuracy rather than just minimizing Word Error Rate (WER).

2025

ICLR'26 MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Paper / Code / Homepage / Blog / 知乎

TL;DR

Proposes MMTok to efficiently select informative vision tokens via a multimodal coverage maximization strategy, significantly accelerating VLM inference while maintaining model performance.

arXiv'2508 LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

Paper / Code / Benchmark / 知乎

TL;DR

Constructs a stress-testing benchmark specifically designed to evaluate and diagnose the stability and limitations of MCP-enabled agents when handling complex, challenging queries.

arXiv'2508 Complex Logical Instruction Generation

Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song

Paper / Code / Benchmark / 知乎

TL;DR

Introduces an automated data generation framework and evaluation benchmark for complex logical instructions to assess and enhance the multi-step reasoning capabilities of LLMs.

arXiv'2506 Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives

Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu

Paper / Code / Homepage / Blog / 知乎

TL;DR

Proposes TimesCLIP, innovatively aligning time series data with visual and textual multi-modal perspectives to effectively enhance both short-term and long-term forecasting performance.

Under Review TimesFrame: Multi-Variable Time Series is a Video of Numerical Data

Sixun Dong, Nanxu Gong, Haoyue Bai, Xinyuan Wang, Wangyang Ying, Wei Fan, Yanjie Fu

Paper (Coming Soon)

arXiv'2505 Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories

Nanxu Gong*, Sixun Dong*, Haoyue Bai, Xinyuan Wang, Wangyang Ying, Yanjie Fu

Paper

TL;DR

Builds an LLM-driven agentic framework that unifies feature selection and generation through collaborative teaming, task planning, and memory mechanisms.

NeurIPS'25 Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

Nanxu Gong*, Zijun Li*, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, Yanjie Fu

Paper / Code / 知乎

TL;DR

Proposes a reward-guided hierarchical diffusion model that "sculpts" data representations from noise to generate optimal feature transformations for specific downstream tasks.

Under Review MECT: From Multimodal Knowledge Acquisition To Contrastive Embedding Construction For Generative Feature Transformation

Nanxu Gong, Sixun Dong, Haoyue Bai, Wangyang Ying, Yanjie Fu

Paper (Coming Soon)

AAAI'26 Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Xinyuan Wang, Dongjie Wang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Sixun Dong, Kunpeng Liu, Yanjie Fu

Paper / Code

TL;DR

Proposes an efficient post-training refinement strategy specifically designed to optimize and unlock the deep logical reasoning capabilities of LLMs within their latent representation space.

AAAI'26 Brownian Bridge Augmented Surrogate Simulation and Injection Planning for Geological CO2 Storage

Haoyue Bai, Guodong Chen, Wangyang Ying, Xinyuan Wang, Nanxu Gong, Sixun Dong, Giulia Pedrielli, Haoyu Wang, Haifeng Chen, Yanjie Fu

Paper

TL;DR

Combines Brownian Bridge-augmented surrogate models to provide highly efficient simulation and injection planning for optimizing geological CO2 storage strategies.

IJCAI'25 Unsupervised feature transformation via in-context generation, generator-critic llm agents, and duet-play teaming

Nanxu Gong, Xinyuan Wang, Wangyang Ying, Haoyue Bai, Sixun Dong, Haifeng Chen, Yanjie Fu

Paper / Code

TL;DR

Presents a fully unsupervised feature transformation framework utilizing generator-critic LLM agents and in-context generation via duet-play teaming to automatically extract high-quality features.

arXiv'2505 Bridging the domain gap in equation distillation with reinforcement feedback

Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, H Chen, Yanjie Fu

Paper

TL;DR

Introduces a reinforcement learning feedback mechanism to effectively bridge and calibrate the domain gap in equation distillation tasks.

2024

WACV'25 MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, Shenghua Gao

Paper / Code

TL;DR

Introduces the MLLM-Tool framework and contributes a specialized dataset to empower Multimodal Large Language Models (MLLMs) with the ability to understand, learn, and invoke external tools.

3DV'24 RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation

Yiqun Zhao, Zibo Zhao, Jing Li, Sixun Dong, Shenghua Gao

Paper / Code

TL;DR

Presents RoomDesigner, which encodes "anchor-latents" to guide the generation of 3D indoor scenes that are both highly style-consistent and shape-compatible.

2023

CVPR'23 Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sixun Dong*, Huazhang Hu*, Dongze Lian, Weixin Luo, Yicheng Qian, Shenghua Gao

Paper / Code / YouTube / Bilibili / 知乎

TL;DR

Proposes a weakly supervised video representation learning approach that utilizes unaligned text for sequential videos, significantly reducing the reliance on fine-grained video-text alignment annotations.

2022

CVPR'22🏆 Oral TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting

Huazhang Hu*, Sixun Dong*, Yiqun Zhao, Dongze Lian, Zhengxin Li, Shenghua Gao

Paper / Code / Dataset / YouTube / Bilibili / 知乎

TL;DR

Introduces the RepCount dataset and pioneers the use of regression density maps alongside a Transformer-based architecture to encode multi-scale temporal correlations, significantly improving repetitive action counting.

Survey Papers

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

TKDD'26 Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, et al.

Paper

TL;DR

A comprehensive survey mapping the evolution of Data-Centric AI in tabular data transformation, covering traditional statistical methods, reinforcement learning, and generative AI approaches.

arXiv'2502 A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

Wangyang Ying, Cong Wei, Nanxu Gong, Xinyuan Wang, Haoyue Bai, Arun Vignesh Malarkkan, Sixun Dong, Dongjie Wang, Denghui Zhang, Yanjie Fu

Paper

TL;DR

A systematic review focusing on how state-of-the-art reinforcement learning and generative AI models can be leveraged to improve the quality and efficiency of tabular data learning.