Sixun Dong (董思勋)

Multimodal Learning, VLM, LLM Agent

Independent Researcher, AZ, USA

I am an independent researcher working on multimodal AI systems. My research evolved from visual-temporal perception (CVPR Oral'22, CVPR'23) through generative modeling (3DV'24, NeurIPS'25) to my current focus on efficient multimodal inference and agentic reasoning (ICLR'26, ICASSP'26), exploring system-algorithm co-design for next-generation multimodal intelligence. I completed my Master's at ShanghaiTech University under Professor Shenghua Gao.

Email Scholar GitHub Twitter LinkedIn 知乎

Recent News

Apr 2026 🗺️ The homepage now includes an interactive Research Roadmap.

Apr 2026 📄 Paper on Rethinking Model Efficiency — accelerating large VLM inference by reusing reasoning tokens from small models. Released on arXiv.

Mar 2026 🛠️ Homepage updated with TL;DR popovers for all publications — hover on desktop or tap on mobile to see a one-sentence summary of each paper.

Feb 2026 📄 Paper on To Think or Not To Think — Large Reasoning Models in Theory of Mind Tasks, released on arXiv

Jan 2026 🎉 Paper on MMTok accepted to ICLR 2026! Thanks to my mentor Qi Qian and all coauthors.

Jan 2026 🎉 Robust Dysarthric Speech Recognition accepted to ICASSP 2026. Congratulations to Xiuwen Zheng!

Sep 2025 🎉 Sculpting Features from Noise accepted to NeurIPS 2025. Congratulations to Nanxu Gong and Zijun Li!

Aug 2025 💼 Completed GenAI Research Internship at Zoom Inc., focusing on efficient vision–language modeling. Grateful to my mentor Qi Qian and the Zoom team.

Aug 2025 📄 Paper on LiveMCP-101 — a new benchmark testing AI agents' real-world tool-use, released on arXiv

Aug 2025 📄 Paper on LogicIF — Complex Logical Instruction Generation released on arXiv

Aug 2025 📄 Paper on TimesCLIP — new multimodal approach to time series forecasting with CLIP, released on arXiv

May 2025 💼 Started GenAI Research Internship at Zoom Inc. focusing on efficient vision-language modeling

Jan 2024 💼 Completed Team Leader internship at DGene (Digital Human Algorithm Dept.), leading co-speech gesture generation and 3D human body reconstruction projects.

Feb 2024 🎉 Paper on MLLM-Tool accepted to WACV 2024

Aug 2023 💼 Completed Team Leader internship at Transsion Holdings (Audio-Video Generation Dept.), leading audio-driven talking-head video generation research with SoTA performance.

Mar 2023 🎉 Paper on WeakSVR accepted to CVPR 2023

Mar 2022 🎉 Paper on TransRAC accepted as 🏆 oral presentation to CVPR 2022

Research Roadmap

Selected Publications

arXiv'2604 Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

Paper

TL;DR

Analyzes VLM end-to-end latency and reveals that output token length dominates inference cost. While large models with short outputs outperform small models with long generations, reasoning remains essential for complex tasks. We bridge this by proposing a multi-agent framework where a small model computes the reasoning tokens and transfers them to a large model, achieving large-model accuracy with minimal latency.

arXiv'2602 To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu, Xing Xie

Paper

TL;DR

Investigates the capabilities of Large Reasoning Models (like OpenAI o1/o3) on Theory of Mind tasks, revealing that continuous reasoning does not always guarantee better socially-aware outcomes.

ICASSP'26 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

Xiuwen Zheng, Sixun Dong, Bornali Phukon, Mark Hasegawa-Johnson, Chang D. Yoo

Paper / Code / Dataset

TL;DR

Proposes an LLM-agent-based post-ASR correction framework for dysarthric speech recognition that focuses on semantic accuracy rather than just minimizing Word Error Rate (WER).

ICLR'26 MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Paper / Code / Homepage / Blog / 知乎

TL;DR

Proposes MMTok to efficiently select informative vision tokens via a multimodal coverage maximization strategy, significantly accelerating VLM inference while maintaining model performance.

arXiv'2506 Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives

Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu

Paper / Code / Homepage / Blog / 知乎

TL;DR

Proposes TimesCLIP, innovatively aligning time series data with visual and textual multi-modal perspectives to effectively enhance both short-term and long-term forecasting performance.

NeurIPS'25 Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

Nanxu Gong*, Zijun Li*, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, Yanjie Fu

Paper / Code / 知乎

TL;DR

Proposes a reward-guided hierarchical diffusion model that "sculpts" data representations from noise to generate optimal feature transformations for specific downstream tasks.

WACV'25 MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, Shenghua Gao

Paper / Code

TL;DR

Introduces the MLLM-Tool framework and contributes a specialized dataset to empower Multimodal Large Language Models (MLLMs) with the ability to understand, learn, and invoke external tools.

CVPR'23 Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sixun Dong*, Huazhang Hu*, Dongze Lian, Weixin Luo, Yicheng Qian, Shenghua Gao

Paper / Code / YouTube / Bilibili / 知乎

TL;DR

Proposes a weakly supervised video representation learning approach that utilizes unaligned text for sequential videos, significantly reducing the reliance on fine-grained video-text alignment annotations.

CVPR'22🏆 Oral TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting

Huazhang Hu*, Sixun Dong*, Yiqun Zhao, Dongze Lian, Zhengxin Li, Shenghua Gao

Paper / Code / Dataset / YouTube / Bilibili / 知乎

TL;DR

Introduces the RepCount dataset and pioneers the use of regression density maps alongside a Transformer-based architecture to encode multi-scale temporal correlations, significantly improving repetitive action counting.

View All Publications

Experience

GenAI Research Intern

Zoom Inc., GenAI Research Group

May 2025 - Aug 2025

Worked on VLM and LLM Agent. Published one first-author paper on efficient VLM inference and two collaborative papers on LLM evaluation.

Research Intern

DGene Digital Technology Co., Ltd., Digital Human Algorithm Department

May 2023 - Jan 2024

Led digital human projects: (1) audio-driven talking head video generation, achieving SoTA on commercial and academic benchmarks; (2) 3D human body reconstruction with <7% measurement error; (3) co-speech gesture generation.

Academic Service

Reviewer

Conferences: CVPR (2023–2026), ICCV (2023, 2025), ECCV (2024, 2026), NeurIPS (2025), ICML (2025, 2026),ICLR (2026), ACM MM (2023–2025), ACCV (2024), KDD (2025)

Journals: IEEE Transactions on Multimedia, Neural Networks(Elsevier), ACM Transactions on Knowledge Discovery from Data

Education

2021 - 2024

M.S. in Computer Science

ShanghaiTech University, China

SVIP-Lab, Advisor: Prof. Shenghua Gao

2016 - 2020

B.E. in Computer Science (Dual Degree)

Dalian University of Technology, China

2016 - 2020

B.E. in Process Equipment and Control Engineering

Dalian University of Technology, China