Publications
Currently, my research focuses on multimodal AI systems, spanning multimodal understanding, generative modeling, and agentic systems. My work evolved from visual-temporal understanding (CVPR Oral'22, CVPR'23) to generative modeling (3DV'24, NeurIPS'25), efficient multimodal inference (ICLR'26), and agentic systems with reasoning and tool use (WACV'25, ICASSP'26), with a current focus on system-algorithm co-design for efficient multimodal modeling and agentic reasoning.
2026
arXiv'2604 Rethinking Model Efficiency: Multi-Agent Inference with Large Models
Analyzes VLM end-to-end latency and reveals that output token length dominates inference cost. While large models with short outputs outperform small models with long generations, reasoning remains essential for complex tasks. We bridge this by proposing a multi-agent framework where a small model computes the reasoning tokens and transfers them to a large model, achieving large-model accuracy with minimal latency.
arXiv'2602 To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks
Investigates the capabilities of Large Reasoning Models (like OpenAI o1/o3) on Theory of Mind tasks, revealing that continuous reasoning does not always guarantee better socially-aware outcomes.
ICASSP'26 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER
Proposes an LLM-agent-based post-ASR correction framework for dysarthric speech recognition that focuses on semantic accuracy rather than just minimizing Word Error Rate (WER).
2025
ICLR'26 MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Proposes MMTok to efficiently select informative vision tokens via a multimodal coverage maximization strategy, significantly accelerating VLM inference while maintaining model performance.
arXiv'2508 LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Constructs a stress-testing benchmark specifically designed to evaluate and diagnose the stability and limitations of MCP-enabled agents when handling complex, challenging queries.
Introduces an automated data generation framework and evaluation benchmark for complex logical instructions to assess and enhance the multi-step reasoning capabilities of LLMs.
arXiv'2506 Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives
Proposes TimesCLIP, innovatively aligning time series data with visual and textual multi-modal perspectives to effectively enhance both short-term and long-term forecasting performance.
Under Review TimesFrame: Multi-Variable Time Series is a Video of Numerical Data
Paper (Coming Soon)
arXiv'2505 Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories
Builds an LLM-driven agentic framework that unifies feature selection and generation through collaborative teaming, task planning, and memory mechanisms.
NeurIPS'25 Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation
Proposes a reward-guided hierarchical diffusion model that "sculpts" data representations from noise to generate optimal feature transformations for specific downstream tasks.
Under Review MECT: From Multimodal Knowledge Acquisition To Contrastive Embedding Construction For Generative Feature Transformation
Paper (Coming Soon)
AAAI'26 Efficient Post-Training Refinement of Latent Reasoning in Large Language Models
Proposes an efficient post-training refinement strategy specifically designed to optimize and unlock the deep logical reasoning capabilities of LLMs within their latent representation space.
AAAI'26 Brownian Bridge Augmented Surrogate Simulation and Injection Planning for Geological CO2 Storage
Combines Brownian Bridge-augmented surrogate models to provide highly efficient simulation and injection planning for optimizing geological CO2 storage strategies.
IJCAI'25 Unsupervised feature transformation via in-context generation, generator-critic llm agents, and duet-play teaming
Presents a fully unsupervised feature transformation framework utilizing generator-critic LLM agents and in-context generation via duet-play teaming to automatically extract high-quality features.
2024
Introduces the MLLM-Tool framework and contributes a specialized dataset to empower Multimodal Large Language Models (MLLMs) with the ability to understand, learn, and invoke external tools.
2023
CVPR'23 Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Proposes a weakly supervised video representation learning approach that utilizes unaligned text for sequential videos, significantly reducing the reliance on fine-grained video-text alignment annotations.
2022
CVPR'22🏆 Oral TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting
Introduces the RepCount dataset and pioneers the use of regression density maps alongside a Transformer-based architecture to encode multi-scale temporal correlations, significantly improving repetitive action counting.
Survey Papers
TKDD'26 Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation
A comprehensive survey mapping the evolution of Data-Centric AI in tabular data transformation, covering traditional statistical methods, reinforcement learning, and generative AI approaches.
arXiv'2502 A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective
A systematic review focusing on how state-of-the-art reinforcement learning and generative AI models can be leveraged to improve the quality and efficiency of tabular data learning.