Tao Jin  (金涛)
Research Interests: Multimedia Analysis, Computer Vision, Natural Language Learning, Transfer Learning,
Address: Hangzhou/Ningbo, Zhejiang Province
Email: jint_zju@zju.edu.cn

Education


Work Experiences

  • Research Intern at Taobao Research
    Taobao Lab
    June 2020 - Sep 2020       Hangzhou, China
  • Research Intern at Kuake Research
    Kuake Lab
    Nov 2019 - Feb 2020       Hangzhou, China

Supervised and Co-supervised Students

  • School of Software, Zhejiang University
    Wang Lin (2021, linwanglw@zju.edu.cn, National Scholarship, PHD of ZJU),
    Linjun Li (2021, lilinjun21@zju.edu.cn, National Scholarship, Beidou Plan of Meituan),
    Xize Cheng (2021, chengxize@zju.edu.cn, National Scholarship, PHD of ZJU),
    Ye Wang (2021, yew@zju.edu.cn, National Scholarship, Daka of Tecent&2-1 of Bytedance),
    Zirun Guo (2024),
    Weicai Yan (2024),
    Dongjie Fu (2024),
    Fangming Feng (2024),

Publications(* denotes equal contributions, & denotes corresponding author)

  1. Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
    Dongjie Fu, Fangming Feng, Xize Cheng, Linjun Li, Zhou Zhao, Tao Jin,
    Arxiv

  2. ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
    Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang, Siyu Chen, Xiaoda Yang, Tao Jin, Zhou Zhao,
    Arxiv

  3. X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
    Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin,
    Arxiv

  4. LLM-I: LLMs are Naturally Interleaved Multimodal Creators
    Zirun Guo, Feng Zhang, Kai Jia, Tao Jin,
    Arxiv

  5. Unleashing the Power of Natural Audio Featuring Multiple Sound Sources
    Xize Cheng, Slytherin Wang, Zehan Wang, Rongjie Huang, Tao Jin, Zhou Zhao,
    Arxiv

  6. OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios
    Xize Cheng, Dongjie Fu, Tao Jin, Zhou Zhao,
    Arxiv

  7. Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
    Zirun Guo, Minjie Hong, Tao Jin,
    Arxiv

  8. Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
    Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian,
    ICML, 2026

  9. From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
    Xiaoda Yang, Yuxiang Liu, Shenzhou Gao, Can Wang, Jingyang Xue, Lixin Yang, Yao Mu, Tao Jin, Shuicheng YAN, Zhimeng Zhang, Zhou Zhao
    ICML, 2026

  10. Text-Guided Multi-Scale Frequency Representation Adaptation
    Weicai Yan, Xinhua Ma, Wang Lin, Tao Jin,
    ACL, 2026

  11. Rectifying the Emotional Flow: Aligning Priors and Dynamic Guidance for High-Arousal Text-to-Speech
    Fangming Feng, Dongjie Fu, Zequn Xie, Yu Zhang, Yangyang Wu, Zhou Zhao, Tao Jin,
    ACL, 2026

  12. SAME: Signer-Aware Mixture-of-Experts for Test-time Adaptation in Sign Language Translation
    Lujia Yang, Weicai Yan, Yongbo He, Qifei Zhang, Tao Jin, Jinshan Zhang, Meng Xi, Jianwei Yin,
    ACL, 2026

  13. Generative-to-Discriminative Test-Time Adaptation via Manifold-Aware Diffusion and Bayesian Distillation
    Boyun Zhang, Zequn Xie, Fangming Feng, Qifei Zhang, Tao Jin,
    ACL, 2026

  14. Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search
    Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin, Zhou Zhao, Yixuan Tang,
    ACL, 2026

  15. View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning
    Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao,
    ACL, 2026

  16. Dual-Pathway and Dual-View Representation Learning for Bridging Information Asymmetry in Text-Video Retrieval
    Zequn Xie, Xin Liu, Fangming Feng, Boyun Zhang, Tao Jin,
    ACL, 2026

  17. Thinking with Programming Vision: Towards a Unified View for Thinking with Images
    Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin,
    CVPR, 2026

  18. Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
    Yongbo He, Zirun Guo, Tao Jin,
    CVPR, 2026

  19. SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer
    Jimin Xu, Bosheng Qin, Tao Jin, Zhou Zhao, Zhenhui Ye, Jun Yu, Fei Wu,
    ICME, 2026

  20. Emphasizing Domain Differences through Interactive-Augmented Prompts in Continual AVSR
    Dongjie Fu, Xize Cheng, Tao Jin, Zhongfei Zhang,
    TIP, 2026

  21. MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
    Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin,
    ICLR, 2026

  22. WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark
    Wang Lin, Feng Wang, Majun Zhang, Wentao Hu, Tao Jin, Zhou Zhao, Fei Wu, Jingyuan Chen, Sucheng Ren, Alan Yuille,
    ICLR, 2026

  23. AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching
    Xize Cheng, Chenyuhao Wen, Tao Jin, Zhou Zhao,
    ICLR, 2026

  24. HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval
    Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin,
    ICASSP, 2026

  25. Scene-Aware Spatiotemporal Generalization: Towards Robust Temporal Action Detection Across Domains
    Fangming Feng, Sihang Cai, Zequn Xie, Yangyang Wu, Tao Jin,
    AAAI, 2026

  26. AHa-Bench: Benchmarking Audio Hallucinations in Large Audio-Language Models
    Xize Cheng, Dongjie Fu, Tao Jin, Zhou Zhao,
    NeurIPS, 2025

  27. Multi-talker Audio-Visual Speech Recognition towards Diverse Scenarios
    Yuxiao Lin, Tao Jin, Xize Cheng, Zhou Zhao, Fei Wu,
    FITEE, 2025

  28. TAG: Triple Alignment with Rationale Generation for Knowledge-based Visual Question Answering
    Sihang Cai, Tao Jin, Zhou Zhao, Fei Wu, Jun Yu,
    TBD, 2025

  29. PA-Chat: Persona-Aware Speech Assistant for Multi-party Dialogue
    Dongjie Fu, Xize Cheng, Linjun Li, Tao Jin,
    EMNLP, 2025

  30. Chat-Driven Text Generation and Interaction for Person Retrieval
    Zequn Xie, Chuxin Wang, Sihang Cai, Shulei Wang, Tao Jin,
    EMNLP, 2025

  31. ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
    Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao,
    ACM MM, 2025

  32. Parameter-efficient Task-Aware Prompting for Adverse Weather Removal
    Hanting Wang, Shengpeng Ji, Shulei Wang, Hai Huang, Tao Jin,
    ACM MM, 2025

  33. Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
    Wenrui Liu, Qian Chen, Wen Wang, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Jin Xu, Zemin Liu
    ACM MM, 2025

  34. Open-set Cross Modal Generalization via Multimodal Unified Representation
    Hai Huang, Yan Xia, Shulei Wang, Hanting Wang, Minghui Fang, Shengpeng Ji, Sashuai Zhou, Tao Jin, Zhou Zhao,
    ICCV, 2025

  35. Multimodal Conditional Retrieval with High Controllability
    Xiaoda Yang, Xize Cheng, Zhou Zhao, Tao Jin,
    KDD, 2025

  36. T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback
    Zehan Wang, Ke Lei, Chen Zhu, Jiawei Huang, Xize Cheng, Shengpeng Ji, Zhenhui Ye, Tao Jin, Zhou Zhao,
    ACL(ORAL), 2025

  37. TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
    Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao,
    ACL, 2025

  38. IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models
    Hanting Wang, Tao Jin, Wang Lin, Zhou Zhao,
    ICML, 2025

  39. Robust Speech-Driven Body Language Generation
    Xize Cheng, Zehan Wang, Rongjie Huang, Huadai Liu, Tao Jin,
    Interspeech(ORAL), 2025

  40. Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval
    Ruofan Hu, Yan Xia, Mingjie Hong, Jieming Zhu, Xiaoda Yang, Minghui Fang, Tao Jin,
    Interspeech, 2025

  41. Recognize-and-tell: Generating Video Captions with Textual Cue in Scene
    Tao Jin, Wang Lin, Zhou Zhao, Zhongfei Zhang,
    ESWA, 2025

  42. Concept Preservation and Unbinding in Continual Diffusion Customization
    Zirun Guo, Tao Jin,
    CVPR, 2025

  43. Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
    Shulei Wang, Wang Lin, Tao Jin, Zhou Zhao,
    CVPR, 2025

  44. Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
    Wang Lin, Tao Jin, Zhou Zhao, Jingyuan Chen,
    CVPR, 2025

  45. SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
    Zehan Wang, Sashuai Zhou, Shaoxuan He, Haifeng Huang, Lihe Yang, Ziang Zhang, Tao Jin, Hengshuang Zhao, Zhou Zhao,
    CVPR, 2025

  46. Efficient Prompting for Continual Adaptation to Missing Modalities
    Zirun Guo, Shulei Wang, Wang Lin, Weicai Yan, Yangyang Wu, Tao Jin,
    NAACL, 2025

  47. Omni-Chart: A Comprehensive Dataset of Chart Types for Chart Understanding
    Shulei Wang, Shuai Yang, Wang Lin, Zirun Guo, Sihang Cai, Hai Huang, Ye Wang, Jingyuan Chen, Tao Jin&,
    NAACL, 2025

  48. Chat-3D: Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception
    Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Tao Jin, Zhou Zhao,
    NAACL, 2025

  49. Smoothing the Shift: Towards Stable Test-time Adaptation under Complex Multimodal Noises
    Zirun Guo, Tao Jin&,
    ICLR, 2025

  50. Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision
    Weicai Yan, Wang Lin, Tao Jin&,
    ICLR, 2025

  51. OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
    Xize Cheng, Tao Jin, Zhou Zhao,
    ICLR, 2025

  52. Improving Multi-modal Representations via Binding Space in Scale
    Zehan Wang, Ziang Zhang, Minjie Hong, Tao Jin, Hengshuang Zhao, Zhou Zhao,
    ICLR, 2025

  53. VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
    Xize Cheng, Tao Jin, Zhou Zhao,
    ICLR, 2025

  54. Curriculum Learning aided Audio-Visual Speech Recognition with Arbitrary Number of Overlapping Speakers
    Yuxiao Lin, Tao Jin&, Xize Cheng, Zhou Zhao, Fei Wu,
    ICASSP, 2025

  55. Bridging the Gap for Test-time Multimodal Sentiment Analysis
    Zirun Guo, Tao Jin&, Wenlong Xu, Wang Lin, Yangyang Wu,
    AAAI, 2025

  56. Low-rank Sequence Adapter for Efficient Multimodal Transfer Learning
    Zirun Guo, Xize Cheng, Yangyang Wu, Tao Jin&,
    AAAI, 2025

  57. Speech Watermarking with Discrete Intermediate Representations
    Shengpeng Ji, Ziyue Jiang, Jialong Zuo, Minghui Fang, Yifu Chen, Tao Jin, Zhou Zhao,
    AAAI, 2025

  58. Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset
    Wang Lin, Tao Jin, Zhou Zhao, Chang Yao, Jingyuan Chen,
    NeurIPS, 2024

  59. Extending Multi-modal Contrastive Representations
    Ziang Zhang, Zehan Wang, Luping Liu, Yang Zhao, Tao Jin, Zhou Zhao
    NeurIPS, 2024

  60. Action Imitation in Common Action Space for Customized Action Image Synthesis
    Wang Lin, Jingyuan Chen, Zirun Guo, Tao Jin, Zhou Zhao,
    NeurIPS, 2024

  61. Balancing Multimodal Learning with Classifier-guided Gradient Modulation
    Zirun Guo, Tao Jin&,
    NeurIPS, 2024

  62. AudioVSR: Enhancing Video Speech Recognition with Audio Data
    Xiaoda Yang, Xize Cheng, Tao Jin&,
    EMNLP, 2024

  63. Calibrating Prompt from History for Continual Vision-Language Retrieval and Grounding
    Tao Jin, Weicai Yan, Ye Wang, Zhou Zhao,
    ACM MM, 2024

  64. Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts
    Dongjie Fu, Xize Cheng, Xiaoda Yang, Tao Jin&, Zhou Zhao,
    ACM MM (ORAL), 2024

  65. SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning
    Xiaoda Yang, Xize Cheng, Dongjie Fu, Tao Jin&, Zhou Zhao,
    ACM MM, 2024

  66. Low-rank Prompt Interaction for Continual Vision-Language Retrieval
    Weicai Yan, Ye Wang, Tao Jin&, Zhou Zhao,
    ACM MM, 2024

  67. TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
    Xize Cheng, Tao Jin, Zhou Zhao,
    ACL, 2024

  68. Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation
    Songju Lei, Xize Cheng&, Tao Jin, Zhou Zhao,
    ACL, 2024

  69. Rethinking the Multimodal Correlation of Multimodal Sequential Learning
    Tao Jin, Zhou Zhao,
    ACL, 2024

  70. Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition
    Zirun Guo, Tao Jin&, Zhou Zhao,
    ACL, 2024

  71. Two-Stream Generative Recommender with Behavior-Semantic Collaboration
    Ye Wang, Jiahao Xun, Minjie Hong, Jieming Zhu, Tao Jin, Zhou Zhao,
    KDD, 2024

  72. Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
    Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao,
    NAACL, 2024

  73. Non-confusing Generation of Customized Concepts in Diffusion Models
    Wang Lin, Jingyuan Chen, Tao Jin, Zhou Zhao,
    ICML, 2024

  74. Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion
    Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Tao Jin, Zhou Zhao,
    ICML, 2024

  75. MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Optimization
    Jimin Xu*, Tianbao Wang*, Tao Jin&, Zhou Zhao,
    CVPR, 2024

  76. Rethinking Missing Modality Learning from a Decoding Perspective
    Tao Jin, Zhou Zhao,
    ACM MM, 2023

  77. Exploring Group-Based Video Captioning with Efficient Relational Approximation
    Wang Lin*, Tao Jin*, Ye Wang, Zhou Zhao,
    ICCV, 2023

  78. Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
    Xize Cheng*, Tao Jin*, Linjun Li, Zhou Zhao,
    ICCV, 2023

  79. Multi-Granularity Relational Attention Network for Audio-Visual QA
    Linjun Li*, Tao Jin*, Wang Lin, Hao Jiang, Zhou Zhao,
    TCSVT, 2023

  80. OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
    Xize Cheng*, Tao Jin*, Linjun Li, Wang Lin, Xinyu Duan,
    ACL (ORAL), 2023

  81. TAVT: Towards Transferable Audio-Visual Text Generation
    Wang Lin*, Tao Jin*, Ye Wang, Wenwen Pan, Xize Cheng, Linjun Li, Zhou Zhao
    ACL, 2023

  82. Semantic-Conditioned Dual Adaptation for Query-based Visual Segmentation
    Ye Wang*, Tao Jin*, Wang Lin, Xize Cheng, Linjun Li, Zhou Zhao
    ACL, 2023

  83. Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
    Ye Wang, Wang Lin, Shengyu Zhang, Tao Jin, Zhou Zhao
    ACL (ORAL), 2023

  84. Contrastive Token-Wise Meta-Learning for Unseen Temporal-Aligned Translation
    Linjun Li*, Tao Jin*, Xize Cheng, Ye Wang, Wang Lin, Rongjie Huang, Zhou Zhao,
    ACL, 2023

  85. DATE: Domain Adaptive Product Seeker for E-commerce
    Haoyuan Li, Hao Jiang, Tao Jin, Mengyan Li, Yan Chen, Zhijie Lin, Yang Zhao, Zhou Zhao,
    CVPR, 2023

  86. Gloss Attention for Gloss-free Sign Language Translation
    Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, Zhou Zhao,
    CVPR, 2023

  87. Interaction Augmented Transformer with Decoupled Decoding for Video Captioning
    Tao Jin, Zhou Zhao, Peng Wang, Jun Yu, Fei Wu,
    Neurocom., 2022

  88. MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation
    Tao Jin, Zhou Zhao, Meng Zhang, Xingshan Zeng,
    ACM MM, 2022

  89. Prior Knowledge and Memory Enriched Transformer for Sign Language Translation
    Tao Jin, Zhou Zhao, Meng Zhang, Xingshan Zeng,
    ACL, 2022

  90. Generalizable Multi-Linear Attention Network
    Tao Jin, Zhou Zhao,
    NeurIPS, 2021

  91. Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation
    Tao Jin, Zhou Zhao,
    ACM MM (ORAL), 2021

  92. Dual Low-Rank Multimodal Fusion
    Tao Jin*, Siyu Huang*, Yingming Li, Zhongfei Zhang
    EMNLP, 2020

  93. SBAT: Video Captioning with Sparse Boundary-Aware Transformer
    Tao Jin, Siyu Huang, Ming Chen, Yingming Li, Zhongfei Zhang
    IJCAI, 2020

  94. Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
    Tao Jin, Siyu Huang, Yingming Li, Zhongfei Zhang,
    EMNLP, 2019

  95. Recurrent Convolutional Video Captioning with Global and Local Attention
    Tao Jin, Yingming Li, Zhongfei Zhang,
    Neurocom., 2019


Contest