I am currently a Ph.D. student at the University of Science and Technology of China (USTC), supervised by Prof. Linli Xu. My research interests inlcude:

  • Generative Models
  • Multimodal learning

Recently, I have been focusing on Multimodal LLMs, specifically exploring the unified model across text, vision, and speech with multimodal inputs and multimodal outputs processed by a single neural network.

I am currently seeking an internship (our lab is also open to collaborations) to explore the development of an advanced GPT-4o-like voice model that supports streaming, full-duplex, fine-grained perceptual speech dialogue systems, though not limited to this. Please feel free to reach out to me at zyx2016@mail.ustc.edu.cn

Publications

2024

  • Addressing Representation Collapse in Vector Quantized Models with One Linear Layer.
    Yongxin Zhu, Bocheng Li, Yifei Xin, Linli Xu.
    ArXiv [Link] [code] (Recommended by Jianlin Su) (Integrated to lucidrains’s vector-quantize-pytorch repo)

  • Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective.
    Yongxin Zhu, Bocheng Li, Hang Zhang, Xin Li, Linli Xu, Lidong Bing.
    In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS-24). [Link] [code]

  • Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer.
    Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu
    In Proceedings of the 62st Annual Meeting of the Association for Computational Linguistics (ACL-24). [Link] [code] (Adopted by Moshi)

  • Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction.
    Haoqiu Yan#, Yongxin Zhu#, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, Linli Xu.
    In Proceedings of the 62st Annual Meeting of the Association for Computational Linguistics (ACL-24). (Oral) [Link] [code]

  • VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs.
    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing.
    ArXiv [Link] [code]

  • Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation.
    Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, Linli Xu.
    In 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-24). [Link] [code]

  • Few-shot Temporal Pruning Accelerates Diffusion Models for Text Generation.
    Bocheng Li, Zhujin Gao, Yongxin Zhu, Kun Yin, Haoyu Cao, Deqiang Jiang, Linli Xu.
    In Proceedings of the 31th International Conference on Computational Linguistics (COLING-24). [Link] [code]

  • Visual Hallucination Elevates Audio Speech Recognition.
    Fang Zhang, Yongxin Zhu, Xiangxiang Wang, Huang Chen, Xing Sun, Linli Xu.
    In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI-24). [Link]

2023

  • DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation.
    Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu.
    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-23). [Link]

  • Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA.
    Yongxin Zhu, Zhen Liu, Yukang Liang, Xin Li, Hao Liu, Changcun Bao, Linli Xu.
    In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI-23).(Oral) [Link]

  • Span-level Aspect-based Sentiment Analysis via Table Filling.
    Mao Zhang, Yongxin Zhu, Zhen Liu, Zhimin Bao, Yunfei Wu, Xing Sun, Linli Xu.
    In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL-23). [Link]

  • ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval.
    Zhen Liu, Yongxin Zhu, Zhujin Gao, Xin Sheng, Linli Xu.
    In the 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-23). [Link]

2022

  • Sequence-to-Action: Grammatical Error Correction with Action Guided Sequence Generation.
    Jiquan Li, Junliang Guo, Yongxin Zhu, Xin Sheng, Deqiang Jiang, Bo Ren, Linli Xu.
    In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI-22). [Link]

Educations

  • 2020.09 - present, Ph.D. in Data Science, University of Science and Technology of China, Hefei.
  • 2016.09 - 2020.07, B.Sc. in Statistics, University of Science and Technology of China, Hefei.

Internships

  • 2023.07 - 2023.12, Tencent AI Lab, Beijing.
  • 2023.03 - 2023.06, iFlytek Research, Hefei.
  • 2021.12 - 2022.08, Tencent Youtu Lab, Hefei.