Paper published in a book (Scientific congresses, symposiums and conference proceedings)
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization
Zhu, Dongsheng; TANG, Xunzhu; Han, Weidong et al.
2024In Duh, Kevin (Ed.) Long Papers
Peer reviewed
 

Files


Full Text
NAACL24-VisLingInstruct.pdf
Author postprint (6.58 MB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
'current; Context learning; In contexts; Instructional texts; Language model; Modal language; Multi-modal; Optimisations; Performance; Quality of instructions; Computer Networks and Communications; Hardware and Architecture; Information Systems; Software
Abstract :
[en] This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual content. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets. Our main code is available at https://github.com/Zhudongsheng75/VisLingInstruct.
Disciplines :
Computer science
Author, co-author :
Zhu, Dongsheng;  Baidu Inc., China
TANG, Xunzhu  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Han, Weidong;  Fudan University, China
Lu, Jinghui;  University College Dublin, Ireland
Zhao, Yukun;  Baidu Inc., China
Xing, Guoliang;  Baidu Inc., China
Wang, Junfeng;  Baidu Inc., China
Yin, Dawei;  Baidu Inc., China
External co-authors :
yes
Language :
English
Title :
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization
Publication date :
2024
Event name :
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Event place :
Hybrid, Mexico City, Mex
Event date :
16-06-2024 => 21-06-2024
By request :
Yes
Main work title :
Long Papers
Editor :
Duh, Kevin
Publisher :
Association for Computational Linguistics (ACL)
ISBN/EAN :
9798891761148
Peer reviewed :
Peer reviewed
Name of the research project :
R-AGR-3885 - H2020-ERC-NATURAL - BISSYANDE Tegawendé
Available on ORBilu :
since 02 September 2025

Statistics


Number of views
48 (1 by Unilu)
Number of downloads
8 (0 by Unilu)

Scopus citations®
 
3
Scopus citations®
without self-citations
2
OpenCitations
 
0
OpenAlex citations
 
0

Bibliography


Similar publications



Contact ORBilu