Reddit-V; virality prediction; pre-engagement metadata; zero-shot LLM evaluation; multimodal fine-tuning; AI; Social media
Abstract :
[en] We present Reddit-V, a new dataset designed to advance research on social media virality prediction in natural language processing. The dataset consists of over 27,000 Reddit posts, each enriched with images, textual content, and pre-engagement metadata such as post titles, categories, sentiment scores, and posting times. As an initial benchmark, we evaluate several instruction-tuned large language models (LLMs) in a zero-shot setting, prompting them with post titles and metadata to predict post virality. We then fine-tune two multimodal models, CLIP and IDEFICS, to assess whether incorporating visual context enhances predictive performance. Our results show that zero-shot LLMs perform poorly, whereas the fine-tuned multimodal models achieve better performance. Specifically, CLIP outperforms the best-performing zero-shot LLM (CodeLLaMA) by 3%, while IDEFICS achieves an 7% improvement over the same baseline, highlighting the importance of visual features in virality prediction. We release the Reddit-V dataset and our evaluation results to facilitate further research on multimodal and text-based virality prediction. Our dataset and code will be made publicly available on Github
Disciplines :
Computer science
Author, co-author :
EL-AMRANY, Samir ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
R. Brust, Matthias; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
LAMSIYAH, Salima ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
BOUVRY, Pascal ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
no
Language :
English
Title :
Reddit-V: A Virality Prediction Dataset and Zero-Shot Evaluation with Large Language Models