[en] From a visual-perception perspective, modern graphical user interfaces (GUIs) comprise a complex graphics-rich two-dimensional visuospatial arrangement of text, images, and interactive objects such as buttons and menus. While existing models can accurately predict regions and objects that are likely to attract attention "on average", no scanpath model has been capable of predicting scanpaths for an individual. To close this gap, we introduce EyeFormer, which utilizes a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that predicts gaze locations. Our model offers the unique capability of producing personalized predictions when given a few user scanpath samples. It can predict full scanpath information, including fixation positions and durations, across individuals and various stimulus types. Additionally, we demonstrate applications in GUI layout optimization driven by our model.
Disciplines :
Computer science
Author, co-author :
Jiang, Yue ; Aalto University, Finland
Guo, Zixin ; Aalto University, Finland
Rezazadegan Tavakoli, Hamed ; Nokia Technologies, Finland
LEIVA, Luis A. ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Oulasvirta, Antti ; Aalto University, Finland
External co-authors :
yes
Language :
English
Title :
EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning
Publication date :
13 October 2024
Event name :
Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology
Event place :
Pittsburgh, Usa
Event date :
13-10-2024 => 16-10-2024
Audience :
International
Main work title :
UIST 2024 - Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology
HE - 101071147 - SYMBIOTIK - Context-aware adaptive visualizations for critical decision making
FnR Project :
FNR15722813 - BANANA - Brainsourcing For Affective Attention Estimation, 2021 (01/02/2022-31/01/2025) - Luis Leiva
Funders :
European Union
Funding text :
This work was supported by Aalto University s Department of Information and Communications Engineering, the Research Council of Finland (fagship program: Finnish Center for Artifcial Intelligence, FCAI, grants 328400, 345604, 341763; Subjective Functions, grant 357578), the Academy of Finland in project 345791, the Meta Research PhD Fellowship, the Horizon 2020 FET program of the European Union (grant CHIST-ERA-20-BCI-001), and the European Innovation Council Pathfnder program (SYMBIOTIK project, grant 101071147).
Nicola C Anderson, Fraser Anderson, Alan Kingstone, andWalter F Bischof. 2015. A comparison of scanpath comparison methods. Behavior research methods 47, 4 (2015), 1377-1392.
Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2017. SaltiNet: Scan-Path Prediction on 360 Degree Images Using Saliency Volumes. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 2331-2338. https://doi.org/10.1109/ICCVW.2017.275
Marc Assens, Xavier Giro i Nieto, Kevin McGuinness, and Noel E. O'Connor. 2018. PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks. ECCV Workshop on Egocentric Perception, Interaction and Computing (EPIC).
Donald J Berndt and James Cliford. 1994. Using dynamic time warping to fnd patterns in time series. In KDD workshop, Vol. 10. Seattle, WA, USA:, 359-370.
Kevin Brohan, Kevin Gurney, and Piotr Dudek. 2010. Using reinforcement learning to guide the development of self-organised feature maps for visual orienting. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artifcial Intelligence and Lecture Notes in Bioinformatics)|Lect. Notes Comput. Sci., Vol. 6353. Springer Nature, United States, 180-189. https://doi.org/10.1007/978-3-64215822-3-23 20th International Conference on Artifcial Neural Networks, ICANN 2010 ; Conference date: 01-07-2010.
Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. 2021. When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations. CoRR abs/2106.01548 (2021). arXiv:2106.01548 https://arxiv.org/abs/2106.01548
Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10876-10885.
Zhenzhong Chen andWanjie Sun. 2018. Scanpath Prediction for Visual Attention Using IOR-ROI LSTM (IJCAI'18). AAAI Press, 642-648.
Niraj Ramesh Dayama, Kashyap Todi, Taru Saarelainen, and Antti Oulasvirta. 2020. Grids: Interactive layout design with integer programming. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1-13.
Ryan Anthony Jalova de Belen, Tomasz Bednarz, and Arcot Sowmya. 2022. ScanpathNet: A Recurrent Mixture Density Network for Scanpath Prediction. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 5006-5016.
Richard Dewhurst, Marcus Nyström, Halszka Jarodzka, Tom Foulsham, Roger Johansson, and Kenneth Holmqvist. 2012. It depends on how you look at it: Scanpath comparison in multiple dimensions with MultiMatch, a vector-based approach. Behavior research methods 44 (2012), 1079-1100.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). arXiv:2010.11929 https://arxiv.org/abs/2010.11929
Parvin Emami, Yue Jiang, Zixin Guo, and Luis A. Leiva. 2024. Impact of Design Decisions in Scanpath Modeling. In Proceedings of the ACM Symposium on Eye Tracking Research & Applications (ETRA).
Parvin Emami, Yue Jiang, Zixin Guo, and Luis A. Leiva. 2024. Impact of Design Decisions in Scanpath Modeling. Proc. ACM Hum.-Comput. Interact. 8, ETRA, Article 228 (may 2024), 16 pages. https://doi.org/10.1145/3655602
Ramin Fahimi and Neil DB Bruce. 2021. On metrics for measuring scanpath similarity. Behavior Research Methods 53, 2 (2021), 609-628.
Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O'Donovan, Aaron Hertzmann, and Zoya Bylinskii. 2020. Predicting visual importance across graphic design types. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST). 249-260.
Zixin Guo, Tzu-JuiWang, and Jorma Laaksonen. 2022. CLIP4IDC: CLIP for Image Diference Captioning. In Proceedings of the 2nd Conference of the Asia-Pacifc Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing. 33-42.
Zixin Guo, Tzu-Jui Julius Wang, Selen Pehlivan, Abduljalil Radman, and Jorma Laaksonen. 2023. PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2261-2265.
Lena Hegemann, Yue Jiang, Joon Gi Shin, Yi-Chi Liao, Markku Laine, and Antti Oulasvirta. 2023. Computational Assistance for User Interface Design: Smarter Generation and Evaluation of Design Ideas. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1-5.
Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20, 11 (1998), 1254-1259.
Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in Context. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1072-1080. https://doi.org/10.1109/CVPR.2015.7298710
Yue Jiang. 2024. Computational Representations for Graphical User Interfaces. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24).
Yue Jiang, Ruofei Du, Christof Lutteroth, and Wolfgang Stuerzlinger. 2019. ORC Layout: Adaptive GUI Layout with OR-Constraints. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI '19). Association for Computing Machinery, New York, NY, USA, Article 413, 12 pages. https://doi.org/10.1145/3290605.3300643
Yue Jiang, Luis A. Leiva, Paul R. B. Houssel, Hamed R. Tavakoli, Julia Kylmälä, and Antti Oulasvirta. 2023. UEyes: Understanding Visual Saliency across User Interface Types. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23).
Yue Jiang, Luis A Leiva, Hamed Rezazadegan Tavakoli, Paul RB Houssel, Julia Kylmälä, and Antti Oulasvirta. 2023. UEyes: An Eye-Tracking Dataset across User Interface Types. In Workshop Paper at the 2023 CHI Conference on Human Factors in Computing Systems.
Yue Jiang, Yuwen Lu, Clara Kliman-Silver, Christof Lutteroth, Toby Jia-Jun Li, Jefrey Nichols, and Wolfgang Stuerzlinger. 2024. Computational Methodologies for Understanding, Automating, and Evaluating User Interfaces. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24).
Yue Jiang, Yuwen Lu, Christof Lutteroth, Toby Jia-Jun Li, Jefrey Nichols, and Wolfgang Stuerzlinger. 2023. The Future of Computational Approaches for Understanding and Adapting User Interfaces. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA '23). Association for Computing Machinery, New York, NY, USA, Article 367, 5 pages. https://doi.org/10.1145/3544549.3573805
Yue Jiang, Yuwen Lu, Jefrey Nichols, Wolfgang Stuerzlinger, Chun Yu, Christof Lutteroth, Yang Li, Ranjitha Kumar, and Toby Jia-Jun Li. 2022. Computational Approaches for Understanding, Generating, and Adapting User Interfaces. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA '22). Association for Computing Machinery, New York, NY, USA, Article 74, 6 pages. https://doi.org/10.1145/3491101.3504030
Yue Jiang, Wolfgang Stuerzlinger, and Christof Lutteroth. 2021. ReverseORC: Reverse Engineering of Resizable User Interface Layouts with OR-Constraints. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 316, 18 pages. https://doi.org/10.1145/3411764.3445043
Yue Jiang, Wolfgang Stuerzlinger, Matthias Zwicker, and Christof Lutteroth. 2020. ORCSolver: An Efcient Solver for Adaptive GUI Layout with OR-Constraints. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI '20). Association for Computing Machinery, New York, NY, USA, 1-14. https://doi.org/10.1145/3313831.3376610
Yue Jiang, Changkong Zhou, Vikas Garg, and Antti Oulasvirta. 2024. Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1-18.
Ronald Klein, Barbara E.K Klein, Kristine E Lee, Karen J Cruickshanks, and Richard J Chappell. 2001. Changes in visual acuity in a population over a 10year period1 1Each author states that he/she has no proprietary interest in any aspect of this work.: The Beaver Dam eye study. Ophthalmology 108, 10 (2001), 1757-1766. https://doi.org/10.1016/S0161-6420(01)00769-2
Matthias Kümmerer, Matthias Bethge, and Thomas SAWallis. 2022. DeepGaze III: Modeling free-viewing human scanpaths with deep learning. Journal of Vision 22, 5 (2022).
Olivier Le Meur and Zhi Liu. 2015. Saccadic model of eye movements for free-viewing condition. Vision Research 116 (2015), 152-164. https://doi.org/10.1016/j.visres.2014.12.026 Computational Models of Visual Attention.
Luis A Leiva, Yunfei Xue, Avya Bansal, Hamed R Tavakoli, Tuçe Körolu, Jingzhou Du, Niraj R Dayama, and Antti Oulasvirta. 2020. Understanding visual saliency in mobile user interfaces. In Proceedings of the International conference on human-computer interaction with mobile devices and services. 1-12.
Daniel Martin, Diego Gutierrez, and Belen Masia. 2022. A probabilistic time-evolving approach to scanpath prediction. arXiv preprint arXiv:2204.09404 (2022).
Daniel Martin, Ana Serrano, Alexander W Bergman, Gordon Wetzstein, and Belen Masia. 2022. Scangan360: A generative model of realistic scanpaths for 360 images. IEEE Transactions on Visualization and Computer Graphics 28, 5 (2022), 2003-2013.
S Mathot, F Cristino, ID Gilchrist, and J Theeuwes. 2012. Eyenalysis: A similarity measure for eye movement patterns. Journal of Eye Movement Research 5 (2012), 1-15.
Silviu Minut and Sridhar Mahadevan. 2001. A reinforcement learning model of selective visual attention. In Proceedings of the ffth international conference on Autonomous agents. 457-464.
Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai. 2023. Gazeformer: Scalable, efective and fast prediction of goal-directed human attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1441-1450.
Sajad Mousavi, Michael Schukat, Enda Howley, Ali Borji, and Nasser Mozayani. 2017. Learning to predict where to look in interactive environments using deep recurrent q-learning. arXiv:1612.05753 [cs.CV]
Dimitri Ognibene, Christian Balkenius, and Gianluca Baldassarre. 2008. A reinforcement-learning model of top-down attention based on a potential-action map. In The Challenge of Anticipation: A Unifying Framework for the Analysis and Design of Artifcial Cognitive Systems. Springer, 161-184.
Xufang Pang, Ying Cao, Rynson WH Lau, and Antoni B Chan. 2016. Directing user attention via visual fow on web designs. ACM Transactions on Graphics (TOG) 35, 6 (2016), 1-11.
Mengyu Qiu, Quan Rong, Dong Liang, and Huawei Tu. 2023. Visual ScanPath Transformer: Guiding Computers to See the World. In 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 223-232.
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7008-7024.
Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne Heikkilä. 2013. Stochastic bottom-up fxation prediction and saccade generation. Image and Vision Computing 31, 9 (2013), 686-693. https://doi.org/10.1016/j.imavis.2013.06.006
Ruth Rosenholtz, Amal Dorai, and Rosalind Freeman. 2011. Do Predictions of Visual Perception Aid Design? ACM Trans. Appl. Percept. 8, 2 (2011).
Stan Salvador and Philip Chan. 2007. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11, 5 (2007), 561-580.
Leo Schwinn, Doina Precup, Björn Eskofer, and Dario Zanca. 2022. Behind the Machine's Gaze: Neural Networks with Biologically-inspired Constraints Exhibit Human-like Visual Attention. arXiv preprint arXiv:2204.09093 (2022).
Jeremiah D. Still and Christopher M. Masciocchi. 2010. A Saliency Model Predicts Fixations in Web Interfaces. In Proc. MDDAUI Workshop.
Xiangjie Sui, Yuming Fang, Hanwei Zhu, Shiqi Wang, and Zhou Wang. 2023. ScanDMM: A Deep Markov Model of Scanpath Prediction for 360deg Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6989-6999.
Wanjie Sun, Zhenzhong Chen, and Feng Wu. 2019. Visual scanpath prediction using IOR-ROI recurrent mixture density network. IEEE transactions on pattern analysis and machine intelligence 43, 6 (2019), 2101-2118.
Xiaoshuai Sun, Hongxun Yao, and Rongrong Ji. 2012. What are we looking for: Towards statistical modeling of saccadic eye movements and visual saliency. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1552-1559. https://doi.org/10.1109/CVPR.2012.6247846
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
Yunhao Tang and Shipra Agrawal. 2020. Discretizing continuous action space for on-policy optimization. In Proceedings of the aaai conference on artifcial intelligence, Vol. 34. 5981-5988.
Nada Terzimehic, Renate Häuslschmid, Heinrich Hussmann, and m.c. schraefel. 2019. A Review & Analysis of Mindfulness Research in HCI: Framing Current Lines of Research and Future Opportunities. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI '19). Association for Computing Machinery, New York, NY, USA, 1-13. https://doi.org/10.1145/3290605.3300687
Sauer Tim, A Yorke James, and Casdagli Martin. 1991. Embedology. Journal of statistical Physics 65, 3-4 (1991), 579-616.
Kashyap Todi, Jussi Jokinen, Kris Luyten, and Antti Oulasvirta. 2019. Individualising graphical layouts with predictive visual search models. ACM Transactions on Interactive Intelligent Systems (TiiS) 10, 1 (2019), 1-24.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Ashish Verma and Debashis Sen. 2019. HMM-based Convolutional LSTM for Visual Scanpath Prediction. In 2019 27th European Signal Processing Conference (EUSIPCO). 1-5. https://doi.org/10.23919/EUSIPCO.2019.8902643
Chenguang Wang, Mu Li, and Alexander J. Smola. 2019. Language Models with Transformers. CoRR abs/1904.09408 (2019). arXiv:1904.09408 http://arxiv.org/abs/1904.09408
Wei Wang, Cheng Chen, Yizhou Wang, Tingting Jiang, Fang Fang, and Yuan Yao. 2011. Simulating human saccadic scanpaths on natural images. In CVPR 2011. 441-448. https://doi.org/10.1109/CVPR.2011.5995423
Yao Wang, Andreas Bulling, et al. 2023. Scanpath prediction on information visualisations. IEEE Transactions on Visualization and Computer Graphics (2023).
Yao Wang, Yue Jiang, Zhiming Hu, Constantin Ruhdorfer, Mihai Bâce, and Andreas Bulling. 2024. VisRecall++: Analysing and Predicting Visualisation Recallability from Gaze Behaviour. Proceedings of the ACM on Human-Computer Interaction 8, ETRA (2024), 1-18.
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning (1992), 5-32.
Calden Wloka, Iuliia Kotseruba, and John K. Tsotsos. 2018. Active Fixation Control to Predict Saccade Sequences. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3184-3193. https://doi.org/10.1109/CVPR.2018.00336
Chen Xia, Junwei Han, Fei Qi, and Guangming Shi. 2019. Predicting Human Saccadic Scanpaths Based on Iterative Representation Learning. IEEE Transactions on Image Processing 28, 7 (2019), 3502-3515. https://doi.org/10.1109/TIP.2019. 2897966
Juan Xu, Ming Jiang, Shuo Wang, Mohan S Kankanhalli, and Qi Zhao. 2014. Predicting human gaze beyond pixels. Journal of vision 14, 1 (2014), 28-28.
Mai Xu, Yuhang Song, Jianyi Wang, MingLang Qiao, Liangyu Huo, and Zulin Wang. 2018. Predicting head movement in panoramic video: A deep reinforcement learning approach. IEEE transactions on pattern analysis and machine intelligence 41, 11 (2018), 2693-2708.
Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh Hoai. 2020. Predicting Goal-Directed Human Attention Using Inverse Reinforcement Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).