Describing UI Screenshots in Natural Language

LEIVA, Luis A.; Hota; Oulasvirta, Antti

doi:10.1145/3564702

Download

Article (Scientific journals)

Describing UI Screenshots in Natural Language

LEIVA, Luis A.; Hota; Oulasvirta, Antti

2022 • In ACM Transactions on Intelligent Systems and Technology

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/10993/52281

DOI
10.1145/3564702

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

XUI-camready.pdf

Author postprint (7.49 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

LEIVA, Luis A. ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

Hota

Oulasvirta, Antti

External co-authors :

yes

Language :

English

Title :

Describing UI Screenshots in Natural Language

Publication date :

2022

Journal title :

ACM Transactions on Intelligent Systems and Technology

ISSN :

2157-6904

eISSN :

2157-6912

Publisher :

Association for Computing Machinery (ACM), United States - New York

Peer reviewed :

Peer Reviewed verified by ORBi

FnR Project :

FNR15722813 - Brainsourcing For Affective Attention Estimation, 2021 (01/02/2022-31/01/2025) - Luis Leiva

Available on ORBilu :

since 29 September 2022

Statistics

Number of views

194 (2 by Unilu)

Number of downloads

590 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

WoS citations^™

Bibliography

A. Adadi and M. Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160. https://ieeexplore.ieee.org/document/8466590.
American Psychological Association. 2020. Publication Manual of the American Psychological Association (7th. ed.). American Psychological Association (APA).
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the CVPR.
G. Angeli, P. Liang, and D. Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the EMNLP.
N. Banovic, T. Grossman, J. Matejka, and G. Fitzmaurice. 2012. Waken: Reverse engineering usage information and interface structure from software videos. In Proceedings of the UIST.
J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. 2010. VizWiz: Nearly real-time answers to visual questions. In Proceedings of the UIST.
J. P. Bigham, R. S. Kaminsky, R. E. Ladner, O. M. Danielsson, and G. L. Hempton. 2006. WebInSight: Making web images accessible. In Proceedings of the ASSETS.
M. Borenstein. 2009. Effect sizes for continuous data. In Proceedings of the Handbook of Research Synthesis and Meta-analysis (2nd. ed.). H. Cooper, L. V. Hedges, and J. C. Valentine (Eds.), Sage Foundation.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. 2020. Language models are few-shot learners. In Proceedings of the NeurIPS.
B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar. 2017. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the UIST.
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the CVPR.
M. Dixon and J. Fogarty. 2010. Prefab: Implementing advanced behaviors using pixel-based reverse engineering of interface structure. In Proceedings of the CHI.
M. Dixon, D. Leventhal, and J. Fogarty. 2011. Content and hierarchy in pixel-based methods for reverse engineering interface structure. In Proceedings of the CHI.
M. Dixon, A. Nied, and J. Fogarty. 2014. Prefab layers and prefab annotations: Extensible Pixel-based interpretation of graphical interfaces. In Proceedings of the UIST.
P. L. Dognin, I. Melnyk, Y. Mroueh, J. Ross, and T. Sercu. 2019. Adversarial semantic alignment for improved image captions. In Proceedings of the CVPR.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the ICLR.
A. Dutta, Y. Verma, and C. V. Jawahar. 2018. Automatic image annotation: The quirks and what works. Multimedia Tools and Applications 77, 24 (2018).
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt, L. Zitnick, and G. Zweig. 2015. From captions to visual concepts and back. In Proceedings of the CVPR.
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the ECCV.
W. Fedus, B. Zoph, and N. Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39.
J. Garcia. 2011. Ext JS in Action (2nd. ed.). Manning Publications.
A. Gatt and E. Krahmer. 2018. Survey of the state-of-the-art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61 (2018). https://www.jair.org/index.php/jair/article/view/ 11173.
R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. 2019. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the ICML.
C. Gleason, A. Pavel, E. McCamey, C. Low, P. Carrington, K. M. Kitani, and J. P. Bigham. 2020. Twitter a11y: A browser extension to make twitter images accessible. In Proceedings of the CHI. 1–12.
J. Gu, J. Cai, G. Wang, and T. Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In Proceedings of the AAAI.
J. Harel, C. Koch, and P. Perona. 2007. Graph-based visual saliency. In Proceedings of the NIPS.
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys 51, 6 (2019).
X. Hu, X. Yin, K. Lin, L. Wang, L. Zhang, J. Gao, and Z. Liu. 2021. VIVO: Visual vocabulary pre-training for novel object captioning. In Proceedings of the AAAI.
X. Hua and L. Wang. 2019. Sentence-level content planning and style specification for neural text generation. In Proceedings of the EMNLP.
J. Huang and M. B. Twidale. 2007. Graphstract: Minimal graphical help for computers. In Proceedings of the UIST.
T. Intharah, D. Turmukhambetov, and G. J. Brostow. 2017. Help, it looks confusing: GUI task automation through demonstration and follow-up questions. In Proceedings of the IUI.
R. Kimchi. 1992. Primacy of wholistic processing and the global/local paradigm: A critical review. Psychological Bulletin 112 (1992), 24–38.
R. Kondadadi, B. Howald, and F. Schilder. 2013. A statistical NLG framework for aggregated planning and realization. In Proceedings of the ACL.
A. Kraskov, H. Stögbauer, and P. Grassberger. 2004. Estimating mutual information. Physical Review E 69 (2004), 066138. https://journals.aps.org/pre/abstract/10.1103/PhysRevE.69.066138.
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and F.-F. Li. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (2017), 32–73. https://link.springer.com/article/10.1007/s11263-016-0981-7.
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the CVPR.
R. Lebret, D. Grangier, and M. Auli. 2016. Neural text generation from structured data with application to the biography domain. In Proceedings of the EMNLP.
L. A. Leiva, A. Hota, and A. Oulasvirta. 2020a. Enrico: A high-quality dataset for topic modeling of mobile UI designs. In Proceedings of the MobileHCI.
L. A. Leiva, Y. Xue, A. Bansal, H. R. Tavakoli, T. Köroğlu, N. R. Dayama, and A. Oulasvirta. 2020b. Understanding visual saliency in mobile user interfaces. In Proceedings of the MobileHCI.
J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the NAACL.
J. Li, D. Li, C. Xiong, and S. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proc. ICML.
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the ACL.
X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the ECCV.
Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. 2020. Widget captioning: Generating natural language description for mobile user interface elements. In Proceedings of the EMNLP.
P. Liang, M. Jordan, and D. Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the ACL/IJCNLP.
T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV.
T. F. Liu, M. Craft, J. Situ, E. Yumer, R. Mech, and R. Kumar. 2018. Learning design semantics for mobile apps. In Proceedings of the UIST.
J. Lu, C. Xiong, D. Parikh, and R. Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the CVPR.
W. Luo, Y. Li, R. Urtasun, and R. Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the NIPS.
Z. Luo, Y. Xi, R. Zhang, and J. Ma. 2022. VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training. arXiv:2201.12723. Retrieved from https://arxiv.org/abs/2201.12723.
K. R. McKeown. 1985. Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge University Press.
S. W. McRoy, S. Channarukul, and S. S. Ali. 2000. YAG: A template-based generator for real-time systems. In Proceedings of the INLG.
T. Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267 (2019), 1–38. https://www.sciencedirect.com/science/article/pii/S0004370218305988.
V. S. Morash, Y.-T. Siu, J. A. Miele, L. Hasty, and S. Landau. 2015. Guiding novice web workers in making image descriptions using templates. ACM Transactions on Accessible Computing 7, 4 (2015), 1–21.
R. Moriyon, P. Szekely, and R. Neches. 1994. Automatic generation of help from interface design models. In Proceedings of the UIST.
M. R. Morris, A. Zolyomi, C. Yao, S. Bahram, J. P. Bigham, and S. K. Kane. 2016. “With most of it being pictures now, i rarely use it”: Understanding twitter’s evolving accessibility to blind users. In Proceedings of the CHI.
D. Navon. 1977. Forest before trees: The precedence of global features in visual perception. Cognitive Psychology 9, 3 (1977), 353–383.
J. Novikova, O. Dušek, and V. Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the SIGDIAL.
S. Pangoli and F. Paternó. 1995. Automatic generation of task-oriented help. In Proceedings of the UIST.
S. Pareddy, A. Guo, and J. P. Bigham. 2019. X-Ray: Screenshot accessibility via embedded metadata. In Proceedings of the ASSETS.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2017. Flickr30K entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74–93.
D. Powers. 2011. Evaluation: from Precision, Recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2, 1 (2011), 37–63.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the ICML.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical Report. OpenAi.
K. Ramnath, S. Baker, L. Vanderwende, M. El-Saban, S. N. Sinha, A. Kannan, N. Hassan, M. Galley, Y. Yang, D. Ramanan, A. Bergamo, and L. Torresani. 2014. AutoCaption: Automatic caption generation for personal photos. In Proceedings of the WACV.
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. 2016. Sequence level training with recurrent neural networks. In Proceedings of the ICLR.
E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.
H. Schielzeth, N. J. Dingemanse, S. Nakagawa, D. F. Westneat, H. Allegue, C. Teplitsky, D. Réale, N. A. Dochtermann, L. Z. Garamszegi, and Y. G. Araya-Ajoy. 2020. Robustness of linear mixed-effects models to violations of distributional assumptions. Methods in Ecology and Evolution 11, 9 (2020), 1141–1152.
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2019. Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision 128, 2 (2019), 336–359.
R. J. Senter and E. A. Smith. 1967. Automated Readability Index. Technical Report AMRL-TR-6620. Wright-Patterson Air Force Base.
C. Shen and Q. Zhao. 2014. Webpage saliency. In Proceedings of the ECCV.
K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR.
A. Stangl, M. R. Morris, and D. Gurari. 2020. “Person, shoes, tree. Is the person naked?” what people with vision impairments want in image descriptions. In Proceedings of the CHI.
G. Thomas, R. D. Hartley, and J. P. Kincaid. 1975. Test-retest and inter-analyst reliability of the automated readability index, flesch reading ease score, and the fog count. Journal of Literacy Research 7, 2 (1975), 149–154.
D. Todorovic. 2008. Gestalt principles. Scholarpedia 3, 12 (2008), 5345.
K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and C. Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of the CVPR.
S. Wiseman, S. Shieber, and A. Rush. 2017. Challenges in data-to-document generation. In Proceedings of the EMNLP.
S. Wiseman, S. M. Shieber, and A. M. Rush. 2018. Learning neural templates for text generation. In Proceedings of the EMNLP.
T. Yeh, T.-H. Chang, and R. C. Miller. 2009. Sikuli: Using GUI screenshots for search and automation. In Proceedings of the UIST.
T. Yeh, T.-H. Chang, B. Xie, G. Walsh, I. Watkins, K. Wongsuphasawat, M. Huang, L. S. Davis, and B. B. Bederson. 2011. Creating contextual help for GUIs using screenshots. In Proceedings of the UIST.
Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. 2016. Image captioning with semantic attention. In Proceedings of the CVPR.
S. Zedeck (Ed.). 2013. APA Dictionary of Statistics and Research Methods (1st. ed.). American Psychological Association (APA).
X. Zhang, L. de Greef, A. Swearngin, S. White, K. I. Murray, L. Yu, Q. Shan, J. Nichols, J. Wu, C. Fleizach, A. Everitt, and J. P. Bigham. 2021. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the CHI.