Establishing Cognitive Item Models for Fair and Theory-Grounded Automatic Item Generation A Large-Scale Assessment Study with Image-Based Math Items-2.pdf
[en] Mathematics is a core domain in large-scale assessments (LSA), yet item
development remains resource-intensive, limiting scalability and innovation.
Automatic Item Generation (AIG) offers a promising solution, but empirical
validations remain rare. This study investigates the psychometric functioning
and fairness of 48 cognitive item models designed to generate language-
reduced, image-based math items for Grades 1, 3, and 5. Treating these
models as proto-theories, we generated 612 item instances varying in cog-
nitive demands and contextual features. Using data from Luxembourg’s
school monitoring (N = 35,058), we found that item difficulty was mainly
driven by predefined cognitive factors, with stronger contextual influences
in early grades. We introduce Differential Radical Functioning to evaluate
whether AIG-based items permit comparable score interpretations across
subgroups. Results reveal meaningful differences by cultural background,
regardless of language proficiency. These findings highlight the importance
of contextual embedding and demonstrate the potential of cognitive mod-
eling in AIG for scalable, valid, and equitable assessments.
Disciplines :
Education & instruction
Author, co-author :
SONNLEITNER, Philipp ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > LUCET
BERNARD, Steve ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > LUCET
Michels, Michael A.; University of Luxembourg
Inostroza-Fernandez, Pamela; Universidad de los Andes
KELLER, Ulrich ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > LUCET
FNR13650128 - FAIR-ITEMS - Fairness Of Latest Innovations In Item And Test Development In Mathematics, 2019 (01/09/2020-31/08/2023) - Philipp Sonnleitner
Name of the research project :
R-AGR-3682 - C19/SC/13650128/FAIR-ITEMS - SONNLEITNER Philipp
American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing. AERA.
Arendasy, M., & Sommer, M., (2007). Using psychometric technology in educational assessment: The case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and Individual Differences, 17(4), 366–383. https://doi.org/10.1016/j.lindif.2007.03.005
Attali, Y., (2018). Automatic item generation unleashed: An evaluation of a large-scale deployment of item models. In Artificial Intelligence in Education: 19th International Conference AIED 2018 (pp. 17–29). London, UK: Springer International Publishing.
Attali, Y., & Arieli-Attali, M., (2015). Gamification in assessment: Do points affect test performance?Computers and Education, 83, 57–63. https://doi.org/10.1016/j.compedu.2014.12.012
Attali, Y., & Arieli-Attali, M., (2019). Validating classifications from learning progressions: Framework and implementation. ETS Research Report Series, 2019(1), 1–20. https://doi.org/10.1002/ets2.12253
Bates, D., Mächler, M., Bolker, B., & Walker, S., (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Blum, W., & Leiss, D., (2007). How do students and teachers deal with modelling problems. In C., Haines, P., Galbraith, W., Blum, & S., Khan (Eds.), Mathematical modelling: Education, engineering and economics-ICTMA (Vol. 12, pp. 222–231). Horwood: Chichester.
Borsboom, D., H. L., Van Der Maas, J., Dalege, R. A., Kievit, & B. D., Haig. (2021). Theory construction methodology: A practical framework for building theories in psychology. Perspectives on Psychological Science, 16 (4), 756–766.
Chen, Z., (1999). Schema induction in children’s analogical problem solving. Journal of Educational Psychology, 91(4), 703. https://doi.org/10.1037/0022-0663.91.4.703
Cho, S. J., De Boeck, P., Embretson, S., & Rabe-Hesketh, S., (2014). Additive multilevel item structure models with random residuals: Item modeling for explanation and item generation. Psychometrika, 79(1), 84–104. https://doi.org/10.1007/s11336-013-9360-2
Circi, R., Hicks, J., & Sikali, E., (2023). Automatic item generation: foundations and machine learning-based approaches for assessments. In Frontiers in education (Vol. 8, p. 858273). https://doi.org/10.3389/feduc.2023.858273
Daniel, R. C., & Embretson, S. E., (2010). Designing cognitive complexity in mathematical problem-solving items. Applied Psychological Measurement, 34(5), 348–364. https://doi.org/10.1177/0146621609349801
Davis-Kean, P. E., Domina, T., Kuhfeld, M., Ellis, A., & Gershoff, E. T., (2022). It matters how you start: Early numeracy mastery predicts high school math course-taking and college attendance. Infant and Child Development, 31(2), e2281. https://doi.org/10.1002/icd.2281
De Boeck, P., Bakker, M., Zwitser, R., Nivard, M., Hofman, A., Tuerlinckx, F., & Partchev, I., (2011). The estimation of item response models with the lmer function from the lme4 package in R. Journal of Statistical Software, 39(12), 1–28. https://doi.org/10.18637/jss.v039.i12
De Lange, J., (2007). Large-scale assessment and mathematics education. In F. K., Lester (Ed.), Second Handbook of Research on Mathematics Teaching and Learning (pp. 1111-1142). Charlotte, NC: Information Age Publishing.
Dewolf, T., Van Dooren, W., & Verschaffel, L., (2017). Can visual aids in representational illustrations help pupils to solve mathematical word problems more realistically?European Journal of Psychology of Education, 32(3), 335–351. https://doi.org/10.1007/s10212-016-0308-7
Drasgow, F., Luecht, R., & Bennett, R., (2006). Technology and testing. In R., Brennan (Ed.), Educational measurement (4th ed., pp. 471–516). Westport, CT: Praeger Publishers.
Embretson, S. E., & Kingston, N. M., (2018). Automatic item generation: A more efficient process for developing mathematics achievement items?Journal of Educational Measurement, 55(1), 112–131. https://doi.org/10.1111/jedm.12166
European Parliament. (2024). Artificial intelligence act. https://www.europarl.europa.eu/doceo/document/TA-9-2024-0138-FNL-COR01_EN.Pdf
Fennema, E., Carpenter, T. P., Jacobs, V. R., Franke, M. L., & Levi, L. W., (1998). A longitudinal study of gender differences in young children’s mathematical thinking. Educational Researcher, 27(5), 6–11. https://doi.org/10.3102/0013189X027005006
Ganzeboom, H. B., De Graaf, P. M., & Treiman, D. J., (1992). A standard international socio-economic index of occupational status. Social Science Research, 21(1), 1–56. https://doi.org/10.1016/0049-089X(92)90017-B
Gierl, M. J., & Haladyna, T. M., (2013). Automatic item generation. Routledge.
Gierl, M. J., & Lai, H., (2016). A process for reviewing and evaluating generated test items. Educational Measurement Issues & Practice, 35(4), 6–20. https://doi.org/10.1111/emip.12129
Gierl, M. J., Lai, H., Hogan, J. B., & Matovinovic, D., (2015). A method for generating educational test items that are aligned to the Common Core State Standards. Journal of Applied Testing Technology, 16(1), 1–18. http://www.jattjournal.net/index.php/atp/article/view/80234.
Gierl, M. J., Lai, H., & Tanygin, V., (2021). Advanced methods in automatic item generation. Routledge.
Gierl, M. J., Zhou, J., & Alves, C., (2008). Developing a taxonomy of item model types to promote assessment engineering. Journal of Technology, Learning, and Assessment, 7(2),1–51.
Girardelli, I., (2017). Der Einfluss bildlicher Kontextualisierung auf die Schwierigkeit mathematischer Testaufgaben. [The impact of visual contextualisation on the difficulty of mathematical tasks.] [Unpublished Master thesis]. University of Luxembourg.
Gliksman, Y., Berebbi, S., & Henik, A., (2022). Math fluency during primary school. Brain Sciences, 12(3), 371. https://doi.org/10.3390/brainsci12030371
Greisen, M., Georges, C., Hornung, C., Sonnleitner, P., & Schiltz, C., (2021). Learning mathematics with shackles: How lower reading comprehension in the language of mathematics instruction accounts for lower mathematics achievement in speakers of different home languages. Acta Psychologica, 221, 103456. https://doi.org/10.1016/j.actpsy.2021.103456
Harris, D. J., Welch, C. J., & Dunbar, S. B., (2024). In the beginning, there was an item …. Educational Measurement Issues & Practice, 43(4), 40–45. https://doi.org/10.1111/emip.12647
Holland, P. W., & Wainer, H., (2012). Differential item functioning. Routledge.
Hoogland, K., Pepin, B., de Koning, J., Bakker, A., & Gravemeijer, K., (2018). Word problems versus image-rich problems: An analysis of effects of task characteristics on students’ performance on contextual mathematics problems. Research in Mathematics Education, 20(1), 37–52. https://doi.org/10.1080/14794802.2017.1413414
Irvine, S. H., (2002). The foundations of item generation for mass testing. In S. H., Irvine & P. C., Kyllonen (Eds.), Item generation for test development (pp. 3–34). Erlbaum.
Irvine, S., & Kyllonen, P., (2002). Generating items for cognitive tests: Theory and practice. Lawrence Erlbaum.
Kellogg, M., Rauch, S., Leathers, R., Simpson, M. A., Lines, D., Bickel, L., & Elmore, J., (2015, April). Construction of a dynamic item generator for K-12 mathematics. In National Council on Measurement in Education Conference. Chicago, IL: Na-tional Council on Measurement in Education Conference.
Kosh, A. E., Simpson, M. A., Bickel, L., Kellogg, M., & Sanford-Moore, E., (2019). A cost–benefit analysis of automatic item generation. Educational Measurement Issues & Practice, 38(1), 48–53. https://doi.org/10.1111/emip.12237
Leighton, J. P., & Gierl, M. J., (2011). The learning sciences in educational assessment: The role of cognitive models. Cambridge University Press.
Leiss, D., Ehmke, T., & Heine, L., (2024). Reality-based tasks for competency-based education: The need for an integrated analysis of subject-specific, linguistic, and contextual task features. Learning and Individual Differences, 114, 102518. https://doi.org/10.1016/j.lindif.2024.102518
Martin, A. J., Liem, G. A. D., Mok, M. M. C., & Xu, J., (2012). Problem solving and immigrant student mathematics and science achievement: Multination findings from PISA. Journal of Educational Psychology, 104(4), 1054–1073. https://doi.org/10.1037/a0029152
MENFP. (2011). Plan d ’ études. École fondamentale MENFP.
Meredith, W., (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825
National Center for Education Statistics. (2022). The nation’s report card: Mathematics 2022. NCES 2023-001. U.S. Department of Education. https://nces.ed.gov/nationsreportcard/
Nettle, D., (2021, May16). Theories and models are not the only fruit. Medium. https://leotiokhin.medium.com/theories-and-models-are-not-the-only-fruit-a05c7cf188f6
Nohara, D., (2001). A comparison of the National Assessment of Educational Progress (NAEP), the Third International Mathematics and Science Study Repeat (TIMSS-R), and the Programme for International Student Assessment (PISA). Working paper no. 2001-07. National Center for Educational Statistics.
OECD. (2012). Untapped skills: Realising the potential of immigrant students.
OECD. (2024). Education at a glance, 2024: OECD indicators.
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org
Robitzsch, A., Kiefer, T., & Wu, M., (2024). TAM: Test Analysis Modules. R package version 4.2-21. https://CRAN.R-project.org/package=TAM
Saalbach, H., Gunzenhauser, C., Kempert, S., & Karbach, J., (2016). Der Einfluss von mehrsprachigkeit auf mathematische fähigkeiten bei grundschulkindern mit niedrigem sozioökonomischen status. Frühe Bildung.
Sam, D. L., Vedder, P., Ward, C., & Horenczyk, G., (2022). Psychological and sociocultural adaptation of immigrant youth. In J. W., Berry, J., Phinney, & D., Sam (Eds), Immigrant youth in cultural transition (pp. 119–143). Routledge.
Sells, L., (1973). High school mathematics as the critical filter in the job market. In R. T., Thomas (Ed.), Developing opportunities for minorities in graduate education (pp. 37–39). University of California Press.
Sievert, H., Hickendorff, M., van den Ham, A. K., & Heinze, A., (2025). Children’s arithmetic strategy use and strategy change from grade 3 to grade 4. International Journal of Science & Mathematics Education, 1–22. https://doi.org/10.1007/s10763-025-10578-3
Singer, V., & Strasser, K., (2017). The association between arithmetic and reading performance in school: A meta-analytic study. School Psychology Quarterly, 32(4), 435. https://doi.org/10.1037/spq0000197
Singley, M. K., & Bennett, R. E., (2002). Item generation and beyond: Applications of schema theory to mathematics assessment. In S., Irvine & P., Kyllonen (Eds.), Item generation for test development (pp. 361–384). Lawrence Erlbaum.
Sinharay, S., & Johnson, M. S., (2008). Use of item models in a large-scale admissions test: A case study. International Journal of Testing, 8(3), 209–236. https://doi.org/10.1080/15305050802262019
Sonnleitner, P., Krämer, C., Gamo, S., Reichert, M., Muller, C., Keller, U., & Ugen, S., (2018). Schülerkompetenzen im Längsschnitt - Die Entwicklung von Deutsch-Leseverstehen und Mathematik in Luxemburg zwischen der 3. und 9. Klasse. LUCET, Universität Luxemburg, SCRIPT.
Van Dooren, W., & Inglis, M., (2015). Inhibitory control in mathematical thinking, learning and problem solving: A survey. ZDM, 47(5), 713–721. https://doi.org/10.1007/s11858-015-0715-2
von Davier, A. A., Runge, A., Park, Y., Attali, Y., Church, J., & LaFlair, G., (2024). The item factory. In Jiao, Hong & Robert W., Lissitz (Eds.), Machine learning, natural language processing, and psychometrics (pp. 1-26). Information Age Publishing.
von Davier, M., Fishbein, B., & Kennedy, A., (Eds.). (2024). Timss, 2023 technical report (methods and procedures). TIMSS & PIRLS International Study Center.
Walkington, C. A., (2013). Using adaptive learning technologies to personalize instruction to student interests: The impact of relevant contexts on performance and learning outcomes. Journal of Educational Psychology, 105(4), 932. https://doi.org/10.1037/a0031882
Wilson, J., Morrison, K., & Embretson, S. E., (2014). Automatic item generator for mathematical achievement items: MathGen3.0(Technical report IES1005A-2014). Cognitive Measurement Laboratory, Georgia Institute of Technology.
Wright, B. D., & Masters, G. N., (1982). Rating scale analysis. MESA Press.
Wu, M., Tam, H. P., & Jen, T.-H., (2016). Educational measurement for applied researchers. Theory into practice. Springer.
Yang, S. J. H., Ogata, H., & Matsui, T., (2023). Guest editorial: Human-centered AI in education: Augment human intelligence with machine intelligence. Educational Technology & Society, 26(1), 95–98.
Zumbo, B. D., (1999). A handbook on the theory and methods of differential item functioning (DIF). National Defense Headquarters.
Zumbo, B. D., (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233. https://doi.org/10.1080/15434300701375832