![]() ![]() Inostroza Fernandez, Pamela Isabel ![]() ![]() ![]() Scientific Conference (2023, April 14) Today’s educational field has a tremendous hunger for valid and psychometrically sound items to reliably track and model students’ learning processes. Educational large-scale assessments, formative ... [more ▼] Today’s educational field has a tremendous hunger for valid and psychometrically sound items to reliably track and model students’ learning processes. Educational large-scale assessments, formative classroom assessment, and lately, digital learning platforms require a constant stream of high-quality, and unbiased items. However, traditional development of test items ties up a significant amount of time from subject matter experts, pedagogues and psychometricians and might not be suited anymore to nowadays demands. Salvation is sought in automatic item generation (AIG) which provides the possibility of generating multiple items within a short period of time based on the development of cognitively sound item templates by using algorithms (Gierl, Lay & Tanygin, 2021). Using images or other pictorial elements in math assessment – e.g. TIMSS (Trends in International Mathematics and Science (TIMSS, Mullis et al 2009) and Programme for International Student Assessment (PISA, OECD 2013) – is a prominent way to present mathematical tasks. Research on using images in text items show ambiguous results depending on their function and perception (Hoogland et al., 2018; Lindner et al. 2018; Lindner 2020). Thus, despite the high importance, effects of image-based semantic embeddings and their potential interplay with cognitive characteristics of items are hardly studied. The use of image-based semantic embeddings instead of mainly text-based items will increase though, especially in contexts with highly heterogeneous student language backgrounds. The present study psychometrically analyses cognitive item models that were developed by a team of national subject matter experts and psychometricians and then used for algorithmically producing items for the mathematical domain of numbers & operations for Grades 1, 3, and 5 of the Luxembourgish school system. Each item model was administered in 6 experimentally varied versions to investigate the impact of a) the context the mathematical problem was presented in, and b) problem characteristics which cognitive psychology identified to influence the problem solving process. Based on samples from Grade 1 (n = 5963), Grade 3 (n = 5527), and Grade 5 (n = 5291) collected within the annual Épreuves standardisées, this design allows for evaluating whether psychometric characteristics of produced items per model are a) stable, b) can be predicted by problem characteristics, and c) are unbiased towards subgroups of students (known to be disadvantaged in the Luxembourgish school system). The developed cognitive models worked flawlessly as base for generating item instances. Out of 348 generated items, all passed ÉpStan quality criteria which correspond to standard IRT quality criteria (rit > .25; outfit >1.2). All 24 cognitive models could be fully identified either by cognitive aspects alone, or a mixture of cognitive aspects and semantic embeddings. One model could be fully described by different embeddings used. Approximately half of the cognitive models could fully explain all generated and administered items from these models, i.e. no outliers were identified. This remained constant over all grades. With the exemption of one cognitive model, we could identify those cognitive factors that determined item difficulty. These factors included well known aspects, such as, inverse ordering, tie or order effects in additions, number range, odd or even numbers, borrowing/ carry over effects or number of elements to be added. Especially in Grade 1, the chosen semantic embedding the problem was presented in impacted item difficulty in most models (80%). This clearly decreased in Grades 3, and 5 pointing to older students’ higher ability to focus on the content of mathematical problems. Each identified factor was analyzed in terms of subgroup differences and about half of the models were affected by such effects. Gender had the most impact, followed by self-concept and socioeconomic status. Interestingly those differences were mostly found for cognitive factors (23) and less for factors related to the embedding (6). In sum, results are truly promising and show that item development based on cognitive models not only provides the opportunity to apply automatic item generation but to also create item pools with at least approximately known item difficulty. Thus, the majority of developed cognitive models in this study could be used to generate a huge number of items (> 10.000.000) for the domain of numbers & operations without the need for expensive field-trials. A necessary precondition for this is the consideration of the semantic embedding the problems are presented in, especially in lower Grades. It also has to be stated that modeling in Grade 1 was more challenging due to unforeseen interactions and transfer effects between items. We will end our presentation by discussing lessons learned from models where prediction was less successful and highlighting differences between the Grades. [less ▲] Detailed reference viewed: 88 (19 UL)![]() ![]() Sonnleitner, Philipp ![]() ![]() ![]() Scientific Conference (2023, April 13) For several decades, researchers have suggested cognitive models as superior basis for item development (Hornke & Habon, 1986; Leighton & Gierl, 2011). Such models would make item writing decisions ... [more ▼] For several decades, researchers have suggested cognitive models as superior basis for item development (Hornke & Habon, 1986; Leighton & Gierl, 2011). Such models would make item writing decisions explicit and therefore more valid. By further formalizing such models, even automated item generation with its manifold advantages for economic test construction, and increased test security is possible. If item characteristics are stable, test equating would be rendered unnecessary allowing for individual but equal tests, or even adaptive or multistage testing without extensive pre-calibration. Finally, validated cognitive models would allow for applying Diagnostic Classification Models that provide fine-grained feedback on students’ competencies (Leighton & Gierl, 2007; Rupp, Templin, & Henson, 2010). Remarkably, despite constantly growing need for validated items, educational large-scale assessments (LSAs) have largely forgone cognitive models as template for item writing. Traditional, often inefficient item writing techniques prevail and participating students are offered a global competency score at best. This may have many reasons, above all the focus of LSAs on the system and not individual level. Many domains lack the amount of cognitive research necessary for model development (e.g. Leighton & Gierl, 2011) and test frameworks are mostly based on didactical viewpoints. Moreover, developing an empirically validated cognitive model remains a challenge. Considering the often time-sensitive test development cycles of LSAs, the balance clearly goes against the use of cognitive models. Educational LSAs are meant to stay, however, and the question remains, whether increased effort and research on this topic might pay off in the long run by leveraging all benefits cognitive models have to offer. In total, 35 cognitive item models were developed by a team of national subject matter experts and then used for algorithmically producing items for the mathematical domain of numbers & shapes. Each item model was administered in 6 experimentally varied versions to investigate the impact of problem characteristics which cognitive psychology identified to influence the problem-solving process. Based on samples from Grade 1 (n = 5963), Grade 3 (n = 5527), Grade 5 (n = 5291), and Grade 7 (n = 3018), this design allowed for evaluating whether psychometric characteristics of produced items per model are stable, and can be predicted by problem characteristics. After item calibration (1-PL model), each cognitive model was analyzed in-depth by descriptive comparisons of resulting IRT parameters, and using the LLTM (Fischer, 1973). In a second step, the same items were analyzed using the G-DINA model (Torre & Minchen, 2019) to derive classes of students for the tested subskills. The cognitive models served as basis for the Q-matrix necessary for applying the diagnostic measurement model. Results make a convincing case for investing the (substantially) increased effort to base item development on fine-grained cognitive models. Model-based manipulations of item characteristics were largely stable and behaved according to previous findings in the literature. Thus, differences in item difficulty could be shaped and were stable over different administrations. This remained true for all investigated grades. The final diagnostic classification models distinguished between different developmental stages in the domain of numbers & operations, on group, as well as on individual level. Although not all competencies might be backed up by literature from cognitive psychology yet, our findings encourage a more exploratory model building approach given the usual long-term perspective of LSAs. [less ▲] Detailed reference viewed: 62 (1 UL)![]() ![]() Sonnleitner, Philipp ![]() ![]() ![]() Scientific Conference (2022, November 10) Assessment is probably the central factor in every educational biography: On the one hand, through direct consequences for school career decisions, on the other hand, through repercussions on each ... [more ▼] Assessment is probably the central factor in every educational biography: On the one hand, through direct consequences for school career decisions, on the other hand, through repercussions on each student’s self-concept in the respective subject, for one's own work behavior and the perception of institutional fairness in general. A crucial factor is the subjective, perceived fairness of assessment, which has been shown to influence students' satisfaction, motivation, and attitudes toward learning (Chory-Assad, 2002; Wendorf & Alexander, 2005). The current study examines how Luxembourgish students experience fairness of assessment on the basis of representative samples of the 7iéme (N > 700 students) and 9iéme/ 5iéme (N > 2200, 35% of the total cohort) and gives a first insight into the connection with school interest and self-concept. Special attention is given to the heterogeneity of the Luxembourgish student population: the extent to which language background, socioeconomic status, and gender are related to these perceptions of fairness will be analyzed. Data was collected as part of the nationwide Épreuves standardisées in fall 2021 using the Fairness Barometer (Sonnleitner & Kovacs, 2020) - a standardized instrument to measure informational and procedural fairness in student assessment. The analyses are theoretically based on Classroom Justice Theory and educational psychology (Chory-Assad and Paulsel, 2004; Chory, 2007; Duplaga & Astani, 2010) and utilize latent variable models (SEM) to study the complex interplay between perceived assessment practices and students’ school-related motivational factors. The insights offered by this study are internationally unique in their scope and provide a first glimpse on fairness perceptions of groups of Luxembourgish students in known disadvantaged situations. Results aim to sensitize especially active teachers and educators to the central importance of assessment in schools and offer some concrete advice how to improve it. References: Chory, R. M. (2007). Enhancing student perceptions of fairness: the relationship between instructor credibility and classroom justice. Commun. Educ. 56, 89–105. doi: 10.1080/03634520600994300 Chory-Assad, R. M., and Paulsel, M. L. (2004). Classroom justice: student aggression and resistance as reactions to perceived unfairness. Commun. Educ. 53, 253–273. doi: 10.1080/0363452042000265189 Chory-Assad, R. M. (2002). Classroom justice: perceptions of fairness as a predictor of student motivation, learning, and aggression. Commun. Q. 50, 58–77. doi: 10.1080/01463370209385646 Duplaga, E. A., and Astani, M. (2010). An exploratory study of student perceptions of which classroom policies are fairest. Decision Sci. J. Innov. Educ. 8, 9–33. doi: 10.1111/j.1540-4609.2009.00241.x Sonnleitner, P., & Kovacs, C. (2020, February). Differences between students’ and teachers’ fairness perceptions: Exploring the potential of a self-administered questionnaire to improve teachers’ assessment practices. In Frontiers in Education (Vol. 5, p. 17). Frontiers Media SA. Wendorf, C. A., and Alexander, S. (2005). The influence of individual- and class-level fairness-related perceptions on student satisfaction. Contemp. Educ. Psychol. 30, 190–206. doi: 10.1016/j.cedpsych.2004.07.003 [less ▲] Detailed reference viewed: 48 (3 UL)![]() ![]() Inostroza Fernandez, Pamela Isabel ![]() ![]() ![]() Scientific Conference (2022, November) Educational large-scale assessments aim to evaluate school systems’ effectiveness by typically looking at aggregated levels of students’ performance. The developed assessment tools or tests are not ... [more ▼] Educational large-scale assessments aim to evaluate school systems’ effectiveness by typically looking at aggregated levels of students’ performance. The developed assessment tools or tests are not intended or optimized to be used for diagnostic purposes on an individual level. In most cases, the underlying theoretical framework is based on national curricula and therefore too blurry for diagnostic test construction, and test length is too short to draw reliable inferences on individual level. This lack of individual information is often unsatisfying, especially for participating students and teachers who invest a considerable amount of time and effort, not to speak about the tremendous organizational work needed to realize such assessments. The question remains, if the evaluation could not be used in an optimized way to offer more differentiated information on students’ specific skills. The present study explores the potential of Diagnostic Classification Models (DCM) in this regard, since they offer crucial information for policy makers, educators, and students themselves. Instead of a ranking of, e.g., an overall mathematics ability, student mastery profiles of subskills are identified in DCM, providing a rich base for further targeted interventions and instruction (Rupp, Templin & Henson, 2010; von Davier, M., & Lee, Y. S., 2019). A prerequisite for applying such models is well-developed, and cognitively described items that map the assessed ability on a fine-grained level. In the present study, we drew on 104 items that were developed on base of detailed cognitive item models for basic Grade 1 competencies, such as counting, as well as decomposition and addition with low numbers and high numbers (Fuson, 1988, Fritz & Ricken, 2008, Krajewski & Schneider, 2009). Those items were spread over a main test plus 6 different test booklets and administered to a total of 5963 first graders within the Luxembourgish national school monitoring Épreuves standardisées. Results of this pilot study are highly promising, giving information about different student’s behaviors patterns: The final DCM was able to distinguish between different developmental stages in the domain of numbers & operations, on group, as well as on individual level. Whereas roughly 14% of students didn’t master any of the assessed competencies, 34% of students mastered all of them including addition with high numbers. The remaining 52% achieved different stages of competency development, 8% of students are classified only mastering counting, 15% of students also can master addition with low numbers, meanwhile 20% of students additionally can master decomposition, all these patterns reflect developmental models of children’s counting and concept of numbers (Fritz & Ricken, 2008; see also Braeuning et al, 2021). Information that could potentially be used to substantially enhance large-scale assessment feedback and to offer further guidance for teachers on what to focus when teaching. To conclude, the present results make a convincing case that using fine-grained cognitive models for item development and applying DCMs that are able to statistically capture these nuances in student response behavior might be worth the (substantially) increased effort. References: Braeuning, D. et al (2021)., Long-term relevance and interrelation of symbolic and non-symbolic abilities in mathematical-numerical development: Evidence from large-scale assessment data. Cognitive Development, 58, https://doi.org/10.1016/j.cogdev.2021.101008. Fritz, A., & Ricken, G. (2008). Rechenschwäche. utb GmbH. Fuson, K. C. (1988). Children's counting and concepts of number. Springer-Verlag Publishing. Rupp, A. A., Templin, J. L., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guildford Press. Von Davier, M., & Lee, Y. S. (2019). Handbook of diagnostic classification models. Cham: Springer International Publishing. [less ▲] Detailed reference viewed: 170 (9 UL)![]() ![]() Michels, Michael Andreas ![]() ![]() ![]() Scientific Conference (2021, November 11) Assessing mathematical skills in national school monitoring programs such as the Luxembourgish Épreuves Standardisées (ÉpStan) creates a constant demand of developing high-quality items that is both ... [more ▼] Assessing mathematical skills in national school monitoring programs such as the Luxembourgish Épreuves Standardisées (ÉpStan) creates a constant demand of developing high-quality items that is both expensive and time-consuming. One approach to provide high-quality items in a more efficient way is Automatic Item Generation (AIG, Gierl, 2013). Instead of creating single items, cognitive item models form the base for an algorithmic generation of a large number of new items with supposedly identical item characteristics. The stability of item characteristics is questionable, however, when different semantic embeddings are used to present the mathematical problems (Dewolf, Van Dooren, & Verschaffel, 2017, Hoogland, et al., 2018). Given culture-specific knowledge differences in students, it is not guaranteed that illustrations showing everyday activities do not differentially impact item difficulty (Martin, et al., 2012). Moreover, the prediction of empirical item difficulties based on theoretical rationales has proved to be difficult (Leighton & Gierl, 2011). This paper presents a first attempt to better understand the impact of (a) different semantic embeddings, and (b) problem-related variations on mathematics items in grades 1 (n = 2338), 3 (n = 3835) and 5 (n = 3377) within the context of ÉpStan. In total, 30 mathematical problems were presented in up to 4 different versions, either using different but equally plausible semantic contexts or altering the problem’s content characteristics. Preliminary results of IRT-scaling and DIF-analysis reveal substantial effects of both, the embedding, as well as the problem characteristics on general item difficulties as well as on subgroup level. Further results and implications for developing mathematic items, and specifically, for using AIG in the course of Épstan will be discussed. [less ▲] Detailed reference viewed: 84 (18 UL) |
||