No document available.
Abstract :
[en] For several decades, researchers have suggested cognitive models as superior basis for item development (Hornke & Habon, 1986; Leighton & Gierl, 2011). Such models would make item writing decisions explicit and therefore more valid. By further formalizing such models, even automated item generation with its manifold advantages for economic test construction, and increased test security is possible. If item characteristics are stable, test equating would be rendered unnecessary allowing for individual but equal tests, or even adaptive or multistage testing without extensive pre-calibration. Finally, validated cognitive models would allow for applying Diagnostic Classification Models that provide fine-grained feedback on students’ competencies (Leighton & Gierl, 2007; Rupp, Templin, & Henson, 2010). Remarkably, despite constantly growing need for validated items, educational large-scale assessments (LSAs) have largely forgone cognitive models as template for item writing. Traditional, often inefficient item writing techniques prevail and participating students are offered a global competency score at best. This may have many reasons, above all the focus of LSAs on the system and not individual level. Many domains lack the amount of cognitive research necessary for model development (e.g. Leighton & Gierl, 2011) and test frameworks are mostly based on didactical viewpoints. Moreover, developing an empirically validated cognitive model remains a challenge. Considering the often time-sensitive test development cycles of LSAs, the balance clearly goes against the use of cognitive models. Educational LSAs are meant to stay, however, and the question remains, whether increased effort and research on this topic might pay off in the long run by leveraging all benefits cognitive models have to offer. In total, 35 cognitive item models were developed by a team of national subject matter experts and then used for algorithmically producing items for the mathematical domain of numbers & shapes. Each item model was administered in 6 experimentally varied versions to investigate the impact of problem characteristics which cognitive psychology identified to influence the problem-solving process. Based on samples from Grade 1 (n = 5963), Grade 3 (n = 5527), Grade 5 (n = 5291), and Grade 7 (n = 3018), this design allowed for evaluating whether psychometric characteristics of produced items per model are stable, and can be predicted by problem characteristics. After item calibration (1-PL model), each cognitive model was analyzed in-depth by descriptive comparisons of resulting IRT parameters, and using the LLTM (Fischer, 1973). In a second step, the same items were analyzed using the G-DINA model (Torre & Minchen, 2019) to derive classes of students for the tested subskills. The cognitive models served as basis for the Q-matrix necessary for applying the diagnostic measurement model. Results make a convincing case for investing the (substantially) increased effort to base item development on fine-grained cognitive models. Model-based manipulations of item characteristics were largely stable and behaved according to previous findings in the literature. Thus, differences in item difficulty could be shaped and were stable over different administrations. This remained true for all investigated grades. The final diagnostic classification models distinguished between different developmental stages in the domain of numbers & operations, on group, as well as on individual level. Although not all competencies might be backed up by literature from cognitive psychology yet, our findings encourage a more exploratory model building approach given the usual long-term perspective of LSAs.