[en] This tutorial covers the basics of how to use statistical tests to
evaluate and compare search-algorithms, in particular when applied
on software engineering problems. Search-algorithms like
Hill Climbing and Genetic Algorithms are randomised. Running
such randomised algorithms twice on the same problem can give
different results. It is hence important to run such algorithms multiple
times to collect average results, and avoid so publishing wrong
conclusions that were based on just luck. However, there is the
question of how often such runs should be repeated. Given a set
of n repeated experiments, is such n large enough to draw sound
conclusions? Or should had more experiments been run? Statistical
tests like the Wilcoxon-Mann-Whitney U-test can be used to
answer these important questions.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
ARCURI, Andrea; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Evaluating Search-Based Techniques With Statistical Tests
A. Arcuri and L. Briand. 2011. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering. In ACM/IEEE International Conference on Software Engineering (ICSE). 1-10.
A. Arcuri and L. Briand. 2014. A Hitchhiker's Guide to Statistical Tests for Assessing Randomized Algorithms in Software Engineering. Software Testing, Verification and Reliability (STVR) 24, 3 (2014), 219-250.
M. Cowles and C. Davis. 1982. On the origins of the. 05 level of statistical significance. American Psychologist 37, 5 (1982), 553-558.
M. P. Fay and M. A. Proschan. 2010. Wilcoxon-Mann-Whitney or t-test On assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys 4 (2010), 1-39.
J. P. A. Ioannidis. 2005. Why most published research findings are false. PLoS medicine 2, 8 (2005), e124.
H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18, 1 (1947), 50-60.
T. V. Perneger. 1998. What's wrong with Bonferroni adjustments. British Medical Journal 316 (1998), 1236-1238.
A. Vargha and H. D. Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25, 2 (2000), 101-132.