Evaluating Search-Based Techniques With Statistical Tests

[en] This tutorial covers the basics of how to use statistical tests to evaluate and compare search-algorithms, in particular when applied on software engineering problems. Search-algorithms like Hill Climbing and Genetic Algorithms are randomised. Running such randomised algorithms twice on the same problem can give different results. It is hence important to run such algorithms multiple times to collect average results, and avoid so publishing wrong conclusions that were based on just luck. However, there is the question of how often such runs should be repeated. Given a set of n repeated experiments, is such n large enough to draw sound conclusions? Or should had more experiments been run? Statistical tests like the Wilcoxon-Mann-Whitney U-test can be used to answer these important questions.

Disciplines :

Computer science

Author, co-author :

ARCURI, Andrea; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)

External co-authors :

yes

Language :

English

Title :

Evaluating Search-Based Techniques With Statistical Tests

Publication date :

2018

Event name :

The Search-Based Software Testing (SBST) Workshop

Event date :

May

Main work title :

The Search-Based Software Testing (SBST) Workshop

FnR Project :

FNR3949772 - Validation And Verification Laboratory, 2010 (01/01/2012-31/07/2018) - Lionel Briand

Available on ORBilu :

since 18 March 2018

Statistics

Number of views

90 (6 by Unilu)

Number of downloads

3 (3 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

A. Arcuri and L. Briand. 2011. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering. In ACM/IEEE International Conference on Software Engineering (ICSE). 1-10.
A. Arcuri and L. Briand. 2014. A Hitchhiker's Guide to Statistical Tests for Assessing Randomized Algorithms in Software Engineering. Software Testing, Verification and Reliability (STVR) 24, 3 (2014), 219-250.
M. Cowles and C. Davis. 1982. On the origins of the. 05 level of statistical significance. American Psychologist 37, 5 (1982), 553-558.
M. P. Fay and M. A. Proschan. 2010. Wilcoxon-Mann-Whitney or t-test On assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys 4 (2010), 1-39.
J. P. A. Ioannidis. 2005. Why most published research findings are false. PLoS medicine 2, 8 (2005), e124.
H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18, 1 (1947), 50-60.
T. V. Perneger. 1998. What's wrong with Bonferroni adjustments. British Medical Journal 316 (1998), 1236-1238.
A. Vargha and H. D. Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25, 2 (2000), 101-132.