Reproducibility of Empirical Results: Evidence from 1,000 Tests in Finance (with O. Akmansoy, C. Hurlin, A. Menkveld, A. Dreber, F. Holzmeister, J. Huber, M. Johannesson, M. Kirchler, M. Razen, U. Weitzel) New
We analyze the computational reproducibility of more than 1,000 empirical answers to six research questions in finance provided by 168 international research teams. Surprisingly, neither researcher seniority, nor the quality of the research paper seem related to the level of reproducibility. Moreover, researchers exhibit strong overconfidence when assessing the reproducibility of their own research and underestimate the difficulty faced by their peers when attempting to reproduce their results. We further find that reproducibility is higher for researchers with better coding skills and for those exerting more effort. It is lower for more technical research questions and more complex code.
Non-Standard Errors (with A. Menkveld, A. Dreber, F. Holzmeister, J. Huber, M. Johannesson, M. Kirchler, M. Razen, U. Weitzel et al.)
My contribution: I co-designed and co-implemented the reproducibility verification policy of the #fincap project
In statistics, samples are drawn from a population in a data-generating process (DGP). Standard errors measure the uncertainty in sample estimates of population parameters. In science, evidence is generated to test hypotheses in an evidence-generating process (EGP). We claim that EGP variation across researchers adds uncertainty: non-standard errors. To study them, we let 164 teams test six hypotheses on the same sample. We find that non-standard errors are sizeable, on par with standard errors. Their size (i) co-varies only weakly with team merits, reproducibility, or peer rating, (ii) declines significantly after peer-feedback, and (iii) is underestimated by participants.
The Fairness of Credit Scoring Models (with C. Hurlin and S. Saurin) Updated
In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
The Economics of Research Reproducibility (with J.-E. Colliard and C. Hurlin)
We investigate why economics displays a relatively low level of research reproducibility. We first study the benefits and costs of reproducibility for readers (demand side) and authors (supply side), as well as the role of academic journals in matching both sides. Second, we prove that competition between journals to attract authors can lead to a suboptimally low level of reproducibility. Third, we show how to optimize the costs of reproducibility and estimate that reaching the highest level of reproducibility could cost USD 365 per paper. Finally, we discuss how leading journals can move economics out of a low-reproducibility equilibrium.