[en] Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial-based systemic inequalities is an important step towards a more equitable research system. However, few large-scale analyses have been performed on this topic, mostly because of the lack of robust race-disambiguation algorithms. Identifying author information does not generally include the author’s race. Therefore, an algorithm needs to be employed, using known information about authors, i.e., their names, to infer their perceived race. Nevertheless, as any other algorithm, the process of racial inference can generate biases if it is not carefully considered. When the research is focused on the understanding of racial-based inequalities, such biases undermine the objectives of the investigation and may perpetuate inequities. The goal of this article is to assess the biases introduced by the different approaches used name-based racial inference. We use information from US census and mortgage applications to infer the race of US author names in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name-based inference varies by race and ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article fills an important research gap that will allow more systematic and unbiased studies on racial disparity in science.
Research center :
University of Luxembourg
Sociology & social sciences
Author, co-author :
Kozlowski, Diego ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Engineering (DoE)
Murray, Dakota S.; Indiana University Bloomington, IN, USA > School of Informatics, Computing, and Engineering
Bell, Alexis; Berry College, GA, USA > Campbell School of Business
Husley, Will; Berry College, GA, USA > Campbell School of Business
Larivière, Vincent; Université de Montréal, Montréal, QC, Canada > École de bibliothéconomie et des sciences de l’information
Monroe-White; Berry College, GA, USA > Campbell School of Business, > Assistant Professor of Technology, Entrepreneurship, and Data Analytics
Sugimoto, Cassidy R.; Indiana University Bloomington, IN, USA > School of Informatics, Computing, and Engineering
External co-authors :
Avoiding bias when inferring race using name-based approaches
Publication date :
Event name :
18th International Conference on Scientometrics & Informetrics
Event organizer :
Event place :
Event date :
from 12-07-2021 to 15-07-2021
Main work title :
18th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, 12–15 July 2021KU Leuven, Belgium
Bozeman, B. (2020). Public Value Science. Issues in Science and Technology, 34-41.
Bourdieu, P. (2001). Science of Science and Reflexivity. Chicago, IL: University of Chicago Press.
Buolamwini, J. & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). PMLR.
Caliskan, A., Bryson, J. J. & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183-186.
Cook, L.D. (2014). Violence and economic activity: evidence from African American patents, 1870-1940. Journal of Economic Growth, 19, 221-257.
D'Ignazio, C. & Klein, L. F. (2020). Data feminism. MIT Press.
Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P. & Lurie, N. (2009). Using the Census Bureau's surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology, 9, 69.
Emirbayer, M. & Desmond, M. (2011). Race and reflexivity. Ethnic and Racial Studies, 35(4), 574-599.
Fiscella, K. & Fremont, A.M. (2006). Use of geocoding and surname analysis to estimate race and ethnicity. Health Services Research, 41(1), 1482-1500.
Freeman, R.B. & Huang, W. (2014). Collaborating with people like me: Ethnic co-authorship within the US. NBER working paper 19905.
Galton, F. (1891). Hereditary genius. D. Appleton.
Ginther, D.K., Basner, J., Jensen, U., Schnell, J., Kington, R. & Schaffer, W.T. (2018). Publications as predictors of racial and ethnic differences in NIH research awards. PLoS ONE, 13(11), e0205929.
Ginther, D.K., Schaffer, W.T., Schnell, J., Masimore, B., Liu, F., Haak, L.L. & Kington, R. (2011). Race, ethnicity, and NIH research awards. Science, 333(6045), 1015-1019.
Godin, B. (2007). From eugenics to scientometrics: Galton, Cattell, and men of science. Social Studies of Science, 37(5), 691-728.
Hopkins, A.L., Jawitz, J.W., McCarty, C., Goldman, A. & Basu, N.B. (2013). Disparities in publication patterns by gender, race and ethnicity based on a survey of a random sample of authors. Scientometrics, 96, 515-534.
Hoppe, T.A., Litovitz, A., Willis, K.A., Meseroll, R.A., Perkins, M.J., Hutchins, B.A., Davis, A.F., Lauer, M.S., Valantine, H.A., Anderson, J.M. & Santangelo, G.M. (2019). Topic choice contributes to the lower rate of NIH awards to African-American/black scientists. Science Advances, 5: eea7238.
Horton, H. D. (1998). Toward a critical demography of race and ethnicity: Introduction of the “R” word. Sociology Faculty Scholarship, 1. https://scholarsarchive.library.albany.edu/sociology_fac_scholar/1
Horton, H. D. & Sykes, L. L. (2001). Reconsidering wealth, status, and power: Critical Demography and the measurement of racism. Race and Society, 4(2), 207-217.
Horton, H. D. (2002). Rethinking American diversity: Conceptual and theoretical challenges for racial and ethnic demography. In N. Denton & S. Tolnay (Eds.), American diversity: A Demographic challenge for the twenty-first century (p. 261-278). New York: State University of New York Press.
Humes, K., Jones, N. & Ramirez, R. (2011). Overview of Race and Hispanic Origin: 2010. U.S. https://10.4:51awww.atlantic.org/images/publications/Democratic_Defense_Against_Disinformation_FINAL.pdf
Larivière, V., Ni, C., Gingras, Y., Cronin, B. & Sugimoto, C.R. (2013). Global gender disparities in science. Nature, 504, 211-213.
Liebler, C.A., Porter, S.R., Fernandez, L.E., Noon, J.M. & Ennis, S.R. (2017). America's Churning Races: Race and Ethnicity Response Changes Between Census 2000 and the 2010 Census. Demography, 54(1), 259-284.
Locke, G., Blank, R. & Groves, R. (2011). 2010 Census Redistricting Data (Public Law 94-171) Summary File. https://www.census.gov/prod/cen2010/doc/pl94-171.pdf
Prescod-Weinstein, C. (2020). Making Black women scientists under white empiricism: the racialization of epistemology in physics. Signs: Journal of Women in Culture and Society, 45(2), 421-447.
Stevens, K. R., Masters, K. S., Imoukhuede, P. I., Haynes, K. A., Setton, L. A., Cosgriff-Hernandez, E.,... & Eniola-Adefeso, O. (2021). Fund Black scientists. Cell, 184(3), 561-565.
Teh, Y. W. (2010). Dirichlet Process. https://www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf
Tzioumis, K. (2018). Demographic aspects of first names. Scientific data, 5, 180025.
U.S. Bureau of the Census. (1975). Historical Statistics of the United States, Colonial Times to 1970, Bicentennial Edition, Part 1). https://www.census.gov/history/pdf/histstats-colonial-1970.pdf
US Census Bureau. (2016). Frequently Occurring Surnames from the 2010 Census. The United States Census Bureau. https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
Wilkerson, I. (2020). Caste: The Origins of Our Discontents. Random House.
Zuberi, T. (2001). Thicker than blood: How racial statistics lie. Univ. of Minnesota Press.