Sub-group Fair Coding Taken to Scale for Science, Technology, Engineering, and Mathematics Learning
Effective Years: 2021-2026
This project will advance research in an important area needed for contributing to the national need for well-educated scientists, mathematicians, engineers, and technicians through the creation and validation of a process designed to effectively and fairly code educational data. Over its five-year duration, this project will develop and test an approach to coding data on learning in science, technology, engineering, and mathematics (STEM). It will do so in a manner that takes into account differences between groups without requiring researchers to code data by hand. Thus, this proposal will save time while also being able to help with answering questions about group differences in STEM learning. This approach will be made publicly available for use by researchers coding learning data while taking into account differences across groups.
As a result of these efforts, algorithms will be developed for learning science research that will: (i) produce fair classifiers, fair sets of codes, and identify conceptual shift among subgroups, (ii) provide support for identifying intersectional subgroups that may be modeled unfairly without specifying the interactions of group characteristics in advance, (iii) control for elevated Type 1 error rates, and (iv) provide an interface that can be used by researchers who care about fair coding. The technique being tested will enable this to occur with less data than is typically needed outside of learning science research. Though fairness in coding is a well-recognized challenge, particularly in STEM education research, this issue is handled almost exclusively by researchers examining their data by hand to see if the coding appears fair. This is time-consuming, and usually only done when there are strong a priori reasons to be concerned about fairness. In data science, there are relatively sophisticated approaches to fairness in coding, but there are significant limitations (e.g., the need for very large amounts of human coded data) that are currently insurmountable in learning science research. As such, having a validated method to code data, while systematically checking for subgroup fairness with the ability to identify the presence of conceptual shift will advance the field of data science in STEM learning research. By providing the protocol and algorithms associated with this process to other researchers through R Project for Statistical Computing, journal articles, and presentations will assure dissemination. This project is funded by the EHR Core Research (ECR) program, which supports work that advances fundamental research on STEM learning and learning environments, broadening participation in STEM, and STEM workforce development.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.