Introduction

Front. Educ.

Frontiers in Education

Front. Educ.

2504-284X

Frontiers Media S.A.

10.3389/feduc.2025.1620029

Original Research

Advancing textbook evaluation with debiased machine learning: a theoretical and empirical approach

Bian

Yong

¹ ² ^* ^† Writing – review & editing Writing – original draft Fang

Zhou

³ Writing – original draft Writing – review & editing

1Department of Family Medicine and Public Health Sciences, Wayne State University, Detroit, MI, United States 2Center for Molecular Medicine and Genetics (CMMG), Wayne State University, Detroit, MI, United States 3Department of Psychological & Brain Sciences, Texas A&M University, College Station, TX, United States

*Correspondence: Yong Bian, oliviabiany@hotmail.com †

ORCID: Yong Bian orcid.org/0000-0002-0557-3145

12 01 2026

2025

1620029

14 05 2025 08 12 2025 08 12 2025

2026

Bian and Fang

https://creativecommons.org/licenses/by/4.0/

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Introduction

Textbooks can substantially influence student achievement, but common evaluation approaches (e.g., linear regression) often depend on strong functional-form assumptions that may misstate causal effects. This study presents Double/Debiased Machine Learning (DML) as a more flexible framework for estimating the causal impact of textbooks on learning outcomes.

Methods

We use DML to estimate textbook effects while allowing high-dimensional, non-parametric modeling of outcome and treatment assignment processes. We (1) derive the theoretical advantages of DML-particularly its robustness to model misspecification and its use of orthogonalized estimating equations-and (2) apply the approach to an existing large-scale elementary school mathematics curriculum dataset. We compare DML estimates to those produced by Ordinary Least Squares (OLS) regression and Kernel matching, focusing on precision and efficiency of causal effect estimation.

Results

Across the empirical application, DML yields more precise and efficient estimates of textbook effects than OLS and Kernel matching. The approach reduces reliance on restrictive linearity assumptions and improves the stability of estimated causal impacts in settings where relationships between covariates, curriculum assignment, and outcomes are complex.

Discussion

These findings indicate that DML is a robust alternative for evaluating educational materials, offering clearer evidence to inform curriculum selection and adoption decisions. More broadly, the study contributes methodologically to learning and intelligence research by strengthening the tools used to measure educational interventions' effects on achievement.

California Math causal DML estimation non-parametric modeling

The author(s) declared that financial support was not received for this work and/or its publication.

section-at-acceptance

STEM Education

1 Introduction

Student learning outcomes are significantly influenced by the quality of instructional materials. Textbooks continue to be one of the most widely available resources for educational advancement among the many factors that affect academic success. Textbook selection provides schools with an effective way to improve student performance with relatively little effort, in contrast to other interventions that call for intensive training or structural modifications (Arumuru and David, 2024; Li and Wang, 2024).

This pattern holds true in K−12 mathematics education as well: improving mathematics education achievement through textbook selection is a common and widely accepted practice (Slavin and Lake, 2008). Scholars have attempted to quantify the impact of textbook selection on academic performance. However, upon reviewing the existing literature on mathematics textbook selection, we found that the selection algorithm is not well-developed. Table 1 summarizes key papers that have examined mathematics textbook selection and their methodological approaches.

Table 1

Information on selected papers studying mathematics textbook selection.

Title	Research object	Evaluation method used
Are first graders' arithmetic skills related to the quality of mathematics textbooks? A study on students' use of arithmetic principles (Sievert et al., 2021)	2,462 students from 40 schools and 127 classes in Schleswig-Holstein	OLS (multilevel)
Curriculum reform in the common core era: evaluating elementary math textbooks across six U.S. States (Blazar et al., 2020)	Over 6,000 schools across six states: California, Louisiana, Maryland, New Jersey, New Mexico, and Washington	OLS
The formalized processes districts use to evaluate mathematics textbooks (Polikoff et al., 2019)	34 education leaders	Interview
Learning by the book: comparing math achievement growth by textbook in six Common Core states (Blazar et al., 2019)	5,107 schools from 6 states	OLS
Mathematics curriculum effects on student achievement in California (Koedel et al., 2017)	5,494 schools in California	Kernel matching and restricted OLS
Opportunities to learn: mathematics textbooks and students' achievements (Hadar, 2017)	4,040 eighth-grade students in an Arab community	OLS (hierarchical)
Big bang for just a few bucks: The impact of math textbooks in California (Koedel and Polikoff, 2017)	~7,600 schools in California state that serve grades K-8	Kernel matching, restricted OLS, and remnant-based residualized matching
How well aligned are textbooks to the common core standards in mathematics? (Polikoff, 2015)	7 math books	Grading (Alignment index)
Is curriculum quality uniform? Evidence from Florida (Bhatt et al., 2013)	1,205 schools in Florida	OLS and Kernel matching
Large-scale evaluations of curricular effectiveness (Bhatt and Koedel, 2012)	716 schools in Indiana	OLS, Kernel matching, and LLR matching

Traditional causal effect analysis methods, such as fixed effects, difference-in-differences, and two-stage least squares (2SLS), are commonly used in empirical studies, often through linear models. However, linear models can be a strong assumption when modeling aims to uncover causal relationships. Linear partial effect estimations may have limitations in capturing true causality that is not constant or has complicated forms. While some may argue that adding polynomials and interaction terms to a linear model can capture non-constant and heterogeneous effects, the number of regressors must be limited by the sample size. The number of parameters must be substantially less than the number of observations to obtain precise estimates. OLS performs poorly when the sample size is limited, and the number of parameters is large or even increases with sample size (Liu et al., 2023).

Linear models serve different purposes depending on the situation. OLS regression captures the overall features of the data but may be less flexible in revealing details (Acito, 2023). Population relationships can be complex and interact in a non-linear way, making it challenging to establish a linear model. In statistical modeling, the objective is to fit the data and make accurate predictions. One could build local non-linear functions and sum them up, as in the idea from Kernel Smoothing Methods, but this approach can be tedious and inefficient in empirical studies in economics (Batlle et al., 2025; Hiabu et al., 2019). Because social scientists are more interested in explanations rather than simply a well-fitted model, they often turn to non-parametric methods such as kernel matching techniques or continue to use a linear model (Heckman et al., 1998), even if it has low prediction power or poorly fits the data, as long as it has good explanatory properties (Breznau, 2022).

Even if a linear model is the final “best” choice among various empirical techniques for an economics researcher, it may still be challenging to construct a robust linear-based model. The researcher must decide what interaction terms, year dummies, and fixed-effect terms to add, which depends on subjective judgment or convention. For example, researchers often include as many regressors (characteristic variables) as possible in their regressions to reduce bias and control variance (Li and Müller, 2021). However, increasing the number of independent variables in an OLS regression can lead to multicollinearity and overfitting issues (Efeizomor, 2023). Additionally, when the sample size is smaller than the number of parameters, the existence of nuisance parameters can result in poor performance of traditional OLS regression, even if the interest is only in a small part of the model's parameters (Li and Müller, 2021).

Although Kernel matching already addresses most of these concerns, we would like to suggest Double/Debiased Machine Learning (DML) as a suitable alternative. DML is a recently developed, powerful method, which, as a causal estimation technique, combines the prediction power of machine learning to obtain consistent causal estimators under high-dimensional covariates (Chernozhukov et al., 2018). When predicting the treatment effect, DML utilizes a more general algorithm that integrates all regressors in a non-parametric model, allowing both linear and non-linear functional forms and leading to more precise predictions. DML, as a non-parametric method, allows the correlations between treatment and controls to be of any functional form, making it more general and robust than Kernel matching.

In this study, we will first derive the algorithm mathematically to explain how it addresses the limitations of traditional models. Following this, we will apply a slightly modified DML method to a dataset recently used in an elementary school math curriculum study (Koedel et al., 2017). The aim of this second step is to compare our results with an existing analysis and demonstrate the superiority of DML under these circumstances. Our causal estimates are more efficient than those of Koedel et al.

We selected this dataset for two reasons. First, in the previous study, the authors used traditional estimation methods of kernel matching and restricted OLS as causal analysis tools, which is a perfect foundation for comparison. Second, the authors expressed concerns about the linear setting in their paper, and our work builds a partially linear model with DML, serving as a legitimate extension of their study.

The remainder of the article is organized as follows. The next section introduces the model and provides a mathematical explanation for its advantages in estimating the textbook effects. Section 3 presents the background, data, and results of the study we aim to compare with, using the new model. Section 4 presents our results using the same dataset and compares them with the previous research. Finally, Section 5 concludes this study.

2 The DML model

In this section, we briefly introduce the models and explain why they are superior. Since our algorithm is based on the original work by Chernozhukov et al. (2018) reading their paper will be especially helpful to better understand this section. Generally, we first used a partially linear model, with the binary treatment variable, D, linearly added to a non-parametric function, g₀(.), of the control variables, X. We also used a more general one with the binary treatment variable D being included in a totally non-parametric function. The partially linear model is:

Y=Dθ0+g0(X)+U, 𝔼[U∣X,D]=0,D=m0(X)+V, 𝔼[V∣X]=0(1)

Equation 1 measures a constant causal effect on every observation, whereas the more general non-parametric model allows the effect to be heterogeneous across observations. Letting the binary variable D be involved in the g₀ function:

Y=g0(D,X)+U, 𝔼[U∣X,D]=0,D=m0(X)+V, 𝔼[V∣X]=0.(2)

The parameter of interest is the average treatment effect (ATE), which can be derived from Equation 2; and when CIA is satisfied, ATE is equal to the model parameter θ₀ as shown in Equation 3:

θ0=𝔼[g0(1,X)-g0(0,X)](3)

X affects the treatment through m₀(X) and the outcome variable through g₀(D, X). Both g₀ and m₀ functions are non-parametric, complicated, and unknown. Additionally, unconfoundedness or the conditional independence assumption (CIA)¹ must be satisfied if the goal is to identify the causal effect.

The main idea of DML is to build a Neyman-orthogonal score function and split the sample doing cross-fitting. A simple idea of estimating θ₀ in (2.1) is to subtract an estimator of g₀(X) from Y and apply the OLS procedure afterward.² A naive estimator of θ₀ is then given by:

θ^0=(1n∑i∈IDi2)-11n∑i∈IDi(Yi-g^0(Xi))(4)

In Equation 4, Dθ+g₀(X) is a conditional expectation function (CEF). Functional form of g₀ is unknown and unrestricted. Maybe it is non-linear and complicated. Dθ₀ part is a linear restriction (may not be correct to do so). However, it is hard to know what a true CEF looks like. With a more flexible g₀(X), the whole function could be close enough to the true CEF as much as possible.

However, this naive way of θ₀ estimation cannot provide a properly converged estimator when machine learning is used in the estimation of g₀(X) Chernozhukov et al. (2018). Restrictions in Lasso and Ridge regression, penalty in Neural Nets, and other penalty forms would increase the estimation bias to control variances. Unavoidably, regularization bias is produced. It comes along with the tradeoff between bias and variance in the estimations. The regularization keeps variance small but increases bias. The scaled decomposed estimation error of a naive estimator of a partially linear model is given by:

n(θ0^-θ0)& =(1n∑i∈IDi2)-11n∑i∈IDiUi& +(1n∑i∈IDi2)-11n∑i∈IDi(g0(Xi)-g^0(Xi))(5)

the second term is indeed:

(1n∑i∈IDi2)-11n∑i∈Im0(Xi)(g0(Xi)-g^0(Xi))+oP(1)(6)

Because of the existence of bias, the sum of n terms of m0(Xi)(g0(Xi)-g^0(Xi)) in Equations 5, 6 do not have a mean of zero. If we say the convergence rate of an estimator converging in a root mean squared error sense is 12, the convergence rate of g^0 to g₀ is slower than 1/2. We denote the convergence rate of g^0 by φ_g, φ_g < 1/2. Thus, the performance of θ^0 is poor. When the sample size is relatively small, because of the slow convergence rate, the estimator may deviate from the true parameter too much.

A general moment condition is:

𝔼(ψ(D,X;θ0,η0))=0(7)

where ψ is a vector of score functions. The function could be in any form, such as a maximum likelihood score function in Equation 7, or a GMM moment function; η₀ denotes the true value of nuisance parameters included in g₀ and m₀, η₀∈τ where τ is the nuisance parameter space. In addition, the score function must satisfy an additional condition, where its Gateaux derivative D_r[η−η₀] exists and is non-sensitive to the change of nuisance parameters η toward any direction. The Gateaux derivative is:

Dr[η−η0]:=∂r{E[ψ(D,X;θ0,η0+r(η−η0)]}, η∈τ}(8)

where r∈[0, 1). With the Neyman orthogonality condition holding, D_r[η−η₀] exists for all r∈[0, 1), with η∈τ and at r = 0,

∂η𝔼ψ(D,X;θ0,η0)[η-η0]=0(9)

The two conditions (Equations 8, 9) together constitute the orthogonality of DML estimation.

Equation 10 shows how DML makes the estimator of the true θ₀ consistent in a partially linear model. Let θ^0 denote the consistent DML estimator. The inference on θ₀ relies on the score function:

ψ(D,X;θ,η):=(Y-Dθ-g(X))(D-m(X))(10)

which satisfies the moment condition E(VU) = 0 and the orthogonality condition. After some algebra, a DML estimator of θ^0 is given by,

θ^0=(1n∑i∈IV^iDi)-11n∑i∈IV^i(Yi-g^0(Xi))(11)

where V^ is from ML estimation V^=D-m^0(X). Note that m^0 and g^0 are obtained by auxiliary sample and θ^0 is obtained by the main sample in Equation 11. The scaled decomposed estimation error is then,

n(θ^0-θ0)=(𝔼V2)-11n∑i∈IViUi+(𝔼V2)-1(1n∑i∈I(m0(Xi)-m^0(Xi))(g0(Xi)-g^0(Xi)))+op(1)(12)

Now, this equation can be separated into three parts:

a∗=(EV2)−11n∑i∈IViUib∗=(EV2)−1(1n∑i∈I(m0(Xi)−m^0(Xi)) (g0(Xi)−g0^(Xi)) c∗=oP(1)(13)

The first term from Equation 13 converges to a normal distribution under mild condition, a^*⇝N(0, Σ). The second term b^* now is determined by the estimation errors of both m₀(X) and g₀(X). It contains regularization bias from both of them. The convergence rate depends on specific Machine Learning methods used, usually slower than a square root rate.

The convergence rate of Random Forest estimators depends on its strong features, the rate order has the form of n-0.75Slog2+0.75, where S is a subset of features (Biau, 2012). According to the convergence properties of least squares regression under L₂ norm (Chen, 2007), the convergence rate is n-p2p+d, where d is the dimension of the raw explanatory variables and p is the assumed degree of smoothness of the CEF (like number of derivatives). We can control the bound by choosing proper p's for any given d. We now know m₀(X) and g₀(X) are estimated with a slower convergence rate to their true value, but the product of the two makes the whole term converge within a vanishing upper bound. This upper bound is nn-(φm+φg), where φ_m is the convergence rate of m^0 to m₀ and φ_g is the convergence rate of g^0 to g₀. Thus, b^* vanishes eventually if φ_m+φ_g>1/2³. Another requirement for θ^0 to be consistent is to control the remainder items in c^* and make sure c*=OP(1). In partially linear model, terms like:

1n∑i∈IVi(g^0(Xi)-g0(Xi))

are included in c^*. Without sample splitting, the model error terms V_i and estimation errors g^0(Xi)-g0(Xi) are generally related. The reason is that in estimating, g^0 information contained in observation i has already been used, whereas V_i also has information from observation I; therefore, the relation between them will cause poor performance of c^*. Conditional on the auxiliary sample and with 𝔼(V_i∣X_i) = 0, A.8) has a mean of zero and a variance of order 1n∑i∈I(g^0(Xi)−g0(Xi)→P0).

The score has to satisfy both a moment condition and an orthogonality condition to overcome the regularization bias. The stricter requirement on the score function makes DML different from traditional methods. Meanwhile, sample splitting is also important to remove the bias from overfitting. Because in the model, it requires estimations of nuisance parameters such as g₀(·) and m₀(·), as well as causal parameters, and they are not estimated simultaneously. Therefore, using a different part of the data for estimating a different part is needed.

In the estimation of both models above, the sample of (D, X) needs to be i.i.d. (independent, identically, and distributed). Sample splitting plays an important role. First, let us divide the sample into K folds randomly, such that each subsample I_k contains N/K number of observations, where k∈1, …, K. The final DML estimator θ~0 solves:

1K∑k=1KEnk[ψ(D,X;θ˜0,η^0k]=0

where ψ(·) is the Neyman-orthogonal score function; η^0k represents the estimator of nuisance parameters associated with g₀(.) and m₀(.); 𝔼_nk is the empirical expectation over k th fold of the data. For each subsample I_k, its corresponding auxiliary sample is used to construct an ML estimator of η^0k.

3 Data

Our analysis aims to compare with the results of a previous study on the effects of curriculum materials on student achievement based on California data Koedel et al., (2017). In this section, we will briefly introduce the background, data, and results of the previous study that we based our work on.

In California, the curriculum adoption process is partially centralized, with the state initiating a list of textbooks for a particular subject in a given year. Each district has the option to adopt any textbook from the list or choose not to adopt at all (i.e., not use textbooks on the list). In math, the adoption process typically moves in sync with the state's initiation. The authors of the previous study focused on elementary math textbooks adopted in California in the fall of 2008 and 2009. The data were derived from schools' 2013 School Accountability Report Cards (SARC)⁴.

Table 2 provides an overview of the descriptive statistics for the four math textbooks examined by Koedel et al. The test score values presented in this table are the average standardized data at both the school and district levels. Specifically, the “California Math” column identifies schools that adopted California Math with characteristics from either 2007 or 2008 and had at least one Grade 3 test score from 2009 to 2013⁵. The “Composite Alternative” column, on the other hand, displays the average values of schools that adopted the three other textbooks⁶. It is important to note that the outcome variable Y in this study is the Grade 3 math score, while all other characteristic variables, except for the Grade 3 ELA score, are represented by the vector X. Additionally, the school-level data are exclusively utilized in this study.

Table 2

Descriptive statistics for California math and composite alternative.

Variable	California math	Composite alternative
School outcomes
Preadoption Grade 3 math score	0.06	−0.07
Preadoption Grade 3 ELA score	0.07	−0.08
School characteristics
%Female	48.9	48.6
%Economically disadvantaged	56	57.3
%English learner	28	30.1
%White	29.9	29.3
%Black	6.3	7.7
%Asian	7.2	7.7
%Other ethnicities	56.6	50.0
Enrollment	429.5	401.5
2008 adopter	53.7	48.6
School-area characteristics (census)
Median household income (log)	10.9	10.8
Share low education	19.3	20.0
Share missing census data	1.2	1.7
District outcomes
Preadoption Grade 3 math score	0	−0.01
Preadoption Grade 3 ELA score	−0.12	−0.06
District characteristics
Enrollment	6075.5	16022.9
n(Schools)	602	1276
n(Districts)	92	224

This is summarized from Koedel et al. (2017) Table 1, showing all variables we use. The second column is a weighted average of columns 4, 6, and 7 from Koedel et al. (2017) Table 1.

In the study, the researchers used traditional techniques to estimate the effects of the four math textbooks on student achievement and pointed out the limitations of their modeling approach. They first calculated the propensity score for each observation using a probit model, and based on the propensity score, they selected a subset of the sample with common support. This meant that only observations from the same range of propensity scores were included in the modeling samples for both the treated and control groups. The researchers then applied kernel matching, restricted OLS, and residualized matching techniques separately to estimate the treatment effect of the four textbooks on student achievement.

Among the four commonly used elementary math textbooks studied, the California Math textbook published by Houghton Mifflin had the highest treatment effect and outperformed the other textbooks. The authors also noted that their restricted OLS model, which imposed a linear form, produced more statistically precise results but introduced bias into their estimation.

Upon discovering that the California Math textbook worked best among the four textbooks studied, the researchers conducted a quasi-experimental study to compare California Math with the composite alternative. They designated adopters of California Math as the treatment group and all other adopters of the three alternative textbooks as the control group.

Table 3 presents the treatment effects observed over 4 years following the adoption of the textbooks. For example, the Year 1 results provide a comparison between the academic performance of students who used the newly adopted textbooks solely in Grade 3. The Year 2 results, on the other hand, compare the performance of students who used the newly adopted textbooks in Grades 2 and 3.

Table 3

Estimated effects of California math on grade 3 mathematics achievement relative to the composite alternative.

Variable	Year 1	Year 2	Year 3	Year 4
Treatment: California math; Control: composite alternative
Treatment effect: Kernel matching	0.063 (0.054)	0.083 (0.051)	0.061 (0.059)	0.070 (0.059)
Treatment effect: Restricted OLS	0.050^** (0.019)	0.064^** (0.023)	0.049^** (0.023)	0.058^** (0.023)
Treatment effect: residualized matching	0.050^** (0.020)	0.065^** (0.024)	0.052^** (0.024)	0.060^** (0.026)
No. of districts/schools (California Math)	92/597	89/588	91/595	90/590
No. of districts/schools (composite alternative)	213/1143	214/1145	216/1146	213/1144

^**p < 0.05.

To ensure that the estimates have causal interpretations, the authors conducted falsification tests to justify the conditional independence assumption. They estimated two types of models. In the first model, the authors estimated the effects of the pre-adoption curriculum on students by using test scores from previous adoption years (i.e., 3 to 6 years before adoption) instead of scores after adoption. Schools adopting California Math were treated as the treatment group, and the composite alternative served as the control group. The results showed no effect on student scores for all pre-adoption years.

In the second model, the authors used Grade 3 English test scores as the outcome variable and estimated effects for all pre-adoption years, similar to the first model, and all 4 years after adoption, similar to the main model. They also found no effect. Therefore, the authors argue that the conditional independence assumption is satisfied.

The authors noted their surprise upon discovering that the treatment effects did not increase over time. They provided several potential explanations for this unexpected finding, including a moderate dosage effect, insufficient exposure time in earlier grades to have a significant impact on grade three test scores, and variations in the quality of curriculum materials across different grades.

Additionally, the authors acknowledged that the limitations of linear models may have contributed to the unexpected results. It is difficult to believe that linear models can accurately capture the true conditional expected function (CEF) or its non-parametric/generalized approximation. Furthermore, the California textbook adoption dataset contains a large number of variables that are not all discrete, making it impossible to construct a saturated model.

To address these concerns, the authors utilized the newly developed DML method to estimate textbook causal effects. DML allows for non-parametric estimation of the CEF, can handle both discrete and continuous variables, and is capable of dealing with high-dimensional nuisance parameters. Unlike traditional techniques, such as separately calculating propensity scores and using OLS regression or Kernel Matching, DML integrates all variables into a non-parametric model to obtain consistent treatment effect estimators under the CIA condition. In the next section, we will demonstrate how the use of DML leads to more accurate estimations of treatment effects compared to the original results.

4 Results

Tables 4–7 present the results of DML estimations for the student achievement effect of using California Math for 4 years after adoption. Each table shows the results for a different outcome variable, and the columns display six different machine learning methods used for obtaining the g₀ and m₀ estimators.

Table 4

Effect of California math after 1 year of adoption.

Estimates	Lasso	Reg.Trees	Forest	Boosting	Nnet	Ensemble	Best
A. Interactive model
ATE	0.051	0.040	0.039	0.047	0.094	0.065	0.084
se(median)	0.015	0.055	0.019	0.022	0.040	0.025	0.034
se	0.010	0.026	0.008	0.015	0.017	0.008	0.020
clustered.se	0.031	0.034	0.016	0.019	0.045	0.015	0.016
B. Partially linear model
ATE	0.046	0.024	0.047	0.047	0.042	0.029	0.053
se(median)	0.011	0.026	0.019	0.014	0.014	0.018	0.020
se	0.010	0.017	0.018	0.012	0.012	0.018	0.018

Table 5

Effect of California math after 2 years of adoption.

Estimates	Lasso	Reg.Trees	Forest	Boosting	Nnet	Ensemble	Best
A. Interactive model
ATE	0.068	0.083	0.057	0.065	0.098	0.068	0.065
se(median)	0.016	0.032	0.015	0.017	0.028	0.013	0.036
Se	0.014	0.026	0.009	0.015	0.021	0.009	0.023
clustered.se	0.026	0.027	0.018	0.019	0.039	0.019	0.018
B. Partially linear model
ATE	0.059	0.062	0.074	0.076	0.057	0.081	0.080
se(median)	0.015	0.050	0.023	0.016	0.020	0.023	0.024
Se	0.012	0.018	0.019	0.014	0.014	0.019	0.019

Table 6

Effect of California math after 3 years of adoption.

Estimates	Lasso	Reg.Trees	Forest	Boosting	Nnet	Ensemble	Best
A. Interactive model
ATE	0.011	0.042	0.041	0.061	0.077	0.057	0.049
se(median)	0.033	0.047	0.017	0.025	0.043	0.016	0.037
se	0.014	0.030	0.010	0.019	0.024	0.010	0.028
clustered.se	0.047	0.032	0.018	0.021	0.029	0.018	0.019
B. Partially linear model
ATE	0.041	0.036	0.054	0.056	0.086	0.039	0.061
se(median)	0.018	0.025	0.023	0.018	0.018	0.023	0.023
se	0.013	0.019	0.020	0.016	0.014	0.021	0.021

Table 7

Effect of California math after 4 years of adoption.

Estimates	Lasso	Reg.Trees	Forest	Boosting	Nnet	Ensemble	Best
A. Interactive model
ATE	0.037	0.015	0.039	0.055	0.087	0.057	0.038
se(median)	0.024	0.049	0.018	0.028	0.059	0.017	0.046
se	0.014	0.033	0.009	0.017	0.024	0.010	0.034
clustered.se	0.044	0.035	0.019	0.020	0.031	0.018	0.019
B. Partially linear model
ATE	0.051	0.028	0.047	0.057	0.050	0.058	0.058
se(median)	0.014	0.027	0.022	0.015	0.015	0.023	0.021
se	0.013	0.019	0.020	0.015	0.014	0.020	0.020

For the “Lasso” method, all characteristic variables listed in Table 1 are included along with six-order polynomials of school-level enrollment, six-order polynomials of district-level enrollment, and eight-order polynomials of income, along with all their second-order interaction terms. For all other methods, the variables are used in their original level without any interaction or powered terms.

In the “Reg.Trees” method, a single decision tree is fitted, and the hyperparameter is chosen using 2-fold cross-validation. The “Forest” method runs random forests and takes an average of over 1,000 trees. The “Boosting” method uses boosted regression trees with 2-fold cross-validation. The “Neural net” method uses two neurons and sets a logistic loss function for classification and a linear for regression. Finally, the “Ensemble” method combines “Lasso,” “Boosting,” “Random Forests,” and “Neural Net” and takes their average.

The last column, “best,” runs differently by selecting the method(s) that give the best estimates of g₀ and m₀ at each splitting time, and then using each selected method to estimate them separately. Therefore, DML may use different methods in g₀ and m₀ estimations in the “best” column.

These tables provide valuable insights into the performance of different machine learning methods in estimating the treatment effect of using California Math on student achievement, and can help inform future research and policy decisions.

Each table presents two sets of results, with Panel A displaying the general interactive DML model and Panel B displaying the partially linear DML model. For each model, an estimate of the average treatment effect (ATE) and its standard error are reported. The “se(median)” column reports standard errors using the median method to adjust for split variations⁷, while the “se” column reports the median standard error across the 10 splits.

DML standard errors, both “se(median)” and “se”, are calculated under the assumption of independent and identically distributed (i.i.d.) sampling. However, for the dataset used in this study, a clustered standard error is more appropriate. To address this, the authors employ a bootstrap procedure where the entire sample is resampled at the district level, with 50 bootstrapped samples obtained. For each of the 50 bootstrapped samples, a DML estimate is obtained, and the 50 estimates are used to calculate the standard deviation, resulting in clustered standard errors reported under the “Clustered.se” column in each result table.

In Table 8, we present the ensemble DML results for comparison purposes. The main purpose of using DML is to relax the linear setting and obtain a more general and representative model compared to partially linear models.

Table 8

Effects of California math on grade 3 mathematics achievement relative to the composite alternative: compare with DML results.

Variable	Year 1	Year 2	Year 3	Year 4
Treatment: California math; Control: Composite alternative
Treatment effect: Kernel matching	0.063 (0.054)	0.083 (0.051)	0.061 (0.059)	0.070 (0.059)
Treatment effect: Restricted OLS	0.050^** (0.019)	0.064^** (0.023)	0.049^** (0.023)	0.058^** (0.023)
Treatment effect: interactive DML (Ensemble)	0.065^** (0.015)	0.068^** (0.019)	0.057^** (0.018)	0.057^** (0.018)

^**p < 0.05.

5 Discussion

The results from the interactive DML model show smaller clustered standard errors compared to Kernel Matching and OLS. DML fulfills its promise of being a more efficient non-parametric estimator due to its two core strategies: orthogonalization and sample splitting.

For the point estimates, the interactive DML results are quite similar to Kernel Matching in Year 1 and Year 3; however, the effects appear to be fairly stable over the years in interactive DML, as opposed to the increasing and decreasing trend observed in Kernel Matching. The DML results align with what is expected in reality, as a stable effect is more common than a fluctuating effect. Therefore, from both a realistic meaning and model setting perspective, the DML results provide a new insight into understanding the true effect.

Moreover, the difference between OLS and interactive DML in Year 1 highlights the performance differences between the two models. Although linear models can capture the general features of the data, their lack of flexibility and stability makes it worth running a new set of non-parametric models, such as DML.

It is important to acknowledge several limitations of this study. First, the causal interpretation of the results depends on the Conditional Independence Assumption holding. While Koedel et al. (2017) provided falsification tests to support this assumption, the current study does not offer additional validation procedures.

Second, the data structure creates challenges due to hierarchical dependence. Schools are nested within districts, which introduces complex correlation patterns. Although clustered bootstrap methods help address some of these dependencies, they may not fully account for all sources of correlation present in the data.

Finally, the findings come from the specific context of California's textbook adoption process during the period studied. The unique characteristics of this setting may affect results in ways that do not extend to other states, different grade levels, or subjects other than mathematics.

6 Conclusion

The evaluation of treatment effects is crucial in educational research, involving policies, textbooks, and teacher training Chingos and Whitehurst, (2012). The most commonly used tool for this purpose is OLS, due to its simplicity and straightforwardness. However, in this study, we introduce a newly developed method called double/debiased machine learning (DML), which outperforms OLS and kernel matching, the second most popular tool, in providing statistically significant estimators of textbook performance.

We first provide a mathematical explanation of why DML is superior, and then compare our results to a previous study that used the same dataset. Our findings suggest that DML not only surpasses OLS in its ability to overcome linear restrictions but also outperforms Kernel matching in providing a more accurate estimator. Despite its advantages, there are some limitations to using DML. One such limitation is that DML requires larger sample sizes compared to the other two methods, and smaller samples can cause DML to fail. Additionally, DML requires significantly more computing power, which can take several days to process data on a personal computer.

Our study makes several contributions: first, we applied the DML method to evaluate textbooks and demonstrated its superiority over OLS and Kernel matching, serving as a template for future studies. Second, we addressed concerns raised by Koedel et al. (2017) and verified the limitations of linear models for the data, cautioning against potential issues in similar studies. Finally, we extended the built-in DML standard errors to account for clustering, which is a minor but useful contribution to the field of model application.

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Author contributions

ZF: Writing – review & editing, Writing – original draft. YB: Writing – original draft, Writing – review & editing.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Acito

(2023). “Ordinary least squares regression,” in Predictive Analytics with KNIME: Analytics for Citizen Data Scientists (New York: Springer), 105–124. doi: 10.1007/978-3-031-45630-5_6

Arumuru

David

T. O.

(2024). The impact of instructional resources on academic achievement: a study of library and information science postgraduates in Nigeria. Asian J. Inf. Sci. Technol. 14, 54–60. doi: 10.70112/ajist-2024.14.1.4259

Batlle

Chen

Hosseini

Owhadi

Stuart

A. M.

(2025). Error analysis of kernel/GP methods for nonlinear and parametric PDEs. J. Comput. Phys. 520:113488. doi: 10.1016/j.jcp.2024.113488

Bhatt

R. R.

Koedel

(2012). Large-Scale evaluations of curricular effectiveness. Educ. Eval. Policy Anal. 34, 391–412. doi: 10.3102/0162373712440040

Bhatt

R. R.

Koedel

Lehmann

(2013). Is curriculum quality uniform? Evidence from Florida. Econ. Educ. Rev. 34, 107–121. doi: 10.1016/j.econedurev.2013.01.014

Biau

(2012). Analysis of a random forests model. J. Mach. Learn. Res. 13, 1063–1095.

Blazar

Heller

Kane

T. J.

Polikoff

Staiger

D. O.

Carrell

. (2020). Curriculum reform in the common core Era: evaluating elementary math textbooks across six U.S. States. J. Policy Anal. Manag. 39, 966–1019. doi: 10.1002/pam.22257

Blazar

Heller

Kane

Polikoff

Staiger

Carrell

. (2019). Learning by the Book: Comparing Math Achievement Growth by Textbook in Six Common Core States. Center for Education Policy Research, Harvard University.

Breznau

(2022). Integrating computer prediction methods in social science: a comment on Hofman et al. (2021). Soc. Sci. Comput. Rev. 40, 844–853. doi: 10.1177/08944393211049776

Chen

(2007). “Large sample sieve estimation of semi-nonparametric models,” in Handbook of Econometrics (New Haven, CT: NewYork University Press), 5549–5632.

Chernozhukov

Chetverikov

Demirer

Duflo

Hansen

Newey

. (2018). Double/debiased machine learning for treatment and structural parameters. Econ. J. 21, C1–C68. doi: 10.1111/ectj.12097

Chingos

M. M.

Whitehurst

G. J.

(2012). Choosing Blindly: Instructional Materials, Teacher Effectiveness, and the Common Core. Washington, DC: Brookings Institution.

Efeizomor

R. O.

(2023). A comparative study of methods of remedying multicolinearity. Am. J. Theor. Appl. Stat. 12, 87–91. doi: 10.11648/j.ajtas.20231204.14

Hadar

L. L.

(2017). Opportunities to learn: mathematics textbooks and students' achievements. Stud. Educ. Eval. 55, 153–166. doi: 10.1016/j.stueduc.2017.10.002

Heckman

J. J.

Ichimura

Todd

(1998). Matching as an econometric evaluation estimator. Rev. Econ. Stud. 65, 261–294. doi: 10.1111/1467-937X.00044

Hiabu

Mammen

Meyer

J. T.

(2019). “Local linear smoothing in additive models as data projection,” in Foundations of Modern Statistics, eds. D. Belomestny, C. Butucea, E. Mammen, E. Moulines, M. Reiß, and V. V. Ulyanov (New York: Springer), 197–223.

Koedel

Polikoff

M. S.

Hardaway

Wrabel

S. L.

(2017). Mathematics curriculum effects on student achievement in California. AERA Open 3, 1–22. doi: 10.1177/2332858417690511

Koedel

Polikoff

(2017). Big bang for just a few bucks: the impact of math textbooks in California. Evid. Speaks Rep. 2, 1–7.

Müller

U. K.

(2021). Linear regression with many controls of limited explanatory power. Quant. Econ. 12, 405–442. doi: 10.3982/QE1577

Wang

(2024). A study on textbook use and its effects on students' academic performance. Discip. Interdiscip. Sci. Educ. Res. 6:4. doi: 10.1007/978-3-031-52924-5

Liu

Zhao

Huang

(2023). New tests for high-dimensional linear regression based on random projection. Stat. Sin. 33, 475–498. doi: 10.5705/ss.202020.0405

Polikoff

M. S.

(2015). How well aligned are textbooks to the common core standards in mathematics? Am. Educ. Res. J. 52, 1185–1211. doi: 10.3102/0002831215584435

Polikoff

M. S.

Campbell

S. E.

Rabovsky

Koedel

Q. T.

Hardaway

. (2019). The formalized processes districts use to evaluate mathematics textbooks. J. Curric. Stud. 52, 451–477. doi: 10.1080/00220272.2020.1747116

Sievert

van den Ham

A.-K.

Heinze

(2021). Are first graders' arithmetic skills related to the quality of mathematics textbooks? a study on students' use of arithmetic principles. Learn. Ins. 71:101401. doi: 10.1016/j.learninstruc.2020.101401

Slavin

R. E.

Lake

(2008). Effective programs in elementary mathematics: a best-evidence synthesis. Rev. Educ. Res. 78, 427–515. doi: 10.3102/0034654308317473

Edited by: Gladys Sunzuma, Bindura University of Science Education, Zimbabwe

Reviewed by: Jamiu Idowu, University College London, United Kingdom

Munir Ahmad, Government of Pakistan, Pakistan

Unconfoundedness means conditioning on X, the counterfactuals Y(0) and Y(1) are uncorrelated with treatment D. CIA means conditioning on X the choice of D, which is statistically independent with U, the model error term. They are similar concepts; but treated as same in the study.

Because of the high dimensional nuisance parameter space, usually machine learning methods are used in estimating g₀, which can be regarded as an ML estimator.

Chernozhukov et al. (2018) made this claim in their study. They prove that if the specific machine learning method used in the model has this property, then b^* will converge as well as proved. They also show good simulation results. In practice, it is hard to find theoretical justifications for how each method converge, but DML has been shown to outperform just arbitrary picking some variables and running a simple regression.

For more detailed information, please refer to Koedel et al. (2017).

All test scores mentioned here are standardized student test scores. The standardized test score is obtained from the universe sample data collected from California Department of Education (CDE).

Namely, enVision Math California; California Mathematics: Concepts, Skills, and Problem Solving; California HSP Math.

For median method details, please check Chernozhukov et al. (2018) definition (3.3).