Generative AI statement

Front. Comput. Sci.

Frontiers in Computer Science

Front. Comput. Sci.

2624-9898

Frontiers Media S.A.

10.3389/fcomp.2025.1734114

Correction

Correction: Optimizing architectural-feature tradeoffs in Arabic automatic short answer grading: comparative analysis of fine-tuned AraBERTv2 models

Frontiers Production Office ^*

Frontiers Media SA, Lausanne, Switzerland

*Correspondence: Frontiers Production Office, production.office@frontiersin.org

12 11 2025

2025

1734114

28 10 2025 28 10 2025

2025

Frontiers Production Office

https://creativecommons.org/licenses/by/4.0/

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

A Correction on Optimizing architectural-feature tradeoffs in Arabic automatic short answer grading: comparative analysis of fine-tuned AraBERTv2 models by Mahmood, S. A. (2025). Front. Comput. Sci. 7:1683272. doi: 10.3389/fcomp.2025.1683272

large language model (LLMs) AraBERT neural network Arabic natural language processing educational assessment Automated Short Answer Grading (ASAG)

section-at-acceptance

Digital Education

There was a mistake in the article as published. Tables 1–7 and Figures 1–8 were published as supplementary material when they should have been added to the main article. The corrected figures and tables appear below.

Table 1

Distribution of answers by question type.

Question type	Question type (In Arabic)	Total questions	Total answers
Define the scientific term		6	291
Explain		21	830
What are the consequences of		6	282
Justify or give reasons for		10	465
What is the difference between		5	217
Total	5 types	48	2,085

Table 2

Detailed distribution of randomly sampled responses across selected questions.

Q–No.	Question type	Total answers	Training answers	Test answers
1	Define the scientific term	46	36	10
26	Explain	47	37	10
28	What are the consequences of	48	38	10
35	Justify or give reasons for	51	40	11
45	What is the difference between	36	28	8

Table 3

Performance evaluation of AraBERTv2 with MLP model using different feature sets: training vs. testing results.

Model	Stage	No. of feature	MAE	RMSE	Pearson correlation	Spearman's correlation	Epoch 1–5
AraBERTv2 with MLP	Training	2-feature	1.14	1.51	0.847	0.85	898 → 533 → 347 → 250 → 156
		3-feature	1.2	1.58	0.818	0.816	1,026 → 614 → 263 → 185
		4-feature	0.18	0.2	0.999	0.998	713 → 34 → 13 → 9 → 7
	Testing	2-feature	1.31	1.76	0.803	0.808
		3-feature	1.48	1.9	0.744	0.746
		4-feature	1.77	2.22	0.691	0.689

Table 4

Performance evaluation of AraBERTv2 with CNN model using different feature sets: training vs. testing results.

Model	Stage	No. of features	MAE	RMSE	Pearson correlation	Spearman's correlation	Epoch 1–5
AraBERTv2 with CNN	Training	2-feature	1.22	1.59	0.849	0.843	1,092 → 610 → 427 → 306 → 227
		3-feature	1.17	1.53	0.833	0.832	1,057 → 567 → 379 → 280 → 205
		4-feature	0.24	0.27	0.999	0.998	773 → 28 → 12 → 8 → 6
	Testing	2-feature	1.45	1.93	0.784	0.788
		3-feature	1.6	2.02	0.746	0.75
		4-feature	2.63	3.07	0.607	0.613

Table 5

Performance evaluation of AraBERTv2 with LSTM model using different feature sets: training vs. testing results.

Model	Stage	No. of features	MAE	RMSE	Pearson correlation	Spearman's correlation	Epoch 1–5
AraBERTv2 with LSTM	Training	2-feature	1.26	1.62	0.821	0.825	1,147 → 718 → 524 → 356 → 262
		3-feature	1.27	1.66	0.811	0.818	1,141 → 675 → 456 → 349 → 267
		4-feature	0.14	0.19	0.998	0.998	728 → 62 → 31 → 22 → 19
	Testing	2-feature	1.48	1.86	0.757	0.759
		3-feature	1.6	2.03	0.757	0.77
		4-feature	3.62	4.19	0.388	0.419

Table 6

Performance comparison of AraBERTv2 fine-tuned models with MLP, CNN, and LSTM architectures using different feature sets.

Fine-tuned models	MAE	RMSE	Pearson correlation	Spearman's correlation
2-features-AraBERTv2 with MLP	1.31	1.76	0.803	0.808
2-features-AraBERTv2 with CNN	1.45	1.93	0.784	0.788
2-features-AraBERTv2 with LSTM	1.48	1.86	0.757	0.759
3-features-AraBERTv2 with MLP	1.48	1.9	0.744	0.746
3-features-AraBERTv2 with CNN	1.6	2.02	0.746	0.75
3-features-AraBERTv2 with LSTM	1.6	2.03	0.757	0.77
4-features-AraBERTv2 with MLP	1.77	2.22	0.691	0.689
4-features-AraBERTv2 with CNN	2.63	3.07	0.607	0.613
4-features-AraBERTv2 with LSTM	3.62	4.19	0.388	0.419

The bold values represent the optimal results obtained from our experimental analysis.

Table 7

Comparative performance evaluation of Arabic Automated Short Answer Grading (ASAG) systems.

Criterion/study	Methodology	Dataset	Best RMSE	Best Pearson/Spearman	Key strength	Primary limitation
Our study (AraBERTv2)	-Fine-tuned AraBERTv2 with MLP/CNN/LSTM -Tested 2/3/4 feature configurations	AS-ARSG (2,133 answers)	1.31	-Pearson: 0.803 -Spearman: 0.808	Optimal balance between generalizability and accuracy with limited data	Performance degradation in LSTM with added features
(4)	Latent Semantic Analysis (LSA) with local/hybrid weighting	AR-ASAG (2,133 answers)	N/A	N/A	Effective semantic weighting	Limited capacity for capturing complex contextual relationships
(19)	-BERT vs. Word2Vec/AWN comparison -Intensive text preprocessing	-AR-ASAG (2,133) -Jordanian History (550)	1.00308	Pearson: 0.841902	Demonstrated BERT's superiority over traditional approaches	Heavy dependency on text normalization and stemming

Figure 1

General workflow of the proposed automated Arabic short-answer grading model using AraBERTv2.

Flowchart illustrating the process of training and testing with the AR-ASAG dataset. The sequence includes dataset loading, preprocessing, and splitting into 80% training and 20% testing. The training subset undergoes feature selection and AraBERT training, leading to finetuned AraBERT models. These models are evaluated and compared, followed by visualization to determine the best AraBERT model. Arrows indicate the workflow and connections among the steps.

Figure 2

The AraBERT_MLP training methodology.

Bar charts compare AraBERTv2 with LSTM across training and testing phases. The top charts show MAE and RMSE, and Pearson and Spearman correlations for different features in training. The bottom charts depict the same metrics for testing, highlighting variations in error values and correlation coefficients across two, three, and four features.

Figure 3

The AraBERT_CNN training methodology.

Bar charts showing AraBERTv2 with CNN performance during training and testing. Training error bars compare MAE and RMSE for 2, 3, and 4 features, with RMSE generally higher. Correlation values for Pearson and Spearman increase with more features. Testing error values increase with more features, while correlation values decrease slightly as features increase.

Figure 4

The AraBERT_LSTM training methodology.

Flowchart depicting the AraBERT with MLP training stage. Feature selection leads to three models: 2-features (red), 3-features (green), and 4-features (purple). Each model connects to AraBERT with CNN, then fine-tuning AraBERT stages, ending with evaluation and comparison.

Figure 5

Performance evaluation of AraBERTv2 with MLP model using different feature sets: training vs. testing results.

Bar charts comparing AraBERTv2 with MLP performance in training and testing phases. In training, 4-feature shows minimal MAE and RMSE, with high Pearson and Spearman correlations. In testing, 2-feature has lower error values than 3-feature and 4-feature, though 4-feature performs slightly better in correlation values.

Figure 6

Performance evaluation of AraBERTv2 with CNN model using different feature sets: training vs. testing results.

Diagram illustrating a machine learning process, titled “AraBERT with MLP Training Stage.” It starts with “Feature Selection,” leads to three models: “2-features model,” “3-features model,” and “4-features model.” Each model goes to “AraBERT with CNN,” followed by “Fine tuning AraBERT,” and ends with “Evaluation and comparison.” Arrows indicate the flow direction.

Figure 7

Performance evaluation of AraBERTv2 with LSTM model using different feature sets: training vs. testing results.

Scatter plot titled “Model Performance: Error vs Spearman Correlation” showing different models' performance using colored markers: AraBERTv2 with MLP, CNN, and LSTM. The x-axis represents MAE (mean absolute error), where lower is better, and the y-axis represents Spearman’s rank correlation, where higher is better. The plot uses different shapes to indicate feature numbers. Most points cluster between 1.5 to 2.0 MAE and 0.65 to 0.80 correlation, with one outlier beyond 3.5 MAE and below 0.45 correlation.

Figure 8

Fine-tuned models performance: MAE vs. spearman correlation.

Diagram of a machine learning workflow featuring AraBERT with MLP. It starts with feature selection, separating into three models: a 2-feature model in red, a 3-feature model in green, and a 4-feature model in blue. These feed into the AraBERT with MLP stage, which then advances to fine-tuning AraBERT in individual boxes. An evaluation and comparison stage follows, indicated by arrows.

All in-text Supplementary Table and Supplementary Figure in-text citations have been changed to Table and Figure in-text citations.

The original version of this article has been updated.

Generative AI statement

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Approved by: Frontiers Editorial Office, Frontiers Media SA, Switzerland