<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2023.1230087</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Air pollution particulate matter (PM2.5) prediction in South African cities using machine learning techniques</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Morapedi</surname> <given-names>Tshepang Duncan</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2520929/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Obagbuwa</surname> <given-names>Ibidun Christiana</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2326835/overview"/>
</contrib>
</contrib-group>
<aff><institution>Department of Computer Science and Information Technology, School of Natural and Applied Sciences, Sol Plaatje University</institution>, <addr-line>Kimberley</addr-line>, <country>South Africa</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Serestina Viriri, University of KwaZulu-Natal, South Africa</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Roseline Oluwaseun Ogundokun, Landmark University, Nigeria; Emmanuel Asani, Landmark University, Nigeria; Adekanmi Adegun, University of KwaZulu-Natal, South Africa; Anil Utku, Munzur University, T&#x000FC;rkiye</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Ibidun Christiana Obagbuwa <email>Ibidun.obagbuwa&#x00040;spu.ac.za</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>10</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>6</volume>
<elocation-id>1230087</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>05</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>09</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Morapedi and Obagbuwa.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Morapedi and Obagbuwa</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>Air pollution contributes to the most severe environmental and health problems due to industrial emissions and atmosphere contamination, produced by climate and traffic factors, fossil fuel combustion, and industrial characteristics. Because this is a global issue, several nations have established control of air pollution stations in various cities to monitor pollutants like Nitrogen Dioxide (NO2), Ozone (O3), Sulfur Dioxide (SO2), Carbon Monoxide (CO), Particulate Matter (PM2.5, PM10), to notify inhabitants when pollution levels surpass the quality threshold. With the rise in air pollution, it is necessary to construct models to capture data on air pollutant concentrations. Compared to other parts of the world, Africa has a scarcity of reliable air quality sensors for monitoring and predicting Particulate Matter (PM2.5). This demonstrates the possibility of extending research in air pollution control.</p></sec>
<sec>
<title>Methods</title>
<p>Machine learning techniques were utilized in this study to identify air pollution in terms of time, cost, and efficiency so that different scenarios and systems may select the optimal way for their needs. To assess and forecast the behavior of Particulate Matter (PM2.5), this study presented a Machine Learning approach that includes Cat Boost Regressor, Extreme Gradient Boosting Regressor, Random Forest Classifier, Logistic Regression, Support Vector Machine, K-Nearest Neighbor, and Decision Tree.</p></sec>
<sec>
<title>Results</title>
<p>Cat Boost Regressor and Extreme Gradient Boosting Regressor were implemented to predict the latest PM2.5 concentrations for South African Cities with recording stations using past dated recordings, then the best performing model between the two is used to predict PM2.5 concentrations for South African Cities with no recording stations and also to predict future PM2.5 concentrations for South African Cities. K-Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest Classifier were implemented to create a system predicting the Air Quality Index (AQI) Status.</p></sec>
<sec>
<title>Conclusion</title>
<p>This study investigated various machine learning techniques for air pollution to analyze and predict air pollution behavior regarding air quality and air pollutants, detecting which areas are most affected in South African cities.</p></sec></abstract>
<kwd-group>
<kwd>air pollution</kwd>
<kwd>pollutants</kwd>
<kwd>Particulate Matter (PM2.5)</kwd>
<kwd>air quality</kwd>
<kwd>machine learning</kwd>
<kwd>data analysis</kwd>
<kwd>health</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Research Foundation<named-content content-type="fundref-id">10.13039/501100001321</named-content></contract-sponsor>
<counts>
<fig-count count="15"/>
<table-count count="4"/>
<equation-count count="8"/>
<ref-count count="26"/>
<page-count count="19"/>
<word-count count="7507"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Medicine and Public Health</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1. Introduction</title>
<p>In recent years, the industry&#x00027;s fast growth has been accompanied by air pollution, which kills millions of people yearly and gets widespread attention (Guo et al., <xref ref-type="bibr" rid="B9">2020</xref>). According to the World Health Organization (WHO), about 90% of people breathe air that is contaminated and violates WHO air quality criteria (Bekkar et al., <xref ref-type="bibr" rid="B6">2021</xref>; World Health Organization, <xref ref-type="bibr" rid="B23">2021</xref>). Air pollution is a worldwide health issue, causing respiratory disorders, lung problems, eye problems, and skin diseases in people and affecting the ability of plants and animals to thrive. As a result, air pollution control and prevention have become major concerns. Factories&#x00027; smoke exhaust, pollution caused by vehicles&#x00027; exhaust, and power plants are the primary causes of air quality degradation (Sultana, <xref ref-type="bibr" rid="B21">2019</xref>). (PM2.5, PM10), O3, SO2, CO, and NO2 are the five categories of air pollutants (Mao et al., <xref ref-type="bibr" rid="B16">2021</xref>). PM2.5 is the most concerning air pollution component because these particles are small and light. They can stay in the atmosphere longer and easily bypass the filters in the human nose and throat (Akiladevi et al., <xref ref-type="bibr" rid="B3">2020</xref>). PM2.5 is a standard air quality metric. However, it is usually measured with ground-based sensors (Jonathan et al., <xref ref-type="bibr" rid="B13">2020</xref>). Many researchers focus on air pollution because of its increasing attention, and numerous important research papers are on it. Due to population and economic expansion, global energy consumption is steadily growing (Heydari et al., <xref ref-type="bibr" rid="B12">2021</xref>).</p>
<p>Traditional statistical approaches have been frequently applied to solve air quality forecasting difficulties. These strategies are based on the principle of using historical data for learning; however, owing to the time-series data complexity and variance, they can produce poor estimates of air pollution. Several machine-learning algorithms have been developed during the last 60 years to aid in the resolution of complexity concerns (Ameer et al., <xref ref-type="bibr" rid="B4">2019</xref>). Ensemble learning, MLR, SVM, RF, ANN, and other hybrid models are the primary machine learning approaches to combat air pollution (Bekkar et al., <xref ref-type="bibr" rid="B6">2021</xref>). However, because the model selection is the focus of most prediction approaches and reasons for the change in air pollution concentrations are not analyzed by most present air quality prediction machine learning methods (Ameer et al., <xref ref-type="bibr" rid="B4">2019</xref>). Furthermore, since contemporary deep learning frameworks are relatively adaptable, the model may need to be deep and sophisticated to match the Dataset. As a result, many weights in a deep neural network model may cause overfitting difficulties.</p>
<p>To assess and forecast the behavior of Particulate Matter (PM2.5), this study presents a Machine Learning approach that includes Cat Boost Regressor, Extreme Gradient Boosting Regressor, Random Forest Classifier, Logistic Regression, Support Vector Machine, K-Nearest Neighbor, and Decision Tree. This study summarizes the procedure of these methods to estimate the best solution for the corresponding requirement in any circumstance, to forecast air quality to raise public awareness about air quality degradation and its health effects.</p>
<p>The rest of the paper proceeds as follows: Section 2 presents the literature review, Section 3 presents the methodology used for the study, Section 4 presents the experiment and results, Section 5 shows the discussion of results, and Section 6 compares this work with existing research. Finally, Section 7 concludes the paper with a summary of the main points, future directions, and the study&#x00027;s limitations.</p></sec>
<sec id="s2">
<title>2. Literature review</title>
<p>According to Liao et al. (<xref ref-type="bibr" rid="B15">2020</xref>), no studies with complete adequate long-time intervals that include pollutant measurements from all sources, CTM (Chemistry-Transport Models), data assimilation products, driving meteorological fields, and emission sources. As a result, to progress, it will be required first to create such extensive benchmark datasets for testing learning algorithms and designing deep network topologies. They examined studies on methods such as RNN, LSTM, GRU, CNN (Convolutional Neural Network), SAE (Sparse Autoencoder), and DBN (Deep Belief Network) for Air Quality Forecasts in this paper. Finally, they determined that dealing with meteorological factors and pollution measurements from ground-level monitoring networks limits deep-learning research for air quality forecasts. They looked at attempts to use deep learning techniques to overcome the limitations of standard air quality forecasting methods that use chemistry-transport models (CTMs) or shallow statistical methods.</p>
<p>Ameer et al. (<xref ref-type="bibr" rid="B4">2019</xref>) studied and compared four current methods for predicting air pollution in smart cities in Machine Learning Techniques for Predicting Air Quality comparative analysis. The methods were RF regression, GBR, DT (Decision Tree) regression, MLP (Multi-Layer Perceptron) regression, and RF regression emerged as the best. They identified which of the compared techniques used to predict Air Pollution is the best. They did not discuss data handling. Sultana compared air pollution detecting techniques using image processing, machine learning, and deep learning approaches, where they evaluated these three methods used to detect air pollution and better compare estimates, how they operate, and are processed in the air pollution detection (Sultana, <xref ref-type="bibr" rid="B21">2019</xref>). Finally, they determined that the deep learning technique outperforms the other two regarding efficacy and accuracy. However, it necessitates a large dataset, and as the accuracy level rises, so does the total expenditure and cost. They considered three procedures (Image Processing, Machine Learning, and Deep Learning) used to detect air pollution and estimate a better comparison of how they work and are processed in air pollution detection. Data implementation was not discussed (Sultana, <xref ref-type="bibr" rid="B21">2019</xref>).</p>
<p>Guo et al. developed an EN model to forecast PM2.5 concentrations based on previous PM2.5 concentrations, meteorological data, and time stamp data. RNN, GRU, LSTM, and NN (Neural Network) were among the optimum algorithms employed. Human activities and topographical data were missing from the study (Guo et al., <xref ref-type="bibr" rid="B9">2020</xref>). The findings showed that the suggested technique beats existing algorithms in terms of performance. Mao et al. used graph convolution and LSTM networks to create and present a spatiotemporal modeling hybrid deep learning framework to forecast various air contaminants (Mao et al., <xref ref-type="bibr" rid="B16">2021</xref>). Models such as MLR and LSTM networks were employed. The findings revealed that the distribution of errors in space, to some extent, corresponds to the spatiotemporal correlation strength distribution, highlighting the necessity of spatiotemporal dependency modeling for pollutant prediction. They did not discuss data implementation. Heydari et al. (<xref ref-type="bibr" rid="B12">2021</xref>) anticipated and assessed air pollution from Combined Cycle Power Plants by creating a novel hybrid intelligence model based on MVO (Multi-Verse Optimizer) algorithm and LSTM. They applied the method only to observe the correlation coefficient of NO2 and SO2 pollutants.</p>
<p>Xayasouk and Lee proposed a deep-learning-based technique for fine dust prediction. They utilized the deep-learning algorithm to construct a spatiotemporal prediction framework that considers the Dataset&#x00027;s temporal and geographical relationships during the modeling process (Xayasouk and Lee, <xref ref-type="bibr" rid="B24">2018</xref>). To train and evaluate the data, they employed the Stacked Encoders model, which is unsuitable for learning and training the time series data (Xayasouk and Lee, <xref ref-type="bibr" rid="B24">2018</xref>). Abdellatif et al. created a CNN-LSTM that can be utilized to estimate air quality and can efficiently conduct Spatiotemporal prediction (Bekkar et al., <xref ref-type="bibr" rid="B6">2021</xref>). Deep learning models such as LSTM, CNN, GRU, CNN-GRU, CNN-LSTM, Bi-LSTM, and RNN were utilized (Bekkar et al., <xref ref-type="bibr" rid="B6">2021</xref>). The model can efficiently extract data from temporal and spatial aspects using CNN and LSTM, and it also has excellent accuracy and stability, according to the findings of this work. They did not discuss the processing time. Aarthi et al. (<xref ref-type="bibr" rid="B1">2020</xref>) stated that Environmentalists and the government aided in framing air quality standards and regulations based on hazardous and pathogenic air exposure and health-related risks to human welfare. The processed datasets were used to generate a function that plots the training and validation data for several models, including SV (Support Vector), Lasso, Linear, and DT regression. The authors found that their project raised public awareness, assisted environmentalists and the government in creating air quality standards and regulations based on hazardous and pathogenic air exposure and health-related dangers to human welfare, and discussed the health effects of air quality degradation. They used a decision tree in this experiment, which is not a suitable classifier for time series data (Aarthi et al., <xref ref-type="bibr" rid="B1">2020</xref>).</p>
<p>Aditya et al. (<xref ref-type="bibr" rid="B2">2018</xref>) suggested an approach that would assist ordinary people and meteorologists in detecting and forecasting pollution levels and responding appropriately. Logistic Regression and Autoregression were employed as machine-learning regression approaches. This will also assist individuals in establishing a data source for small towns, which are sometimes overlooked compared to major cities. Logistic Regression performed well on a prediction but failed to explain the constraints (Aditya et al., <xref ref-type="bibr" rid="B2">2018</xref>). Balasubramanian et al. (<xref ref-type="bibr" rid="B5">2021</xref>) developed a technique to anticipate the following 5 h&#x00027; Air Quality Index. They employed a Linear regression model, an SV regression Model, and RF regression Model for data analysis. According to the researchers, Machine Learning algorithms were used to anticipate the AQI (Air Quality Index) values for the following 5 h (Balasubramanian et al., <xref ref-type="bibr" rid="B5">2021</xref>). The Stacking Ensemble model has the lowest RSME (Root Mean Squared Error) value when all the models&#x00027; RMSE (Root Mean Square Error) values are compared. As a result, this model was picked to anticipate the following 5 h&#x00027; Air Quality Index. They did not thoroughly discuss data handling. Dobrea et al. developed a technique that calculates the number of atmospheric pollutants (PM2.5 and PM10) (Dobrea et al., <xref ref-type="bibr" rid="B8">2020</xref>). Support Vector Regression, Autoregression Integrated Moving Average, and LSTM are the models employed. After a comparison of data analysis methods and Machine Learning algorithms for estimating atmospheric pollutants (PM10 and PM2.5), it was determined that the Support Vector Regression and ARIMA (Auto Regressive Integrated Moving Average) algorithms are the most suitable for forecasting air pollutants concentrations, with correlation coefficients of 96.6% and 92.1% for PM10 and PM2.5, respectively (Dobrea et al., <xref ref-type="bibr" rid="B8">2020</xref>). The experiment only focused on one factor of air pollution.</p>
<p>Akiladevi et al. (<xref ref-type="bibr" rid="B3">2020</xref>) proposed a technique for developing an air quality forecasting system that can anticipate main contaminants in various locations. To assess the Dataset&#x00027;s performance, ML (Machine Learning) methods such as LR (Linear Regression), NB (Na&#x000EF;ve Bayes), SVM, RF, KNN (K-Nearest Neighbor), and DT were utilized. Performance measurement factors such as accuracy, recall, f1-score, Specificity, and Sensitivity were computed for each method. For each technique, confusion matrix parameters such as TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) were determined. LR had a 98% accuracy, NB had a 95% accuracy, RF had a 99% accuracy, SVM had a 70% accuracy, K-NN had a 97% accuracy, and DT had a 100% accuracy. Out of these six ML algorithms, the Decision Tree approach had the best accuracy (Akiladevi et al., <xref ref-type="bibr" rid="B3">2020</xref>). The decision Tree was not a good time series data classifier, so it performed well in this research. Bui et al. (<xref ref-type="bibr" rid="B7">2018</xref>) proposed a deep learning technique for air quality index predictions. The Encoder-Decoder paradigm was employed, as well as Long Short-Term Memory units. Based on historical meteorological data, their suggested model produced substantial results in predicting PM2.5 AQI for the long term. The accuracy was discussed but not the processing time.</p>
<p>Taylan et al. (<xref ref-type="bibr" rid="B22">2021</xref>) mentioned that to minimize respiratory and cardiovascular deaths, researchers developed a method that is feasible, robust, and capable of evaluating pollutants&#x00027; cumulative effect inside metropolitan areas. They employed the Non-linear Autoregressive with External (NARX) Input and the Levenberg&#x02013;Marquardt (LM) Algorithm. They concluded that managing air pollution entails establishing capacity and monitoring ground-based networks and systems to make suitable strategic and operational decisions. Quality assurance and control, modeling methodologies, and institutional competencies are all required to implement these initiatives. The Dataset used was limited.</p>
<p>Kalajdjieski et al. (<xref ref-type="bibr" rid="B14">2020</xref>) developed a data fusion method for using multi-modal data such as weather and pollution measurements obtained by sensors and picture data collected by cameras. Basic Convolutional Neural Network, Residual Network Model, Inception Model, and Custom pre-trained Inception were among the predictive models tested. Their trials reveal that our bespoke pre-trained inception model, paired with their data preparation strategy, outperforms known state-of-the-art approaches in accuracy (Kalajdjieski et al., <xref ref-type="bibr" rid="B14">2020</xref>). The model used was biased. Saleh et al. (<xref ref-type="bibr" rid="B20">2016</xref>) developed a model for predicting CO2 emissions from energy. The Support Vector Machine model was utilized. They concluded that a lower RMSE (Root Mean Square Error) value must be produced when the prediction model&#x00027;s accuracy is good. It can assist the management in developing policies or making decisions to limit the negative environmental impact throughout the manufacturing process by monitoring energy use. The experiment only focused on CO2 (Saleh et al., <xref ref-type="bibr" rid="B20">2016</xref>).</p>
<p>Popa et al. developed a system model that forecasts temperature changes in a densely populated area of Bucharest, Romania. They employed LR, SVM with Gaussian kernel, and Gaussian process regression with the exponential kernel as well as other techniques (Popa et al., <xref ref-type="bibr" rid="B19">2021</xref>). They concluded that future studies might combine the current findings with camera photos to assess and anticipate air pollution in various large cities or establish a platform to provide traffic suggestions based on air pollution predictions. They only used linear methods for classification.</p>
<p>Based on the reviewed literature on Machine Learning Applications in Air Pollution. To the best of our knowledge, no work was done involving the analysis and prediction of air pollution in South Africa. Many have been done in countries like China (Moursi et al., <xref ref-type="bibr" rid="B18">2019</xref>; Guo et al., <xref ref-type="bibr" rid="B9">2020</xref>; Harishkumar et al., <xref ref-type="bibr" rid="B11">2020</xref>; Balasubramanian et al., <xref ref-type="bibr" rid="B5">2021</xref>; Bekkar et al., <xref ref-type="bibr" rid="B6">2021</xref>; World Health Organization, <xref ref-type="bibr" rid="B23">2021</xref>), India (Aditya et al., <xref ref-type="bibr" rid="B2">2018</xref>; Sultana, <xref ref-type="bibr" rid="B21">2019</xref>; Aarthi et al., <xref ref-type="bibr" rid="B1">2020</xref>; Akiladevi et al., <xref ref-type="bibr" rid="B3">2020</xref>; Masood and Ahmad, <xref ref-type="bibr" rid="B17">2020</xref>), Korea (Bui et al., <xref ref-type="bibr" rid="B7">2018</xref>; Xayasouk and Lee, <xref ref-type="bibr" rid="B24">2018</xref>; Yang et al., <xref ref-type="bibr" rid="B25">2020</xref>), and Iran (Zamani Joharestani et al., <xref ref-type="bibr" rid="B26">2019</xref>). The proposed method in this study will analyze and predict the behavior of PM2.5, monitor a period of historical levels and correlation analysis for future predictions of PM2.5 levels in cities of South Africa and evaluate the models used to find the best that will be used to measure the performance of the Dataset.</p></sec>
<sec id="s3">
<title>3. Methodology</title>
<p>This study used the Anaconda Navigator (Jupyter Notebook) and an AMD Ryzen 7 5700U computer with 8GB of RAM and a 1.80 GHz Radeon graphics processor. Python 3.6 exposed the proposed machine learning models to data cleaning and feature extraction for training and testing models. This study aims to investigate various machine learning approaches to air pollution, to analyse and predict air pollution behavior in terms of air quality and air pollutants (PM2.5), detecting which areas are most affected in South African cities. All the graphs in this chapter are created using Python. The data was handled using Pandas, and the charts were plotted with Matplotlib and Seaborn.</p>
<sec>
<title>3.1. Air pollution methodology approach</title>
<p>This study aims to forecast the concentration of a particular substance (PM2.5) in South Africa. Most metropolitan people can suffer adverse effects from exposure to air pollutants like PM2.5 in ambient air. When pollutant concentrations exceed an air quality limit, we pay closer attention. Determining whether the PM2.5 concentration surpasses a specific threshold is the focus of the problem. There are several classification models in use. The proposed models used other air pollutants as initial features and meteorological data gathered at various heights above the ground. There are many features when the multiple periods of these features are considered. Therefore, we reduce the dimensionality of the data before using the classification models. The resampling technique is also used to manage an imbalanced data collection like ours. Next, a complete discussion of evaluation metrics follows.</p></sec>
<sec>
<title>3.2. Data understanding</title>
<p>The Dataset used is available at: <ext-link ext-link-type="uri" xlink:href="https://aqicn.org/data-platform/covid19/">https://aqicn.org/data-platform/covid19/</ext-link>. About the Dataset: The average (median) of numerous stations was used to compile the statistics for each main city. Each air pollution species&#x00027; data set includes the minimum, maximum, median, standard deviation, and meteorological data. The US EPA (United State Environmental Protection Agency) standard is applied to all air pollutant species (i.e., no raw concentrations). All dates are in UTC (Coordinated Universal Time). The number of samples used to calculate the median and standard deviation is listed in the count column. PM2.5 is a unit of measurement for tiny inhalable particles having dimensions of 2.5 micrometers or less. High levels of PM2.5 have been linked to respiratory problems and other harmful health consequences, and they can constitute a serious health risk to residents. PM2.5 is a standard air quality metric; however, it is usually measured with ground-based sensors. This Dataset provides daily pollution estimates from January 2015 to February 2022 for 386 nations worldwide. The clusters in South African cities will be sampled from this Dataset. The Dataset includes (Middelburg, Pretoria, East London, Johannesburg, Bloemfontein, Cape Town, Vereeniging, Durban, Klerksdorp, Richards Bay, Port Elizabeth, and Worcester) which are considered stations for South Africa. The estimations will be derived using a model that has been trained using previous data from pollution sensor sites. Several global layers will be used as inputs to the model, including data from Sentinel 5P and meteorological details. The additional global layers are also obtained from the same Dataset whose link is provided above. To get the exact data for new locations, a dataset with a list of South African cities from <ext-link ext-link-type="uri" xlink:href="https://simplemaps.com/data/za-cities">https://simplemaps.com/data/za-cities</ext-link> is used, but the same process is repeated for other locations as well. The population centers are found using a custom Google Earth Engine script, available here: <ext-link ext-link-type="uri" xlink:href="https://code.earthengine.google.com/6dc3cd0c9cf91ba69592c5ce4c54ff55">https://code.earthengine.google.com/6dc3cd0c9cf91ba69592c5ce4c54ff55</ext-link>.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> depict the attributes of the Dataset used for this work. <xref ref-type="table" rid="T1">Table 1</xref> shows the attributes of the original Dataset, illustrates the attributes of the Dataset after sampling, and the attributes of the Dataset with a list of South African cities.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Attributes of dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Attribute</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#dee1e1">
<td valign="top" align="left" colspan="2"><bold>Attributes of the original dataset</bold></td>
</tr> <tr>
<td valign="top" align="left">Date</td>
<td valign="top" align="left">Contains the date of the recorded concentration</td>
</tr> <tr>
<td valign="top" align="left">Country</td>
<td valign="top" align="left">Contains the country of the City of the recorded concentration</td>
</tr> <tr>
<td valign="top" align="left">City</td>
<td valign="top" align="left">Contains the City of the recorded concentration</td>
</tr> <tr>
<td valign="top" align="left">Specie</td>
<td valign="top" align="left">Contains the name of the of the pollutants (NO2, SO2, O3, CO, PM2.5, PM10)</td>
</tr> <tr>
<td valign="top" align="left">Min</td>
<td valign="top" align="left">Contains the minimum concentration of the pollutant on the given date</td>
</tr> <tr>
<td valign="top" align="left">Max</td>
<td valign="top" align="left">Contains the maximum concentration of the pollutant on the given date</td>
</tr> <tr>
<td valign="top" align="left">Median</td>
<td valign="top" align="left">Contains the median of the concentration of the pollutant on the given date</td>
</tr> <tr>
<td valign="top" align="left">Variance</td>
<td valign="top" align="left">Contains the variance of the concentration of the pollutant on the given date</td>
</tr> <tr style="background-color:#dee1e1">
<td valign="top" align="left" colspan="2"><bold>Attributes of the dataset after sampling</bold></td>
</tr> <tr>
<td valign="top" align="left">Date</td>
<td valign="top" align="left">Contains the date of the recorded concentration</td>
</tr> <tr>
<td valign="top" align="left">City</td>
<td valign="top" align="left">Contains the South African City of the recorded concentration</td>
</tr> <tr>
<td valign="top" align="left">Median_PM25</td>
<td valign="top" align="left">Contains the median concentration of the PM2.5</td>
</tr> <tr>
<td valign="top" align="left">Lat</td>
<td valign="top" align="left">Contains the latitude of the City given</td>
</tr> <tr>
<td valign="top" align="left">Long</td>
<td valign="top" align="left">Contains the longitude of the City given</td>
</tr> <tr style="background-color:#dee1e1">
<td valign="top" align="left" colspan="2"><bold>Attributes for a dataset with a list of South African cities</bold></td>
</tr> <tr>
<td valign="top" align="left">City</td>
<td valign="top" align="left">The name of the city/town</td>
</tr> <tr>
<td valign="top" align="left">Lat</td>
<td valign="top" align="left">The latitude of the city/town</td>
</tr> <tr>
<td valign="top" align="left">Lng</td>
<td valign="top" align="left">The longitude of the city/town</td>
</tr> <tr>
<td valign="top" align="left">Country</td>
<td valign="top" align="left">The name of the city/town&#x00027;s country</td>
</tr> <tr>
<td valign="top" align="left">Admin Name</td>
<td valign="top" align="left">The name of the highest-level administration region of the city town</td>
</tr> <tr>
<td valign="top" align="left">Population</td>
<td valign="top" align="left">An estimate of the city&#x00027;s urban population</td>
</tr> <tr>
<td valign="top" align="left">id</td>
<td valign="top" align="left">A 10-digit unique id generated by SimpleMaps</td>
</tr></tbody>
</table>
</table-wrap></sec>
<sec>
<title>3.3. Research design</title>
<p>This research adopts the deductive approach adopted from the Positivism concept to use an experimental design to carry out cluster analysis for:</p>
<list list-type="simple">
<list-item><p>(i) Data Pre-processing: Missing values, Label Encoding, Normalization.</p></list-item>
</list>
<p>Data pre-processing was used to convert the raw data into an understandable format because the data in the real world is incomplete, noisy, and inconsistent. The generalized Dataset undergoes pre-processing, which helps recover missing, null, and duplicate values and convert the data into the numeric format.</p>
<p>Missing values are filled using the mean of the PM2.5 Median. Time Series Cross Validation is used to prevent overfitting and evaluate model performance.</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> shows one of the cities after applying the Time Series Cross Validation with 5-folds.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Applying time series cross validation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0001.tif"/>
</fig>
<list list-type="simple">
<list-item><p>(ii) Feature Selection: Air Quality Feature, Meteorological Feature, and Correlation Analysis in a quantitative study, since there is an involvement of numerical data and experiments, and they are part of the quantitative research.</p></list-item>
</list>
<p>The PM2.5 concentrations of the South African Cities are sampled from the original Dataset, then merged with the Meteorological Data and the population centers found using the location coordinates.</p>
<list list-type="simple">
<list-item><p>(iii) Data Split: Train Set and Test Set.</p></list-item>
</list>
<p>The Dataset was split into training and testing datasets. Generally, by default, the Dataset is split in the ratio of 80:20, but in this system model, the Dataset is split by the date. The Train Set consists of the concentrations dated before&#x00027; 01-01-2022&#x02032;, and the Test Set consists of those dated on and after&#x00027; 01-01-2022&#x02032;.</p>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> shows the split data with 2 of the 12 cities.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>How data was split before making predictions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0002.tif"/>
</fig>
<list list-type="simple">
<list-item><p>(iv) Performance Evaluation</p></list-item>
</list>
<p>The Dataset is trained by applying ML algorithms such as Cat Boost Regressor, Extreme Gradient Boosting Regressor, K-Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest Classifier.</p>
<p>The performance measurement parameters used in this work are as follows:</p>
<list list-type="order">
<list-item><p>Precision:</p>
<p>Precision is defined as the ratio of a true positive (TP) divided by the sum of a true positive (TP) and a false positive (FP).</p>
<p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>Recall:</p>
<p>The recall is defined as the ratio of a true positive (TP) divided by the sum of a true positive (TP) and a false negative (FN).</p>
<p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>F1-score:</p>
<p>F1 score is defined as the mean between precision and recall.</p>
<p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>Specificity:</p>
<p>Specificity is defined as the ratio of a true negative (TN) divided by the sum of a true negative (TN) and a false positive (FP).</p>
<p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>Sensitivity:</p>
<p>Sensitivity is the true positive (TP) ratio divided by the sum of a true positive and false negative.</p>
<p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>Confusion matrix:</p>
<p>A confusion matrix is represented as a table used to describe the performance of the classification model on a test dataset for which the correct values are known.</p>
<p><inline-graphic xlink:href="frai-06-1230087-i0001.tif"/></p>
</list-item>
<list-item><p>Mean Square Error</p>
<p>The Mean Square Error (MSE) measures the error in statistical models using the average squared difference between actual and predicted values.</p>
<p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x00177;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>Mean Absolute Error</p>
<p>The Mean Absolute Error (MAE) measures the average magnitude of the errors between the actual and predicted values.</p>
<p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>A</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x00177;</mml:mi><mml:mo>|</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
<list-item><p>Root Mean Square Error</p>
<p>The Root Mean Square Error (RMSE) measures the average difference between a statistical model&#x00027;s predicted and actual values.</p>
<p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msqrt><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x00177;</mml:mi></mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p> </list-item>
</list>
<list list-type="simple">
<list-item><p>(v) Training and Testing the Model</p></list-item>
</list>
<p>Cross-validation trained and tested the XGB model with five splits, a test size of 150, and a gap of 24. With features being the day of the year, and days of the week, with lag variables and the target being the median of PM2.5. The regressor base score was set to 0.5, with booster as the gradient boosting tree, with 1,000 estimates, three max depths, and a learning rate of 0.01.</p>
<p><inline-graphic xlink:href="frai-06-1230087-i0002.tif"/></p>
<list list-type="simple">
<list-item><p>(vi) Predictions.</p></list-item>
</list>
<list list-type="order">
<list-item><p>Predicting the latest PM2.5 concentrations for South African Cities with recording stations using past-dated recordings.</p></list-item>
<list-item><p>Predicting PM2.5 concentrations for South African Cities with no recording stations.</p></list-item>
<list-item><p>Predicting Future PM2.5 Concentrations for South African Cities.</p></list-item>
<list-item><p>Predicting the Air Quality Index (AQI) Status.</p></list-item>
</list></sec>
<sec>
<title>3.4. Data transformation</title>
<p>The clusters in South African cities were sampled from the original Dataset. The clustered Dataset includes cities like (Middelburg, Pretoria, East London, Johannesburg, Bloemfontein, Cape Town, Vereeniging, Durban, Klerksdorp, Richards Bay, Port Elizabeth, and Worcester) which are considered stations for South Africa.</p>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> shows how the clustered data looks by City, Date, and Month.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Clustered data by city, date, and month.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0003.tif"/>
</fig>
<p>From the clustered Dataset, only the data of PM2.5 was selected and used for predictions. <xref ref-type="fig" rid="F4">Figure 4</xref>, on the right, is the original map of South Africa, with the cities included in the Dataset plotted. On the left is the map plot according to the Median_PM25 concentrations, plotted based on the Longitude and Latitude of the South African Cities.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>South African cities locations based on the maps.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0004.tif"/>
</fig>
<p>The saved data with air quality measurements were augmented with satellite data via GEE (Google Earth Engine), getting it into a state that is ready for modeling to get the exact data for a new location which is essential when making predictions with no stations (ones which were not included in the Dataset.</p></sec>
<sec>
<title>3.5. Modeling</title>
<p>The datasets were collected from different sites that need to be converted into a generalized format to recover from missing and null values. Then the ML algorithms are applied to extract patterns and find the highest accuracy. <xref ref-type="fig" rid="F5">Figure 5</xref> represents the complete workflow of the System modeling.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Workflow for system modeling.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0005.tif"/>
</fig>
<p>Cat Boost Regressor and Extreme Gradient Boosting Regressor were used to make PM2.5 predictions then the best was selected to make PM2.5 predictions on the cities not included in the Dataset. Then the Static Variables and Time-series Data for those Cities are added, and the feature engineering is done when training. K-Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest Classifier are used to make Air Quality Index status predictions, whether Air is &#x00027;Good, Moderate, Severe, Unhealthy, Very Unhealthy or Hazardous&#x00027; based on the Median PM2.5. A system is created where you will need to enter the value of the PM2.5 then the results will be the AQI Status.</p></sec>
<sec>
<title>3.6. Hyperparameter tuning</title>
<p>The K-fold for the Cat Boost Regressor is set to 5 splits, with 1,000 iterations. The loss function is Root Mean Square Error (RMSE), with 100 early stopping rounds and verbose being false for the latest and future predictions. The verbosity of the XGB Regressor is set to zero for the latest forecasts and future projections.</p></sec>
<sec>
<title>3.7. Performance evaluation</title>
<p>The two metrics that are most frequently employed are RMSE (root mean squared error) and MAE (Mean Absolute Error), which are based on the discrepancy between the predicted result and the true value. Performance validation introduces bias when the data set is partitioned, taught, and tested simply once. This suggests that the results acquired from the testing dataset might no longer be valid if the testing subset is changed.</p>
<p>To measure differences between an estimator&#x00027;s anticipated value and the actual value, one uses RMSE (Root Mean Square Error). The term &#x0201C;root mean square error&#x0201D; can also describe this error measurement method. It establishes the importance of the error. A measure of mistakes between paired observations representing the same phenomenon is called MAE (Mean Absolute Error). The ratio of a genuine positive to the total of a false positive and false negative is known as Sensitivity. The ratio of a true negative to the total of a true negative and a false positive is known as Specificity.</p></sec></sec>
<sec id="s4">
<title>4. Experiment and results</title>
<p>Evaluation Models used for predicting PM2.5 concentrations for South African Cities.</p>
<sec>
<title>4.1. Cat boost regressor</title>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> shows the Model Evaluation for the Cat Boost Regressor, which includes the data shape of the train and test data frame, the RMSE (Root Mean Square Error) for each in five steps, and the overall mean RMSE of the five steps. The predictions of the Cat Boost Regressor on the Training data, the predictions are saved under the column named &#x00027;preds.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Cat Boost Regressor model evaluation and predictions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0006.tif"/>
</fig>
<sec>
<title>4.1.1. Cat boost actual PM2.5 vs. predicted PM2.5</title>
<p>The time series plot for Johannesburg &#x0201C;SMOOTHED&#x0201D; of the &#x00027;Predicted (orange) vs. Actual (blue)&#x00027; for Johannesburg city stations of PM2.5 is depicted in <xref ref-type="fig" rid="F7">Figure 7</xref>. The linear regression plot shows that the predicted and the actual are not so far apart. They are almost the same; therefore, they have a better correlation.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Time series and linear regression plots of Cat Boost Actual vs. Predicted PM2.5.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0007.tif"/>
</fig></sec></sec>
<sec>
<title>4.2. XGB (extreme gradient boosting) regressor</title>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> shows the Model Evaluation for XGB, which includes the data shape of the train and test data frame and the RMSE (Root Mean Square Error) for each in five steps, then the mean RMSE of the five steps. In addition, <xref ref-type="fig" rid="F8">Figure 8</xref> shows the predictions of the XGB on the Training data. The predictions are saved under the column named &#x0201C;preds&#x0201D;.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>XGB evaluation and predictions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0008.tif"/>
</fig>
<sec>
<title>4.2.1. XGB actual PM2.5 vs. predicted PM2.5</title>
<p>The linear regression plot of <xref ref-type="fig" rid="F9">Figure 9</xref> shows an excellent correlation between the Actual (Median_PM25) and Predicted (Preds) PM25. Furthermore, <xref ref-type="fig" rid="F9">Figure 9</xref> shows the smoothed time series plot of the &#x00027;Predicted (orange) vs. Actual (blue)&#x00027; for the Klerksdorp station of PM2.5.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Time series and linear regression and time series plots of XGB actual vs. Predicted PM2.5.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0009.tif"/>
</fig></sec></sec>
<sec>
<title>4.3. Parameter analysis results</title>
<p><xref ref-type="table" rid="T2">Table 2</xref> shows both regression models used when training and testing the dataset, and the CBR model performed better.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>RMSE of regression models used for predictions.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th valign="top" align="center"><bold>CBR</bold></th>
<th valign="top" align="center"><bold>XGBR</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">RMSE</td>
<td valign="top" align="center">25.72</td>
<td valign="top" align="center">27.64</td>
</tr></tbody>
</table>
</table-wrap></sec>
<sec>
<title>4.4. Predictions on South African cities which were not included in the dataset</title>
<p><xref ref-type="fig" rid="F10">Figure 10</xref> shows the mean of the predicted PM2.5 concentrations of the cities that do not have the stations. Cat Boost Regressor was used to make PM2.5 predictions because it had better accuracy. These cities had no historical data, and these predictions are made based on the other cities&#x00027; recordings and based on the neighboring cities. Therefore, it was best to use a better-accuracy model to make these predictions.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Predictions of cities with no stations.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0010.tif"/>
</fig></sec>
<sec>
<title>4.5. Future predictions on South African cities</title>
<p>Each City&#x00027;s data is clustered from the data with the PM2.5 concentrations for all the cities to make future predictions (<xref ref-type="fig" rid="F11">Figure 11</xref>).</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>Clustering the city data.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0011.tif"/>
</fig>
<p><xref ref-type="fig" rid="F12">Figure 12</xref> shows the head and tail of the data frame Johannesburg_F_features, which contains the predicted PM2.5 concentrations for Johannesburg from 26 November 2022 to 31 December 2023 (<xref ref-type="fig" rid="F12">Figure 12</xref>).</p>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption><p>Future PM2.5 predictions (data frame).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0012.tif"/>
</fig>
<p><xref ref-type="fig" rid="F13">Figure 13</xref> shows the Future Predictions of PM2.5 concentration from 26 November 2022 to 31 December 2023, using the XGB Model. Any of these two models, Cat Boost and XGB, had the best accuracy, and there was not much of a difference between them. Therefore, both were used to make different predictions.</p>
<fig id="F13" position="float">
<label>Figure 13</label>
<caption><p>Future PM2.5 predictions of South African cities (Graph).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0013.tif"/>
</fig></sec>
<sec>
<title>4.6. Evaluating models used for predicting the air quality index status</title>
<p>From <xref ref-type="table" rid="T3">Table 3</xref>, Decision Tree and Random Forest have 100% accuracy in predicting the AQI Status. More data was needed to check if the data changed, the accuracy would remain the same.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Results from the models for predicting AQI status.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>Sensitivity</bold></th>
<th valign="top" align="center"><bold>Specificity</bold></th>
<th valign="top" align="center"><bold>MAE</bold></th>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>RMSE</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">RF</td>
<td valign="top" align="center">100.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">0.0</td>
</tr> <tr>
<td valign="top" align="left">LR</td>
<td valign="top" align="center">98.957</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.003862</td>
<td valign="top" align="center">0.003862</td>
<td valign="top" align="center">0.062144</td>
</tr> <tr>
<td valign="top" align="left">SVM</td>
<td valign="top" align="center">98.881</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.973</td>
<td valign="top" align="center">0.012040</td>
<td valign="top" align="center">0.012040</td>
<td valign="top" align="center">0.109727</td>
</tr> <tr>
<td valign="top" align="left">KNN</td>
<td valign="top" align="center">99.986</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.000227</td>
<td valign="top" align="center">0.000227</td>
<td valign="top" align="center">0.015072</td>
</tr> <tr>
<td valign="top" align="left">DT</td>
<td valign="top" align="center">100.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">0.0</td>
</tr></tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="F14">Figure 14</xref> shows the classification reports of the models used for predicting the AQI Status.</p>
<fig id="F14" position="float">
<label>Figure 14</label>
<caption><p>Classification reports of models used.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0014.tif"/>
</fig></sec>
<sec>
<title>4.7. Making prediction results for the AQI status</title>
<p><xref ref-type="fig" rid="F15">Figure 15</xref> depicts the AQI threshold, AQI analysis function (defined based on the AQI Threshold), and AQI status predictions respectively. &#x0201C;Good&#x0201D;: 0, &#x0201C;Moderate&#x0201D;: 1, &#x0201C;Severe&#x0201D;: 2, &#x0201C;Unhealthy&#x0201D;: 3, &#x0201C;Very Unhealthy&#x0201D;: 4, &#x0201C;Hazardous&#x0201D;: 5.</p>
<fig id="F15" position="float">
<label>Figure 15</label>
<caption><p>AQI threshold (<ext-link ext-link-type="uri" xlink:href="https://aqicn.org/data-platform/covid19/">https://aqicn.org/data-platform/covid19/</ext-link>); AQI analysis function; AQI status predictions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1230087-g0015.tif"/>
</fig>
<p>Regarding AQI Status Predictions shown in <xref ref-type="fig" rid="F15">Figure 15</xref>, when making predictions for the AQI status, an input value of PM.25 is required to output the prediction. As we can see from <xref ref-type="fig" rid="F15">Figure 15</xref>, the input entered for the PM2.5 median value was 125.55, then each model had their own predicted output, and they all indicated that the forewarned is 2, which means that the air is severe.</p></sec></sec>
<sec id="s5">
<title>5. Discussion of results</title>
<p>The likelihood of PM2.5 surpassing the healthy level is predicted using regression models. Two regression models, Cat Boost Regressor and Extreme Gradient Boosting Regressor, were implemented for the PM 2.5 prediction. These models achieved reasonably good Accuracy scores of over 0.6 and were both often correct for over 0.9 of the time. To predict the Air Quality Status, K-Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest Classifier, five classification models were implemented with an excellent Accuracy Score of over 0.98 to predict when provided the PM2.5 Level.</p>
<p>To assess the high-level relevance of traits, the Mean RMSE of all models used is compared, and the actual is compared to the predicted. The lagged inputs played a significant role in predicting the PM2.5 and the AQI status, as many of them were selected and used by models when predicting. According to the results, Cat Boost Regressor was the best model to predict PM2.5. Furthermore, for AQI status, Random Forest Classifier and Decision were equally the best.</p></sec>
<sec id="s6">
<title>6. Comparison of this work with existing research</title>
<p>In this study, SVM, Random Forest, and KNN performed better with accuracy of 98.88%, 100%, and 99.99%, respectively, compared to the same models by Akiladevi et al. (<xref ref-type="bibr" rid="B3">2020</xref>), which achieved the accuracy of 70%, 99%, and 97%, respectively. The Decision Tree performed best in both cases, with an accuracy of 100%.</p>
<p>Cross-validation, XGB, the second fold, had the highest RMSE of 39.86 compared to the XGB used by Zamani Joharestani et al. (<xref ref-type="bibr" rid="B26">2019</xref>), which achieved 13.58.</p>
<p>Gupta et al. included models with an accuracy of 99.88% for CatBoost regression, 92.40% for SVM, and 91.99% for Decision Tree. In contrast, this study has the accuracy of CatBoost regression, 98.88% for SVM and 100% for Decision Tree.</p>
<p>Generally, the models used in this work perform better on our datasets when compared to existing works using similar models, as shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Comparing this work with existing work.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Models</bold></th>
<th valign="top" align="left"><bold>This study</bold></th>
<th valign="top" align="left"><bold>Akiladevi et al., <xref ref-type="bibr" rid="B3">2020</xref></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SVM</td>
<td valign="top" align="left">98.88%</td>
<td valign="top" align="left">70%</td>
</tr> <tr>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="left">100%</td>
<td valign="top" align="left">99%</td>
</tr> <tr>
<td valign="top" align="left">KNN</td>
<td valign="top" align="left">99.99%</td>
<td valign="top" align="left">97%</td>
</tr> <tr>
<td valign="top" align="left">DT</td>
<td valign="top" align="left">100%</td>
<td valign="top" align="left">100%</td>
</tr> <tr>
<td valign="top" align="left">Models</td>
<td valign="top" align="left">This study</td>
<td valign="top" align="left">Zamani Joharestani et al., <xref ref-type="bibr" rid="B26">2019</xref></td>
</tr> <tr>
<td valign="top" align="left">XGB</td>
<td valign="top" align="left">39.86</td>
<td valign="top" align="left">13.58</td>
</tr> <tr>
<td valign="top" align="left">Models</td>
<td valign="top" align="left">This study</td>
<td valign="top" align="left">Gupta et al., <xref ref-type="bibr" rid="B10">2023</xref></td>
</tr> <tr>
<td valign="top" align="left">SVM</td>
<td valign="top" align="left">98.88%</td>
<td valign="top" align="left">92.40%</td>
</tr> <tr>
<td valign="top" align="left">Decision Tree</td>
<td valign="top" align="left">100%</td>
<td valign="top" align="left">91.99%</td>
</tr></tbody>
</table>
</table-wrap></sec>
<sec id="s7">
<title>7. Conclusion</title>
<p>This study focused on predicting the concentration of PM2.5 pollutants in South African cities. The proposed machine learning models are intended to forecast the probability that PM2.5 would surpass the established threshold or not. At various heights above the ground along a vertical axis, meteorological data and air pollutant PM2.5 features are carefully considered. The forecasting ability of the models may be improved by incorporating other characteristics into Google Earth Engine that further extract meaningful information from the data. A higher forecast performance may be possible if more extensive and reliable data are provided. More complex models, like deep learning techniques, may improve prediction accuracy with a larger dataset.</p>
<p>Several models were used, and regression models used included Cat Boost Regressor and Extreme Gradient Boosting Regressor; the performance measure used is an RMSE (Root Mean Square Error). Classification models included K-Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest Classifier, which were compared using the MSE (Mean Square Error), MAE (Mean Absolute Error), and RMSE (Root Mean Square Error) parameters for predicting the Air Quality Index (AQI) Status. The results show that the proposed hybrid model is more accurate than the solo models, proving its superiority. The suggested method can be used in the future to forecast data from other cities. Using prediction, we may also identify the polluted area and its root cause. Some pollutants pose a severe threat to human health in the future.</p></sec>
<sec id="s8">
<title>8. Future work</title>
<p>The data used in this investigation is static. Interestingly, the site offered daily updates to the data. Leveraging real-time data analysis through the cloud to create better results for improved performance shall be considered in the future extension of this work. Moreover, the models used in this work will be evaluated on more datasets from Nitrogen Dioxide (NO2), Ozone (O3), Sulfur Dioxide (SO2), Carbon Monoxide (CO) pollutants. Furthermore, Deep learning methods and Ensembled methods shall be consider for PM2.5, PM10 and other pollutants indicated above.</p></sec>
<sec id="s9">
<title>9. Limitation</title>
<p>Not all the South African cities were included in the Dataset. This is because the ones included are the ones that are only having the stations. Even though it was possible to make predictions of the selected cities, the comparison could not be made for all the cities in South Africa since there are no recorded readings for some cities.</p></sec>
<sec sec-type="data-availability" id="s10">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s11">
<title>Author contributions</title>
<p>Study conception and design, analysis and interpretation of results, and draft manuscript preparation: IO and TM. Data collection: TM. All authors reviewed the results and approved the final version of the manuscript.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s12">
<title>Funding</title>
<p>TM acknowledges the full financial support of the NRF National Research Foundation (Ref: MND211129652673).</p>
</sec>
<ack><p>The authors gladly recognize the infrastructure support offered by Sol Plaatje University for this study.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s13">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aarthi</surname> <given-names>A.</given-names></name> <name><surname>Gayathri</surname> <given-names>P.</given-names></name> <name><surname>Gomathi</surname> <given-names>N. R.</given-names></name> <name><surname>Kalaiselvi</surname> <given-names>S.</given-names></name> <name><surname>Gomathi</surname> <given-names>D. V.</given-names></name></person-group> (<year>2020</year>). <article-title>Air quality prediction through regression model</article-title>. <source>Int. J. Sci. Technol. Res.</source> <volume>9</volume>, <fpage>923</fpage>&#x02013;<lpage>928</lpage>.</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aditya</surname> <given-names>C. R.</given-names></name> <name><surname>Deshmukh</surname> <given-names>C. R.</given-names></name> <name><surname>Nayana</surname> <given-names>D. K.</given-names></name> <name><surname>Vidyavastu</surname> <given-names>P. G.</given-names></name></person-group> (<year>2018</year>). <article-title>Detection and prediction of air pollution using machine learning models</article-title>. <source>Int. J. Engin. Trends Technol. (IJETT)</source>, <volume>59</volume>, <fpage>204</fpage>&#x02013;<lpage>207</lpage>. <pub-id pub-id-type="doi">10.14445/22315381/IJETT-V59P238</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akiladevi</surname> <given-names>R.</given-names></name> <name><surname>Devi</surname> <given-names>N.</given-names></name> <name><surname>Karthick</surname> <given-names>N.</given-names></name> <name><surname>Nivetha</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>Prediction and analysis of pollutants using supervised machine learning</article-title>. <source>Int. J. Recent Technol. Engin.</source> <volume>9</volume>, <fpage>50</fpage>&#x02013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.35940/ijrte.A2837.079220</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ameer</surname> <given-names>S.</given-names></name> <name><surname>Shah</surname> <given-names>M. A.</given-names></name> <name><surname>Khan</surname> <given-names>A.</given-names></name> <name><surname>Song</surname> <given-names>H.</given-names></name> <name><surname>Maple</surname> <given-names>C.</given-names></name> <name><surname>Islam</surname> <given-names>S. U.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Comparative analysis of machine learning techniques for predicting air quality in smart cities</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>128325</fpage>&#x02013;<lpage>128338</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2925082</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Balasubramanian</surname> <given-names>S.</given-names></name> <name><surname>Talapala</surname> <given-names>S.</given-names></name> <name><surname>Vinushiya</surname> <given-names>B.</given-names></name> <name><surname>Saraswathi</surname> <given-names>S</given-names></name></person-group>. (<year>2021</year>). <article-title>Air pollution monitoring and prediction using IoT and machine learning</article-title>. <source>Int. J. Comp. Sci. Technol.</source> <volume>12</volume>, <fpage>60</fpage>&#x02013;<lpage>65</lpage>.</citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bekkar</surname> <given-names>A.</given-names></name> <name><surname>Hssina</surname> <given-names>B.</given-names></name> <name><surname>Douzi</surname> <given-names>S.</given-names></name> <name><surname>Douzi</surname> <given-names>K.</given-names></name></person-group> (<year>2021</year>). <article-title>Air-pollution prediction in smart city, deep learning approach</article-title>. <source>J. Big Data</source> <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1186/s40537-021-00548-1</pub-id><pub-id pub-id-type="pmid">34956819</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bui</surname> <given-names>T. C.</given-names></name> <name><surname>Le</surname> <given-names>V. D.</given-names></name> <name><surname>Cha</surname> <given-names>S. K.</given-names></name></person-group> (<year>2018</year>). <article-title>A deep learning approach for forecasting air pollution in South Korea using LSTM</article-title>. <source>arXiv preprint arXiv</source> <volume>1804</volume>, <fpage>07891</fpage>.</citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dobrea</surname> <given-names>M.</given-names></name> <name><surname>B&#x00103;dicu</surname> <given-names>A.</given-names></name> <name><surname>Barbu</surname> <given-names>M.</given-names></name> <name><surname>Subea</surname> <given-names>O.</given-names></name> <name><surname>B&#x00103;l&#x00103;nescu</surname> <given-names>M.</given-names></name> <name><surname>Suciu</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Machine Learning algorithms for air pollutants forecasting,&#x0201D;</article-title> in <source>2020 IEEE 26th International Symposium for Design and Technology in Electronic Packaging (SIITME)</source>. <publisher-name>IEEE</publisher-name>, <fpage>109</fpage>-<lpage>113</lpage>. <pub-id pub-id-type="doi">10.1109/SIITME50350.2020.9292238</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>G.</given-names></name> <name><surname>Chen</surname> <given-names>C. H.</given-names></name></person-group> (<year>2020</year>). <article-title>Air pollution concentration forecast method based on the deep ensemble neural network</article-title>. <source>Wireless Commun. Mobile Comp.</source> <volume>2020</volume>, <fpage>1</fpage>&#x02013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1155/2020/8854649</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gupta</surname> <given-names>N. S.</given-names></name> <name><surname>Mohta</surname> <given-names>Y.</given-names></name> <name><surname>Heda</surname> <given-names>K.</given-names></name> <name><surname>Armaan</surname> <given-names>R.</given-names></name> <name><surname>Valarmathi</surname> <given-names>B.</given-names></name> <name><surname>Arulkumaran</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Prediction of air quality index using machine learning techniques: a comparative analysis</article-title>. <source>J. Environ. Public Health</source> <volume>3</volume>, <fpage>2023</fpage>. <pub-id pub-id-type="doi">10.1155/2023/4916267</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harishkumar</surname> <given-names>K. S.</given-names></name> <name><surname>Yogesh</surname> <given-names>K. M.</given-names></name> <name><surname>Gad</surname> <given-names>I.</given-names></name></person-group> (<year>2020</year>). <article-title>Forecasting air pollution particulate matter (PM2, 5.) using machine learning regression models</article-title>. <source>Procedia Comput. Sci.</source> <volume>171</volume>, <fpage>2057</fpage>&#x02013;<lpage>2066</lpage>. <pub-id pub-id-type="doi">10.1016/j.procs.2020.04.221</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heydari</surname> <given-names>A.</given-names></name> <name><surname>Majidi Nezhad</surname> <given-names>M.</given-names></name> <name><surname>Astiaso Garcia</surname> <given-names>D.</given-names></name> <name><surname>Keynia</surname> <given-names>F.</given-names></name> <name><surname>De Santoli</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>Air pollution forecasting application based on deep learning model and optimization algorithm</article-title>. <source>Clean Technol. Environ. Policy</source> <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1007/s10098-021-02080-5</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Jonathan</surname> <given-names>W.</given-names></name> <name><surname>Yasin</surname> <given-names>A.</given-names></name> <name><surname>Amy</surname> <given-names>B.</given-names></name> <name><surname>Nikhil Kumar</surname> <given-names>M.</given-names></name> <name><surname>Karim</surname> <given-names>K.</given-names></name> <name><surname>Achraf</surname> <given-names>H.</given-names></name> <etal/></person-group> (<year>2020</year>). <source>Daily air quality estimates for urban centers in Africa. Zindi</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://catalogue.saeon.ac.za/records/10.15493/SARVA.301020-2">https://catalogue.saeon.ac.za/records/10.15493/SARVA.301020-2</ext-link> (accessed March 08, 2022).</citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kalajdjieski</surname> <given-names>J.</given-names></name> <name><surname>Zdravevski</surname> <given-names>E.</given-names></name> <name><surname>Corizzo</surname> <given-names>R.</given-names></name> <name><surname>Lameski</surname> <given-names>P.</given-names></name> <name><surname>Kalajdziski</surname> <given-names>S.</given-names></name> <name><surname>Pires</surname> <given-names>I. M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Air pollution prediction with multi-modal data and deep neural networks</article-title>. <source>Remote Sens.</source> <volume>12</volume>, <fpage>4142</fpage>. <pub-id pub-id-type="doi">10.3390/rs12244142</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liao</surname> <given-names>Q.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Wu</surname> <given-names>L.</given-names></name> <name><surname>Pan</surname> <given-names>X.</given-names></name> <name><surname>Tang</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>Deep learning for air quality forecasts: a review</article-title>. <source>Curr. Pollut. Rep.</source> <volume>6</volume>, <fpage>399</fpage>&#x02013;<lpage>409</lpage>. <pub-id pub-id-type="doi">10.1007/s40726-020-00159-z</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mao</surname> <given-names>W.</given-names></name> <name><surname>Jiao</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Tong</surname> <given-names>X.</given-names></name> <name><surname>Zhao</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>A hybrid integrated deep learning model for predicting various air pollutants</article-title>. <source>GIScience Remote Sens.</source> <volume>58</volume>, <fpage>1395</fpage>&#x02013;<lpage>1412</lpage>. <pub-id pub-id-type="doi">10.1080/15481603.2021.1988429</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Masood</surname> <given-names>A.</given-names></name> <name><surname>Ahmad</surname> <given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>A model for particulate matter (PM2, 5.) prediction for Delhi based on machine learning approaches</article-title>. <source>Procedia Comput. Sci.</source> <volume>167</volume>, <fpage>2101</fpage>&#x02013;<lpage>2110</lpage>. <pub-id pub-id-type="doi">10.1016/j.procs.2020.03.258</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moursi</surname> <given-names>A. S.</given-names></name> <name><surname>Shouman</surname> <given-names>M. A.</given-names></name> <name><surname>Hemdan</surname> <given-names>E. E. D.</given-names></name> <name><surname>El-Fishawy</surname> <given-names>N.</given-names></name></person-group> (<year>2019</year>), <article-title>M2. 5 Concentration prediction for air pollution using machine learning algorithms</article-title>. <source>Menoufia J. Electron. Eng. Res</source>. <volume>28</volume>, <fpage>349</fpage>&#x02013;<lpage>354</lpage>. <pub-id pub-id-type="doi">10.21608/mjeer.2019.67375</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Popa</surname> <given-names>C. L.</given-names></name> <name><surname>Dobrescu</surname> <given-names>T. G.</given-names></name> <name><surname>Silvestru</surname> <given-names>C. I.</given-names></name> <name><surname>Firulescu</surname> <given-names>A. C.</given-names></name> <name><surname>Popescu</surname> <given-names>C. A.</given-names></name> <name><surname>Cotet</surname> <given-names>C. E.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Pollution and weather reports: using machine learning for combating pollution in big cities</article-title>. <source>Sensors</source> <volume>21</volume>, <fpage>7329</fpage>. <pub-id pub-id-type="doi">10.3390/s21217329</pub-id><pub-id pub-id-type="pmid">34770634</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Saleh</surname> <given-names>C.</given-names></name> <name><surname>Dzakiyullah</surname> <given-names>N. R.</given-names></name> <name><surname>Nugroho</surname> <given-names>J. B.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Carbon dioxide emission prediction using support vector machine,&#x0201D;</article-title> in <source>IOP Conference Series: Materials Science and Engineering</source>. (<publisher-name>IOP Publishing</publisher-name>) <volume>114</volume>, <fpage>012148</fpage>. <pub-id pub-id-type="doi">10.1088/1757-899X/114/1/012148</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sultana</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>A comparison study of air pollution detection using image processing, machine learning, and deep learning approach</article-title>. <source>Global J. Comp. Sci. Technol</source>. <volume>19</volume>, <fpage>2019</fpage>.</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Taylan</surname> <given-names>O.</given-names></name> <name><surname>Alkabaa</surname> <given-names>A. S.</given-names></name> <name><surname>Alamoudi</surname> <given-names>M.</given-names></name> <name><surname>Basahel</surname> <given-names>A.</given-names></name> <name><surname>Balubaid</surname> <given-names>M.</given-names></name> <name><surname>Andejany</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Air quality modeling for sustainable clean environment using ANFIS and machine learning approaches</article-title>. <source>Atmosphere</source> <volume>12</volume>, <fpage>713</fpage>. <pub-id pub-id-type="doi">10.3390/atmos12060713</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><collab>World Health Organization</collab></person-group> (<year>2021</year>). <source>WHO Global Air Quality Guidelines: Particulate Matter (PM2, 5. and PM10), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide</source>. <publisher-loc>Geneva</publisher-loc>: <publisher-name>World Health Organization</publisher-name>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xayasouk</surname> <given-names>T.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Air pollution prediction system using deep learning</article-title>. <source>WIT Trans. Ecol. Environ</source>. <volume>230</volume>, <fpage>71</fpage>&#x02013;<lpage>79</lpage>. <pub-id pub-id-type="doi">10.2495/AIR180071</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>G.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name> <name><surname>Lee</surname> <given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>A hybrid deep learning model to forecast particulate matter concentration levels in Seoul, South Korea</article-title>. <source>Atmosphere</source> <volume>11</volume>, <fpage>348</fpage>. <pub-id pub-id-type="doi">10.3390/atmos11040348</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zamani Joharestani</surname> <given-names>M.</given-names></name> <name><surname>Cao</surname> <given-names>C.</given-names></name> <name><surname>Ni</surname> <given-names>X.</given-names></name> <name><surname>Bashir</surname> <given-names>B.</given-names></name> <name><surname>Talebiesfandarani</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>M2. 5 predictions based on random forest, XGBoost, and deep learning using multisource remote sensing data</article-title>. <source>Atmosphere</source> <volume>10</volume>, <fpage>373</fpage>. <pub-id pub-id-type="doi">10.3390/atmos10070373</pub-id></citation></ref>
</ref-list> 
</back>
</article>