<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Physiol.</journal-id>
<journal-title>Frontiers in Physiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Physiol.</abbrev-journal-title>
<issn pub-type="epub">1664-042X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1084420</article-id>
<article-id pub-id-type="doi">10.3389/fphys.2022.1084420</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Physiology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Heart sound classification based on improved mel-frequency spectral coefficients and deep residual learning</article-title>
<alt-title alt-title-type="left-running-head">Li et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fphys.2022.1084420">10.3389/fphys.2022.1084420</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Feng</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Zheng</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Wang&#x2009;</surname>
<given-names>Lingling</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2044604/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Wei</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Department of Computer Science and Technology</institution>, <institution>Anhui University of Finance and Economics</institution>, <addr-line>Bengbu</addr-line>, <addr-line>Anhui</addr-line>, <country>China</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>School of Information Science and Technology, University of Science and Technology of China</institution>, <addr-line>Hefei</addr-line>, <addr-line>Anhui</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/555442/overview">Lisheng Xu</ext-link>, Northeastern University, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1676516/overview">Nivitha Varghees V</ext-link>., Jyothi Engineering College, India</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/356744/overview">Hong Tang</ext-link>, Dalian University of Technology, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Lingling Wang&#x2009;, <email>wll@aufe.edu.cn</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Computational Physiology and Medicine, a section of the journal Frontiers in Physiology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>22</day>
<month>12</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>1084420</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>12</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Li, Zhang, Wang&#x2009; and Liu.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Li, Zhang, Wang&#x2009; and Liu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Heart sound classification plays a critical role in the early diagnosis of cardiovascular diseases. Although there have been many advances in heart sound classification in the last few years, most of them are still based on conventional segmented features and shallow structure-based classifiers. Therefore, we propose a new heart sound classification method based on improved mel-frequency cepstrum coefficient features and deep residual learning. Firstly, the heart sound signal is preprocessed, and its improved features are computed. Then, these features are used as input features of the neural network. The pathological information in the heart sound signal is further extracted by the deep residual network. Finally, the heart sound signal is classified into different categories according to the features learned by the neural network. This paper presents comprehensive analyses of different network parameters and network connection strategies. The proposed method achieves an accuracy of 94.43% on the dataset in this paper.</p>
</abstract>
<kwd-group>
<kwd>heart sound classification</kwd>
<kwd>cardiovascular</kwd>
<kwd>MFCC</kwd>
<kwd>deep learning</kwd>
<kwd>Resnet</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Cardiovascular disease is a term used to describe a group of diseases, including coronary heart disease, cerebrovascular disease, and rheumatic heart disease. A patient&#x2019;s blood pressure, blood sugar, and lipid levels can be raised by fried foods, fast foods, alcohol, and tobacco, as well as weight gain and obesity, leading to premature death. Prevention of sudden death from cardiovascular disease can be achieved by finding groups at risk for cardiovascular disease and ensuring they receive the proper treatment. It is possible to reduce the risk of sudden death from cardiovascular disease by reducing alcohol consumption, reducing salt intake, eating more fruits and vegetables, and exercising more.</p>
<p>Heart sounds are produced by the heart through rhythmic contraction and diastole. The heart is the powerhouse of the body and it is the most critical organ in the body, responsible for delivering blood to other organs to provide oxygen and other nutrients and to carry away the end products of metabolism so that cells can maintain a normal physiological state. Hearts have four chambers: Left atrium, left ventricle, right atrium, and right ventricle, the details of heart structure are shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. Atrioventricular valves prevent blood from flowing backward between the atria and ventricles <xref ref-type="bibr" rid="B22">Li S. et al. (2020)</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The Structure of the human heart.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g001.tif"/>
</fig>
<p>A cardiac cycle occurs when one heartbeat precedes the next, producing four heart sounds, which are the first, second, third, and fourth heart sounds. Screening for cardiovascular disease by auscultatory heart sound auscultation is a simple, necessary, and effective method that has been used for over 180&#xa0;years <xref ref-type="bibr" rid="B24">Liu et al. (2016)</xref>. The first heart sound marks the beginning of ventricular systole and is characterized by long duration, high intensity, and loud sound. The second heart sound marks the beginning of ventricular diastole and has the characteristics of shorter duration, less intensity, and less sound. After the second heart sound, the third heart sound occurs. It lasts between 0.04 and 0.05&#xa0;s and has a longer wavelength. About half of young adults and most children hear it, and it does not necessarily indicate abnormality. In the fourth heart sound, a long wave sound precedes the first heart sound and lasts for about 0.04&#xa0;s. It is mechanical wave caused by the contraction of the atria and the rapid filling of the ventricles with blood flow, also known as an atrial sound. Most healthy adults can record a tiny fourth heart sound on an electrocardiogram, which is difficult to detect on general auscultation. Based on the patient&#x2019;s clinical condition, the physician records the four basic heart sounds and analyzes their differences from the normal situation. It is typically tricky for physicians to determine a patient&#x2019;s condition by heart sound auscultation in clinical practice <xref ref-type="bibr" rid="B17">Jiang and Choi (2006)</xref>. Industrialization has made sophisticated machines standard medical tools, and electrocardiograms (PCG) are recorded using acoustic instruments to diagnose and treat patients. With the continuous application of PCG, the use of signal processing and artificial intelligence techniques to extract physiological and pathological information from PCG data has gradually become a popular trend <xref ref-type="bibr" rid="B11">Herzig et al. (2014)</xref>. Benefit from the development of deep learning field in recent years <xref ref-type="bibr" rid="B12">Hinton and Salakhutdinov (2006)</xref>; <xref ref-type="bibr" rid="B43">Yu et al. (2013)</xref>; <xref ref-type="bibr" rid="B30">Ranzato et al. (2006)</xref>; <xref ref-type="bibr" rid="B5">Bengio. (2009)</xref>; <xref ref-type="bibr" rid="B13">Hinton and Salakhutdinov (2012)</xref>; <xref ref-type="bibr" rid="B39">Vincent et al. (2010)</xref>; <xref ref-type="bibr" rid="B33">Silver et al. (2016)</xref>; <xref ref-type="bibr" rid="B27">Nair and Hinton (2010)</xref>, a new horizon has been opened for heart sound classification <xref ref-type="bibr" rid="B45">Zhang and Han (2017)</xref>. CNN is now a mature deep learning framework since it was first proposed in 2006. It has become a widely used approach in computer vision due to its convolutional layer that learns local patterns of images. CNN is also gradually applied to biomedical signal classification and speech semantic recognition through corresponding audio processing methods, such as transforming human physiological signals into speech spectrograms. Recurrent neural networks (RNN) are a class of neural networks that specialize in processing sequential data. Gated recurrent units (GRU) and long short-term memory (LSTM) are improved versions of RNN, and they provide state-of-the-art performance in many applications, including machine translation, speech recognition, and image captioning <xref ref-type="bibr" rid="B1">Abduh et al. (2019)</xref>. Heart sound signals are sequential data with strong temporal correlation, so heart sound classification can be efficiently processed by RNN <xref ref-type="bibr" rid="B28">Nogueira et al. (2019)</xref>; <xref ref-type="bibr" rid="B16">Ismail et al. (2022)</xref>; <xref ref-type="bibr" rid="B32">Sakib et al. (2019)</xref>. <xref ref-type="fig" rid="F2">Figure 2</xref> describes the Waveform representation of S1, S2, S3, and S4 sounds in systole and diastole intervals.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Waveform representation of S1, S2, S3, and S4 sounds in systole and diastole intervals, as of <xref ref-type="bibr" rid="B38">Varghees and Ramachandran (2014)</xref>.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g002.tif"/>
</fig>
<p>In addition, since some noise in the environment is inevitably collected during the acquisition of heart sounds, this can greatly affect the accuracy of the model classification. Therefore, it is crucial to process the original heart sound signal through feature engineering before feeding it into the neural network for training. There are several commonly used feature extraction methods in heart sound classification tasks, including discrete wavelet transform coefficients (DWT) <xref ref-type="bibr" rid="B25">Mei et al. (2021)</xref>, and Mel frequency cepstral coefficients (MFCC) <xref ref-type="bibr" rid="B42">Yang and Hsieh (2016)</xref>. In this paper, the MFCC-based first and second-order difference coefficients are used as the input tensor of the neural network. This feature extraction method reduces the effect of noise on the results and allows the neural network to extract the physiological and pathological features in the heart sound signal, resulting in higher classification accuracy. Compared to traditional heart sound classification algorithms, deep learning techniques avoid the problems of manual intervention, complex processes, and poor generalization. <xref ref-type="bibr" rid="B19">Kui et al. (2021)</xref> combined MFSC and CNN for classification of heart sounds. <xref ref-type="bibr" rid="B23">Li et al. (2021)</xref> used Short Time Fourier Transform (SFTF) based features as input to CNN. <xref ref-type="bibr" rid="B37">Tschannen et al. (2016)</xref> used Wavelet-based features and CNN. <xref ref-type="bibr" rid="B21">Li F. et al. (2020)</xref> extracted 497 features from time series as input to the CNN. <xref ref-type="bibr" rid="B8">Er (2021)</xref> proposed Local Binary Pattern (LBP) and Local Ternary (LTP) pattern features as input to the CNN. <xref ref-type="bibr" rid="B41">Wu et al. (2019)</xref> used MFCC as input to the CNN. Lack of large authoritative open heart sound datasets restricts the performance of the model. To address this concern, this paper incorporates three of the most widely used heart sound datasets. It helps to radically improve the performance of the deep learning model. Although the performance of the above methods has been greatly improved compared to traditional machine learning methods, most of these are shallow structures and the features used are insufficient to fully express the information of heart sounds. In this study, we select improved MFCC as input features to more comprehensively represent the static and dynamic characteristics of the heart sound signal. Additionally, we use a residual neural network which alleviates gradient disappearance and degradation during training. <xref ref-type="fig" rid="F3">Figure 3</xref> summarizes the motivation of our study.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Motivation of the proposed method.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g003.tif"/>
</fig>
<p>The rest of the paper is structured as follows: <xref ref-type="sec" rid="s2">Section 2</xref> discusses recent research trends and essential methods related to heart sound classification. <xref ref-type="sec" rid="s3">Section 3</xref> describes in detail the preprocessing and feature engineering of heart sound audio and introduces the deep residual neural network structure used in this paper and analyzes in detail the more critical convolution and residual principles. In <xref ref-type="sec" rid="s4">Section 4</xref>, we describe the three datasets used in this paper in detail. We split 20% of the dataset as the testing set. All metrics are the results of the testing set. Additionally, we make a comparison between MFCC, &#x25b5;MFCC, &#x25b5;<sup>2</sup>MFCC, and improved MFCC to further explain what the improvements are for a better understanding of the superiority of the methods in this paper, RNNs and CNNs are used for comparison and we show models&#x2019; loss and accuracy during training. We also list references with other methods used for comparison. <xref ref-type="sec" rid="s5">Section 5</xref> summarizes our study, and our proposed method is feasible for the heart sound classification task.</p>
</sec>
<sec id="s2">
<title>2 Related work</title>
<p>At present, heart sound auscultation technology is one of the leading clinical diagnostic tools for treating cardiovascular diseases, with the characteristics of non-invasive, efficient, convenient, and can obtain physiological and pathological information about the heart, but due to the complex clinical diagnostic conditions, there is a lot of noise pollution, a lack of experience in physicians are often disturbed by the noise of the environment, resulting in an inaccurate diagnoses of the condition. In 1929, the German doctor Werner used a catheter to deliver drugs to the heart, opening the door to the use of physical models to study cardiovascular disease; in the 1970s, Dr. Marcus in the United States used angiography to observe the causes of cardiovascular disease, overturning long-held misconceptions about heart disease; in the 1980s, the earliest cardiac defibrillators came into clinical use at Johns Hopkins University, and the earliest telemetry systems were developed so that Doctors coule observe the vital signs of heart disease patients from a distance; in recent years, with the development of technology, devices similar to comprehensive ECG heart sound analyzers and intelligent electronic stethoscopes have been put into clinical use, but due to the inevitable factors in the use process, the collected heart sound signals will contain various types of noise to varying degrees, affecting the final diagnostic results. At present, digital filters, wavelet decomposition and empirical modal decomposition are widely used for digital denoising of heart sound signals. In recent years, with the rise of artificial intelligence, big data, and other technologies, more accurate and effective heart sound detection methods are expected to be realized.</p>
<p>The dataset is one of the fundamental issues affecting the results, and heart sound classification is no exception. In general, the larger the data set, the more specialized the distribution, and the more extensive the heart sound data, the more overfitting of the model can be avoided, and the generalizability of the model can be increased. According to a survey <xref ref-type="bibr" rid="B26">Milani et al. (2022)</xref>, using deep learning techniques for heart sound classification tasks remains challenging due to the lack of a large authoritative open heart sound dataset. In this paper, the Physio heart sound dataset <xref ref-type="bibr" rid="B24">Liu et al. (2016)</xref>, Pascal heart sound dataset <xref ref-type="bibr" rid="B9">Gomes et al. (2013)</xref> and Yaseen heart sound dataset <xref ref-type="bibr" rid="B34">Son and Kwon (2018)</xref> were used to construct more extensive, less noisy, and more reliable heart sound dataset. Positive and negative sample imbalances can affect the performance of the model. It is assumed that the distribution of positive and negative samples in the feature space is unbalanced. When the neural network tries to learn the mapping relationship model. It predicts that more samples will bring less loss in most feature space regions. Eventually, this causes the model to fail, and the predicted values are always concentrated near the labels with more samples. That is, the model has very high accuracy on the training set, but a low accuracy on the validation and test sets. It significantly reduces the generalizability of the model. To solve such problems, researchers usually sample the heart sound data and perform slicing operations <xref ref-type="bibr" rid="B3">Baghel et al. (2020)</xref>; <xref ref-type="bibr" rid="B4">Baydoun et al. (2020)</xref> to ensure the balance between the different labels of the samples. <xref ref-type="bibr" rid="B40">Wang et al. (2021)</xref> used a weighted improvement of the classifier to reduce the impact of the unbalanced dataset on training. In this paper, the pre-processing of heart sound audio is used to perform cuts and enhance a smaller number of samples to avoid the problem of sample imbalance.</p>
<p>In general, binary classification, multiple classification and regression are often used in classification problems, and how the classification task is chosen can also affect the classification results to some extent. For sequence data with considerable background noise such as heart sounds, the impact of the acquisition process on the real heart sounds must be considered according to the actual situation of the data set. In the current studies of heart sound classification, most of the tasks are dichotomous, normal heart sounds and abnormal heart sounds. Few experiments have classified specific situations such as aortic stenosis and mitral valve insufficiency based on medical knowledge. <xref ref-type="bibr" rid="B6">Demir et al. (2019)</xref> used deep convolution neural networks to perform a four classification task on a Kaggle dataset, as well as <xref ref-type="bibr" rid="B29">Oh et al. (2020)</xref> performed a quintuple classification task on a heart sound dataset. In this paper, heart sound datasets from three different platforms are considered, considering the inevitable noise generated during the acquisition process due to hardware limitations. Since some cannot identify the heart sound signals, three classification tasks are performed for heart sounds, namely normal, abnormal and noisy, and this selection of classification tasks is closer to the actual situation. It also helps to further improve the accuracy and practical application of heart sound classification.</p>
<p>Many researcher have used deep learning techniques to solve heart sound classification problems. <xref ref-type="bibr" rid="B19">Kui et al. (2021)</xref> investigated the effect of discrete cosine transform (DCT) on classification results during MFCC signal extraction. MFSC is an intermediate state in the MFCC extraction process, which omits the step of DCT. CNN is essentially a non-linear transformation of the data, and since DCT is essentially a linear transformation, this operation results in the absence of pathological information in the heart sound signal, so MFSC is feasible for heart sound classification using deep learning techniques. <xref ref-type="bibr" rid="B18">Krishnan et al. (2020)</xref> obtained an accuracy of 85.74% by directly using the unsegmented PCG signal as the input to the CNN. <xref ref-type="bibr" rid="B44">Zeinali and Niaki (2022)</xref> used a heart sound audio signal processing algorithm to convert one-dimensional temporal features into two-dimensional spectral features. This proposed method achieved 87.0% accuracy in a heart sound triple classification task. <xref ref-type="bibr" rid="B35">Tian et al. (2022)</xref> directly trained the neural network using raw data without using feature engineering from the PhysioNet dataset to perform a binary classification task on PCG to distinguish between normal and abnormal heart sounds. <xref ref-type="bibr" rid="B40">Wang et al. (2021)</xref> extracted five classes of features by segmenting the PCG signal. and used a recursive feature elimination method to obtain suitable input features, and proposed an XGBoost-based and LSTM combination for heart sound classification, and obtained an accuracy of 90.0% on the test set. <xref ref-type="bibr" rid="B23">Li et al. (2021)</xref> segmented the original heart sound signal and then calculated its frequency domain features by short-time Fourier transform. For training, they proposed 2D-CNN and achieved an accuracy of 85.70%. <xref ref-type="bibr" rid="B8">Er (2021)</xref> extracted the local binary pattern (LBP) of heart sounds using local three-valued pattern (LTP) and trained it with 1D-CNN with an accuracy of 90% on the PhysioNet dataset. <xref ref-type="bibr" rid="B31">Ren et al. (2022)</xref> used the attention mechanism to explore the interpretable heart sound classification algorithm for heart sound triple classification task on PhysioNet dataset and obtained an unweighted average recall of 51.2%. <xref ref-type="bibr" rid="B15">Iqtidar et al. (2021)</xref> obtained 98.3% accuracy on heart sound double classification problem using MFCC based 1D adaptive local ternary model and support vector machine. <xref ref-type="bibr" rid="B20">Lahmiri and Bekiros (2022)</xref> used discrete wavelet transform with support vector machine optimized through bayesian optimization obtained 89.26% accuracy. In the heart-tone classification task mentioned above, neural networks with MFCC-based features perform better. To further enhance the advantages of MFCC features in expressing heart sound signals, this paper calculates first-order and second-order difference coefficients for expressing the dynamic properties of heart sound signals.</p>
</sec>
<sec id="s3">
<title>3 Proposed methodology</title>
<p>This section describes the heart sound classification algorithm proposed in this paper in three parts. The first step is data set fusion, which filters, downsamples, and cuts the original heart sounds. The second step is feature engineering, extracting standard MFCC, first-order MFCC, and second-order MFCC, and fusing them into input feature vectors. In the third step, a deep residual neural network is constructed, and feature vectors are input for training. Finally, the test samples are predicted using the trained model, and the accuracy is counted. <xref ref-type="fig" rid="F4">Figure 4</xref> shows the workflow of this paper. The innovation of the methodology as threefold: 1) Using the authoritative heart sound datasets from three different sources, which helps to radically improve the performance of the deep learning model. 2) Selecting improved MFCC as input features to more comprehensively represent the static and dynamic characteristics of the heart sound signal. 3) Using a residual neural network, which alleviates gradient disappearance and degradation during training.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Flow chart of the proposed method.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g004.tif"/>
</fig>
<sec id="s3-1">
<title>3.1 Dataset fusion</title>
<p>The label classification standards of the datasets selected in this paper are different. Before entering the data into the neural network, the labels must be unified, and data pre-processing is performed on all files. making full use of heart sound datasets from different sources helps to improve the generalization of the model further. According to the characteristics of the label types of the dataset, this paper divides the labels of the fused heart sound data into three categories: normal, abnormal, and noise.</p>
<sec id="s3-1-1">
<title>3.1.1 Digital filtering</title>
<p>In collecting heart sound audio, due to hardware limitations and the influence of the background environment, many noises will inevitably be collected in the audio. To reduce the impact of noise on neural network training, this paper filtered the heart sound audio. To preserve the low frequency components of heart sounds that contains important physiological information, this paper sends the heart sound audio into the fifth-order 400&#xa0;hz Butterworth low-pass filter to filter out the high-frequency murmurs in the heart sound signal.</p>
</sec>
<sec id="s3-1-2">
<title>3.1.2 Down sampling</title>
<p>To reduce the computational complexity of the model and ensure that the heart sound data from different sources can generate the same size feature map in the subsequent feature engineering, all audio signals are down-sampled to 2000&#xa0;hz.</p>
</sec>
<sec id="s3-1-3">
<title>3.1.3 Audio cutting</title>
<p>Considering the significant difference in length between heart sound audios, this paper cuts the audio in units of 2&#xa0;s to use the existing heart sound audio and unified audio length as much as possible. On the other hand, considering the solid temporal correlation of pathological features in heart sound audio, heart sound audio with too short duration is difficult to express the pathological features of heart sound, so this paper discarded heart sound audio with less than 2&#xa0;s.</p>
</sec>
</sec>
<sec id="s3-2">
<title>3.2 Feature engineering</title>
<p>In most cases, deep learning models cannot learn from completely arbitrary data, so it is essential to extract heart sound features by hard coding through feature engineering. To obtain an effective pathological feature representation of cardiovascular disease, this paper used an improved feature extraction algorithm based on MFCC <xref ref-type="bibr" rid="B7">Deng et al. (2020)</xref>. The human ear&#x2019;s perception of frequency is logarithmic. It is sensitive to changes in low-frequency bands and insensitive to changes in high-frequency bands. The use of linearly distributed spectrograms in feature engineering affected the model&#x2019;s performance. MFCC reflects the non-linear relationship between the human ear and the sound frequency, which can effectively extract the pathological features in the heart sound audio. The calculation formula of the MFCC is shown as follows<disp-formula id="e1">
<mml:math id="m1">
<mml:mi mathvariant="normal">M</mml:mi>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mi mathvariant="normal">l</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2595</mml:mn>
<mml:mi>lg</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>/</mml:mo>
<mml:mn>700</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(1)</label>
</disp-formula>where lg is defined as the base 10 logarithm.</p>
<sec id="s3-2-1">
<title>3.2.1 Signal pre-emphasis</title>
<p>In processing the heart sound signals, the high-frequency signal generated during cardiovascular exercise is inadequate, and the low-frequency signal is adequeate. The reason for this phenomenon can explain from the physical level. In the process of sound energy propagation in the medium, the higher the frequency, the more it is easy to be lost, and pre-emphasis makes up for the loss of high frequency and protects the original heart sound signal. In this paper, the heart sound signal. is passed through a high-pass filter to narrow the intensity gap between the high and low-frequency components of the signal. The specific operation of the signal x[n] is shown as follows<disp-formula id="e2">
<mml:math id="m2">
<mml:mi>y</mml:mi>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>x</mml:mi>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>x</mml:mi>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(2)</label>
</disp-formula>where <italic>&#x3b1;</italic> usually takes a value close to 1.</p>
</sec>
<sec id="s3-2-2">
<title>3.2.2 Framing windowing</title>
<p>To obtain the distribution of each element of frequency in the heart sound audio, it is necessary to perform Fourier transform on the audio signal, and the Fourier transform requires that the input signal must be stable, so the audio signal needs to be framed and windowed first. Framing is to divide the original signal into several small blocks according to time, and one block is called a frame. In framing process, the original signal will have a spectrum leakage phenomenon. The spectrum corresponding to the original signal and the signal after framing are very different. The Hamming window can effectively overcome the leakage phenomenon <xref ref-type="bibr" rid="B2">Astuti et al. (2012)</xref>. The Hamming window function W(n) is shown as follows<disp-formula id="e3">
<mml:math id="m3">
<mml:mi>W</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>cos</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>/</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
<label>(3)</label>
</disp-formula>where the <italic>&#x3b1;</italic> value is 0.46 by suggested in <xref ref-type="bibr" rid="B36">Trang et al. (2014)</xref>.</p>
</sec>
<sec id="s3-2-3">
<title>3.2.3 Get power spectrum</title>
<p>After framing and windowing, this paper used discrete Fourier transform (DFT) on the data to transform the time-domain signal into a frequency-domain signal to obtain the spectrum X(k) is shown as follows<disp-formula id="e4">
<mml:math id="m4">
<mml:mi>X</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>x</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mn>2</mml:mn>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
<label>(4)</label>
</disp-formula>The power spectrum P(k) is equal to the signal spectrum X(k) as the square of its modulus, as shown in Eq. <xref ref-type="disp-formula" rid="e5">5</xref>. The power spectrum expresses the energy characteristics of the heart sound signal more accurately, retains some amplitude elements in the heart sound spectrum, and discards the phase characteristics of the heart sound spectrum is described as follows<disp-formula id="e5">
<mml:math id="m5">
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>X</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
</sec>
<sec id="s3-2-4">
<title>3.2.4 Mel filter bank</title>
<p>A normal human ear is able to hear sounds with frequencies from 20&#xa0;Hz to 20,000&#xa0;Hz. The range of 20&#xa0;Hz to 20,000&#xa0;Hz is called the audible frequency range. The sounds we hear comprise of various frequencies. The Mel filter bank is represented as a group of triangular filters on the image. Usually a set contains 20 to 40 ascending triangular filters, and the starting position of each triangular filter is at the midpoint of the previous triangular filter, and because it has a linear frequency in the Mel scale, it is called a Mel filter bank. At each frequency, calculate the product of P(k) and filter Hm(k). Defining a triangular filter bank with Mel filters, the frequency response Hm(k) of the triangular filter is calculated as follows<disp-formula id="e6">
<mml:math id="m6">
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="{" close="">
<mml:mrow>
<mml:mtable class="array">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd columnalign="center">
<mml:mi>k</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mfrac>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd columnalign="center">
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd columnalign="center">
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
</mml:mtd>
<mml:mtd columnalign="center">
<mml:mi>k</mml:mi>
<mml:mo>&#x3e;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(6)</label>
</disp-formula>where m represents the serial number of the filter, and f (m-1), f(m), and f (m&#x2b;1) correspond to the starting point, middle point, and end point of the filter, respectively. In calculations, the values of m take 1, 2, &#x2026; , 13. For a Mel triangular filter, f(m) represents the center frequency of the Mel trangular filter, f (m-1) represents the start of the Mel trangular filter, and f (m&#x2b;1) represents the end of the Mel trangular filter. Summing the whole of Hm(k), we can obtain Eq. <xref ref-type="disp-formula" rid="e7">7</xref>, and the value of <italic>M</italic> is 13.<disp-formula id="e7">
<mml:math id="m7">
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
</sec>
<sec id="s3-2-5">
<title>3.2.5 Log spectrum</title>
<p>The logarithmic energy spectrum S(m) at each frame is obtained by using the logarithmic operation is shown as follows<disp-formula id="e8">
<mml:math id="m8">
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>ln</mml:mi>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>M</mml:mi>
</mml:math>
<label>(8)</label>
</disp-formula>where lg is defined as the base <italic>e</italic> logarithm.</p>
</sec>
<sec id="s3-2-6">
<title>3.2.6 Discrete cosine transform</title>
<p>The discrete cosine transform (DCT) is performed on the above log spectrum to obtain the Mel cepstral coefficient C(n), which is the MFCC feature, The corresponding equation is described as follows.<disp-formula id="e9">
<mml:math id="m9">
<mml:mi>C</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mi>cos</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>n</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:mfenced>
<mml:mo>/</mml:mo>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1,2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>L</mml:mi>
</mml:math>
<label>(9)</label>
</disp-formula>
</p>
</sec>
<sec id="s3-2-7">
<title>3.2.7 Dynamic feature extraction</title>
<p>MFCC reflects the static information of the heart sound signal, and the dynamic information of the heart sound signal also contains rich pathological features, which can be used to improve the classification accuracy further. To reflect the dynamic information of the heart sound signal, this paper extracts the first-order difference coefficient D(n) and the second-order difference coefficient D2(n) based on MFCC. The calculation formulas are described as follows<disp-formula id="e10">
<mml:math id="m10">
<mml:mi>D</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>i</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(10)</label>
</disp-formula>
<disp-formula id="e11">
<mml:math id="m11">
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:msubsup>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>i</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(11)</label>
</disp-formula>where the value of k is taken as 2, and C (n &#x2b; i) is a frame of MFCC coefficient. <xref ref-type="fig" rid="F5">Figure 5</xref> shows 2D visualization of them, where MFCC is the result of Eq. <xref ref-type="disp-formula" rid="e9">9</xref>, &#x25b5;MFCC is the result of Eq. <xref ref-type="disp-formula" rid="e10">10</xref>, and &#x25b5;<sup>2</sup>MFCC is the result of Eq. <xref ref-type="disp-formula" rid="e11">11</xref>. The size of them are all (199,13), we use them to construct a (199,39) feature as the input of neural network.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>2D visualization of the features. <bold>(A)</bold> Normal heart sound. <bold>(B)</bold> Abnormal heart sound. <bold>(C)</bold> Noise</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g005.tif"/>
</fig>
</sec>
</sec>
<sec id="s3-3">
<title>3.3 Resnet</title>
<p>The network structure in this paper is shown in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Structure of Resnet.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g006.tif"/>
</fig>
<p>Convolutional neural network (CNN) can learn valuable features in large-scale heart sound spectrograms developed from traditional artificial neural networks, CNN not only have the traditional fully connected neural network characteristics, but also have many differences and improvements based on them. Convolutional neural networks work on the basic principle of converting the original data into a two-dimensional matrix format, which is superior to traditional artificial neural networks in terms of the performance of extracting image feature values. In CNN, the initial convolutional layer functions similarly to an edge detector and can be used to identify low-level features. Although the network near the convolutional layer is more complex or abstract, because of the CNN weight sharing property, its network requires fewer parameters to train than the fully connected to the feature space. It shows that when the network layers, each layer output at the same time, the number of dimensions required for the stage CNN to process the same data is much lower than the whole connected to the feature space fully. Compared with other feature extraction methods, CNN has a simple structure, fitting ability and trainability. The principle of convolution calculation in CNN is shown in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Principle of convolution.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g007.tif"/>
</fig>
<p>Batch Normalization (BN) was originally designed to solve Internal Covariate Shift (ICS), which is a phenomenon where the internal node data distribution changes due to parameter changes in the network. ICS has a greater negative impact on deeper neural networks. Data distribution change times increase with the number of neural network layers. It makes the network harder to train and more sensitive to overfitting. BN layer adjusts their distribution by normalizing each batch of data, the principle of which is shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. Using the BN layer not only reduces the training time, but also make the model converge faster, and better control the problems of gradient disappearance and gradient explosion at the same time <xref ref-type="bibr" rid="B14">Ioffe and Szegedy (2015)</xref>. The BN is calculated as follows<disp-formula id="e12">
<mml:math id="m12">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:math>
<label>(12)</label>
</disp-formula>where <italic>&#x3bc;</italic>
<sub>
<italic>B</italic>
</sub> is the mean of each batch of data, <inline-formula id="inf1">
<mml:math id="m13">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is the variance of each batch of data, and <italic>&#x3f5;</italic> is called the smoothing term, which ensures numerical stability in the operation by stopping the division by zero values.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Principle of batch normalization.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g008.tif"/>
</fig>
<p>The residual neural network was proposed initially by <xref ref-type="bibr" rid="B10">He et al. (2016)</xref>. The degeneration phenomenon refers to the substantial decrease in model accuracy that occurs without warning as the depth of the network continues to increase. The degeneracy phenomenon makes us reflect on non-linear transformation, which significantly improves data classification. However, as the depth of the network continues to increase, we have gone too far in the non-linear transformation to achieve linear transformation surprisingly. Bottlenecks can quickly occur when training the data using CNN, and this paper introduces a residual module to address this phenomenon. It is no exaggeration to say that half of the neural networks used in computer vision today are based on Resnet and his variants.</p>
<p>The principle of the residual structure constructed in this paper is shown in <xref ref-type="fig" rid="F9">Figure 9</xref>. A layer of the network can usually be viewed as y &#x3d; H(x), and a residual block of the residual network is: H(x) &#x3d; F(x) &#x2b; x, then F(x) &#x3d; H(x)&#x2014;x, and y &#x3d; x is the observed value and H(x) is the predicted value, so H(x)&#x2014;x is the residual, that is, F(x) is the residual, so it is called the residual network. When the deep network propagates forward, the information obtained by the network decreases layer by layer as the network deepens. In contrast, ResNet deals with this problem by identity mapping. The next layer includes not only the information x of that layer, but also the new information F(x) after the non-linear transformation of that layer. This treatment makes the information instead show an increasing trend layer by layer. This is so useful that you cannot worry about lossing data. Intuitively, the residual block protects the integrity of the information by directly passing the input information around to the output, and the whole network only needs that part of the input and output difference, simplifying the experimental goal and difficulty.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Residual structure.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g009.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Experimental evaluation</title>
<sec id="s4-1">
<title>4.1 Dataset</title>
<p>This paper uses heart sound datasets published on three different platforms, the PhysioNetChallenge 2016 heart sound database, the heart sound dataset from the kaggle platform, and the Yaseen heart sound dataset. In 2016, Physionet hosted the PhysioNet/Computing in Cardiology (CinC) Challenge 2016 and released the dataset <xref ref-type="bibr" rid="B24">Liu et al. (2016)</xref>. Physionet is a resource platform for complex physiological signal research managed by the MIT Computational Physiology Laboratory. The dataset was collected by different research groups in clinical and non-clinical conditions. These heart sound data were sampled at the same frequency, with a large amount of data and low noise. The label classification of the dataset is relatively simple and is divided into two categories: normal and abnormal. There was a wide range of audio lengths, ranging from 5&#xa0;s to 120&#xa0;s. In this paper, the audio was cut before the classification task. The details of this dataset are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>PhysioNet/CinC Challenge dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">File name</th>
<th align="center">Normal</th>
<th align="center">Abnormal</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Training-a</td>
<td align="center">292</td>
<td align="center">117</td>
</tr>
<tr>
<td align="center">Training-b</td>
<td align="center">104</td>
<td align="center">386</td>
</tr>
<tr>
<td align="center">Training-c</td>
<td align="center">24</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">Training-d</td>
<td align="center">28</td>
<td align="center">27</td>
</tr>
<tr>
<td align="center">Training-e</td>
<td align="center">183</td>
<td align="center">1958</td>
</tr>
<tr>
<td align="center">Training-f</td>
<td align="center">34</td>
<td align="center">80</td>
</tr>
<tr>
<td align="center">Total</td>
<td align="center">665</td>
<td align="center">2575</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Kaggle is currently one of the largest data science platforms in the world, with many high-quality datasets. These datasets are often sponsored by large companies for data science competitions in 2016, Kaggle held a heart sound classification competition with a dataset that referenced the Pascal heart sound dataset <xref ref-type="bibr" rid="B17">Jiang and Choi. (2006)</xref> and attached several description files without any modifications to the audio files. For labeling purposes, the dataset used in this paper is the one published by Kaggle. The audio lengths in this dataset range from 1s to 30&#xa0;s, and the details are shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Pascal dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">File name</th>
<th align="center">Normal</th>
<th align="center">Murmur</th>
<th align="center">Extrahs</th>
<th align="center">Artifact</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Set-a</td>
<td align="center">31</td>
<td align="center">34</td>
<td align="center">19</td>
<td align="center">40</td>
</tr>
<tr>
<td align="center">Set-b</td>
<td align="center">320</td>
<td align="center">95</td>
<td align="center">None</td>
<td align="center">None</td>
</tr>
<tr>
<td align="center">Total</td>
<td align="center">351</td>
<td align="center">133</td>
<td align="center">19</td>
<td align="center">40</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The third dataset was open-sourced by <xref ref-type="bibr" rid="B11">Herzig et al. (2014)</xref> on the GitHub platform, and the authors preprocessed the dataset. The audio was sampled at the same frequency, with the same length and less murmur. The data were labeled with five categories: normal, aortic stenosis, mitral valve insufficiency, mitral stenosis, and murmur, the latter four being abnormal heart sound signals in patients with cardiovascular disease, with the specific information shown in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Yaseen dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">File name</th>
<th align="center">Normal</th>
<th align="center">Aortic stenosis</th>
<th align="center">Mitral stenosis</th>
<th align="center">Mitral regurgitation</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">N</td>
<td align="center">200</td>
<td align="center">None</td>
<td align="center">None</td>
<td align="center">None</td>
</tr>
<tr>
<td align="center">AS</td>
<td align="center">None</td>
<td align="center">200</td>
<td align="center">None</td>
<td align="center">None</td>
</tr>
<tr>
<td align="center">MS</td>
<td align="center">None</td>
<td align="center">None</td>
<td align="center">200</td>
<td align="center">None</td>
</tr>
<tr>
<td align="center">MR</td>
<td align="center">None</td>
<td align="center">None</td>
<td align="center">None</td>
<td align="center">200</td>
</tr>
<tr>
<td align="center">MVP</td>
<td align="center">None</td>
<td align="center">None</td>
<td align="center">None</td>
<td align="center">None</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2">
<title>4.2 Experimental setup</title>
<p>In this study, we use Accuracy, Sensitivity, Specificity, and Precision to evaluate the proposed method. All of them are defined as follows<disp-formula id="e13">
<mml:math id="m14">
<mml:mtext>&#x2009;Accuracy&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<label>(13)</label>
</disp-formula>
<disp-formula id="e14">
<mml:math id="m15">
<mml:mtext>&#x2009;Sensitivity&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<label>(14)</label>
</disp-formula>
<disp-formula id="e15">
<mml:math id="m16">
<mml:mtext>&#x2009;Specificity&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<label>(15)</label>
</disp-formula>
<disp-formula id="e16">
<mml:math id="m17">
<mml:mtext>&#x2009;Precision&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<label>(16)</label>
</disp-formula>
</p>
<p>To further illustrate the classification performance, we tested the proposed algorithm on two different deep learning network architectures by adding LSTM and GRU, whose structures are shown in <xref ref-type="table" rid="T4">Table 4</xref>. LSTM(x) represents an LSTM layer, and x is the dimension of the output space. GRU(x) represents a GRU layer, and x is the dimension of the output space. Drop(x) represents a Dropout layer, x is the possibility of dropping neurons. FC(x) represents a fully connected layer with x neurons. Conv [x, (y, z)] represents a convolution layer, x is the number of filters, y and z are the width and height of 2D filter window. BN represents a Batch Normalization layer <xref ref-type="bibr" rid="B14">Ioffe and Szegedy (2015)</xref>. SeparableConv [x, (y, z)] is a deeply separable convolutional layer. MaxPooling (x, y) is a max pooling layer, and x and y are the pooling sizes. Residual (x) is a residual connectivity module, it is not a specific layer, it marks the position of the output layer. Add represents a residual connection layer, which takes the output of a previous layer as the input of a later one. GlobalAveragePooling() represents the global average pooling layer.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>The parameters of deep learning architecture.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Model</th>
<th align="center">Structure details</th>
<th align="center">Params</th>
<th align="center">Training time s)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">LSTM</td>
<td align="center">LSTM (64)-Drop (0.5)-FC(64)-FC (3)</td>
<td align="center">30,979</td>
<td align="center">75</td>
</tr>
<tr>
<td align="center">GRU</td>
<td align="center">GRU (64)-Drop (0.5)-FC(64)-FC (3)</td>
<td align="center">24,515</td>
<td align="center">55</td>
</tr>
<tr>
<td align="center">CNNa</td>
<td align="left">Conv [16, (3,3)]-MaxPooling (3,3)-Conv [32, (3,3)]-MaxPooling (3,3)-Conv [64, (3,3)]- MaxPooling (3,3)-Conv [128, (3,3)]-MaxPooling (3,3)-Drop (0.5)-GlobalAveragePooling ()-Dense (3)</td>
<td align="center">97,539</td>
<td align="center">55</td>
</tr>
<tr>
<td align="center">CNNb</td>
<td align="left">Conv [16, (3,3)]-MaxPooling (3,3)-Conv [32, (3,3)]-MaxPooling (3,3)-Conv [64, (3,3)]- MaxPooling (3,3)-Conv [128, (3,3)]-MaxPooling (3,3)-Drop (0.5)-GlobalAveragePooling ()-Dense (3)</td>
<td align="center">40,979</td>
<td align="center">200</td>
</tr>
<tr>
<td align="center">Resnet</td>
<td align="left">Conv [8, (3,3)]-BN-Conv [8, (3,3)]-residual {Conv [16, (1,1)]-BN}-SeparableConv [16, (3,3)]-BN-MaxPooling (3,3)-add-residual {Conv [32, (1,1)]-BN}-SeparableConv [32, (3,3)]-BN-SeparableConv [32, (3,3)]-BN-MaxPooling (3,3)-add-residual {Conv [64, (1,1)]-BN}- SeparableConv [64, (3,3)]-BN-SeparableConv [64, (3,3)]-BN-MaxPooling (3,3)-add- residual {Conv [128, (1,1)]-BN}-SeparableConv [128, (3,3)]-BN-MaxPooling (3,3)-add-Conv [3, (3,3)]-GlobalAveragePooling ()</td>
<td align="center">52,339</td>
<td align="center">320</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-3">
<title>4.3 Experimental results</title>
<p>To test the validation of the improved MFCC, we do comparison using the single features. MFCC, &#x25b5;MFCC, &#x25b5;<sup>2</sup>MFCC, and improved MFCC are trained on neural network separately, and the best epoch is taken as the result for comparison. The results of this experiment are shown in <xref ref-type="fig" rid="F10">Figure 10</xref>. Improved MFCC&#x2019;s sensitivity, specificity, and accuracy are higher than other features, the precision is lower than MFCC. In medical signal recognition, higher sensitivity and specificity is a valid result. Especially for sensitivity, identifying more patients is a crucial thing.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Comparison of heart sound features based on the proposed method.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g010.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F11">Figure 11</xref> shows the experimental results. It can be observed that the single Resnet, although the accuracy is higher, overfitting occurs very fast and overfitting occurs in the 10th round. Although LSTM can avoid overfitting better, has not yet reached the accuracy of Resnet in the 10th round, or even in the 30th round. This should be due to feature engineering, because the first-and second-order MFCC features are more reflective of relationships on time series, a property that is good for LSTM and GRU, but not friendly for networks like Resnet that extract locally relevant features. In addition, it can be seen that the accuracy of GRU is much lower than LSTM, but the average training time per round is 55&#xa0;s for GRU and 75&#xa0;s for LSTM. On the whole, Resnet can get better results.</p>
<fig id="F11" position="float">
<label>FIGURE 11</label>
<caption>
<p>Comparison of three different networks between accuracy and loss. <bold>(A)</bold> LSTM <bold>(B)</bold> GRU <bold>(C)</bold> CNNa <bold>(D)</bold> CNNb <bold>(E)</bold> Proposed method.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g011.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F12">Figure 12</xref> shows the results of the comparison. CNNa has a shallow structure. In terms of performance, it is the least effective. The CNNb structure eliminates the residual connection of the Resnet. In comparison to CNNa, it performs better. In addition, it can be seen that the accuracy of GRU is lower than LSTM. The highest score is achieved by Resnet. As a result, it was determined that deep structure and residual connections are useful for classification of heart sounds. The results shows the training process of RNNs, CNNs and Resnet. It can be observed that the CNNs and Resnet, although the accuracy is higher, overfitting occurs very fast in the 10th round. Although LSTM can avoid overfitting better, has not yet reached the accuracy of CNNb and Resnet in the 10th round, or even in the 30th round. Overfitting exists in all machine learning problems. Obtaining more authoritative heart sound data is the best solution. Adjusting the capacity of the model is another solution. For a deep learning model, the number of parameters it can learn is called the capacity. If the model has a very large capacity, then the model can even achieve a dictionary-style mapping of the data, but this mapping does not have any recognition of new data, which is a serious overfitting. So this is when we need to improve the generalization ability of the model by decreasing the capacity of the model and compelling the model to learn the most important patterns. To reduce the influence of data partitioning on the experimental results, we use 5-fold cross-validation. The first step divides 20% on the whole dataset as the test set. The second step selects 80% of the remaining as the training set and 20% of the remaining as the validation set. It will reapeat the second step 5 times to allow the validation set to iterate, each time training a new neural network separately. Finally, taking the average of the accuracy of the five models on the test set as the study result.</p>
<fig id="F12" position="float">
<label>FIGURE 12</label>
<caption>
<p>Comparison of RNNs and CNNs.</p>
</caption>
<graphic xlink:href="fphys-13-1084420-g012.tif"/>
</fig>
<p>
<xref ref-type="table" rid="T5">Table 5</xref> shows the comparison with the results of other studies. The essential difference between CNN and Resnet is that Resnet introduces a residual structure, which effectively mitigates the effect of degeneracy on the training of deep neural networks. Thus, it can be more applicable to the heart sound classification problem. In addition to the residual structure, the features are also essential. MFCC is inspired by biology and simulates the non-linear changes of the human ear to sound, thus, extracting the physiological and pathological information in heart sounds, which can fully reflect the disease of the heart. Considering MFCC only reflects the static information of the heart sound signal, but the dynamic information of the heart sound signal also contains rich pathological features, which can be used to improve the classification accuracy further. We merge the extracted dynamic features with static features to more fully represent the physiological and pathological information in the heart sounds.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Comparison of experimental results of different algorithms.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">References</th>
<th align="center">Algorithms</th>
<th align="center">Sensitivity (%)</th>
<th align="center">Specificity</th>
<th align="center">Precision</th>
<th align="center">Accuracy (%)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<xref ref-type="bibr" rid="B23">Li et al. (2021)</xref>
</td>
<td align="center">SFTF and CNN</td>
<td align="center">88.70</td>
<td align="center">86.40%</td>
<td align="center">&#x2014;</td>
<td align="center">86.00</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B41">Wu et al. (2019)</xref>
</td>
<td align="center">MFCC and CNN</td>
<td align="center">91.73</td>
<td align="center">87.90%</td>
<td align="center">&#x2014;</td>
<td align="center">89.81</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B37">Tschannen et al. (2016)</xref>
</td>
<td align="center">Wavelet and CNN</td>
<td align="center">88.12</td>
<td align="center">76.30%</td>
<td align="center">&#x2014;</td>
<td align="center">82.12</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B21">Li F. et al. (2020)</xref>
</td>
<td align="center">497-features and CNN</td>
<td align="center">87.00</td>
<td align="center">72.10%</td>
<td align="center">&#x2014;</td>
<td align="center">86.80</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B8">Er. (2021)</xref>
</td>
<td align="center">LBF and LTF</td>
<td align="center">91.24</td>
<td align="center">&#x2014;</td>
<td align="center">90.36%</td>
<td align="center">91.66</td>
</tr>
<tr>
<td align="center">Ours</td>
<td align="center">Improved MFCC and Resnet</td>
<td align="center">
<bold>92.32</bold>
</td>
<td align="center">
<bold>95.47%</bold>
</td>
<td align="center">
<bold>90.55%</bold>
</td>
<td align="center">
<bold>94.43</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusion" id="s5">
<title>5 Conclusion</title>
<p>In this paper, we fused datasets from three different platforms for the lack of reliable heart sound datasets, which provided a solid foundation for neural network training. In addition, we used an enhanced feature extraction algorithm based on MFCC, and experiments show that using such features as input to the neural network can improve the model&#x2019;s performance well. The proposed method makes the neural network training faster and the model generalization enhanced, which effectively mitigates the negative effects of gradient disappearance and degradation phenomena on medical signal recognition and achieves an accuracy rate of 94.43% on the constructed dataset, which is higher than the state-of-the-art methods.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>FL improved dataset processing method, optimized neural network algorithm, manuscript writing. ZZ collected, processed the dataset, developed, implemented, and evaluated the neural network algorithm. LW analyzed the results and final approval of manuscript. WL revised it critically for important intellectual content. All author contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This work was supported in part by the Innovation Support Program for Returned Overseas Students in Anhui Province under Grant No. 2021LCX032 and National Natural Science Foundation of China under Grant No. 62202001.</p>
</sec>
<ack>
<p>We thank the reviewers and editors for their very constructive comments.</p>
</ack>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abduh</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Nehary</surname>
<given-names>E. A.</given-names>
</name>
<name>
<surname>Wahed</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Kadah</surname>
<given-names>Y. M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Classification of heart sounds using fractional Fourier transform based mel-frequency spectral coefficients and stacked autoencoder deep neural network</article-title>. <source>J. Med. Imaging Health Inf.</source> <volume>9</volume>, <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1166/jmihi.2019.2568</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Astuti</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Sediono</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Aibinu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Akmeliawati</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Salami</surname>
<given-names>M.-J. E.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>Adaptive short time Fourier transform (stft) analysis of seismic electric signal (ses): A comparison of hamming and rectangular window</article-title>,&#x201d; in <source>2012 IEEE symposium on industrial electronics and applications</source> (<publisher-name>IEEE</publisher-name>), <fpage>372</fpage>&#x2013;<lpage>377</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baghel</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Dutta</surname>
<given-names>M. K.</given-names>
</name>
<name>
<surname>Burget</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Automatic diagnosis of multiple cardiac diseases from pcg signals using convolutional neural network</article-title>. <source>Comput. Methods Programs Biomed.</source> <volume>197</volume>, <fpage>105750</fpage>. <pub-id pub-id-type="doi">10.1016/j.cmpb.2020.105750</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baydoun</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Safatly</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ghaziri</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>El Hajj</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Analysis of heart sound anomalies using ensemble learning</article-title>. <source>Biomed. Signal Process. Control</source> <volume>62</volume>, <fpage>102019</fpage>. <pub-id pub-id-type="doi">10.1016/j.bspc.2020.102019</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Learning deep architectures for ai</article-title>. <source>Found. trends&#xae; Mach. Learn.</source> <volume>2</volume>, <fpage>1</fpage>&#x2013;<lpage>127</lpage>. <pub-id pub-id-type="doi">10.1561/2200000006</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Demir</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>&#x15e;eng&#xfc;r</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bajaj</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Polat</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Towards the classification of heart sounds based on convolutional deep neural network</article-title>. <source>Health Inf. Sci. Syst.</source> <volume>7</volume>, <fpage>16</fpage>&#x2013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1007/s13755-019-0078-0</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deng</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Heart sound classification based on improved mfcc features and convolutional recurrent neural networks</article-title>. <source>Neural Netw.</source> <volume>130</volume>, <fpage>22</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2020.06.015</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Er</surname>
<given-names>M. B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Heart sounds classification using convolutional neural network with 1d-local binary pattern and 1d-local ternary pattern features</article-title>. <source>Appl. Acoust.</source> <volume>180</volume>, <fpage>108152</fpage>. <pub-id pub-id-type="doi">10.1016/j.apacoust.2021.108152</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gomes</surname>
<given-names>E. F.</given-names>
</name>
<name>
<surname>Bentley</surname>
<given-names>P. J.</given-names>
</name>
<name>
<surname>Pereira</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Coimbra</surname>
<given-names>M. T.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>Classifying heart sounds-approaches to the pascal challenge</article-title>,&#x201d; in <source>Healthinf</source>, <fpage>337</fpage>&#x2013;<lpage>340</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>, <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Herzig</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bickel</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Eitan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Intrator</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Monitoring cardiac stress using features extracted from S&#x2081; heart sounds</article-title>. <source>IEEE Trans. Biomed. Eng.</source> <volume>62</volume>, <fpage>1169</fpage>&#x2013;<lpage>1178</lpage>. <pub-id pub-id-type="doi">10.1109/TBME.2014.2377695</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R. R.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Reducing the dimensionality of data with neural networks</article-title>. <source>science</source> <volume>313</volume>, <fpage>504</fpage>&#x2013;<lpage>507</lpage>. <pub-id pub-id-type="doi">10.1126/science.1127647</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>An efficient learning procedure for deep Boltzmann machines</article-title>. <source>Neural Comput.</source> <volume>24</volume>, <fpage>1967</fpage>&#x2013;<lpage>2006</lpage>. <pub-id pub-id-type="doi">10.1162/NECO_a_00311</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ioffe</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Szegedy</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>,&#x201d; in <source>International conference on machine learning</source> (<publisher-loc>Lille</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>448</fpage>&#x2013;<lpage>456</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iqtidar</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Qamar</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Aziz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>M. U.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Phonocardiogram signal analysis for classification of coronary artery diseases using mfcc and 1d adaptive local ternary patterns</article-title>. <source>Comput. Biol. Med.</source> <volume>138</volume>, <fpage>104926</fpage>. <pub-id pub-id-type="doi">10.1016/j.compbiomed.2021.104926</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ismail</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Siddiqi</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Akram</surname>
<given-names>U.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Heart rate estimation in ppg signals using convolutional-recurrent regressor</article-title>. <source>Comput. Biol. Med.</source> <volume>145</volume>, <fpage>105470</fpage>. <pub-id pub-id-type="doi">10.1016/j.compbiomed.2022.105470</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>A cardiac sound characteristic waveform method for in-home heart disorder monitoring with electric stethoscope</article-title>. <source>Expert Syst. Appl.</source> <volume>31</volume>, <fpage>286</fpage>&#x2013;<lpage>298</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2005.09.025</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krishnan</surname>
<given-names>P. T.</given-names>
</name>
<name>
<surname>Balasubramanian</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Umapathy</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Automated heart sound classification system from unsegmented phonocardiogram (pcg) using deep neural network</article-title>. <source>Phys. Eng. Sci. Med.</source> <volume>43</volume>, <fpage>505</fpage>&#x2013;<lpage>515</lpage>. <pub-id pub-id-type="doi">10.1007/s13246-020-00851-w</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kui</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zong</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Heart sound classification based on log mel-frequency spectral coefficients features and convolutional neural networks</article-title>. <source>Biomed. Signal Process. Control</source> <volume>69</volume>, <fpage>102893</fpage>. <pub-id pub-id-type="doi">10.1016/j.bspc.2021.102893</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lahmiri</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bekiros</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Complexity measures of high oscillations in phonocardiogram as biomarkers to distinguish between normal heart sound and pathological murmur</article-title>. <source>Chaos, Solit. Fractals</source> <volume>154</volume>, <fpage>111610</fpage>. <pub-id pub-id-type="doi">10.1016/j.chaos.2021.111610</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mathiak</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Cong</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Classification of heart sounds using convolutional neural network</article-title>. <source>Appl. Sci.</source> <volume>10</volume>, <fpage>3956</fpage>. <pub-id pub-id-type="doi">10.3390/app10113956</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A review of computer-aided heart sound detection techniques</article-title>. <source>BioMed Res. Int.</source> <volume>2020</volume>, <fpage>5846191</fpage>. <pub-id pub-id-type="doi">10.1155/2020/5846191</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Lightweight end-to-end neural network model for automatic heart sound classification</article-title>. <source>Information</source> <volume>12</volume>, <fpage>54</fpage>. <pub-id pub-id-type="doi">10.3390/info12020054</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Springer</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Moody</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Juan</surname>
<given-names>R. A.</given-names>
</name>
<name>
<surname>Chorro</surname>
<given-names>F. J.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>An open access database for the evaluation of heart sound algorithms</article-title>. <source>Physiol. Meas.</source> <volume>37</volume>, <fpage>2181</fpage>&#x2013;<lpage>2213</lpage>. <pub-id pub-id-type="doi">10.1088/0967-3334/37/12/2181</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mei</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Classification of heart sounds based on quality assessment and wavelet scattering transform</article-title>. <source>Comput. Biol. Med.</source> <volume>137</volume>, <fpage>104814</fpage>. <pub-id pub-id-type="doi">10.1016/j.compbiomed.2021.104814</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Milani</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Abas</surname>
<given-names>P. E.</given-names>
</name>
<name>
<surname>De Silva</surname>
<given-names>L. C.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>A critical review of heart sound signal segmentation algorithms</article-title>. <source>Smart Health</source> <volume>24</volume>, <fpage>100283</fpage>. <pub-id pub-id-type="doi">10.1016/j.smhl.2022.100283</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Nair</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2010</year>). &#x201c;<article-title>Rectified linear units improve restricted Boltzmann machines</article-title>,&#x201d; in <source>Icml</source>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nogueira</surname>
<given-names>D. M.</given-names>
</name>
<name>
<surname>Ferreira</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Gomes</surname>
<given-names>E. F.</given-names>
</name>
<name>
<surname>Jorge</surname>
<given-names>A. M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Classifying heart sounds using images of motifs, mfcc and temporal features</article-title>. <source>J. Med. Syst.</source> <volume>43</volume>, <fpage>168</fpage>&#x2013;<lpage>213</lpage>. <pub-id pub-id-type="doi">10.1007/s10916-019-1286-5</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oh</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Jahmunah</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Ooi</surname>
<given-names>C. P.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>R.-S.</given-names>
</name>
<name>
<surname>Ciaccio</surname>
<given-names>E. J.</given-names>
</name>
<name>
<surname>Yamakawa</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Classification of heart sound signals using a novel deep wavenet model</article-title>. <source>Comput. Methods Programs Biomed.</source> <volume>196</volume>, <fpage>105604</fpage>. <pub-id pub-id-type="doi">10.1016/j.cmpb.2020.105604</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ranzato</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Poultney</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chopra</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Cun</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Efficient learning of sparse representations with an energy-based model</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>19</volume>.</citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ren</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Nejdl</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yamamoto</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Deep attention-based neural networks for explainable heart sound classification</source>. <publisher-loc>Elsevier</publisher-loc>: <publisher-name>Machine Learning with Applications</publisher-name>, <fpage>100322</fpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sakib</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ahmed</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Kabir</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>Ahmed</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). <source>An overview of convolutional neural network: Its architecture and applications</source>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Maddison</surname>
<given-names>C. J.</given-names>
</name>
<name>
<surname>Guez</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sifre</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Van Den Driessche</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Mastering the game of go with deep neural networks and tree search</article-title>. <source>nature</source> <volume>529</volume>, <fpage>484</fpage>&#x2013;<lpage>489</lpage>. <pub-id pub-id-type="doi">10.1038/nature16961</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Son</surname>
<given-names>G.-Y.</given-names>
</name>
<name>
<surname>Kwon</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Classification of heart sound signal using multiple features</article-title>. <source>Appl. Sci.</source> <volume>8</volume>, <fpage>2344</fpage>. <pub-id pub-id-type="doi">10.3390/app8122344</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tian</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Lian</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zang</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Imbalanced heart sound signal classification based on two-stage trained dsanet</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>Cognitive Computation</publisher-name>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Trang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Loc</surname>
<given-names>T. H.</given-names>
</name>
<name>
<surname>Nam</surname>
<given-names>H. B. H.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Proposed combination of pca and mfcc feature extraction in speech recognition system</article-title>,&#x201d; in <source>2014 international conference on advanced technologies for communications (ATC 2014)</source> (<publisher-name>IEEE</publisher-name>), <fpage>697</fpage>&#x2013;<lpage>702</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tschannen</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kramer</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Marti</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Heinzmann</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wiatowski</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Heart sound classification using deep structured features</article-title>,&#x201d; in <source>2016 computing in Cardiology conference (CinC)</source> (<publisher-name>IEEE</publisher-name>), <fpage>565</fpage>&#x2013;<lpage>568</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Varghees</surname>
<given-names>V. N.</given-names>
</name>
<name>
<surname>Ramachandran</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>A novel heart sound activity detection framework for automated heart sound analysis</article-title>. <source>Biomed. Signal Process. Control</source> <volume>13</volume>, <fpage>174</fpage>&#x2013;<lpage>188</lpage>. <pub-id pub-id-type="doi">10.1016/j.bspc.2014.05.002</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vincent</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Larochelle</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lajoie</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Manzagol</surname>
<given-names>P.-A.</given-names>
</name>
<name>
<surname>Bottou</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion</article-title>. <source>J. Mach. Learn. Res.</source> <volume>11</volume>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A heart sound classification method based on joint decision of extreme gradient boosting and deep neural network</article-title>. <source>Sheng wu yi xue Gong Cheng xue za zhi&#x3d; J. Biomed. Engineering&#x3d; Shengwu Yixue Gongchengxue Zazhi</source> <volume>38</volume>, <fpage>10</fpage>&#x2013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.7507/1001-5515.202006025</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>J. M.-T.</given-names>
</name>
<name>
<surname>Tsai</surname>
<given-names>M.-H.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y. Z.</given-names>
</name>
<name>
<surname>Islam</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Hassan</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Alelaiwi</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Applying an ensemble convolutional neural network with savitzky&#x2013;golay filter to construct a phonocardiogram prediction model</article-title>. <source>Appl. Soft Comput.</source> <volume>78</volume>, <fpage>29</fpage>&#x2013;<lpage>40</lpage>. <pub-id pub-id-type="doi">10.1016/j.asoc.2019.01.019</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>T.-c. I.</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Classification of acoustic physiological signals based on deep learning neural networks with augmented features</article-title>,&#x201d; in <source>2016 computing in Cardiology conference (CinC)</source> (<publisher-name>IEEE</publisher-name>), <fpage>569</fpage>&#x2013;<lpage>572</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>W.</given-names>
</name>
<etal/>
</person-group> (<year>2013</year>). <article-title>Deep learning: Yesterday, today, and tomorrow</article-title>. <source>J. Comput. Res. Dev.</source> <volume>50</volume>, <fpage>1799</fpage>&#x2013;<lpage>1804</lpage>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zeinali</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Niaki</surname>
<given-names>S. T. A.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Heart sound classification using signal processing and machine learning algorithms</article-title>. <source>Mach. Learn. Appl.</source> <volume>7</volume>, <fpage>100206</fpage>. <pub-id pub-id-type="doi">10.1016/j.mlwa.2021.100206</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Towards heart sound classification without segmentation using convolutional neural network</article-title>,&#x201d; in <source>2017 computing in Cardiology (CinC)</source> (<publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>