1 Introduction

Front. Genet.

Frontiers in Genetics

Front. Genet.

1664-8021

Frontiers Media S.A.

1135260

10.3389/fgene.2023.1135260

Genetics

Original Research

MSC-CSMC: A multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints for gene expression data

Wang et al.

10.3389/fgene.2023.1135260

Wang

Zeyuan

¹ Gu

Hong

¹ Zhao

Minghui

¹ Li

Dan

¹ * Wang

Jia

² *

¹ Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China ² Department of Breast Surgery, Second Hospital of Dalian Medical University, Dalian, Liaoning, China

Edited by: Suyan Tian, Jilin University, China

Reviewed by: Guojun Liu, Xi’an University of Finance and Economics, China

Changjing Zhuge, Beijing University of Technology, China

*Correspondence: Dan Li, ldan@dlut.edu.cn; Jia Wang, wangjia77@hotmail.com

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

27 02 2023

2023

1135260

31 12 2022 16 02 2023

2023

Wang, Gu, Zhao, Li and Wang

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Many clustering techniques have been proposed to group genes based on gene expression data. Among these methods, semi-supervised clustering techniques aim to improve clustering performance by incorporating supervisory information in the form of pairwise constraints. However, noisy constraints inevitably exist in the constraint set obtained on the practical unlabeled dataset, which degenerates the performance of semi-supervised clustering. Moreover, multiple information sources are not integrated into multi-source constraints to improve clustering quality. To this end, the research proposes a new multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC) for unlabeled gene expression data. The proposed method first uses the gene expression data and the gene ontology (GO) that describes gene annotation information to form multi-source constraints. Then, the multi-source constraints are applied to the clustering by improving the constraint violation penalty weight in the semi-supervised clustering objective function. Furthermore, the constraints selection and cluster prototypes are put into the multi-objective evolutionary framework by adopting a mixed chromosome encoding strategy, which can select pairwise constraints suitable for clustering tasks through synergistic optimization to reduce the negative influence of noisy constraints. The proposed MSC-CSMC algorithm is testified using five benchmark gene expression datasets, and the results show that the proposed algorithm achieves superior performance.

semi-supervised clustering constraint selection multi-source constraints gene expression data multi-objective optimization

1 Introduction

The rapid development of microarray technology has generated a large amount of gene expression data and mining the inherent patterns in the massive gene expression data is a major challenge in the current bioinformatics field (Bandyopadhyay et al., 2007; Pirooznia et al., 2008). As an important unsupervised data mining method, clustering has become a powerful tool for gene expression data analysis. One of the main tasks of gene expression data clustering is to identify co-expressed genomes, which is a useful tool for further research on gene function (Bandyopadhyay et al., 2007; Chen et al., 2019). Compared with the unsupervised clustering methods, the semi-supervised clustering methods use prior information to guide the clustering process through data labels or pairwise constraints, which can effectively improve the performance of clustering (Wagstaff et al., 2001; Bilenko et al., 2004; Yin et al., 2010).

For semi-supervised clustering algorithms, the pairwise constraints are usually used to describe if two data belong to the same cluster. Specifically, the must-link constraint (ML) means that two data must be divided into the same cluster, and the cannot-link constraint (CL) means that two data must be divided into different clusters. The quality of the selected pairwise constraints is of vital importance, which significantly affects the performance of semi-supervised clustering algorithms (Grira et al., 2008; Vu et al., 2012; Masud et al., 2019; Abin and Vu, 2020). The pairwise constraints can be generated by directly using part of the known data labels (Lai et al., 2021) or by using an active learning method (Masud et al., 2019). In practical, most gene expression data are unlabeled, for which it is impossible to obtain pairwise constraints based on labels. Vu et al. (2012) indicated that the generation of the pairwise constraints should mainly focus on the data samples on the cluster boundaries, which are more likely to be misclassified. To this end, Basu et al. (2004) developed a farthest-first traversal scheme-based active learning method to obtain pairwise constraints. However, this method has been reported to be sensitive to noise (Davidson and Qi, 2008). Grira et al. (2008) proposed an active learning method to generate pairwise constraints by determining cluster boundary data using membership obtained by fuzzy clustering. Vu et al. (2012) identified data in sparse regions based on k-nearest neighbor graphs and constructed pairwise constraints. However, it was claimed that some pairwise constraints might not be generated by this method (Abin and Vu, 2020). Liu et al. (2018) proposed an entropy-based query strategy to select the most uncertain pairwise constraints. Abin (2018) proposed a random walk approach on the adjacency graph of data for querying informative constraints. Masud et al. (2019) used local density estimation to identify the most informative objects as pairwise constraints. Abin and Vu (2020) proposed a density tracking method which takes into account the density relationship between data, and uses the information about boundaries and skeleton of clusters to generate the pairwise constraints.

Although the above methods can automatically mine and learn the pairwise constraints of unlabeled datasets through different approaches, there are inevitably noisy constraints, i.e., constraints inconsistent with the ground-truth clusters, in the obtained pairwise constraints (Yin et al., 2010; Lai et al., 2021). However, the existing semi-supervised clustering algorithms are mostly based on the assumption that pairwise constraints conform to real cluster information, and usually susceptible to noisy constraints. Therefore, it is necessary to implement constraints selection, where noisy constraints are filtered out, and only pairwise constraints that are beneficial for semi-supervised clustering are retained. In addition, most of the pairwise-constraints-based semi-supervised clustering algorithms were developed for single-source constraints, i.e., the pairwise constraints are obtained only from the data itself. In real-world applications, many data also possess related domain information. For example, Gene Ontology (GO) (Ashburner et al., 2000), which describes gene products in terms of their associated biological processes, cellular components and molecular functions, can further provide gene annotation information for gene expression data. In this paper, the multi-source constraints are the pairwise constraints formed by the data itself and domain information. Apparently, compared with the single-source pairwise constraints based solely on gene expression data, the multi-source constraints formed by the fusion of gene ontology can provide more comprehensive information about the structure of gene clusters and help to guide semi-supervised clustering to obtain more accurate clustering results.

Aiming at the unlabeled gene expression data and from the perspective of reducing the negative impact of noisy constraints and integrating multi-source constraints, a method called multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC) is proposed in this research. At first, the proposed algorithm uses gene expression data and GO information to generate multi-source pairwise constraints. Then, under the multi-objective optimization framework of Non-dominated Sorting Genetic Algorithm-II (NSGA-II), the constraints selection and the cluster prototypes are collaboratively optimized to realize the selection of pairwise constraints suitable for clustering with respect to the multi-source constraints and to improve the accuracy of semi-supervised clustering of gene expression data by reducing the negative impact of noisy constraints.

2 Methods

In this section, the details of our proposed MSC-CSMC algorithm are described. Our proposed method consists of two parts. Firstly, multi-source pairwise constraints are generated by integrating gene expression and gene ontology (GO) information. Then, by using the improved penalty weights as well as mixed chromosome encoding strategy of cluster prototype and constraints selection, multi-objective semi-supervised clustering based on constraints selection and multi-source constraints is performed to identify co-expressed gene groups. The workflow of MSC-CSMC is shown in Figure 1.

FIGURE 1

Workflow of MSC-CSMC. (A) Generation of multi-source pairwise constraints. (B) Multi-objective semi-supervised clustering.

2.1 Generation of multi-source pairwise constraints

Gene expression data and gene ontology (GO) describe gene-related information from the abundance of mRNA of genes and gene annotation. Compared with the method only using gene expression data, the combination of these two aspects of information can help to further improve the clustering accuracy of gene expression data (Giri and Saha, 2020; Li et al., 2022). In this paper, we use gene expression data and gene ontology information to generate multi-source pairwise constraints for semi-supervised clustering.

In view of the superior performance of the density tracking method (Abin and Vu, 2020), we use this method to generate the initial gene expression constraint set. The method consists of three steps: density estimation, density following, and constraints generation. Let X = x 1 , x 2 , … x n , x i ∈ R d denote a d-dimensional gene expression dataset with n genes. Gene x _i’s density is obtained by D e n s i t y x i = 1 max x j ∈ N b x i x i − x j 2 , (1) where N _b( x _i) is the set of b nearest genes of gene x _i; ⋅ 2 is the Euclidean distance. Based on the density in Formula 1, the density tracking method constructs density chains according to the density relationship between data. Specifically, starting from each gene x _i, the closest gene x _j ∈ N _b( x _i) whose density is greater than that of x _i is selected, and the relation between them is recorded as density chain x _i → x _j. Then start from gene x _j and continue the above density tracking until there exists no gene whose density is greater than that of the gene at the end of the chain. Consequently, the density chain Chains ( x _i) can be denoted as x _i → x _j → ⋯ → x _e. After constructing all the density chains, the total times of gene x _i appearing in all the chains is referred to as centrality and denoted by Centrality ( x _i). The sum of centrality with respect to all genes in a density chain is used as the centrality of the density chain. All density chains with a common endpoint are considered connected density chains and the points belonging to them are considered to be in the same density group. Besides, the impurity of gene x _i is defined as follows: I m p u r i t y x i = 1 − ∑ g = 1 | G r o u p s | ∑ x j ∈ S I G r o u p x j = g b + 1 2 × 1 − D e n s i t y x i D e n s i t y x e (2) with |Groups| being the total number of groups, S = { x _i ∪ N _b( x _i)}, Group( x _j) being the group index of x _j, I being the indictor function.

According to the density, impurity, density chain, and density group of the data, the density tracking method proposes three assumptions for mining informative pairwise constraints. Let Ω denote the pairwise constraint set, whose elements satisfy the following key assumptions: (1) providing feasible information about the boundary data of clusters; (2) providing feasible information about the boundary between various clusters; (3) providing feasible information about the skeleton of clusters. Among them, assumptions (1) and (3) are used to generate the must-link constraint set Ω_ML, assumption (2) is used to generate the cannot-link constraint set Ω_CL. With the subsets Ω_ML and Ω_CL, the penalization can be constructed for the cost function of the clustering. The workflow of density tracking method is given in Figure 2. The initial gene expression constraint set Ω = Ω_ML ∪ Ω_CL is generated as follows.

1. For each gene x _i, calculate its Density( x _i) and Impurity( x _i). Construct density chain Chains( x _i) and density group Group( x _i), get the centrality of density chain. Initialize Ω_ML = ∅, Ω_CL = ∅;

2. Select gene x _i in descending order of Impurity ( x _i), query the nearest neighbor gene x _j that is not in its density group Group ( x _i), and add the pairwise constraint ( x _i, x _j) into the cannot-link constraint set, i.e., Ω_CL = Ω_CL ∪ {( x _i, x _j)}.

3. Select gene x _i in descending order of Impurity( x _i), and find the next gene x _j along its density chain Chains( x _i). Let ɛ > 0 denote the density drop rate. If Density( x _j) ≥ɛ× Density( x _e), then add the pairwise constraint ( x _i, x _j) to the must-link constraint set, i.e., Ω_ML = Ω_ML ∪ {( x _i, x _j)};

4. Select the density chain Chains( x _i) in descending order of the centrality of the density chain, start from the starting gene x _i, select the gene x _j with an interval, and add the pairwise constraint ( x _i, x _j) to the must-link constraint set, i.e., Ω_ML = Ω_ML ∪ {( x _i, x _j)}.

FIGURE 2

Workflow of density tracking method.

For a set of genes to be analyzed, each gene can be annotated with several GO terms. Thus, the functional similarity between genes can be deduced based on the term similarity. In the proposed MSC-CSMC algorithm, we adopt the aggregate information content (AIC) (Song et al., 2014) to measure the semantic similarity of GO terms t ₁ and t ₂: s i m AIC t 1 , t 2 = ∑ t ∈ T t 1 ∩ T t 2 2 × S W t S V t 1 + S V t 2 (3) with S W t = 1 1 + exp − 1 / I C t , S V t = ∑ t ′ ∈ T t S W t ′ Here, T _t is the set of ancestors of term t in the GO graph, p(t) is the frequency of the term appearing in the GO database, IC(t) = − log p(t) is the information content of term t. The higher the annotation frequency, the more general the information contained and the smaller the corresponding IC value. SW(t) normalizes the knowledge reflected by 1/IC(t), describing the semantic weight of term t. Consequently, the functional similarity of genes x _i and x _j can be obtained as follows: s i m G O x i , x j = ∑ t 2 ∈ a n n x j s i m x i , t 2 + ∑ t 1 ∈ a n n x i s i m x j , t 1 a n n x i + | a n n ( x j ) | (4) where s i m x i , t 2 = max t 1 ∈ a n n x i s i m A I C t 1 , t 2 is the similarity of gene x _i and term t ₂. ann( x _i) and ann( x _j) represent the sets of GO terms that annotate the two genes, respectively. The cardinalities of ann( x _i) and ann( x _j) are denoted by |ann( x _i)| and |ann( x _j)|, respectively.

The gene function similarity obtained through GO can also reflect the pairwise constraint relationship between genes to a certain extent. In the proposed MSC-CSMC algorithm, gene pairs with a similarity of more than 0.9 constitute the GO must-link constraint set Ω M L * , gene pairs with a similarity less than 0.1 constitute the GO cannot-link constraint set Ω C L * , and then generate the GO pairwise constraint set Ω * = Ω M L * ∪ Ω C L * . Finally, the gene expression pairwise constraint set Ω and the gene ontology pairwise constraint set Ω* together constitute multi-source constraints for gene clustering.

2.2 Semi-supervised clustering objective functions based on multi-source constraints

At present, multi-objective optimization has gradually become a mainstream method for solving gene expression data clustering problems, which can achieve better clustering results on gene expression data compared with single-objective optimization methods. In the unsupervised multi-objective clustering problem of gene expression data, the cluster validity indices J _FCM (Bezdek et al., 1981) and XB (Xie and Beni, 1991), which measure the intra-cluster compactness and inter-cluster separation respectively, are commonly used as objective functions to realize the evolution of decision variables based on two conflicting objectives (Bandyopadhyay et al., 2007; Maulik et al., 2009; Mukhopadhyay et al., 2013; Li et al., 2022). In this paper, the proposed MSC-CSMC algorithm uses XB and the function based on quadratic-regularized fuzzy c-means with constraint violation penalty, namely, J _P (Mei, 2019), as the objective functions. Furthermore, the constraint violation penalty weights in J _P are improved to achieve semi-supervised clustering of gene expression data based on the multi-source constraints in the NSGA-II framework. The objective functions of XB and J _P are as follows: X B = ∑ c = 1 k ∑ i = 1 n u i c 2 x i − v c 2 2 n × min f ≠ c v f − v c 2 2 (5) J P = ∑ c = 1 k ∑ i = 1 n u i c x i − v c 2 2 + η 2 ∑ c = 1 k ∑ i = 1 n u i c 2 − β 2 ∑ i = 1 n ∑ j = 1 n w i j u i ⊤ u j (6) Here, v c = ∑ i = 1 n u i c x i ∑ i = 1 n u i c is the cth cluster prototype. k is the number of clusters, parameters η and β control the level of fuzziness and the contribution of the penalty term during clustering, respectively. u _ic is the membership degree of the datum x _i belonging to the cth cluster, obtained by u i c = 1 k + 1 η u i c F C M q + β u i c P (7) u i c F C M q = 1 k ∑ f = 1 k x i − v f 2 2 − x i − v c 2 2 (8) u i c P = ∑ j = 1 n w i j u j c − 1 k ∑ f = 1 k ∑ j = 1 n w i j u j f (9) where w _ij ∈ W is the penalty weight for violating pairwise constraint x i , x j . In order to simultaneously consider both the gene expression constraint set Ω = Ω_ML ∪ Ω_CL and gene ontology constraint set Ω * = Ω M L * ∪ Ω C L * , that is, the multi-source constraints proposed in this paper, we improve the constraint violation penalty weights through the following analysis: (1) if pairwise constraint x i , x j exists in both Ω_ML and Ω M L * , or in both Ω_CL and Ω C L * , it means that the same category information of gene pair x i , x j can be obtained from gene expression and gene annotation, so the weight of violating this constraint should be increased; (2) if pairwise constraint x i , x j exists in Ω_ML but not in Ω M L * , or exists in Ω_CL but not in Ω C L * , it indicates that the category information of gene pair x i , x j is not clear enough, thus the penalty weight w _ij should be decreased; (3) if pairwise constraint x i , x j exists in both Ω_ML and Ω C L * , or in both Ω_CL and Ω M L * , it should be regarded as a contradictory constraint and removed from the constraint sets Ω and Ω*. Based on the above idea, the MSC-CSMC algorithm proposed in this paper improves the constraint violation penalty weight as follows: w i j = 1 − θ , x i , x j ∈ Ω M L and x i , x j ∉ Ω M L * − 1 + θ , x i , x j ∈ Ω C L and x i , x j ∉ Ω C L * 1 + θ , x i , x j ∈ Ω M L and x i , x j ∈ Ω M L * − 1 − θ , x i , x j ∈ Ω C L and x i , x j ∈ Ω C L * 0 , otherwise (10) with θ > 0 being the GO action parameter. It can be seen that the improved penalty weights can effectively integrate the gene expression and Gene Ontology information, and provide reasonable violation penalty for pairwise constraints in semi-supervised clustering.

2.3 Mixed chromosome encoding strategy used in MSC-CSMC

For the purpose of co-optimizing the constraints selection and clustering in the process of multi-objective evolution, a mixed encoding strategy combining the constraints selection and cluster prototype is adopted, as shown in Figure 3. Let P denote the genetic population, N be the population size, and s be the number of pairwise constraints to be selected. Considering the existence of noisy constraints in the initial pairwise constraint set and to improve the search efficiency of the algorithm, 2s constraints are randomly selected from the initial pairwise constraint set to generate the candidate constraint set Ω_p, and a serial number is assigned for each pairwise constraint. For a gene expression dataset with k clusters X = x 1 , x 2 , … x n , x i ∈ R d , the rth individual in the lth generation P r l consists of two parts: the cluster prototype P r v l and the constraints selection P r s e t l . Among them, P r v l = v r , 1 , v r , 2 , … , v r , k encode k cluster prototypes v r , c = v r , c 1 , v r , c 2 , … , v r , c d 1 ≤ c ≤ k with real numbers, P r s e t l = g r , 1 , g r , 2 , … , g r , s encode the serial numbers of s pairwise constraints g r , j 1 ≤ g r , j ≤ 2 s , 1 ≤ j ≤ s selected from Ω_p with integers.

FIGURE 3

The mixed chromosome encoding strategy used in MSC-CSMC.

In the proposed algorithm, the two parts of the chromosomes are initialized separately. For the cluster prototype part, in order to ensure initialization quality and population diversity, half of the individuals are encoded as the k cluster prototypes obtained by the density peak method (Rodriguez and Laio, 2014), and the other half are encoded from the randomly generated cluster prototypes. For the constraints selection part of each individual, the components are initialized with non-repeated random integers in 1,2 s .

2.4 Genetic operations

In the genetic evolution process of the MSC-CSMC algorithm, the roulette wheel strategy is first used to implement the selection. Since the NSGA-II algorithm tends to select individuals with lower non-domination ranks, for the rth individual P r l of the lth generation, the selection probability (Zhou and Zhu, 2018) is calculated as follows: p s P r l = α 1 − α f rank − 1 (11) Here, α ∈ 0,1 is the selection parameter, f _rank is the non-domination rank of individual P r l .

For the parent individuals P r 1 l and P r 2 l , let the crossover probability be p _c, different crossover operators are used for the cluster prototypes and constraints selection. Among them, P r 1 v l and P r 2 v l generate offspring through the normal distribution crossover operator (Zhang and Luo, 2009), and the offspring cluster prototypes are: o f f s p 1 v = P r 1 v l + P r 2 v l 2 + 1.481 × P r 1 v l − P r 2 v l 2 × | N 0,1 | (12) o f f s p 2 v = P r 1 v l + P r 2 v l 2 − 1.481 × P r 1 v l − P r 2 v l 2 × | N 0,1 | (13) where N 0,1 is a random variable of normal distribution. The constraints selection P r 1 s e t l and P r 2 s e t l adopts the single-point crossover operator, for a random integer rand _c in 1 , s , the offspring constraints selections are: o f f s p 1 s e t = g r 1 , 1 , … , g r 1 , r a n d c , g r 2 , r a n d c + 1 , … , g r 2 , s (14) o f f s p 2 s e t = g r 2 , 1 , … , g r 2 , r a n d c , g r 1 , r a n d c + 1 , … , g r 1 , s (15) If repeated pairwise constraints appear after crossover, non-repeated pairwise constraints are randomly selected from the candidate constraint set Ω_p as a replacement. For individual P _r(l), different mutation operators are adopted for the two parts. The polynomial mutation operator (Rousseeuw, 1987) is applied for P r ( v ) ( l ) , where site v _r,ci mutates with probability p _m: v r , c i ′ = v r , c i + δ × v u − v l , 1 ≤ c ≤ k , 1 ≤ i ≤ d (16) where, v _u and v _l are the upper and lower bounds of the cluster prototype, respectively. For normalized gene expression data, the bounds are set to 1 and 0. δ is determined as follows (Deb and Tiwari, 2008): δ = 2 × r a n d m + 1 − 2 × r a n d m 1 − v r , c i η m + 1 1 η m + 1 − 1 , r a n d m < 0.5 1 − 2 × 1 − r a n d m + 2 × r a n d m − 0.5 v r , c i η m + 1 1 η m + 1 , r a n d m ≥ 0.5 (17) Here, η _m is the distribution index, rand _m is a random number in 0,1 . For P r s e t l , random mutation is used, that is, first randomly select a position in P r s e t l , and then replace its value with a random integer in 1,2 s that is not repeated with others. In summary, the procedure of the MSC-CSMC algorithm is shown as follows:

Input: Gene expression dataset X , number of neighbors b, density drop rate ɛ, population size N, maximal number of generations L _max , number of clusters k, fuzzy parameter η, penalty parameter β, constraint number s, GO action parameter θ, selection parameter α, crossover probability p _c, mutation probability p _m, and distribution index η _m.

Step 1: Generate gene expression pairwise constraint sets Ω based on density tracking method.

Step 2: Calculate the functional similarity of genes based on AIC, and generate the gene ontology pairwise constraint set Ω*. Then delete the contradictory constraints, and determine the penalty weight matrix W corresponding to the multi-source constraints based on Formula 10.

Step 3: Randomly select 2s pairwise constraints from the initial constraint set to construct the candidate constraint set Ω_p, and initialize the population.

Step 4: When the genetic generation index is l l = 1,2 , … , L max , for each individual P r l 1 ≤ r ≤ N , decode to obtain the cluster prototypes and the selected pairwise constraints. Update the membership degree according to Formulas 7-9, and calculate the individual fitness values based on Formulas 5-6.

Step 5: According to the individual fitness values, calculate the non-domination rank and crowding distance of each individual.

Step 6: Apply selection, crossover, and mutation based on Formulas 11-17, and update the individual fitness values according to Formulas 5-6.

Step 7: Merge the parent and offspring populations, and select the next-generation according to the elite retention strategy.

Step 8: If l = 0 . 5 × L max or l = 0 . 8 × L max , update the penalty parameter β = 2 × β to increase the penalty for violating the currently selected constraints.

Step 9: Set l = l + 1, repeat Steps 4-8 until the maximal number of generations L _max is reached.

Output: The Pareto optimal solutions.

3 Results 3.1 Datasets

In this study, five benchmark gene expression datasets, namely, Yeast Galactose Metabolism, Yeast Cell Cycle, Yeast Sporulation, Serum, and Arabidopsis are used for the experiment.

The Yeast Galactose Metabolism dataset (Ideker et al., 2001) is composed of 205 genes whose expression patterns reflect four functional categories. The gene expression profiles were measured with four replicate assays across 20 time points. The Yeast Cell Cycle dataset (Cho et al., 1998) contains the expression levels of 384 genes involved in yeast cell cycle regulation at 17 time points, and these data are related with five phases of cell cycle. The Yeast sporulation dataset (Chu et al., 1998) contains the expression levels of more than 6,000 genes measured during the sporulation process of budding yeast across seven time points. The genes that showed no significant changes in expression during the harvesting were excluded, and the resulting set consists of 474 genes. The Serum dataset (Iyer et al., 1999) contains the expression levels of 517 human genes. The dataset has 13 dimensions corresponding to 12 time points and 1 unsynchronized sample. The Arabidopsis dataset (Reymond et al., 2000) consists of 138 Arabidopsis Thaliana genes. Each gene has eight expression values that correspond to eight time points. The details of the datasets are shown in Table 1.

TABLE 1

Description of datasets.

Dataset	Number of genes	Number of features	Number of clusters
Yeast Galactose Metabolism	205	80	4
Yeast Cell Cycle	384	17	5
Yeast Sporulation	474	7	6
Serum	517	13	6
Arabidopsis	138	8	4

3.2 Model evaluation criteria and parameter assignment

In order to evaluate the effectiveness of the model, the silhouette index (Rousseeuw, 1987) is chosen as the evaluation criterion for the clustering results. For gene x _i, the silhouette width is calculated as follows: S i = b i − a i max a i , b i , 1 ≤ i ≤ n (18) Here, a(i) is the average distance from gene x _i to other genes in the same cluster, b(i) is the minimum average distance between gene x _i and genes in the other clusters. The silhouette index SI of dataset X is the mean value of the silhouette widths of all genes, with S I ∈ − 1,1 . A greater SI value represents the algorithm with better clustering quality. Besides, as suggested by (Saha and Bandyopadhyay, 2013), the final solution of MSC-CSMS is selected from Pareto optimal solutions by using the silhouette index.

According to (Mei, 2019) and (Abin and Vu, 2020), the parameters of MSC-CSMC are assigned as follows: ɛ = 0.8, b = 10, η = 0.001, β = 0.1, N = 100, L _max = 300, α = 0.3, η _m = 5, p _c = 0.8, p _m = 0.1. The number of pairwise constraints s is chosen as 0, 5, 10, 15, 20, and 25. In gene expression data analysis, the determination of the number of clusters k is an open problem. Generally, there are two approaches to determine the value of k; one is to directly set it as the true number of clusters (Yu et al., 2018; Zhao et al., 2021; Li et al., 2022; Liu et al., 2022; Wu and Ma, 2022); The other approach is applicable to the case where the true number of clusters is unknown, in which the variation range of k is determined firstly, and the k corresponding to the optimal value of an index (Silhouette index, Dunn index, Davies–Bouldin index, etc.) can be chosen as the optimal number of clusters (Gao et al., 2019; Acharya et al., 2020; López-Cortés et al., 2020; Zhang et al., 2022). In this paper, we adopt the first approach, and the number of clusters k is selected according to Table 1. In order to analyze the impact of the GO action parameter θ, we set θ from 0.1 to 0.9 at intervals of 0.1 under the condition that the number of the pairwise constraints is 15. The results are shown in Figure 4. It can be seen that the value of SI barely changes as θ increases, which means that the algorithm is not very sensitive to the value of θ. For Yeast Galactose Metabolism, Yeast Cell Cycle, Yeast Sporulation, Serum, and Arabidopsis, the θ values are respectively set to 0.4, 0.7, 0.6, 0.5, and 0.4, which lead to the optimal clustering performances.

FIGURE 4

The impact of parameter θ on SI tested on different datasets. (A) Yeast Galactose Metabolism (B) Yeast Cell Cycle (C) Yeast Sporulation (D) Serum (E) Arabidopsis.

3.3 Result analysis and model comparison

For the purpose of inspecting the performance of the proposed MSC-CSMC algorithm, several advanced semi-supervised clustering algorithms based on single-source constraints, including COP-Kmeans (Wagstaff et al., 2001), PCKMeans (Basu et al., 2004), MPCKMeans (Bilenko et al., 2004), PCCA (Grira et al., 2008), PCFCMq (Mei, 2019) and MSC-CS (Zhao and Li, 2022), are used for comparison. Among them, the MSC-CS algorithm is the single-source constrained version of MSC-CSMC, which does not consider the annotation information provided by GO. In the above algorithms, the pairwise constraints are randomly selected from the initial gene expression constraint set Ω. To avoid the influence of randomness, each method is run for ten times under the same number of pairwise constraints, and the mean value of the clustering results is taken as the final result. The SI values of all seven algorithms applied to five datasets are shown in Tables 2–6, the optimal solutions in each row are highlighted in bold.

TABLE 2

SI values on Yeast Galactose Metabolism with different number of constraints.

s	COP-Kmeans	PCKMeans	MPCKMeans	PCCA	PCFCMq	MSC-CS	MSC-CSMC
0	0. 384	0. 254	0. 305	0.525	0. 465	0. 566	0. 566
5	0. 423	0. 479	0. 258	0.348	0. 254	0. 583	0. 628
10	0. 460	0. 484	0. 471	0.144	0. 274	0. 592	0. 631
15	0. 458	0. 484	0. 463	0.198	0. 402	0. 645	0. 668
20	0. 459	0. 457	0. 370	0.383	0. 351	0. 645	0. 668
25	0. 445	0. 433	0. 413	0.351	0. 290	0. 645	0. 668