Review Article Deep Learning in Cell Image Analysis

,


Introduction
Cell image analysis plays an important role in biomedical research; it has become the main strategy for projecting how a life system interacts with environmental changes. For example, drug discovery is a crucial process for synthesizing and screening potential candidate medications, which incurs a high cost to study the responses of intact cells or entire organisms to specific chemical substances. Phenotypic screening has been demonstrated to be a superior strategy for small-molecule and first-in-class medicines based on phenotypic analysis of responses [1,2]. Generally, to analyze the biological changes in cells, a phenotypic screening is performed in a high-content screening (HCS) method, where cells are stained with multichannel fluorescent probes and cultured in plates with multiple isolated walls subjected to different treatments [3][4][5]. Before entering clinical trials, molecules must be validated in vitro [6], which unfortunately has a high attrition rate. Moreover, previous research [7][8][9] indicates that using HCS to select small molecules only has an estimated hit rate of 0-0.01%, which is highly dependent on the professional knowledge of different biologists and quality of the screening compound pool. Given these laborious and erratic procedures, the increasing demand for cell image analysis is crucial for accelerating and improving phenotypic screening.
The goal of cell image analysis is to analyze the phenotypic effects of various treatments and to reveal the relationships between them. The most widely studied tasks of cell image analysis include segmentation, tracking, and classification [4][5][6][7][8][9][10]. These tasks have drawn extensive attention from both academia and industry. Recently, a bioimage challenge called the Cell Tracking Challenge [11] for live cell segmentation and tracking was held under the auspices of the IEEE International Symposium on Biomedical Imaging (ISBI). This challenge maintained a benchmark of 20 different treatments, including various imaging methods and cell types. New-generation biotechnology companies, such as Insitro and Recursion, have been using machine learning techniques for large-scale cell image analysis to promote drug discovery. Notably, Recursion also released a series of open-source datasets called RxRx, aiming to extract phenotypic features by classifying the treatments imposed on cells based solely on raw image inputs. To make the quantitative and statistical analyses of cell images automated and high throughput, many software packages are available, such as ImageJ [12], CellProfiler [13], Icy [14], and CellCognition [15], which typically contain plentiful plugins that allow biologists to design a customized pipeline to perform different tasks.
Deep learning, the most extensively used emerging machine-learning technique, has achieved remarkable success in computer vision and natural language processing [16][17][18][19][20]. In deep learning, a deep neural network (DNN) is trained as an end-to-end model to directly infer the desired labels from input data [21]. In contrast to traditional computer vision techniques, a DNN can automatically produce more effective representations than handcrafted representations by learning from a large-scale dataset. In cell images, deep learning-based methods also show promising results in cell segmentation [22][23][24] and tracking [25][26][27]. Such successful applications demonstrate the ability of DNNs to extract high-level features and shed light on the potential capability of using deep learning to reveal more sophisticated life laws behind cellular phenotypes [28]. In addition, a vital breakthrough in computer vision called representation learning [29][30][31][32][33][34] also provides confidence that phenotypic features can be learned end-to-end by DNNs more efficiently.
Deep learning has shown a powerful ability to extract useful information from raw inputs; however, it is highly influenced by the quality of the dataset. As shown in Figure 1, a typical deep-learning method consists of two modules: an inference module and a retraining module. When the test environment changes, if the inference module does not achieve satisfactory generalization performance, the DNN must be retrained to adapt to the new data domain using extra annotations. However, the annotations of cell images are considerably expensive than those of naturalscene images because they require expert knowledge to assign labels, and cell images themselves are difficult to collect. Moreover, rare cases are more valuable in cell image analysis than in natural-scene images. Thus, these factors further amplify the shortage of data hunger in deeplearning methods, limiting their practical application. However, because annotation is the most onerous workload, some promising machine-learning technologies, such as active learning, transfer learning, and noisy learning, have been proposed to address this problem. These technologies aim to train a more robust and generalizable DNN with minimal supervision.
In this study, we provide a comprehensive survey of the current progress of three critical tasks. In addition, we will discuss the challenges of applying machine-learning algorithms to cell image analysis. In contrast to previous survey studies [28,35], we provide a more technical perspective on deep learning in cell image analysis.

The Current Progress of Computational
Methods in Cellular Image Processing In this section, we discuss the current progress in applying deep learning to three crucial tasks in cell image analysis: segmentation, tracking, and classification.

Segmentation.
Owing to its high homology with traditional computer vision tasks, cell segmentation is a popular topic in the computer vision community. Traditional cell semantic segmentation methods are based on image processing techniques such as level set [36], watershed [37], graph cut [38], optimization using intensity features [39], and physics-based restoration [40][41][42]. These methods lack flexibility and automation; thus, users regularly need to adjust their parameters to handle different images.
In contrast to cell semantic segmentation, cell instance segmentation requires not only discriminating which pixel belongs to the class of interest (i.e., the cell) but also distinguishing each individual cell. The existing popular methods can be divided into anchor-and region-based methods. Anchor-based methods are originally built for general purpose, such as Faster RCNN [43], Mask RCNN [44], and RetinaNet [45], which are successfully applied to cellular images [46][47][48]. In contrast, region-based methods adopt a deep convolutional neural network (CNN) as the feature extractor, such as VGG [49] and ResNet [50], and classify each predefined anchor across the input images. Using the anchor mechanism, each pixel can be assigned to multiple anchors, allowing multiple and even overlapped cells to be detected simultaneously. After the network predicts the probability of a cell being present in each anchor, a postprocessing step known as non-maximum suppression (NMS) [51] is used to select the top-scoring anchors as the final result. Recently, region-based methods have gained increasing attention because of their simple and end-toend training schemes. Region-based methods transform each pixel into the desired representation, from which subsequent algorithms can recover individual cells. A basic representation is a two-channel feature map that includes a cell probability channel and a cell boundary channel [27], where the cell probability channel indicates whether the pixel belongs to the cell and the cell boundary channel is used to divide the cell instance. This feature map is then fed into a post-processing algorithm to produce the final result. Song et al. [52] proposed a shape-based method to calculate the shape prior to refining the masks of the touching cells. Alternatively, some methods changed the binary cell probability map into a distance map and transferred the cell segmentation problem to a regression task. Kainz et al. [53] defined the output with regard to the Euclidean distance between the pixels and its closest annotated cell center but neglected the information about the cell boundaries. Bai and Urtasun [54] proposed a method called deepwatershed that adopted 2 Intelligent Computing a center-to-boundary distance map for instance segmentation, followed by a watershed algorithm [55] to produce final masks, and was successfully applied to cellular images [56][57][58]. Instead of predicting the distance of every pixel inside the cells, Schmidt et al. [59] used a star-convex polygon to represent a cell instance, which only required predicting the distance between the center and boundary at several particular angles. However, a star-convex polygon was not sufficient to represent nonconvex cells. Recently, a vectorfield label was proposed for tracking objects [60] and adopted in instance segmentation [61]. In contrast to the distance map, Stringer et al. [62] proposed a vector-field label called Cellpose, which only focused on the local gradient of each pixel. A conspicuous property of Cellpose is that each vector points only to the neighboring pixel that is closer to the center and in the cell region. Thus, the resultant representations can easily model extremely nonconvex cells.
In summary, anchor-and region-based methods have their own advantages. First, anchor-based methods (Figure 2(a)) are generally more accurate than regionbased methods (Figure 2(b)) if the Intersection over Union (IoU) threshold is not excessively high. Relative to regionbased methods, anchor-based methods focus more on the wholeness of objects, whereas region-based methods focus on details such as accurate boundaries. Second, regionbased methods require fewer computing resources than anchor-based methods do. This is because learning features with more "objectivity" often require a more complicated design and deeper neural networks. Third, region-based methods require significantly more burdensome postprocessing than anchor-based methods. Post-processing not only reduces the overall efficiency but also introduces additional hyperparameters. In practice, most biology laboratories have limited computing resources. More accurate and faster algorithms with lower computational consumption remain to be explored.
Deep learning-based methods can achieve remarkable performance in cell segmentation after training with a large-scale and carefully annotated dataset. However, cell images can vary with different treatments, such as different cell types, stains, or even carbon dioxide concentrations.
Moreover, it is very expensive to collect a carefully annotated dataset because the annotation of cell images requires expert knowledge. These barriers cause many researchers to test their methods only on limited and imperfect data, which cannot yield promising results in practice. To this end, an increasing number of large-scale datasets covering multiple experimental environments have been proposed to reflect the practical performance of different algorithms. For example, in cell nucleus segmentation, the data science bowl challenge amassed a dataset of various images of nuclei with up to 30 different treatments or image types [63]. For whole-cell segmentation, Cellpose [62] proposed a generalized dataset containing 10 different treatments (the value is approximated because the Cellpose dataset contains images collected from Google) and up to 608 images, with 69 images held out for testing. In particular, this dataset also included 184 non-cell images shared with repeating convex patterns for better generalization. For more specific applications, Greenwald et al. built a large-scale tissue fluorescence dataset called TissueNet across six imaging platforms and nine organs, which also covered different disease states and species [24]. Edlund et al. manually annotated a large-scale dataset in live-cell imaging with label-free and phasecontrast images of 2D cell culture, named LIVECell, which consisted of more than 1.6 million annotated cells of eight morphologically distinct cell types, grown from early seeding to full confluence [64]. The reporting performance of such large-scale datasets is more convincing for researchers.

2.2.
Tracking. Tracking is another fundamental task in cell image analysis. Monitoring cell behaviors throughout the lineage can provide useful information for drug discovery, including quantification of signaling dynamics [65], efforts to understand cell motility [66], and attempts to unravel the laws of bacterial cell growth [67]. Such an analysis must associate each cell entity over time. However, simultaneously tracking thousands of cells is challenging. First, in contrast to general object tracking, the phenotype of the cell is no longer a reliable feature to discriminate against cells because each cell shares a similar appearance in this task. Second, during cell culture, some cells, such as stem cells, can 3 Intelligent Computing undergo serious deformation between frames. Third, owing to the phototoxicity and photobleaching during imaging, the frame rate of imaging is often limited.
Before the wide adoption of deep learning-based methods, cell tracking was associated with probabilistic models [68,69] and active contour models [70][71][72]. These methods globally optimize a graph or probability map, where each connection between cells indicates a carefully designed energy function of a particular event, such as normal connection, mitosis, and move-in/out of the field of view. Most deep learning-based methods continue to be tracking-by-detection schemes [47,[73][74][75] and do not take advantage of the rich information of spatial-temporal cues. For example, the historical dynamics of cells can predict the location of each cell in the current frame, and the dynamics of cells can be accumulated by accurate cell detection. A few methods have been proposed for joint learning, cell detection/segmentation, and tracking. Payer et al. proposed a recurrent stacked hourglass network (ConvGRU) that jointly optimized the network with both segmentation labels and tracking information [76], where the network was forced to provide a similar embedding for linked cell pairs. However, this method could only handle cell images with high magnification (for more detailed features) and a high frame rate (for reducing phenotype change over time), which limited its application. Zhou et al. used two variants of U-Net to jointly perform segmentation and tracking [77]. However, they only leveraged multiframe input for better segmentation results and used heuristic functions (such as IoU) for the final tracking results. Thus, the tracking context was not implicitly involved in the network training. Hayashida et al. proposed a vector field map called MPM to simultaneously encode the location and motion of the cells [78]. Similar to another tracking algorithm for general objects [79], the MPM treats objects/cells as points. The MPM adopted two successive images as input and produced a number of shifted vectors, where the norm of vectors indicated cell centers and the direction of vectors indicated the formal location of cells. However, MPM has only been tested on a small fraction of tracks over a large-scale dataset owing to the lack of annotations. Thus, the practical performance of the MPM still must be proven.

Classification.
Classification often serves as a downstream analysis task for phenotypic screening and cell profiling. After individual cells are located, each cell is conducted into a high-dimensional feature vector that contains various types of phenotypic information. Typical applications include classifying different gene mutations [80][81][82] and mechanism of action (MoA) [83,84]. Before the popularity of deep learning, a classic workflow of processing the feature vector included quality control (to remove outlier samples), preprocessing (normalization, batch-effect correction, etc.), dimensionality reduction (using data-analysis strategies to select useful features while eliminating unnecessary or redundant features), and finally, a classification algorithm. The selection of a classifier depends on interpreting (clustering) or validating (classification) phenotypic features. Classic methods include hierarchical clustering [85,86], nearest neighbors [87], Bayesian matrix factorization, neural networks, and random forests [88]. These methods rely heavily on the quality of the presteps.
With the impressive capability of CNN to extract more abstract features from images, researchers have started using CNN to analyze cell images end-to-end to substitute the onerous pipeline. However, cell images are often collected in a high-content scheme; thus, each image can have up to hundreds of cells with different phenotypes (specifically, outliers). To infer the correct type from both positive and negative cells, Kraus et al. used multiple instance learning (MIL) to train an integral CNN to jointly segment and predict the label of cell images [89]. This algorithm could generate class-specific segmentation masks, which proved its capability of filtering out outliers. Godinez et al. [90] adopted a multiscale CNN architecture to further eliminate the segmentation step and classify phenotypes into cohesive ones, which further reduced the annotation effort. They tested their algorithms on a real-world dataset, which showed a greater capability to distinguish phenotypes than the conventional pipeline that used handcrafted features. However, CNN classifiers often must be trained under the supervision of meaningful labels to learn useful features, but such labels in cell images are difficult to obtain. In realistic scenarios, we only have partial or no prior knowledge of the target compounds, which narrows the application of supervised classification.

Intelligent Computing
Recently, a vital breakthrough in computer vision known as representation learning [29][30][31][32][33][34] aims to learn a good general representation without task-specific supervision. Extensive methods have been proposed for natural images and have yielded remarkable results. In contrast to natural images, metadata such as batch numbers, compound concentrations, and genetic perturbations are immediately available together with cell image data. Thus, many methods use the metadata of cell images as a pretext or surrogate classification task to train CNNs, expecting to obtain more discriminative feature representations of cell phenotypes [91][92][93]. Specifically, Caicedo et al. [92] used individual cells as inputs rather than the entire image. Individual cells as input can filter out the interference of the background and non-cell impurities but also cause the network to neglect global information such as cell densities. Spiegel et al. [93] further investigated the impact of the number of classes of pretext tasks and the difference between implicit and explicit learning [94]. Instead of using metadata for surrogate supervision, Janssens et al. [95] used deep clustering [96] to assign pseudolabels to each feature vector. Both obtained promising results on the BBBC021 dataset [97,98], which is a public dataset for validating the MoAs of 103 different compound concentrations. More recently, Wang et al. proposed a framework called TEAMs [99] and achieved state-of-theart performance on three cell-painting datasets [100]. Similar to [93], they also used metadata as supervision and built a framework based on conventional metric learning [94]. They further upgraded it with three modules to handle the negative sampling of metric learning and distribution shift between training/testing.
Another line of research involves learning cell embeddings by reconstructing cell images. The basic concept is that, provided a good representation of the cell phenotype, an accurate cell image can be reconstructed from it. Goldsborough et al. first used a generative adversarial network (GAN) [101] to generate cell images [102]. They used the output of the penultimate layer of the discriminator as a feature representation of cell phenotypes. Because the effectiveness of the discriminator of the GAN is highly dependent on the generator, this method cannot obtain a satisfactory result. Lu et al. [103] used an encoder-decoder network to reconstruct cells. They used a fully observed cell as an information source for phenotypes to paint an incomplete cell image with the target channel manually concealed. After the network converges, the encoder output is the final representation. Kobayashi et al. further proposed a new pipeline called Cytoself, based on a VQ-VAE-2 [104] model, to reconstruct endogenous tagged fluorescent images and classify tagged proteins simultaneously [105]. They tested their method on a dataset that tagged 1311 different proteins and showed that such self-supervised algorithms could learn useful feature representations that encoded the localization information of proteins, which is highly associated with the functionality of proteins [106]. Contrastive learning has also been applied to images of cells. For instance, Perakis et al. deployed the simCLR [107] framework to learn the cell phenotypic features [108]. However, they made only a marginal improvement.
Overall, both types of approaches have been proven to successfully learn useful representations of cell phenotypes for downstream analyses. However, more effective frameworks are yet to be developed. For example, the information contained in different treatments is valuable for the implicit representation of cells, as different molecules can cause different phenotypes. Yang et al. used molecular embeddings to indicate cell image synthesis and made significant improvements [109]. Using graph neural networks to predict molecular properties also achieves impressive results [109][110][111][112][113]. Thus, completely using multimodal information appears promising for cell image analysis.

The Challenges and Opportunities of Deep-Learning Methods in Cellular Image Processing
As discussed in the previous sections, deep learning has demonstrated an incredible ability to perform cell image analysis. However, there remains a significant performance gap between deep-learning algorithms in academic research and practical applications. Generally, sufficient and accurate training data are necessary for deep-learning algorithms to guarantee the generalization of the trained models. However, in practice, it is laborious and demanding to collect exhaustive annotations of cell images, which requires numerous biological experts and their efforts. Consequently, practical cell image datasets, which may contain defective training data, can dramatically degrade the performance of deeplearning algorithms. Thus, to further improve deep learning-based cell image analysis, defects in cell image datasets should also be considered and properly solved. In this section, we will discuss the major challenges of cell image analysis and the existing methods from a data perspective. We focus on three aspects of cell image datasets, data quantity, data quality, and data confidence, which are discussed in detail in the following sections.

Deep Learning with Small But Expensive Dataset.
Currently, although cell images can be collected using microscopy and cameras in a high-throughput fashion, constructing a large-scale cell image dataset remains a strenuous task. This is because, compared with common images, cell images require knowledgeable biological experts to assign labels image by image, which is time-consuming and demanding. Thus, the scale of cell image datasets is often limited by the difficulty of annotation. Fortunately, alternative strategies are available for mitigating this problem. The first strategy is dataset expansion (Section 3.1.1), which aims to increase the quantity of training data from labeled or unlabeled images. Data augmentation, which is a widely adopted technique in deep learning, can aid acquire extra training images from labeled images by performing image transformation. In Section 3.1.1, we focus on a key technique called active learning, which can automatically select valuable unlabeled images to expand the scale of the training data by interacting with human experts. The second strategy is knowledge transference (Section 3.1.2), which aims to improve the performance of deep-learning models by 5 Intelligent Computing transferring knowledge contained in other datasets. In Section 3.1.2, we discuss a corresponding technique called transfer learning, which can be divided into model-and feature-based approaches.

Collecting Datasets Efficiently.
As discussed previously, the manual annotations of cell images are laborious and expensive and can only be performed by professionals with rich knowledge. However, although the performance of deep-learning models increased as the amount of data, not all labeled data have the same value in learning effective feature representation. Active learning is proposed to solve these difficulties by selecting training samples with high learning values from the unlabeled data pool to annotate. Under the same cost, active learning does not increase the number of labeled data but constructs a more valuable training dataset that aims to increase the performance of trained learning algorithms.
To investigate the effectiveness of active learning in reducing the cost of phenotypic classification, Smith and Horvath [114] compared various combinations of active learning methods, such as least confident, vote entropy sampling, and margin sampling, with supervised learning algorithms, such as support vector machine (SVM), naïve Bayes, and random forest. Their experimental results on three phenotyping datasets demonstrated that active learning could achieve a performance similar to that of previous methods while largely reducing the labeling cost. Cell tracking aims to capture the dynamic movement of cells and is a fundamental tool in high-content screening for modern drug discovery. Compared with general object tracking, cell tracking faces some unique challenges, such as high similarity in the appearance of cells, low temporal resolution, and various cellular activities, thus requiring more labeled data. Lou et al. [115] proposed a structured learning model with an active learning strategy for cell tracking that has the advantages of automatic parameter learning, higher feature dimensions, and lower annotation cost. Their active learning strategy included four components: dividing images into representative patches, measuring the uncertainty of unlabeled structured data, updating the parameters of the model, and checking the terminal criteria. They evaluated the proposed cell tracking algorithm on five datasets, and the experimental results showed that their active learning method could only use 17% labeled data to achieve a performance similar to that of the baseline model with all training data. Screening methods have been widely used to identify drug candidates that perturb specific targets. Ideally, conducting experiments to test all combinations would be an effective method to discover the desired drug candidates. However, this approach is usually infeasible. Naik et al. [116] showed that an active learning method without any prior knowledge could effectively iteratively select a subset from a pool of biology experiments to learn various compound effects and perform better than the strategies that might be employed by humans. Automatic segmentation of nuclei aims to extract nuclei pixels from entire tissue slides, which is a core step in computer-aided pathology analysis. However, the generalization ability of nucleus segmentation methods with fixed parameters is usually low because the morphology and texture of different types of nuclei vary significantly. Wen et al. [117] proposed the application of active learning with different classification methods to measure the quality of the nucleus segmentation results. This active learning procedure iteratively improved the generalization ability of the learning model under limited labeled samples. Directly combining a patched-based classifier with active learning makes it difficult to manually annotate selected patches with small sizes and a lack of context information. An active learning method with a core-set sampling strategy [118] tackled this challenge by merging uncertain patches into regions for annotation; this strategy did not affect the training procedure of the classifier that still operated on patches. To effectively leverage the information in labeled and unlabeled data, Lai et al. [119] proposed a label-efficient framework with active learning and semi-supervised learning for brain tissue segmentation in gigapixel pathology images, which surpassed fully supervised learning methods by using only 0.1% annotations.

Transferring Knowledge from Other Large-Scale
Datasets. Transfer learning, which focuses on transferring knowledge across domains, is a promising machine learning methodology for solving data-scarcity problems. Inspired by how quickly humans learn new knowledge from similar experiences, transfer learning aims to leverage knowledge from related domains (also known as the source domain) to improve learning performance in the target domain while minimizing the number of labeled examples required [120]. Deep transfer learning (DTL), which is a combination of deep-learning architectures and transfer learning, is the most commonly used type of transfer learning in drug discovery [121]. DTL has yielded impressive results in applications such as automatic cell segmentation [122], prediction of protein subcellular localization [123], and prediction of MoA [124]. Compared with traditional machine learning approaches, deep learning uses DNNs with multiple hidden layers that can represent and learn more complex knowledge. Although the transferability of features decreases as the distance between the source and target domains increases, transferring features from distant tasks can be better than using random features [125]. For example, Khan et al. [126] used an ensemble of three CNNs (GoogleNet [127], VGGNet [49], and ResNet [50]) pretrained on the ImageNet [128] dataset to extract general features from breast cytology images so that the accuracy in the detection and classification of malignant cells is greater than 97%. Studies [123,[128][129][130] have shown that pretrained models deliver better predictive performance with less training time and fewer training samples. Based on current applications in phenotype feature representation for drug discovery, we classify transfer learning methods into parameter-and feature-based approaches.
(1) Parameter-Based Approaches. The parameters of the pretrained model reflect what the model learns in the source domain; therefore, knowledge can be transferred directly at the parameter level. An intuitive method is to use the 6 Intelligent Computing parameters of the pretrained network directly as a feature extractor without additional training (Figure 1(a)). An image of the target domain is passed through the pretrained model to obtain its features, which serve as inputs to the downstream task. For example, Pawlowski et al. [131] extracted features from ImageNet pretrained neural networks and evaluated the task of classifying each treatment condition into its MoA using a 1-nearest neighbor classifier. Phan et al. [132] applied transfer learning from a pretrained network to extract generic features and then used the minimum redundancy maximum relevance (mRMR), a feature selection method, to obtain the most relevant features for classification ( Figure 3).
The pretrained model can also be used to initialize the target model and be fine-tuned to the target task, as shown in Figure 1(b). Kraus et al. [123] trained a deep CNN (Dee-pLoc) for subcellular protein localization prediction. They showed that, in contrast to traditional approaches, the model could be successfully transferred to datasets with different genetic backgrounds acquired from other laboratories, even those with abnormal cellular morphology, by fine-tuning.
(2) Feature-Based Approaches. Feature-based approaches transform each original feature into a new representation for knowledge transfer. The goal is to find a common latent feature space in which the source and target data can have the same probability distribution. Therefore, the source data can be used as a training set for target tasks in the latent feature space, helping improve the performance of the model for the target data. There are two common methods to obtain domain-invariant features. One is to reduce the distribution difference between the source and target domain instances (Figure 2(a)). For example, Bermúdez-Chacón et al. [133] proposed a two-stream U-Net for electron microscopy image segmentation. One stream used source-domain samples, whereas the other used target data. They utilized the maximum mean discrepancy (MMD) and correlation alignment as domain regularization to use training data from the source domain to adjust the network weights in the target domain. Measurement MMD is widely used in transfer learning, which quantifies the distribution difference by cal-culating the distance of the mean values of the instances in a reproducing kernel Hilbert space (RKHS). In addition to MMD, several measurement criteria have been adopted in transfer learning, including the Kullback-Leibler divergence [134], Jensen-Shannon divergence [135], and Wasserstein distance [136] (Figure 4).
The other is an adversarial-based method, which is promising for generating complex samples across different domains (e.g., GANs [101]). The original GAN is composed of a generator G and discriminator D. The goal of the generator G is to produce counterfeits of the actual data to confuse the discriminator. Discriminator D is fed a mixture of the actual data and counterfeits, and it aims to detect whether the data are actual or fake. Motivated by GAN, many transfer learning approaches have been established based on the assumption that a good feature representation contains almost no discriminative information regarding the original domains of the instances. Figure 2(b) shows an adversarialbased method that typically includes a shared-feature transformer, domain, source, and target classifier. The feature transformer, similar to the generator, aims to produce a domain-independent feature representation to confuse the domain classifier. The domain classifier plays the role of a discriminator, which attempts to detect whether the extracted features come from the source or target domains. The source and target classifiers produced label predictions for the source and target tasks, respectively. Adversarialbased learning methods have been widely used in recent years for not only transfer learning but also for data augmentation and addressing batch effects. For example, Boyd et al. [137] proposed domain-adversarial autoencoders to promote domain-invariant representations between cell lines, which not only improved the accuracy of MOA prediction but also enabled the comparison of the effects of drugs on different cell lines. Qian et al. [138] proposed a GANbased batch equalization method that could transfer images from one batch to another while preserving the biological phenotype to address the batch effect.

Deep
Learning with Noisy and Imbalanced Labels. As mentioned previously, annotating cell images requires human annotators with profound biological knowledge. Therefore, the quality of the annotations of cell image datasets is highly dependent on the professional skills of human annotators, which may cause intractable issues influencing the training of deep-learning models. Specifically, assigning incorrect or incomplete labels to training images introduces numerous label noise, damaging the generalization of deeplearning models. Notably, even if the assigned labels are completely accurate, the preference for annotation may result in another issue referred to as label imbalance, where the numbers of labeled images for different classes are quite unbalanced.
To improve the robustness of supervised learning models against noisy labels, existing studies have proposed several types of strategies, such as regularization methods to reduce overfitting on noisy labels, robust architectures to model noise, sample selectors to filter out noise, and loss functions to underweight noisy samples. Caicedo et al. [92] presented an RNN-based regularization to remove unrelated features resulting from noisy labels for weakly supervised single-cell profiling. An unsupervised learning method for nuclear segmentation in brain images was proposed in [139] to iteratively train a mask R-CNN model with automatically generated noisy instance segmentation masks and refine the labels using an expectation and maximization (EM) procedure. Park et al. [140] proposed a robust neuron segmentation method that leveraged ADMSE loss to adaptively reduce the weights of noisy labels. Annotating data by multiple experts improves the quality of labels; however, inconsistency among experts could be a type of noisy label in training models. Xiao et al. [141] resolved this issue for pathological image segmentation by utilizing the surrounding context of pixels to compute the weights of the labels annotated by multiple experts.
Several efforts have been made to resolve this problem. Resampling and reweighting are two fundamental strategies to rebalance the distributions in model training from the perspective of input data and loss functions, respectively. Data resampling methods aim to construct a balanced dataset by oversampling the minority classes or undersampling the majority classes. For example, an undersampling strategy and K-means clustering were used in drug discovery to remove the less important samples in the nonpotential drug class, which was the majority class, and the degree of importance was measured by the distance between samples and their cluster centroids [142]. In contrast to data resampling methods, loss reweight methods retain all data samples while assigning different weights to them to alleviate the effects of imbalance. Because the number of normal cells was considerably larger than the number of abnormal cells, a focal loss [143] was used in [144] to enlarge the weights of hard samples and reduce the weights of easy data for the morphological classification of red blood cells. Dice loss was improved with reweighting strategies for cell segmentation [145] and detection [146]. They achieved a better performance than vanilla loss. Because reweighted methods only change the loss functions and do not alter the network architectures, they can be easily combined with other machine-learning techniques to improve the performance of imbalanced tasks. For example, CBCM [147] integrates focal loss with transfer learning to classify images of bone marrow cells that follow a long-tailed data distribution. Recently, two-stage training frameworks [148,149] have shown high performance on imbalanced natural image datasets, first on the original data distribution and then fine-tuned with rebalanced techniques. However, this paradigm has not been validated on imbalanced cell image tasks and may be a direction worth exploring for further studies.
3.3. Uncertainty-Aware Cell Image Analysis. In biological scenarios, deep learning has frequently been used as an efficient tool for processing biological images and outputting predictions for subsequent steps. Traditionally, annotations of cell images are always deterministic (for instance, deterministic cell boundaries or regions in a cellular image). It is rare and even nonexistent that cell image datasets would contain the confidence of annotations. Therefore, using these datasets, the predictions of traditional plain neural networks are deterministic. However, it is important to acquire the uncertainty of the prediction output using deep-learning models, which can help experts evaluate the confidence of these results and measure the robustness of deep-learning models with a probabilistic interpretation. Owing to the lack of confidence information in datasets, traditional plain  Intelligent Computing neural networks are incapable of capturing uncertainty, which causes some unexpected problems, resulting in difficulties. For instance, in a drug discovery procedure, when applying a deep-learning classifier of cell phenotypes, a cell image with an unseen phenotype may be assigned to one of the classes presented in the training set because there is no mechanism to reflect the confidence of classification results and a plain neural network fails to indicate that it is a new phenotype. Therefore, uncertainty-aware learning is crucial for deep-learning applications in biological scenarios. In response to this issue, many general methods have been proposed to estimate uncertainty in deep learning. Using a Bayesian method to mathematically model uncertainty, Blundell et al. proposed an algorithm called Bayes by Backprop, which uses variational Bayesian learning to introduce uncertainty in the weights of neural networks [150]. Instead of considering the weights of networks as fixed values, Bayes by Backprop assumed them to be independent Gaussian distributions and learned them using variational Bayesian learning. In addition to acting as an uncertainty estimation method, Bayes by Backprop can also be used to perform regularization based on the compression cost of the weights. However, the training and inference of Bayes by Backprop are time-consuming and memory inefficient, limiting its use in practical applications. To address this problem, Gal and Ghahramani used dropout, a common network regularization technique, to develop a new theoretical framework for uncertainty [151]. Gal and Ghahramani showed that using dropout correctly was mathematically equivalent to approximating the probabilistic deep Gaussian process and thus proposed Monte Carlo (MC) dropout for uncertainty estimation. MC dropout is performed using dropout in both training and inference, which does not sacrifice either the computational complexity or inference accuracy. By replacing dropout with other similar techniques (e.g., DropBlock [152], DropConnect [153], and SpatialDropout [154]), the MC dropout performance can be further improved. Moreover, it is worth noting that Lakshminarayanan et al. proposed a different uncertainty estimation method called deep ensembles, which also uses an alternative approach rather than a Bayesian-based one to capture uncertainty in deep learning [155]. Deep ensembles are simple to implement and provide high-quality uncertainty estimates. For instance, it can yield a higher uncertainty in out-of-distribution data. For a clear comparison, schematic of Bayes by Backprop, MC dropout, and deep ensembles are shown in Figure 5.
Notably, some previous studies have considered the uncertainty in practical bioscience applications. Carrieri et al. predicted the host phenotype by maintaining compact representations of genetic material [156]. As one of the evaluation metrics, the uncertainty of predictions from four different classifiers was estimated to measure the performance of the proposed workflow, acquired by cross-validation, and a relevance vector machine (RVM). To reduce the cost of high-throughput screening using categorical matrix completion and active learning, Chen et al. designed an algorithm to guide experiments based on chemical compound effects on subcellular locations of various proteins [157]. In this algorithm, uncertainty estimation is performed for sparse matrix completion and implemented by margin sampling. Some in-depth studies have explored the uncertainty estimation of deep learning in biological scenarios. Using deep Bayesian learning, Gomariz et al. proposed a deep learning-based cell detection framework that could output the desired probabilistic predictions [158], where Bayesian regression techniques were used in uncertainty-aware density maps. In this study, MC dropout is used to capture aleatoric and epistemic uncertainty in the training data, which are used to generate spatial epistemic and aleatoric uncertainty maps as additional inputs for the classifier. A neural network that can capture uncertainty can distinguish between seen and unseen examples because uncertainty will decrease as more examples are observed, allowing the results output by networks to become more deterministic. Using this characteristic, Dürr et al. proposed the exploitation of 9 Intelligent Computing MC dropout to define different uncertainty measures for each phenotype prediction in a real-world biological dataset, which showed that these uncertainty measures can be used to recognize new or unclear phenotypes [159]. Thus, the uncertainty estimation of neural networks also shows potential for discovering new phenotypes. To track single cells in colonies without manual intervention, Theorell et al. developed a novel probabilistic tracking paradigm called uncertainty-aware tracking, which is based on a Bayesian approach to perform lineage hypothesis formation [160]. The introduction of uncertainty in this study improved the accuracy and tracking-induced errors.

Conclusion
In this study, we provide a comprehensive survey of three critical tasks in cell image analysis-segmentation, tracking, and classification-which shows that deep learning has been widely applied to these tasks and achieves promising results. As a data-driven method, deep learning often suffers from a lack of high-quality datasets for biological scenarios. Consequently, a performance gap often exists between academic research and practical application. From a data perspective, we also discuss the challenges of applying machinelearning algorithms to cell image analysis. We hope that the discussed techniques and concepts can provide insights for both the biology and computer vision communities to propose more efficient solutions and promote the applications of deep learning in biomedical and life sciences.

Conflicts of Interest
The authors declare no conflicts of interest.

Authors' Contributions
Junde Xu and Donghao Zhou contributed equally to this work.