Feature Generation and Extraction

Next, I extracted bottleneck features from each phenotype and used those features in machine learning. The bottleneck features were obtained with Inception Model v3 – a pre-trained neural network for TensorFlow.


Feature Reduction


Principle Component Analysis

There are approximately 2050 extracted features from the pre-trained neural network. Not all these features will be useful in creating a classifier and will slow training. Thus, reducing the number of bottleneck features fed into the algorithm is essential. In principle component analysis (PCA) features are transformed into another feature space and ordered according to variability – think eigenvectors and eigenvalues. One can use PCA to reduce the number of features (independent variables) to a set of explanatory variables to be used for algorithm training and classification.

The chart Explained Variance of Principle Component indicates most variance is contained in less than 50 principle components. Therefore, use of 50 or fewer components should not interfere with algorithm performance and speed up training.


Understanding Highly Dimensional Space


Logically, it is difficult to visualize data of more than 3 dimensions. One approach to visualize highly dimensional space is pairwise plotting. Pairwise plots can inform of correlations and groups between features. The pairwise plots shown below are of data after PCA decomposition into 100 components. The data are from three types of brain cells: neurons (TUJ1), oligodendrocytes (RIP), and astrocytes (GFAP).


Pairwise scatter plots of the 11 most variable principle components should provide useful qualitative information. In the next image, 3 pairwise plots are shown, the 50+ that remain look similar (see full plot), distinguishable correlations or grouping are not observed for any cell type. We foresee that a classifier will have difficulty in correctly predicting the three cell types.

t-Distributed Stochastic Neighbor Embedding (tSNE)

The PCA pairwise scatter plots show data which may not likely produce a good classifier. Let’s look at other techniques to transform the data. t-SNE is a popular method for obtaining data grouped or in patterns. The hope is that these patterns match well with each cell type. More info on t-SNE by Laurens van der Maaten here. I embedded the data into 5 component space and the results do not show clear signs of distinguishing grouping or patterns amongst cell types.

Initial Algorithm Results

The three confusion matrices at the top of this post are the results of an initial round of classification with 100 principle components after PCA. I selected three algorithms (1) Naive Bayes, (2) Random forests, and (3) AdaBoost. As you can see all classifiers performed about the same. I didn’t expect the Naive Bayes classifier to perform well because I can’t image a single feature being able to “naively” predict the difference between cells correctly. Thus, I use Naive Bayes here as a reference for minimum expected performance from a classifier.
What insight can be gained from the confusion matrices? Well…it appears we have class imbalance: there are many more cells with neurons as a ground truth classification than other cell types.
Digging further into class imbalance, our data is comprised of 63% nuerons (TUJ1), 12% oligodendrocytes (RIP), and 25% astrocytes (GFAP). This was likely causing over prediction of the neuron (TUJ1) classification. Therefore, I needed to adjust for imbalanced data. To adjust for class imbalance with random over-sampling (imbalance library), as opposed to under-sampling, because I have few (< 10K) observation. After, oversampling I obtained the following results:

Wow! Looks like Random Forest is doing a decent job. However, I will need to dig deeper because Random Forest is known for overfitting.



Anyhow, it looks like I have some problems to tackle and things to write about for future posts:

1) Classes do not naturally group or are not separable as seen in the pairwise plots of features. Therefore, I’ll need to confirm segmentation is not cropping out morphological differences that are useful in distinguishing classes/phenotype.

2) Random Forest likely overfitting given what we see in the pairwise plots. Even though I am cross-validating my data, I will need to modify my hyper-parameters to make sure I am not overfitting. I could also try having a “hold out” set of data, that is not used neither in training nor in testing during cross validation.


Aside: challenges


The challenges with cell segmentation


Ideally, one cell per image is extracted during image processing and segmentation. In reality multiple cells or cell fragments will end in a single image. Your case may be different, you may have superb immunocytochemical labeling of protein targets that are ubiquitous within a cell leading to excellent segmentation.


Cell process and cell bodies connect and/or overlap. It is difficult to determine where a cell or cellular process begins or ends. Therefore, isolating individual cells when they are closely connected is challenging for a human and therefore an algorithm. One way to improve this is to set a maximum area in pixels when at the time when images are converted to objects.


The challenge with machine learned cell classification


Cells are geometrically irregular, therefore correct classification is more difficult than, for example, classification of semiconductor components on wafers. Furthermore, cell morphologies are not always distinguishable between cell types. To facilitate classification immunocytochemical labeling is crucial. Therefore, teaching an algorithm with both immunocytochemical labeling and morphological analysis may lead to better results.