diff --git a/tex/thesis.tex b/tex/thesis.tex index 5e558a6..974ba9e 100644 --- a/tex/thesis.tex +++ b/tex/thesis.tex @@ -116,6 +116,17 @@ \newacronym{cytof}{CyTOF}{cytometry time-of-flight} \newacronym{spade}{SPADE}{spanning-tree progression analysis of density-normalized events} +\newacronym{ml}{ML}{machine learning} +\newacronym{rf}{RF}{random forest} +\newacronym{sr}{SR}{symbolic regression} +\newacronym{gbm}{GBM}{gradient boosted trees} +\newacronym{cif}{CIF}{conditional inference forests} +\newacronym{lasso}{LASSO}{least absolute shrinkage and selection operator} +\newacronym{svm}{SVM}{support vector machines} +\newacronym{plsr}{PLSR}{partial least squares regression} +\newacronym{mse}{MSE}{mean squared error} +\newacronym{loocv}{LOO-CV}{leave-one-out cross validation} +\newacronym{hsqc}{HSQC}{heteronuclear single quantum coherence} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SI units for uber nerds @@ -1109,7 +1120,7 @@ source data (for example, flagging entries which had a reagent whose manufacturing date was after the date the experiment started, which signifies a human input error). -\subsection{statistical analysis} +\subsection{statistical analysis}\label{sec:statistics} For 1-way \gls{anova} analysis with Tukey multiple comparisons test, significance was assessed using the \inlinecode{stat\_compare\_means} function @@ -2097,12 +2108,12 @@ sample in the rack (approx. 4 uL) was combined to create an internal pool. This material was used for internal controls within each rack as well as metabolite annotation. -NMR spectra were collected on a Bruker Avance III HD spectrometer at 600 MHz -using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin). +\gls{nmr} spectra were collected on a Bruker Avance III HD spectrometer at 600 +MHz using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin). One-dimensional spectra were collected on all samples using the noesypr1d pulse -sequence under automation using ICON NMR software. Two-dimensional HSQC and -TOCSY spectra were collected on internal pooled control samples for metabolite -annotation. +sequence under automation using ICON NMR software. Two-dimensional \gls{hsqc} +and TOCSY spectra were collected on internal pooled control samples for +metabolite annotation. One-dimensional spectra were manually phased and baseline corrected in TopSpin. Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra @@ -2123,12 +2134,12 @@ discovered to belong to the same metabolite but were included in further analysis. Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web -server10, where HSQC peaks were automatically matched to database peaks. HSQC -matches were manually reviewed with additional 2D and proton spectra to confirm -the match. Annotations were assigned a confidence score based upon the levels of -spectral data supporting the match as previously described11. Annotated -metabolites were matched to previously selected features used for statistical -analysis. +server10, where \gls{hsqc} peaks were automatically matched to database peaks. +\gls{hsqc} matches were manually reviewed with additional 2D and proton spectra +to confirm the match. Annotations were assigned a confidence score based upon +the levels of spectral data supporting the match as previously described11. +Annotated metabolites were matched to previously selected features used for +statistical analysis. % I'm pretty sure this isn't relevant % Using the list of annotated metabolites obtained above, an approximation of a @@ -2146,17 +2157,19 @@ suggested that some of these unknown features belonged to the same molecules (not shown). Additional multidimensional NMR experiments will be required to determine their identity. -% TODO paraphrase most of this since I didn't do much of the analysis myself +% TODO add footnote saying what I did in this \subsection{machine learning and statistical analysis} -Seven machine learning (ML) techniques were implemented to predict three -responses related to the memory phenotype of the cultured T cells under -different process parameters conditions (i.e. Total Live CD4+ TN and TCM, Total -Live CD8+ TN+TCM, and Ratio CD4+/CD8+ TN+TCM). The ML methods executed were -Random Forest (RF), Gradient Boosted Machine (GBM), Conditional Inference Forest -(CIF), Least Absolute Shrinkage and Selection Operator (LASSO), Partial -Least-Squares Regression (PLSR), Support Vector Machine (SVM), and DataModeler’s -Symbolic Regression (SR). Primarily, SR models were used to optimize process +Linear regression analysis of the \glspl{doe} was performed as described in +\cref{sec:statistics}. + +Seven \gls{ml} techniques were implemented to predict three responses related to +the memory phenotype of the cultured T cells under different process parameters +conditions (i.e. Total Live CD4+ TN and TCM, Total Live CD8+ TN+TCM, and Ratio +CD4+/CD8+ TN+TCM). The \gls{ml} methods executed were \gls{rf}, \gls{gbm}, +\gls{cif}, \gls{lasso}, \gls{plsr}, \gls{svm}, and DataModeler’s +\gls{sr}\footnote{of these seven methods, all except \gls{lasso} were performed + by collaborators}. Primarily, \gls{sr} models were used to optimize process parameter values based on TN+TCM phenotype and to extract early predictive variable combinations from the multi-omics experiments. Furthermore, all regression methods were executed, and the high-performing models were used to @@ -2165,97 +2178,103 @@ critical quality attributes and critical process parameters predictive of T-cell potency, safety, and consistency at the early stages of the manufacturing process. -Symbolic regression (SR) was done using Evolved Analytics’ DataModeler software -(Evolved Analytics LLC, Midland, MI). DataModeler utilizes genetic programming -to evolve symbolic regression models (both linear and non-linear) rewarding -simplicity and accuracy. Using the selection criteria of highest accuracy -(R2>90\% or noise-power) and lowest complexity, the top-performing models were -identified. Driving variables, variable combinations, and model dimensionality -tables were generated. The top-performing variable combinations were used to -generate model ensembles. In this analysis, DataModeler’s SymbolicRegression -function was used to develop explicit algebraic (linear and nonlinear) models. -The fittest models were analyzed to identify the dominant variables using the -VariablePresence function, the dominant variable combinations using the -VariableCombinations function, and the model dimensionality (number of unique -variables) using the ModelDimensionality function. CreateModelEnsemble was used -to define trustable model ensembles using selected variable combinations and -these were summarized (model expressions, model phenotype, model tree plot, -ensemble quality, model quality, variable presence map, ANOVA tables, model -prediction plot, exportable model forms) using the ModelSummaryTable function. -Ensemble prediction and residual performance were respectively assessed via the -EnsemblePredictionPlot and EnsembleResidualPlot subroutines. Model maxima -(ModelMaximum function) and model minima (ModelMinimum function) were calculated -and displayed using the ResponsePlotExplorer function. Trade-off performance of -multiple responses was explored using the MultiTargetResponseExplorer and -ResponseComparisonExplorer with additional insights derived from the -ResponseContourPlotExplorer. Graphics and tables were generated by DataModeler. -These model ensembles were used to identify predicted response values, potential -optima in the responses, and regions of parameter values where the predictions -diverge the most. +\gls{sr} was done using Evolved Analytics’ DataModeler software (Evolved +Analytics LLC, Midland, MI). DataModeler utilizes genetic programming to evolve +symbolic regression models (both linear and non-linear) rewarding simplicity and +accuracy. Using the selection criteria of highest accuracy (R2>90\% or +noise-power) and lowest complexity, the top-performing models were identified. +Driving variables, variable combinations, and model dimensionality tables were +generated. The top-performing variable combinations were used to generate model +ensembles. In this analysis, DataModeler’s SymbolicRegression function was used +to develop explicit algebraic (linear and nonlinear) models. The fittest models +were analyzed to identify the dominant variables using the VariablePresence +function, the dominant variable combinations using the VariableCombinations +function, and the model dimensionality (number of unique variables) using the +ModelDimensionality function. CreateModelEnsemble was used to define trustable +model ensembles using selected variable combinations and these were summarized +(model expressions, model phenotype, model tree plot, ensemble quality, model +quality, variable presence map, \gls{anova} tables, model prediction plot, exportable +model forms) using the ModelSummaryTable function. Ensemble prediction and +residual performance were respectively assessed via the EnsemblePredictionPlot +and EnsembleResidualPlot subroutines. Model maxima (ModelMaximum function) and +model minima (ModelMinimum function) were calculated and displayed using the +ResponsePlotExplorer function. Trade-off performance of multiple responses was +explored using the MultiTargetResponseExplorer and ResponseComparisonExplorer +with additional insights derived from the ResponseContourPlotExplorer. Graphics +and tables were generated by DataModeler. These model ensembles were used to +identify predicted response values, potential optima in the responses, and +regions of parameter values where the predictions diverge the most. -Non-parametric tree-based ensembles were done through the randomForest, gbm, and -cforest regression functions in R, for random forest, gradient boosted trees, -and conditional inference forest models, respectively. Both random forest and -conditional inference forest construct multiple decision trees in parallel, by +Non-parametric tree-based ensembles were done through the +\inlinecode{randomForest}, inlinecode{gbm}, and \inlinecode{cforest} regression +functions in R, for \gls{rf}, \gls{gbm}, and \gls{cif} models, respectively. +Both \gls{rf} and \gls{cif} construct multiple decision trees in parallel, by randomly choosing a subset of features at each decision tree split, in the training stage. Random forest individual decision trees are split using the Gini Index, while conditional inference forest uses a statistical significance test procedure to select the variables at each split, reducing correlation bias. In -contrast, gradient boosted trees construct regression trees in series through an -iterative procedure that adapts over the training set. This model learns from -the mistakes of previous regression trees in an iterative fashion to correct -errors from its precursors’ trees (i.e. minimize mean squared errors). -Prediction performance was evaluated using leave-one-out cross-validation -(LOO)-R2 and permutation-based variable importance scores assessing \% increase -of mean squared errors (MSE), relative influence based on the increase of -prediction error, coefficient values for RF, GBM, and CID, respectively. Partial -least squares regression was executed using the plsr function from the pls -package in R while LASSO regression was performed using the cv.glmnet R package, -both using leave-one-out cross-validation. Finally, the kernlab R package was -used to construct the Support Vector Machine regression models. +contrast, \gls{gbm} construct regression trees in series through an iterative +procedure that adapts over the training set. This model learns from the mistakes +of previous regression trees in an iterative fashion to correct errors from its +precursors’ trees (i.e. minimize \gls{mse}). Prediction performance was +evaluated using \gls{loocv} and permutation-based +variable importance scores assessing \% increase of \gls{mse}, relative +influence based on the increase of prediction error, coefficient values for +\gls{rf}, \gls{gbm}, and \gls{cif}, respectively. \gls{plsr} was executed using +the \inlinecode{plsr} function from the \inlinecode{pls} package in R while +\gls{lasso} regression was performed using the \inlinecode{cv.glmnet} R package, +both using leave-one-out cross-validation. Finally, the \inlinecode{kernlab} R +package was used to construct the \gls{svm} models. Parameter tuning was done for all models in a grid search manner using the train -function from the caret R package using LOO-R2 as the optimization criteria. -Specifically, the number of features randomly sampled as candidates at each -split (mtry) and the number of trees to grow (ntree) were tuned parameters for -random forest and conditional inference forest. In particular, minimum sum of -weights in a node to be considered for splitting and the minimum sum of weights -in a terminal node were manually tuned for building the CIF models. Moreover, -GBM parameters such as the number of trees to grow, maximum depth of each tree, -learning rate, and the minimal number of observations at the terminal node, were -tuned for optimum LOO-R2 performance as well. For PLSR, the optimal number of +function from the \inlinecode{caret} R package using \gls{loocv} as the +optimization criteria. Specifically, the number of features randomly sampled as +candidates at each split (\inlinecode{mtry}) and the number of trees to grow +(\inlinecode{ntree}) were tuned parameters for random forest and conditional +inference forest. In particular, minimum sum of weights in a node to be +considered for splitting and the minimum sum of weights in a terminal node were +manually tuned for building the \gls{cif} models. Moreover, \gls{gbm} parameters +such as the number of trees to grow, maximum depth of each tree, learning rate, +and the minimal number of observations at the terminal node, were tuned for +optimum \gls{loocv} performance as well. For \gls{plsr}, the optimal number of components to be used in the model was assessed based on the standard error of -the cross-validation residuals using the function selectNcomp from the pls -package. Moreover, LASSO regression was performed using the cv.glmnet package -with alpha = 1. The best lambda for each response was chosen using the minimum -error criteria. Lastly, a fixed linear kernel (i.e. svmLinear) was used to build -the SVM regression models evaluating the cost parameter value with best LOO-R2. +the cross-validation residuals using the function \inlinecode{selectNcomp} from +the \inlinecode{pls} package. Moreover, \gls{lasso} regression was performed +using the \inlinecode{cv.glmnet} package with alpha = 1. The best lambda for +each response was chosen using the minimum error criteria. Lastly, a fixed +linear kernel (i.e. \inlinecode{svmLinear}) was used to build the \gls{svm} +regression models evaluating the cost parameter value with best \gls{loocv}. Prediction performance was measured for all models using the final model with -LOO-R2 tuned parameters. Table M2 shows the parameter values evaluated per model -at the final stages of results reporting. +\gls{loocv} tuned parameters. + +% TODO do I need this? +% Table M2 shows the parameter values evaluated per model +% at the final stages of results reporting. \subsection{consensus analysis} Consensus analysis of the relevant variables extracted from each machine learning model was done to identify consistent predictive features of quality at the early stages of manufacturing. First importance scores for all features were -measured across all ML models using varImp with caret R package except for -scores for SVM which rminer R package was used. These importance scores were -percent increase in mean squared error (MSE), relative importance through -average increase in prediction error when a given predictor is permuted, -permuted coefficients values, absolute coefficient values, weighted sum of -absolute coefficients values, and relative importance from sensitivity analysis -determined for RF, GBM, CIF, LASSO, PLSR, and SVM, respectively. Using these -scores, key predictive variables were selected if their importance scores were -within the 80th percentile ranking for the following ML methods: RF, GBM, CIF, -LASSO, PLSR, SVM while for SR variables present in >30\% of the top-performing -SR models from DataModeler (R2>= 90\%, Complexity >= 100) were chosen to -investigate consensus except for NMR media models at day 4 which considered a -combination of the top-performing results of models excluding lactate ppms, and -included those variables which were in > 40\% of the best performing models. -Only variables with those high percentile scoring values were evaluated in terms -of their logical relation (intersection across ML models) and depicted using a -Venn diagram from the venn R package. +measured across all \gls{ml} models using \inlinecode{varImp} with +\inlinecode{caret} R package except for scores for \gls{svm} which +\inlinecode{rminer} R package was used. These importance scores were percent +increase in \gls{mse}, relative importance through average increase in +prediction error when a given predictor is permuted, permuted coefficients +values, absolute coefficient values, weighted sum of absolute coefficients +values, and relative importance from sensitivity analysis determined for +\gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso}, \gls{plsr}, and \gls{svm}, +respectively. Using these scores, key predictive variables were selected if +their importance scores were within the 80th percentile ranking for the +following \gls{ml} methods: \gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso}, +\gls{plsr}, \gls{svm} while for \gls{sr} variables present in >30\% of the +top-performing \gls{sr} models from DataModeler (R2>= 90\%, Complexity >= 100) +were chosen to investigate consensus except for \gls{nmr} media models at day 4 +which considered a combination of the top-performing results of models excluding +lactate ppms, and included those variables which were in > 40\% of the best +performing models. Only variables with those high percentile scoring values were +evaluated in terms of their logical relation (intersection across \gls{ml} +models) and depicted using a Venn diagram from the \inlinecode{venn} R package. \section{results}