ENH paraphrase/update the modeling methods

2021-07-31 16:57:56 -04:00 · 2021-07-31 16:57:56 -04:00 · f0cb320c49
parent 0bf5b03827
commit f0cb320c49
1 changed files with 116 additions and 97 deletions
--- a/tex/thesis.tex
+++ b/tex/thesis.tex
@ -116,6 +116,17 @@
 \newacronym{cytof}{CyTOF}{cytometry time-of-flight}
 \newacronym{spade}{SPADE}{spanning-tree progression analysis of
  density-normalized events}
 \newacronym{ml}{ML}{machine learning}
 \newacronym{rf}{RF}{random forest}
 \newacronym{sr}{SR}{symbolic regression}
 \newacronym{gbm}{GBM}{gradient boosted trees}
 \newacronym{cif}{CIF}{conditional inference forests}
 \newacronym{lasso}{LASSO}{least absolute shrinkage and selection operator}
 \newacronym{svm}{SVM}{support vector machines}
 \newacronym{plsr}{PLSR}{partial least squares regression}
 \newacronym{mse}{MSE}{mean squared error}
 \newacronym{loocv}{LOO-CV}{leave-one-out cross validation}
 \newacronym{hsqc}{HSQC}{heteronuclear single quantum coherence}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 % SI units for uber nerds
@ -1109,7 +1120,7 @@ source data (for example, flagging entries which had a reagent whose
 manufacturing date was after the date the experiment started, which signifies a
 human input error).
-\subsection{statistical analysis}
+\subsection{statistical analysis}\label{sec:statistics}
 For 1-way \gls{anova} analysis with Tukey multiple comparisons test,
 significance was assessed using the \inlinecode{stat\_compare\_means} function
@ -2097,12 +2108,12 @@ sample in the rack (approx. 4 uL) was combined to create an internal pool. This
 material was used for internal controls within each rack as well as metabolite
 annotation.
-NMR spectra were collected on a Bruker Avance III HD spectrometer at 600 MHz
+\gls{nmr} spectra were collected on a Bruker Avance III HD spectrometer at 600
-using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
+MHz using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
 One-dimensional spectra were collected on all samples using the noesypr1d pulse
-sequence under automation using ICON NMR software. Two-dimensional HSQC and
+sequence under automation using ICON NMR software. Two-dimensional \gls{hsqc}
-TOCSY spectra were collected on internal pooled control samples for metabolite
+and TOCSY spectra were collected on internal pooled control samples for
-annotation.
+metabolite annotation.
 One-dimensional spectra were manually phased and baseline corrected in TopSpin.
 Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra
@ -2123,12 +2134,12 @@ discovered to belong to the same metabolite but were included in further
 analysis.
 Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web
-server10, where HSQC peaks were automatically matched to database peaks. HSQC
+server10, where \gls{hsqc} peaks were automatically matched to database peaks.
-matches were manually reviewed with additional 2D and proton spectra to confirm
+\gls{hsqc} matches were manually reviewed with additional 2D and proton spectra
-the match. Annotations were assigned a confidence score based upon the levels of
+to confirm the match. Annotations were assigned a confidence score based upon
-spectral data supporting the match as previously described11. Annotated
+the levels of spectral data supporting the match as previously described11.
-metabolites were matched to previously selected features used for statistical
+Annotated metabolites were matched to previously selected features used for
-analysis.
+statistical analysis.
 % I'm pretty sure this isn't relevant
 % Using the list of annotated metabolites obtained above, an approximation of a
@ -2146,17 +2157,19 @@ suggested that some of these unknown features belonged to the same molecules
 (not shown). Additional multidimensional NMR experiments will be required to
 determine their identity.
-% TODO paraphrase most of this since I didn't do much of the analysis myself
+% TODO add footnote saying what I did in this
 \subsection{machine learning and statistical analysis}
-Seven machine learning (ML) techniques were implemented to predict three
+Linear regression analysis of the \glspl{doe} was performed as described in
-responses related to the memory phenotype of the cultured T cells under
+\cref{sec:statistics}.
-different process parameters conditions (i.e. Total Live CD4+ TN and TCM, Total
+
-Live CD8+ TN+TCM, and Ratio CD4+/CD8+ TN+TCM). The ML methods executed were
+Seven \gls{ml} techniques were implemented to predict three responses related to
-Random Forest (RF), Gradient Boosted Machine (GBM), Conditional Inference Forest
+the memory phenotype of the cultured T cells under different process parameters
-(CIF), Least Absolute Shrinkage and Selection Operator (LASSO), Partial
+conditions (i.e. Total Live CD4+ TN and TCM, Total Live CD8+ TN+TCM, and Ratio
-Least-Squares Regression (PLSR), Support Vector Machine (SVM), and DataModeler’s
+CD4+/CD8+ TN+TCM). The \gls{ml} methods executed were \gls{rf}, \gls{gbm},
-Symbolic Regression (SR). Primarily, SR models were used to optimize process
+\gls{cif}, \gls{lasso}, \gls{plsr}, \gls{svm}, and DataModeler’s
 \gls{sr}\footnote{of these seven methods, all except \gls{lasso} were performed
  by collaborators}. Primarily, \gls{sr} models were used to optimize process
 parameter values based on TN+TCM phenotype and to extract early predictive
 variable combinations from the multi-omics experiments. Furthermore, all
 regression methods were executed, and the high-performing models were used to
@ -2165,97 +2178,103 @@ critical quality attributes and critical process parameters predictive of T-cell
 potency, safety, and consistency at the early stages of the manufacturing
 process.
-Symbolic regression (SR) was done using Evolved Analytics’ DataModeler software
+\gls{sr} was done using Evolved Analytics’ DataModeler software (Evolved
-(Evolved Analytics LLC, Midland, MI). DataModeler utilizes genetic programming
+Analytics LLC, Midland, MI). DataModeler utilizes genetic programming to evolve
-to evolve symbolic regression models (both linear and non-linear) rewarding
+symbolic regression models (both linear and non-linear) rewarding simplicity and
-simplicity and accuracy. Using the selection criteria of highest accuracy
+accuracy. Using the selection criteria of highest accuracy (R2>90\% or
-(R2>90\% or noise-power) and lowest complexity, the top-performing models were
+noise-power) and lowest complexity, the top-performing models were identified.
-identified. Driving variables, variable combinations, and model dimensionality
+Driving variables, variable combinations, and model dimensionality tables were
-tables were generated. The top-performing variable combinations were used to
+generated. The top-performing variable combinations were used to generate model
-generate model ensembles. In this analysis, DataModeler’s SymbolicRegression
+ensembles. In this analysis, DataModeler’s SymbolicRegression function was used
-function was used to develop explicit algebraic (linear and nonlinear) models.
+to develop explicit algebraic (linear and nonlinear) models. The fittest models
-The fittest models were analyzed to identify the dominant variables using the
+were analyzed to identify the dominant variables using the VariablePresence
-VariablePresence function, the dominant variable combinations using the
+function, the dominant variable combinations using the VariableCombinations
-VariableCombinations function, and the model dimensionality (number of unique
+function, and the model dimensionality (number of unique variables) using the
-variables) using the ModelDimensionality function. CreateModelEnsemble was used
+ModelDimensionality function. CreateModelEnsemble was used to define trustable
-to define trustable model ensembles using selected variable combinations and
+model ensembles using selected variable combinations and these were summarized
-these were summarized (model expressions, model phenotype, model tree plot,
+(model expressions, model phenotype, model tree plot, ensemble quality, model
-ensemble quality, model quality, variable presence map, ANOVA tables, model
+quality, variable presence map, \gls{anova} tables, model prediction plot, exportable
-prediction plot, exportable model forms) using the ModelSummaryTable function.
+model forms) using the ModelSummaryTable function. Ensemble prediction and
-Ensemble prediction and residual performance were respectively assessed via the
+residual performance were respectively assessed via the EnsemblePredictionPlot
-EnsemblePredictionPlot and EnsembleResidualPlot subroutines. Model maxima
+and EnsembleResidualPlot subroutines. Model maxima (ModelMaximum function) and
-(ModelMaximum function) and model minima (ModelMinimum function) were calculated
+model minima (ModelMinimum function) were calculated and displayed using the
-and displayed using the ResponsePlotExplorer function. Trade-off performance of
+ResponsePlotExplorer function. Trade-off performance of multiple responses was
-multiple responses was explored using the MultiTargetResponseExplorer and
+explored using the MultiTargetResponseExplorer and ResponseComparisonExplorer
-ResponseComparisonExplorer with additional insights derived from the
+with additional insights derived from the ResponseContourPlotExplorer. Graphics
-ResponseContourPlotExplorer. Graphics and tables were generated by DataModeler.
+and tables were generated by DataModeler. These model ensembles were used to
-These model ensembles were used to identify predicted response values, potential
+identify predicted response values, potential optima in the responses, and
-optima in the responses, and regions of parameter values where the predictions
+regions of parameter values where the predictions diverge the most.
 diverge the most.
-Non-parametric tree-based ensembles were done through the randomForest, gbm, and
+Non-parametric tree-based ensembles were done through the
-cforest regression functions in R, for random forest, gradient boosted trees,
+\inlinecode{randomForest}, inlinecode{gbm}, and \inlinecode{cforest} regression
-and conditional inference forest models, respectively. Both random forest and
+functions in R, for \gls{rf}, \gls{gbm}, and \gls{cif} models, respectively.
-conditional inference forest construct multiple decision trees in parallel, by
+Both \gls{rf} and \gls{cif} construct multiple decision trees in parallel, by
 randomly choosing a subset of features at each decision tree split, in the
 training stage. Random forest individual decision trees are split using the Gini
 Index, while conditional inference forest uses a statistical significance test
 procedure to select the variables at each split, reducing correlation bias. In
-contrast, gradient boosted trees construct regression trees in series through an
+contrast, \gls{gbm} construct regression trees in series through an iterative
-iterative procedure that adapts over the training set. This model learns from
+procedure that adapts over the training set. This model learns from the mistakes
-the mistakes of previous regression trees in an iterative fashion to correct
+of previous regression trees in an iterative fashion to correct errors from its
-errors from its precursors’ trees (i.e. minimize mean squared errors).
+precursors’ trees (i.e. minimize \gls{mse}). Prediction performance was
-Prediction performance was evaluated using leave-one-out cross-validation
+evaluated using \gls{loocv} and permutation-based
-(LOO)-R2 and permutation-based variable importance scores assessing \% increase
+variable importance scores assessing \% increase of \gls{mse}, relative
-of mean squared errors (MSE), relative influence based on the increase of
+influence based on the increase of prediction error, coefficient values for
-prediction error, coefficient values for RF, GBM, and CID, respectively. Partial
+\gls{rf}, \gls{gbm}, and \gls{cif}, respectively. \gls{plsr} was executed using
-least squares regression was executed using the plsr function from the pls
+the \inlinecode{plsr} function from the \inlinecode{pls} package in R while
-package in R while LASSO regression was performed using the cv.glmnet R package,
+\gls{lasso} regression was performed using the \inlinecode{cv.glmnet} R package,
-both using leave-one-out cross-validation. Finally, the kernlab R package was
+both using leave-one-out cross-validation. Finally, the \inlinecode{kernlab} R
-used to construct the Support Vector Machine regression models.
+package was used to construct the \gls{svm} models.
 Parameter tuning was done for all models in a grid search manner using the train
-function from the caret R package using LOO-R2 as the optimization criteria.
+function from the \inlinecode{caret} R package using \gls{loocv} as the
-Specifically, the number of features randomly sampled as candidates at each
+optimization criteria. Specifically, the number of features randomly sampled as
-split (mtry) and the number of trees to grow (ntree) were tuned parameters for
+candidates at each split (\inlinecode{mtry}) and the number of trees to grow
-random forest and conditional inference forest. In particular, minimum sum of
+(\inlinecode{ntree}) were tuned parameters for random forest and conditional
-weights in a node to be considered for splitting and the minimum sum of weights
+inference forest. In particular, minimum sum of weights in a node to be
-in a terminal node were manually tuned for building the CIF models. Moreover,
+considered for splitting and the minimum sum of weights in a terminal node were
-GBM parameters such as the number of trees to grow, maximum depth of each tree,
+manually tuned for building the \gls{cif} models. Moreover, \gls{gbm} parameters
-learning rate, and the minimal number of observations at the terminal node, were
+such as the number of trees to grow, maximum depth of each tree, learning rate,
-tuned for optimum LOO-R2 performance as well. For PLSR, the optimal number of
+and the minimal number of observations at the terminal node, were tuned for
 optimum \gls{loocv} performance as well. For \gls{plsr}, the optimal number of
 components to be used in the model was assessed based on the standard error of
-the cross-validation residuals using the function selectNcomp from the pls
+the cross-validation residuals using the function \inlinecode{selectNcomp} from
-package. Moreover, LASSO regression was performed using the cv.glmnet package
+the \inlinecode{pls} package. Moreover, \gls{lasso} regression was performed
-with alpha = 1. The best lambda for each response was chosen using the minimum
+using the \inlinecode{cv.glmnet} package with alpha = 1. The best lambda for
-error criteria. Lastly, a fixed linear kernel (i.e. svmLinear) was used to build
+each response was chosen using the minimum error criteria. Lastly, a fixed
-the SVM regression models evaluating the cost parameter value with best LOO-R2.
+linear kernel (i.e. \inlinecode{svmLinear}) was used to build the \gls{svm}
 regression models evaluating the cost parameter value with best \gls{loocv}.
 Prediction performance was measured for all models using the final model with
-LOO-R2 tuned parameters. Table M2 shows the parameter values evaluated per model
+\gls{loocv} tuned parameters.
-at the final stages of results reporting.
+
 % TODO do I need this?
 % Table M2 shows the parameter values evaluated per model
 % at the final stages of results reporting.
 \subsection{consensus analysis}
 Consensus analysis of the relevant variables extracted from each machine
 learning model was done to identify consistent predictive features of quality at
 the early stages of manufacturing. First importance scores for all features were
-measured across all ML models using varImp with caret R package except for
+measured across all \gls{ml} models using \inlinecode{varImp} with
-scores for SVM which rminer R package was used. These importance scores were
+\inlinecode{caret} R package except for scores for \gls{svm} which
-percent increase in mean squared error (MSE), relative importance through
+\inlinecode{rminer} R package was used. These importance scores were percent
-average increase in prediction error when a given predictor is permuted,
+increase in \gls{mse}, relative importance through average increase in
-permuted coefficients values, absolute coefficient values, weighted sum of
+prediction error when a given predictor is permuted, permuted coefficients
-absolute coefficients values, and relative importance from sensitivity analysis
+values, absolute coefficient values, weighted sum of absolute coefficients
-determined for RF, GBM, CIF, LASSO, PLSR, and SVM, respectively. Using these
+values, and relative importance from sensitivity analysis determined for
-scores, key predictive variables were selected if their importance scores were
+\gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso}, \gls{plsr}, and \gls{svm},
-within the 80th percentile ranking for the following ML methods: RF, GBM, CIF,
+respectively. Using these scores, key predictive variables were selected if
-LASSO, PLSR, SVM while for SR variables present in >30\% of the top-performing
+their importance scores were within the 80th percentile ranking for the
-SR models from DataModeler (R2>= 90\%, Complexity >= 100) were chosen to
+following \gls{ml} methods: \gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso},
-investigate consensus except for NMR media models at day 4 which considered a
+\gls{plsr}, \gls{svm} while for \gls{sr} variables present in >30\% of the
-combination of the top-performing results of models excluding lactate ppms, and
+top-performing \gls{sr} models from DataModeler (R2>= 90\%, Complexity >= 100)
-included those variables which were in > 40\% of the best performing models.
+were chosen to investigate consensus except for \gls{nmr} media models at day 4
-Only variables with those high percentile scoring values were evaluated in terms
+which considered a combination of the top-performing results of models excluding
-of their logical relation (intersection across ML models) and depicted using a
+lactate ppms, and included those variables which were in > 40\% of the best
-Venn diagram from the venn R package.
+performing models. Only variables with those high percentile scoring values were
 evaluated in terms of their logical relation (intersection across \gls{ml}
 models) and depicted using a Venn diagram from the \inlinecode{venn} R package.
 \section{results}