ENH paraphrase/update the modeling methods

This commit is contained in:
Nathan Dwarshuis 2021-07-31 16:57:56 -04:00
parent 0bf5b03827
commit f0cb320c49
1 changed files with 116 additions and 97 deletions

View File

@ -116,6 +116,17 @@
\newacronym{cytof}{CyTOF}{cytometry time-of-flight} \newacronym{cytof}{CyTOF}{cytometry time-of-flight}
\newacronym{spade}{SPADE}{spanning-tree progression analysis of \newacronym{spade}{SPADE}{spanning-tree progression analysis of
density-normalized events} density-normalized events}
\newacronym{ml}{ML}{machine learning}
\newacronym{rf}{RF}{random forest}
\newacronym{sr}{SR}{symbolic regression}
\newacronym{gbm}{GBM}{gradient boosted trees}
\newacronym{cif}{CIF}{conditional inference forests}
\newacronym{lasso}{LASSO}{least absolute shrinkage and selection operator}
\newacronym{svm}{SVM}{support vector machines}
\newacronym{plsr}{PLSR}{partial least squares regression}
\newacronym{mse}{MSE}{mean squared error}
\newacronym{loocv}{LOO-CV}{leave-one-out cross validation}
\newacronym{hsqc}{HSQC}{heteronuclear single quantum coherence}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SI units for uber nerds % SI units for uber nerds
@ -1109,7 +1120,7 @@ source data (for example, flagging entries which had a reagent whose
manufacturing date was after the date the experiment started, which signifies a manufacturing date was after the date the experiment started, which signifies a
human input error). human input error).
\subsection{statistical analysis} \subsection{statistical analysis}\label{sec:statistics}
For 1-way \gls{anova} analysis with Tukey multiple comparisons test, For 1-way \gls{anova} analysis with Tukey multiple comparisons test,
significance was assessed using the \inlinecode{stat\_compare\_means} function significance was assessed using the \inlinecode{stat\_compare\_means} function
@ -2097,12 +2108,12 @@ sample in the rack (approx. 4 uL) was combined to create an internal pool. This
material was used for internal controls within each rack as well as metabolite material was used for internal controls within each rack as well as metabolite
annotation. annotation.
NMR spectra were collected on a Bruker Avance III HD spectrometer at 600 MHz \gls{nmr} spectra were collected on a Bruker Avance III HD spectrometer at 600
using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin). MHz using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
One-dimensional spectra were collected on all samples using the noesypr1d pulse One-dimensional spectra were collected on all samples using the noesypr1d pulse
sequence under automation using ICON NMR software. Two-dimensional HSQC and sequence under automation using ICON NMR software. Two-dimensional \gls{hsqc}
TOCSY spectra were collected on internal pooled control samples for metabolite and TOCSY spectra were collected on internal pooled control samples for
annotation. metabolite annotation.
One-dimensional spectra were manually phased and baseline corrected in TopSpin. One-dimensional spectra were manually phased and baseline corrected in TopSpin.
Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra
@ -2123,12 +2134,12 @@ discovered to belong to the same metabolite but were included in further
analysis. analysis.
Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web
server10, where HSQC peaks were automatically matched to database peaks. HSQC server10, where \gls{hsqc} peaks were automatically matched to database peaks.
matches were manually reviewed with additional 2D and proton spectra to confirm \gls{hsqc} matches were manually reviewed with additional 2D and proton spectra
the match. Annotations were assigned a confidence score based upon the levels of to confirm the match. Annotations were assigned a confidence score based upon
spectral data supporting the match as previously described11. Annotated the levels of spectral data supporting the match as previously described11.
metabolites were matched to previously selected features used for statistical Annotated metabolites were matched to previously selected features used for
analysis. statistical analysis.
% I'm pretty sure this isn't relevant % I'm pretty sure this isn't relevant
% Using the list of annotated metabolites obtained above, an approximation of a % Using the list of annotated metabolites obtained above, an approximation of a
@ -2146,17 +2157,19 @@ suggested that some of these unknown features belonged to the same molecules
(not shown). Additional multidimensional NMR experiments will be required to (not shown). Additional multidimensional NMR experiments will be required to
determine their identity. determine their identity.
% TODO paraphrase most of this since I didn't do much of the analysis myself % TODO add footnote saying what I did in this
\subsection{machine learning and statistical analysis} \subsection{machine learning and statistical analysis}
Seven machine learning (ML) techniques were implemented to predict three Linear regression analysis of the \glspl{doe} was performed as described in
responses related to the memory phenotype of the cultured T cells under \cref{sec:statistics}.
different process parameters conditions (i.e. Total Live CD4+ TN and TCM, Total
Live CD8+ TN+TCM, and Ratio CD4+/CD8+ TN+TCM). The ML methods executed were Seven \gls{ml} techniques were implemented to predict three responses related to
Random Forest (RF), Gradient Boosted Machine (GBM), Conditional Inference Forest the memory phenotype of the cultured T cells under different process parameters
(CIF), Least Absolute Shrinkage and Selection Operator (LASSO), Partial conditions (i.e. Total Live CD4+ TN and TCM, Total Live CD8+ TN+TCM, and Ratio
Least-Squares Regression (PLSR), Support Vector Machine (SVM), and DataModelers CD4+/CD8+ TN+TCM). The \gls{ml} methods executed were \gls{rf}, \gls{gbm},
Symbolic Regression (SR). Primarily, SR models were used to optimize process \gls{cif}, \gls{lasso}, \gls{plsr}, \gls{svm}, and DataModelers
\gls{sr}\footnote{of these seven methods, all except \gls{lasso} were performed
by collaborators}. Primarily, \gls{sr} models were used to optimize process
parameter values based on TN+TCM phenotype and to extract early predictive parameter values based on TN+TCM phenotype and to extract early predictive
variable combinations from the multi-omics experiments. Furthermore, all variable combinations from the multi-omics experiments. Furthermore, all
regression methods were executed, and the high-performing models were used to regression methods were executed, and the high-performing models were used to
@ -2165,97 +2178,103 @@ critical quality attributes and critical process parameters predictive of T-cell
potency, safety, and consistency at the early stages of the manufacturing potency, safety, and consistency at the early stages of the manufacturing
process. process.
Symbolic regression (SR) was done using Evolved Analytics DataModeler software \gls{sr} was done using Evolved Analytics DataModeler software (Evolved
(Evolved Analytics LLC, Midland, MI). DataModeler utilizes genetic programming Analytics LLC, Midland, MI). DataModeler utilizes genetic programming to evolve
to evolve symbolic regression models (both linear and non-linear) rewarding symbolic regression models (both linear and non-linear) rewarding simplicity and
simplicity and accuracy. Using the selection criteria of highest accuracy accuracy. Using the selection criteria of highest accuracy (R2>90\% or
(R2>90\% or noise-power) and lowest complexity, the top-performing models were noise-power) and lowest complexity, the top-performing models were identified.
identified. Driving variables, variable combinations, and model dimensionality Driving variables, variable combinations, and model dimensionality tables were
tables were generated. The top-performing variable combinations were used to generated. The top-performing variable combinations were used to generate model
generate model ensembles. In this analysis, DataModelers SymbolicRegression ensembles. In this analysis, DataModelers SymbolicRegression function was used
function was used to develop explicit algebraic (linear and nonlinear) models. to develop explicit algebraic (linear and nonlinear) models. The fittest models
The fittest models were analyzed to identify the dominant variables using the were analyzed to identify the dominant variables using the VariablePresence
VariablePresence function, the dominant variable combinations using the function, the dominant variable combinations using the VariableCombinations
VariableCombinations function, and the model dimensionality (number of unique function, and the model dimensionality (number of unique variables) using the
variables) using the ModelDimensionality function. CreateModelEnsemble was used ModelDimensionality function. CreateModelEnsemble was used to define trustable
to define trustable model ensembles using selected variable combinations and model ensembles using selected variable combinations and these were summarized
these were summarized (model expressions, model phenotype, model tree plot, (model expressions, model phenotype, model tree plot, ensemble quality, model
ensemble quality, model quality, variable presence map, ANOVA tables, model quality, variable presence map, \gls{anova} tables, model prediction plot, exportable
prediction plot, exportable model forms) using the ModelSummaryTable function. model forms) using the ModelSummaryTable function. Ensemble prediction and
Ensemble prediction and residual performance were respectively assessed via the residual performance were respectively assessed via the EnsemblePredictionPlot
EnsemblePredictionPlot and EnsembleResidualPlot subroutines. Model maxima and EnsembleResidualPlot subroutines. Model maxima (ModelMaximum function) and
(ModelMaximum function) and model minima (ModelMinimum function) were calculated model minima (ModelMinimum function) were calculated and displayed using the
and displayed using the ResponsePlotExplorer function. Trade-off performance of ResponsePlotExplorer function. Trade-off performance of multiple responses was
multiple responses was explored using the MultiTargetResponseExplorer and explored using the MultiTargetResponseExplorer and ResponseComparisonExplorer
ResponseComparisonExplorer with additional insights derived from the with additional insights derived from the ResponseContourPlotExplorer. Graphics
ResponseContourPlotExplorer. Graphics and tables were generated by DataModeler. and tables were generated by DataModeler. These model ensembles were used to
These model ensembles were used to identify predicted response values, potential identify predicted response values, potential optima in the responses, and
optima in the responses, and regions of parameter values where the predictions regions of parameter values where the predictions diverge the most.
diverge the most.
Non-parametric tree-based ensembles were done through the randomForest, gbm, and Non-parametric tree-based ensembles were done through the
cforest regression functions in R, for random forest, gradient boosted trees, \inlinecode{randomForest}, inlinecode{gbm}, and \inlinecode{cforest} regression
and conditional inference forest models, respectively. Both random forest and functions in R, for \gls{rf}, \gls{gbm}, and \gls{cif} models, respectively.
conditional inference forest construct multiple decision trees in parallel, by Both \gls{rf} and \gls{cif} construct multiple decision trees in parallel, by
randomly choosing a subset of features at each decision tree split, in the randomly choosing a subset of features at each decision tree split, in the
training stage. Random forest individual decision trees are split using the Gini training stage. Random forest individual decision trees are split using the Gini
Index, while conditional inference forest uses a statistical significance test Index, while conditional inference forest uses a statistical significance test
procedure to select the variables at each split, reducing correlation bias. In procedure to select the variables at each split, reducing correlation bias. In
contrast, gradient boosted trees construct regression trees in series through an contrast, \gls{gbm} construct regression trees in series through an iterative
iterative procedure that adapts over the training set. This model learns from procedure that adapts over the training set. This model learns from the mistakes
the mistakes of previous regression trees in an iterative fashion to correct of previous regression trees in an iterative fashion to correct errors from its
errors from its precursors trees (i.e. minimize mean squared errors). precursors trees (i.e. minimize \gls{mse}). Prediction performance was
Prediction performance was evaluated using leave-one-out cross-validation evaluated using \gls{loocv} and permutation-based
(LOO)-R2 and permutation-based variable importance scores assessing \% increase variable importance scores assessing \% increase of \gls{mse}, relative
of mean squared errors (MSE), relative influence based on the increase of influence based on the increase of prediction error, coefficient values for
prediction error, coefficient values for RF, GBM, and CID, respectively. Partial \gls{rf}, \gls{gbm}, and \gls{cif}, respectively. \gls{plsr} was executed using
least squares regression was executed using the plsr function from the pls the \inlinecode{plsr} function from the \inlinecode{pls} package in R while
package in R while LASSO regression was performed using the cv.glmnet R package, \gls{lasso} regression was performed using the \inlinecode{cv.glmnet} R package,
both using leave-one-out cross-validation. Finally, the kernlab R package was both using leave-one-out cross-validation. Finally, the \inlinecode{kernlab} R
used to construct the Support Vector Machine regression models. package was used to construct the \gls{svm} models.
Parameter tuning was done for all models in a grid search manner using the train Parameter tuning was done for all models in a grid search manner using the train
function from the caret R package using LOO-R2 as the optimization criteria. function from the \inlinecode{caret} R package using \gls{loocv} as the
Specifically, the number of features randomly sampled as candidates at each optimization criteria. Specifically, the number of features randomly sampled as
split (mtry) and the number of trees to grow (ntree) were tuned parameters for candidates at each split (\inlinecode{mtry}) and the number of trees to grow
random forest and conditional inference forest. In particular, minimum sum of (\inlinecode{ntree}) were tuned parameters for random forest and conditional
weights in a node to be considered for splitting and the minimum sum of weights inference forest. In particular, minimum sum of weights in a node to be
in a terminal node were manually tuned for building the CIF models. Moreover, considered for splitting and the minimum sum of weights in a terminal node were
GBM parameters such as the number of trees to grow, maximum depth of each tree, manually tuned for building the \gls{cif} models. Moreover, \gls{gbm} parameters
learning rate, and the minimal number of observations at the terminal node, were such as the number of trees to grow, maximum depth of each tree, learning rate,
tuned for optimum LOO-R2 performance as well. For PLSR, the optimal number of and the minimal number of observations at the terminal node, were tuned for
optimum \gls{loocv} performance as well. For \gls{plsr}, the optimal number of
components to be used in the model was assessed based on the standard error of components to be used in the model was assessed based on the standard error of
the cross-validation residuals using the function selectNcomp from the pls the cross-validation residuals using the function \inlinecode{selectNcomp} from
package. Moreover, LASSO regression was performed using the cv.glmnet package the \inlinecode{pls} package. Moreover, \gls{lasso} regression was performed
with alpha = 1. The best lambda for each response was chosen using the minimum using the \inlinecode{cv.glmnet} package with alpha = 1. The best lambda for
error criteria. Lastly, a fixed linear kernel (i.e. svmLinear) was used to build each response was chosen using the minimum error criteria. Lastly, a fixed
the SVM regression models evaluating the cost parameter value with best LOO-R2. linear kernel (i.e. \inlinecode{svmLinear}) was used to build the \gls{svm}
regression models evaluating the cost parameter value with best \gls{loocv}.
Prediction performance was measured for all models using the final model with Prediction performance was measured for all models using the final model with
LOO-R2 tuned parameters. Table M2 shows the parameter values evaluated per model \gls{loocv} tuned parameters.
at the final stages of results reporting.
% TODO do I need this?
% Table M2 shows the parameter values evaluated per model
% at the final stages of results reporting.
\subsection{consensus analysis} \subsection{consensus analysis}
Consensus analysis of the relevant variables extracted from each machine Consensus analysis of the relevant variables extracted from each machine
learning model was done to identify consistent predictive features of quality at learning model was done to identify consistent predictive features of quality at
the early stages of manufacturing. First importance scores for all features were the early stages of manufacturing. First importance scores for all features were
measured across all ML models using varImp with caret R package except for measured across all \gls{ml} models using \inlinecode{varImp} with
scores for SVM which rminer R package was used. These importance scores were \inlinecode{caret} R package except for scores for \gls{svm} which
percent increase in mean squared error (MSE), relative importance through \inlinecode{rminer} R package was used. These importance scores were percent
average increase in prediction error when a given predictor is permuted, increase in \gls{mse}, relative importance through average increase in
permuted coefficients values, absolute coefficient values, weighted sum of prediction error when a given predictor is permuted, permuted coefficients
absolute coefficients values, and relative importance from sensitivity analysis values, absolute coefficient values, weighted sum of absolute coefficients
determined for RF, GBM, CIF, LASSO, PLSR, and SVM, respectively. Using these values, and relative importance from sensitivity analysis determined for
scores, key predictive variables were selected if their importance scores were \gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso}, \gls{plsr}, and \gls{svm},
within the 80th percentile ranking for the following ML methods: RF, GBM, CIF, respectively. Using these scores, key predictive variables were selected if
LASSO, PLSR, SVM while for SR variables present in >30\% of the top-performing their importance scores were within the 80th percentile ranking for the
SR models from DataModeler (R2>= 90\%, Complexity >= 100) were chosen to following \gls{ml} methods: \gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso},
investigate consensus except for NMR media models at day 4 which considered a \gls{plsr}, \gls{svm} while for \gls{sr} variables present in >30\% of the
combination of the top-performing results of models excluding lactate ppms, and top-performing \gls{sr} models from DataModeler (R2>= 90\%, Complexity >= 100)
included those variables which were in > 40\% of the best performing models. were chosen to investigate consensus except for \gls{nmr} media models at day 4
Only variables with those high percentile scoring values were evaluated in terms which considered a combination of the top-performing results of models excluding
of their logical relation (intersection across ML models) and depicted using a lactate ppms, and included those variables which were in > 40\% of the best
Venn diagram from the venn R package. performing models. Only variables with those high percentile scoring values were
evaluated in terms of their logical relation (intersection across \gls{ml}
models) and depicted using a Venn diagram from the \inlinecode{venn} R package.
\section{results} \section{results}