ENH paraphrase/update the modeling methods

This commit is contained in:
Nathan Dwarshuis 2021-07-31 16:57:56 -04:00
parent 0bf5b03827
commit f0cb320c49
1 changed files with 116 additions and 97 deletions

View File

@ -116,6 +116,17 @@
\newacronym{cytof}{CyTOF}{cytometry time-of-flight}
\newacronym{spade}{SPADE}{spanning-tree progression analysis of
density-normalized events}
\newacronym{ml}{ML}{machine learning}
\newacronym{rf}{RF}{random forest}
\newacronym{sr}{SR}{symbolic regression}
\newacronym{gbm}{GBM}{gradient boosted trees}
\newacronym{cif}{CIF}{conditional inference forests}
\newacronym{lasso}{LASSO}{least absolute shrinkage and selection operator}
\newacronym{svm}{SVM}{support vector machines}
\newacronym{plsr}{PLSR}{partial least squares regression}
\newacronym{mse}{MSE}{mean squared error}
\newacronym{loocv}{LOO-CV}{leave-one-out cross validation}
\newacronym{hsqc}{HSQC}{heteronuclear single quantum coherence}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SI units for uber nerds
@ -1109,7 +1120,7 @@ source data (for example, flagging entries which had a reagent whose
manufacturing date was after the date the experiment started, which signifies a
human input error).
\subsection{statistical analysis}
\subsection{statistical analysis}\label{sec:statistics}
For 1-way \gls{anova} analysis with Tukey multiple comparisons test,
significance was assessed using the \inlinecode{stat\_compare\_means} function
@ -2097,12 +2108,12 @@ sample in the rack (approx. 4 uL) was combined to create an internal pool. This
material was used for internal controls within each rack as well as metabolite
annotation.
NMR spectra were collected on a Bruker Avance III HD spectrometer at 600 MHz
using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
\gls{nmr} spectra were collected on a Bruker Avance III HD spectrometer at 600
MHz using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
One-dimensional spectra were collected on all samples using the noesypr1d pulse
sequence under automation using ICON NMR software. Two-dimensional HSQC and
TOCSY spectra were collected on internal pooled control samples for metabolite
annotation.
sequence under automation using ICON NMR software. Two-dimensional \gls{hsqc}
and TOCSY spectra were collected on internal pooled control samples for
metabolite annotation.
One-dimensional spectra were manually phased and baseline corrected in TopSpin.
Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra
@ -2123,12 +2134,12 @@ discovered to belong to the same metabolite but were included in further
analysis.
Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web
server10, where HSQC peaks were automatically matched to database peaks. HSQC
matches were manually reviewed with additional 2D and proton spectra to confirm
the match. Annotations were assigned a confidence score based upon the levels of
spectral data supporting the match as previously described11. Annotated
metabolites were matched to previously selected features used for statistical
analysis.
server10, where \gls{hsqc} peaks were automatically matched to database peaks.
\gls{hsqc} matches were manually reviewed with additional 2D and proton spectra
to confirm the match. Annotations were assigned a confidence score based upon
the levels of spectral data supporting the match as previously described11.
Annotated metabolites were matched to previously selected features used for
statistical analysis.
% I'm pretty sure this isn't relevant
% Using the list of annotated metabolites obtained above, an approximation of a
@ -2146,17 +2157,19 @@ suggested that some of these unknown features belonged to the same molecules
(not shown). Additional multidimensional NMR experiments will be required to
determine their identity.
% TODO paraphrase most of this since I didn't do much of the analysis myself
% TODO add footnote saying what I did in this
\subsection{machine learning and statistical analysis}
Seven machine learning (ML) techniques were implemented to predict three
responses related to the memory phenotype of the cultured T cells under
different process parameters conditions (i.e. Total Live CD4+ TN and TCM, Total
Live CD8+ TN+TCM, and Ratio CD4+/CD8+ TN+TCM). The ML methods executed were
Random Forest (RF), Gradient Boosted Machine (GBM), Conditional Inference Forest
(CIF), Least Absolute Shrinkage and Selection Operator (LASSO), Partial
Least-Squares Regression (PLSR), Support Vector Machine (SVM), and DataModelers
Symbolic Regression (SR). Primarily, SR models were used to optimize process
Linear regression analysis of the \glspl{doe} was performed as described in
\cref{sec:statistics}.
Seven \gls{ml} techniques were implemented to predict three responses related to
the memory phenotype of the cultured T cells under different process parameters
conditions (i.e. Total Live CD4+ TN and TCM, Total Live CD8+ TN+TCM, and Ratio
CD4+/CD8+ TN+TCM). The \gls{ml} methods executed were \gls{rf}, \gls{gbm},
\gls{cif}, \gls{lasso}, \gls{plsr}, \gls{svm}, and DataModelers
\gls{sr}\footnote{of these seven methods, all except \gls{lasso} were performed
by collaborators}. Primarily, \gls{sr} models were used to optimize process
parameter values based on TN+TCM phenotype and to extract early predictive
variable combinations from the multi-omics experiments. Furthermore, all
regression methods were executed, and the high-performing models were used to
@ -2165,97 +2178,103 @@ critical quality attributes and critical process parameters predictive of T-cell
potency, safety, and consistency at the early stages of the manufacturing
process.
Symbolic regression (SR) was done using Evolved Analytics DataModeler software
(Evolved Analytics LLC, Midland, MI). DataModeler utilizes genetic programming
to evolve symbolic regression models (both linear and non-linear) rewarding
simplicity and accuracy. Using the selection criteria of highest accuracy
(R2>90\% or noise-power) and lowest complexity, the top-performing models were
identified. Driving variables, variable combinations, and model dimensionality
tables were generated. The top-performing variable combinations were used to
generate model ensembles. In this analysis, DataModelers SymbolicRegression
function was used to develop explicit algebraic (linear and nonlinear) models.
The fittest models were analyzed to identify the dominant variables using the
VariablePresence function, the dominant variable combinations using the
VariableCombinations function, and the model dimensionality (number of unique
variables) using the ModelDimensionality function. CreateModelEnsemble was used
to define trustable model ensembles using selected variable combinations and
these were summarized (model expressions, model phenotype, model tree plot,
ensemble quality, model quality, variable presence map, ANOVA tables, model
prediction plot, exportable model forms) using the ModelSummaryTable function.
Ensemble prediction and residual performance were respectively assessed via the
EnsemblePredictionPlot and EnsembleResidualPlot subroutines. Model maxima
(ModelMaximum function) and model minima (ModelMinimum function) were calculated
and displayed using the ResponsePlotExplorer function. Trade-off performance of
multiple responses was explored using the MultiTargetResponseExplorer and
ResponseComparisonExplorer with additional insights derived from the
ResponseContourPlotExplorer. Graphics and tables were generated by DataModeler.
These model ensembles were used to identify predicted response values, potential
optima in the responses, and regions of parameter values where the predictions
diverge the most.
\gls{sr} was done using Evolved Analytics DataModeler software (Evolved
Analytics LLC, Midland, MI). DataModeler utilizes genetic programming to evolve
symbolic regression models (both linear and non-linear) rewarding simplicity and
accuracy. Using the selection criteria of highest accuracy (R2>90\% or
noise-power) and lowest complexity, the top-performing models were identified.
Driving variables, variable combinations, and model dimensionality tables were
generated. The top-performing variable combinations were used to generate model
ensembles. In this analysis, DataModelers SymbolicRegression function was used
to develop explicit algebraic (linear and nonlinear) models. The fittest models
were analyzed to identify the dominant variables using the VariablePresence
function, the dominant variable combinations using the VariableCombinations
function, and the model dimensionality (number of unique variables) using the
ModelDimensionality function. CreateModelEnsemble was used to define trustable
model ensembles using selected variable combinations and these were summarized
(model expressions, model phenotype, model tree plot, ensemble quality, model
quality, variable presence map, \gls{anova} tables, model prediction plot, exportable
model forms) using the ModelSummaryTable function. Ensemble prediction and
residual performance were respectively assessed via the EnsemblePredictionPlot
and EnsembleResidualPlot subroutines. Model maxima (ModelMaximum function) and
model minima (ModelMinimum function) were calculated and displayed using the
ResponsePlotExplorer function. Trade-off performance of multiple responses was
explored using the MultiTargetResponseExplorer and ResponseComparisonExplorer
with additional insights derived from the ResponseContourPlotExplorer. Graphics
and tables were generated by DataModeler. These model ensembles were used to
identify predicted response values, potential optima in the responses, and
regions of parameter values where the predictions diverge the most.
Non-parametric tree-based ensembles were done through the randomForest, gbm, and
cforest regression functions in R, for random forest, gradient boosted trees,
and conditional inference forest models, respectively. Both random forest and
conditional inference forest construct multiple decision trees in parallel, by
Non-parametric tree-based ensembles were done through the
\inlinecode{randomForest}, inlinecode{gbm}, and \inlinecode{cforest} regression
functions in R, for \gls{rf}, \gls{gbm}, and \gls{cif} models, respectively.
Both \gls{rf} and \gls{cif} construct multiple decision trees in parallel, by
randomly choosing a subset of features at each decision tree split, in the
training stage. Random forest individual decision trees are split using the Gini
Index, while conditional inference forest uses a statistical significance test
procedure to select the variables at each split, reducing correlation bias. In
contrast, gradient boosted trees construct regression trees in series through an
iterative procedure that adapts over the training set. This model learns from
the mistakes of previous regression trees in an iterative fashion to correct
errors from its precursors trees (i.e. minimize mean squared errors).
Prediction performance was evaluated using leave-one-out cross-validation
(LOO)-R2 and permutation-based variable importance scores assessing \% increase
of mean squared errors (MSE), relative influence based on the increase of
prediction error, coefficient values for RF, GBM, and CID, respectively. Partial
least squares regression was executed using the plsr function from the pls
package in R while LASSO regression was performed using the cv.glmnet R package,
both using leave-one-out cross-validation. Finally, the kernlab R package was
used to construct the Support Vector Machine regression models.
contrast, \gls{gbm} construct regression trees in series through an iterative
procedure that adapts over the training set. This model learns from the mistakes
of previous regression trees in an iterative fashion to correct errors from its
precursors trees (i.e. minimize \gls{mse}). Prediction performance was
evaluated using \gls{loocv} and permutation-based
variable importance scores assessing \% increase of \gls{mse}, relative
influence based on the increase of prediction error, coefficient values for
\gls{rf}, \gls{gbm}, and \gls{cif}, respectively. \gls{plsr} was executed using
the \inlinecode{plsr} function from the \inlinecode{pls} package in R while
\gls{lasso} regression was performed using the \inlinecode{cv.glmnet} R package,
both using leave-one-out cross-validation. Finally, the \inlinecode{kernlab} R
package was used to construct the \gls{svm} models.
Parameter tuning was done for all models in a grid search manner using the train
function from the caret R package using LOO-R2 as the optimization criteria.
Specifically, the number of features randomly sampled as candidates at each
split (mtry) and the number of trees to grow (ntree) were tuned parameters for
random forest and conditional inference forest. In particular, minimum sum of
weights in a node to be considered for splitting and the minimum sum of weights
in a terminal node were manually tuned for building the CIF models. Moreover,
GBM parameters such as the number of trees to grow, maximum depth of each tree,
learning rate, and the minimal number of observations at the terminal node, were
tuned for optimum LOO-R2 performance as well. For PLSR, the optimal number of
function from the \inlinecode{caret} R package using \gls{loocv} as the
optimization criteria. Specifically, the number of features randomly sampled as
candidates at each split (\inlinecode{mtry}) and the number of trees to grow
(\inlinecode{ntree}) were tuned parameters for random forest and conditional
inference forest. In particular, minimum sum of weights in a node to be
considered for splitting and the minimum sum of weights in a terminal node were
manually tuned for building the \gls{cif} models. Moreover, \gls{gbm} parameters
such as the number of trees to grow, maximum depth of each tree, learning rate,
and the minimal number of observations at the terminal node, were tuned for
optimum \gls{loocv} performance as well. For \gls{plsr}, the optimal number of
components to be used in the model was assessed based on the standard error of
the cross-validation residuals using the function selectNcomp from the pls
package. Moreover, LASSO regression was performed using the cv.glmnet package
with alpha = 1. The best lambda for each response was chosen using the minimum
error criteria. Lastly, a fixed linear kernel (i.e. svmLinear) was used to build
the SVM regression models evaluating the cost parameter value with best LOO-R2.
the cross-validation residuals using the function \inlinecode{selectNcomp} from
the \inlinecode{pls} package. Moreover, \gls{lasso} regression was performed
using the \inlinecode{cv.glmnet} package with alpha = 1. The best lambda for
each response was chosen using the minimum error criteria. Lastly, a fixed
linear kernel (i.e. \inlinecode{svmLinear}) was used to build the \gls{svm}
regression models evaluating the cost parameter value with best \gls{loocv}.
Prediction performance was measured for all models using the final model with
LOO-R2 tuned parameters. Table M2 shows the parameter values evaluated per model
at the final stages of results reporting.
\gls{loocv} tuned parameters.
% TODO do I need this?
% Table M2 shows the parameter values evaluated per model
% at the final stages of results reporting.
\subsection{consensus analysis}
Consensus analysis of the relevant variables extracted from each machine
learning model was done to identify consistent predictive features of quality at
the early stages of manufacturing. First importance scores for all features were
measured across all ML models using varImp with caret R package except for
scores for SVM which rminer R package was used. These importance scores were
percent increase in mean squared error (MSE), relative importance through
average increase in prediction error when a given predictor is permuted,
permuted coefficients values, absolute coefficient values, weighted sum of
absolute coefficients values, and relative importance from sensitivity analysis
determined for RF, GBM, CIF, LASSO, PLSR, and SVM, respectively. Using these
scores, key predictive variables were selected if their importance scores were
within the 80th percentile ranking for the following ML methods: RF, GBM, CIF,
LASSO, PLSR, SVM while for SR variables present in >30\% of the top-performing
SR models from DataModeler (R2>= 90\%, Complexity >= 100) were chosen to
investigate consensus except for NMR media models at day 4 which considered a
combination of the top-performing results of models excluding lactate ppms, and
included those variables which were in > 40\% of the best performing models.
Only variables with those high percentile scoring values were evaluated in terms
of their logical relation (intersection across ML models) and depicted using a
Venn diagram from the venn R package.
measured across all \gls{ml} models using \inlinecode{varImp} with
\inlinecode{caret} R package except for scores for \gls{svm} which
\inlinecode{rminer} R package was used. These importance scores were percent
increase in \gls{mse}, relative importance through average increase in
prediction error when a given predictor is permuted, permuted coefficients
values, absolute coefficient values, weighted sum of absolute coefficients
values, and relative importance from sensitivity analysis determined for
\gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso}, \gls{plsr}, and \gls{svm},
respectively. Using these scores, key predictive variables were selected if
their importance scores were within the 80th percentile ranking for the
following \gls{ml} methods: \gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso},
\gls{plsr}, \gls{svm} while for \gls{sr} variables present in >30\% of the
top-performing \gls{sr} models from DataModeler (R2>= 90\%, Complexity >= 100)
were chosen to investigate consensus except for \gls{nmr} media models at day 4
which considered a combination of the top-performing results of models excluding
lactate ppms, and included those variables which were in > 40\% of the best
performing models. Only variables with those high percentile scoring values were
evaluated in terms of their logical relation (intersection across \gls{ml}
models) and depicted using a Venn diagram from the \inlinecode{venn} R package.
\section{results}