ENH paraphrase/update the modeling methods
This commit is contained in:
parent
0bf5b03827
commit
f0cb320c49
213
tex/thesis.tex
213
tex/thesis.tex
|
@ -116,6 +116,17 @@
|
||||||
\newacronym{cytof}{CyTOF}{cytometry time-of-flight}
|
\newacronym{cytof}{CyTOF}{cytometry time-of-flight}
|
||||||
\newacronym{spade}{SPADE}{spanning-tree progression analysis of
|
\newacronym{spade}{SPADE}{spanning-tree progression analysis of
|
||||||
density-normalized events}
|
density-normalized events}
|
||||||
|
\newacronym{ml}{ML}{machine learning}
|
||||||
|
\newacronym{rf}{RF}{random forest}
|
||||||
|
\newacronym{sr}{SR}{symbolic regression}
|
||||||
|
\newacronym{gbm}{GBM}{gradient boosted trees}
|
||||||
|
\newacronym{cif}{CIF}{conditional inference forests}
|
||||||
|
\newacronym{lasso}{LASSO}{least absolute shrinkage and selection operator}
|
||||||
|
\newacronym{svm}{SVM}{support vector machines}
|
||||||
|
\newacronym{plsr}{PLSR}{partial least squares regression}
|
||||||
|
\newacronym{mse}{MSE}{mean squared error}
|
||||||
|
\newacronym{loocv}{LOO-CV}{leave-one-out cross validation}
|
||||||
|
\newacronym{hsqc}{HSQC}{heteronuclear single quantum coherence}
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
% SI units for uber nerds
|
% SI units for uber nerds
|
||||||
|
@ -1109,7 +1120,7 @@ source data (for example, flagging entries which had a reagent whose
|
||||||
manufacturing date was after the date the experiment started, which signifies a
|
manufacturing date was after the date the experiment started, which signifies a
|
||||||
human input error).
|
human input error).
|
||||||
|
|
||||||
\subsection{statistical analysis}
|
\subsection{statistical analysis}\label{sec:statistics}
|
||||||
|
|
||||||
For 1-way \gls{anova} analysis with Tukey multiple comparisons test,
|
For 1-way \gls{anova} analysis with Tukey multiple comparisons test,
|
||||||
significance was assessed using the \inlinecode{stat\_compare\_means} function
|
significance was assessed using the \inlinecode{stat\_compare\_means} function
|
||||||
|
@ -2097,12 +2108,12 @@ sample in the rack (approx. 4 uL) was combined to create an internal pool. This
|
||||||
material was used for internal controls within each rack as well as metabolite
|
material was used for internal controls within each rack as well as metabolite
|
||||||
annotation.
|
annotation.
|
||||||
|
|
||||||
NMR spectra were collected on a Bruker Avance III HD spectrometer at 600 MHz
|
\gls{nmr} spectra were collected on a Bruker Avance III HD spectrometer at 600
|
||||||
using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
|
MHz using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
|
||||||
One-dimensional spectra were collected on all samples using the noesypr1d pulse
|
One-dimensional spectra were collected on all samples using the noesypr1d pulse
|
||||||
sequence under automation using ICON NMR software. Two-dimensional HSQC and
|
sequence under automation using ICON NMR software. Two-dimensional \gls{hsqc}
|
||||||
TOCSY spectra were collected on internal pooled control samples for metabolite
|
and TOCSY spectra were collected on internal pooled control samples for
|
||||||
annotation.
|
metabolite annotation.
|
||||||
|
|
||||||
One-dimensional spectra were manually phased and baseline corrected in TopSpin.
|
One-dimensional spectra were manually phased and baseline corrected in TopSpin.
|
||||||
Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra
|
Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra
|
||||||
|
@ -2123,12 +2134,12 @@ discovered to belong to the same metabolite but were included in further
|
||||||
analysis.
|
analysis.
|
||||||
|
|
||||||
Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web
|
Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web
|
||||||
server10, where HSQC peaks were automatically matched to database peaks. HSQC
|
server10, where \gls{hsqc} peaks were automatically matched to database peaks.
|
||||||
matches were manually reviewed with additional 2D and proton spectra to confirm
|
\gls{hsqc} matches were manually reviewed with additional 2D and proton spectra
|
||||||
the match. Annotations were assigned a confidence score based upon the levels of
|
to confirm the match. Annotations were assigned a confidence score based upon
|
||||||
spectral data supporting the match as previously described11. Annotated
|
the levels of spectral data supporting the match as previously described11.
|
||||||
metabolites were matched to previously selected features used for statistical
|
Annotated metabolites were matched to previously selected features used for
|
||||||
analysis.
|
statistical analysis.
|
||||||
|
|
||||||
% I'm pretty sure this isn't relevant
|
% I'm pretty sure this isn't relevant
|
||||||
% Using the list of annotated metabolites obtained above, an approximation of a
|
% Using the list of annotated metabolites obtained above, an approximation of a
|
||||||
|
@ -2146,17 +2157,19 @@ suggested that some of these unknown features belonged to the same molecules
|
||||||
(not shown). Additional multidimensional NMR experiments will be required to
|
(not shown). Additional multidimensional NMR experiments will be required to
|
||||||
determine their identity.
|
determine their identity.
|
||||||
|
|
||||||
% TODO paraphrase most of this since I didn't do much of the analysis myself
|
% TODO add footnote saying what I did in this
|
||||||
\subsection{machine learning and statistical analysis}
|
\subsection{machine learning and statistical analysis}
|
||||||
|
|
||||||
Seven machine learning (ML) techniques were implemented to predict three
|
Linear regression analysis of the \glspl{doe} was performed as described in
|
||||||
responses related to the memory phenotype of the cultured T cells under
|
\cref{sec:statistics}.
|
||||||
different process parameters conditions (i.e. Total Live CD4+ TN and TCM, Total
|
|
||||||
Live CD8+ TN+TCM, and Ratio CD4+/CD8+ TN+TCM). The ML methods executed were
|
Seven \gls{ml} techniques were implemented to predict three responses related to
|
||||||
Random Forest (RF), Gradient Boosted Machine (GBM), Conditional Inference Forest
|
the memory phenotype of the cultured T cells under different process parameters
|
||||||
(CIF), Least Absolute Shrinkage and Selection Operator (LASSO), Partial
|
conditions (i.e. Total Live CD4+ TN and TCM, Total Live CD8+ TN+TCM, and Ratio
|
||||||
Least-Squares Regression (PLSR), Support Vector Machine (SVM), and DataModeler’s
|
CD4+/CD8+ TN+TCM). The \gls{ml} methods executed were \gls{rf}, \gls{gbm},
|
||||||
Symbolic Regression (SR). Primarily, SR models were used to optimize process
|
\gls{cif}, \gls{lasso}, \gls{plsr}, \gls{svm}, and DataModeler’s
|
||||||
|
\gls{sr}\footnote{of these seven methods, all except \gls{lasso} were performed
|
||||||
|
by collaborators}. Primarily, \gls{sr} models were used to optimize process
|
||||||
parameter values based on TN+TCM phenotype and to extract early predictive
|
parameter values based on TN+TCM phenotype and to extract early predictive
|
||||||
variable combinations from the multi-omics experiments. Furthermore, all
|
variable combinations from the multi-omics experiments. Furthermore, all
|
||||||
regression methods were executed, and the high-performing models were used to
|
regression methods were executed, and the high-performing models were used to
|
||||||
|
@ -2165,97 +2178,103 @@ critical quality attributes and critical process parameters predictive of T-cell
|
||||||
potency, safety, and consistency at the early stages of the manufacturing
|
potency, safety, and consistency at the early stages of the manufacturing
|
||||||
process.
|
process.
|
||||||
|
|
||||||
Symbolic regression (SR) was done using Evolved Analytics’ DataModeler software
|
\gls{sr} was done using Evolved Analytics’ DataModeler software (Evolved
|
||||||
(Evolved Analytics LLC, Midland, MI). DataModeler utilizes genetic programming
|
Analytics LLC, Midland, MI). DataModeler utilizes genetic programming to evolve
|
||||||
to evolve symbolic regression models (both linear and non-linear) rewarding
|
symbolic regression models (both linear and non-linear) rewarding simplicity and
|
||||||
simplicity and accuracy. Using the selection criteria of highest accuracy
|
accuracy. Using the selection criteria of highest accuracy (R2>90\% or
|
||||||
(R2>90\% or noise-power) and lowest complexity, the top-performing models were
|
noise-power) and lowest complexity, the top-performing models were identified.
|
||||||
identified. Driving variables, variable combinations, and model dimensionality
|
Driving variables, variable combinations, and model dimensionality tables were
|
||||||
tables were generated. The top-performing variable combinations were used to
|
generated. The top-performing variable combinations were used to generate model
|
||||||
generate model ensembles. In this analysis, DataModeler’s SymbolicRegression
|
ensembles. In this analysis, DataModeler’s SymbolicRegression function was used
|
||||||
function was used to develop explicit algebraic (linear and nonlinear) models.
|
to develop explicit algebraic (linear and nonlinear) models. The fittest models
|
||||||
The fittest models were analyzed to identify the dominant variables using the
|
were analyzed to identify the dominant variables using the VariablePresence
|
||||||
VariablePresence function, the dominant variable combinations using the
|
function, the dominant variable combinations using the VariableCombinations
|
||||||
VariableCombinations function, and the model dimensionality (number of unique
|
function, and the model dimensionality (number of unique variables) using the
|
||||||
variables) using the ModelDimensionality function. CreateModelEnsemble was used
|
ModelDimensionality function. CreateModelEnsemble was used to define trustable
|
||||||
to define trustable model ensembles using selected variable combinations and
|
model ensembles using selected variable combinations and these were summarized
|
||||||
these were summarized (model expressions, model phenotype, model tree plot,
|
(model expressions, model phenotype, model tree plot, ensemble quality, model
|
||||||
ensemble quality, model quality, variable presence map, ANOVA tables, model
|
quality, variable presence map, \gls{anova} tables, model prediction plot, exportable
|
||||||
prediction plot, exportable model forms) using the ModelSummaryTable function.
|
model forms) using the ModelSummaryTable function. Ensemble prediction and
|
||||||
Ensemble prediction and residual performance were respectively assessed via the
|
residual performance were respectively assessed via the EnsemblePredictionPlot
|
||||||
EnsemblePredictionPlot and EnsembleResidualPlot subroutines. Model maxima
|
and EnsembleResidualPlot subroutines. Model maxima (ModelMaximum function) and
|
||||||
(ModelMaximum function) and model minima (ModelMinimum function) were calculated
|
model minima (ModelMinimum function) were calculated and displayed using the
|
||||||
and displayed using the ResponsePlotExplorer function. Trade-off performance of
|
ResponsePlotExplorer function. Trade-off performance of multiple responses was
|
||||||
multiple responses was explored using the MultiTargetResponseExplorer and
|
explored using the MultiTargetResponseExplorer and ResponseComparisonExplorer
|
||||||
ResponseComparisonExplorer with additional insights derived from the
|
with additional insights derived from the ResponseContourPlotExplorer. Graphics
|
||||||
ResponseContourPlotExplorer. Graphics and tables were generated by DataModeler.
|
and tables were generated by DataModeler. These model ensembles were used to
|
||||||
These model ensembles were used to identify predicted response values, potential
|
identify predicted response values, potential optima in the responses, and
|
||||||
optima in the responses, and regions of parameter values where the predictions
|
regions of parameter values where the predictions diverge the most.
|
||||||
diverge the most.
|
|
||||||
|
|
||||||
Non-parametric tree-based ensembles were done through the randomForest, gbm, and
|
Non-parametric tree-based ensembles were done through the
|
||||||
cforest regression functions in R, for random forest, gradient boosted trees,
|
\inlinecode{randomForest}, inlinecode{gbm}, and \inlinecode{cforest} regression
|
||||||
and conditional inference forest models, respectively. Both random forest and
|
functions in R, for \gls{rf}, \gls{gbm}, and \gls{cif} models, respectively.
|
||||||
conditional inference forest construct multiple decision trees in parallel, by
|
Both \gls{rf} and \gls{cif} construct multiple decision trees in parallel, by
|
||||||
randomly choosing a subset of features at each decision tree split, in the
|
randomly choosing a subset of features at each decision tree split, in the
|
||||||
training stage. Random forest individual decision trees are split using the Gini
|
training stage. Random forest individual decision trees are split using the Gini
|
||||||
Index, while conditional inference forest uses a statistical significance test
|
Index, while conditional inference forest uses a statistical significance test
|
||||||
procedure to select the variables at each split, reducing correlation bias. In
|
procedure to select the variables at each split, reducing correlation bias. In
|
||||||
contrast, gradient boosted trees construct regression trees in series through an
|
contrast, \gls{gbm} construct regression trees in series through an iterative
|
||||||
iterative procedure that adapts over the training set. This model learns from
|
procedure that adapts over the training set. This model learns from the mistakes
|
||||||
the mistakes of previous regression trees in an iterative fashion to correct
|
of previous regression trees in an iterative fashion to correct errors from its
|
||||||
errors from its precursors’ trees (i.e. minimize mean squared errors).
|
precursors’ trees (i.e. minimize \gls{mse}). Prediction performance was
|
||||||
Prediction performance was evaluated using leave-one-out cross-validation
|
evaluated using \gls{loocv} and permutation-based
|
||||||
(LOO)-R2 and permutation-based variable importance scores assessing \% increase
|
variable importance scores assessing \% increase of \gls{mse}, relative
|
||||||
of mean squared errors (MSE), relative influence based on the increase of
|
influence based on the increase of prediction error, coefficient values for
|
||||||
prediction error, coefficient values for RF, GBM, and CID, respectively. Partial
|
\gls{rf}, \gls{gbm}, and \gls{cif}, respectively. \gls{plsr} was executed using
|
||||||
least squares regression was executed using the plsr function from the pls
|
the \inlinecode{plsr} function from the \inlinecode{pls} package in R while
|
||||||
package in R while LASSO regression was performed using the cv.glmnet R package,
|
\gls{lasso} regression was performed using the \inlinecode{cv.glmnet} R package,
|
||||||
both using leave-one-out cross-validation. Finally, the kernlab R package was
|
both using leave-one-out cross-validation. Finally, the \inlinecode{kernlab} R
|
||||||
used to construct the Support Vector Machine regression models.
|
package was used to construct the \gls{svm} models.
|
||||||
|
|
||||||
Parameter tuning was done for all models in a grid search manner using the train
|
Parameter tuning was done for all models in a grid search manner using the train
|
||||||
function from the caret R package using LOO-R2 as the optimization criteria.
|
function from the \inlinecode{caret} R package using \gls{loocv} as the
|
||||||
Specifically, the number of features randomly sampled as candidates at each
|
optimization criteria. Specifically, the number of features randomly sampled as
|
||||||
split (mtry) and the number of trees to grow (ntree) were tuned parameters for
|
candidates at each split (\inlinecode{mtry}) and the number of trees to grow
|
||||||
random forest and conditional inference forest. In particular, minimum sum of
|
(\inlinecode{ntree}) were tuned parameters for random forest and conditional
|
||||||
weights in a node to be considered for splitting and the minimum sum of weights
|
inference forest. In particular, minimum sum of weights in a node to be
|
||||||
in a terminal node were manually tuned for building the CIF models. Moreover,
|
considered for splitting and the minimum sum of weights in a terminal node were
|
||||||
GBM parameters such as the number of trees to grow, maximum depth of each tree,
|
manually tuned for building the \gls{cif} models. Moreover, \gls{gbm} parameters
|
||||||
learning rate, and the minimal number of observations at the terminal node, were
|
such as the number of trees to grow, maximum depth of each tree, learning rate,
|
||||||
tuned for optimum LOO-R2 performance as well. For PLSR, the optimal number of
|
and the minimal number of observations at the terminal node, were tuned for
|
||||||
|
optimum \gls{loocv} performance as well. For \gls{plsr}, the optimal number of
|
||||||
components to be used in the model was assessed based on the standard error of
|
components to be used in the model was assessed based on the standard error of
|
||||||
the cross-validation residuals using the function selectNcomp from the pls
|
the cross-validation residuals using the function \inlinecode{selectNcomp} from
|
||||||
package. Moreover, LASSO regression was performed using the cv.glmnet package
|
the \inlinecode{pls} package. Moreover, \gls{lasso} regression was performed
|
||||||
with alpha = 1. The best lambda for each response was chosen using the minimum
|
using the \inlinecode{cv.glmnet} package with alpha = 1. The best lambda for
|
||||||
error criteria. Lastly, a fixed linear kernel (i.e. svmLinear) was used to build
|
each response was chosen using the minimum error criteria. Lastly, a fixed
|
||||||
the SVM regression models evaluating the cost parameter value with best LOO-R2.
|
linear kernel (i.e. \inlinecode{svmLinear}) was used to build the \gls{svm}
|
||||||
|
regression models evaluating the cost parameter value with best \gls{loocv}.
|
||||||
Prediction performance was measured for all models using the final model with
|
Prediction performance was measured for all models using the final model with
|
||||||
LOO-R2 tuned parameters. Table M2 shows the parameter values evaluated per model
|
\gls{loocv} tuned parameters.
|
||||||
at the final stages of results reporting.
|
|
||||||
|
% TODO do I need this?
|
||||||
|
% Table M2 shows the parameter values evaluated per model
|
||||||
|
% at the final stages of results reporting.
|
||||||
|
|
||||||
\subsection{consensus analysis}
|
\subsection{consensus analysis}
|
||||||
|
|
||||||
Consensus analysis of the relevant variables extracted from each machine
|
Consensus analysis of the relevant variables extracted from each machine
|
||||||
learning model was done to identify consistent predictive features of quality at
|
learning model was done to identify consistent predictive features of quality at
|
||||||
the early stages of manufacturing. First importance scores for all features were
|
the early stages of manufacturing. First importance scores for all features were
|
||||||
measured across all ML models using varImp with caret R package except for
|
measured across all \gls{ml} models using \inlinecode{varImp} with
|
||||||
scores for SVM which rminer R package was used. These importance scores were
|
\inlinecode{caret} R package except for scores for \gls{svm} which
|
||||||
percent increase in mean squared error (MSE), relative importance through
|
\inlinecode{rminer} R package was used. These importance scores were percent
|
||||||
average increase in prediction error when a given predictor is permuted,
|
increase in \gls{mse}, relative importance through average increase in
|
||||||
permuted coefficients values, absolute coefficient values, weighted sum of
|
prediction error when a given predictor is permuted, permuted coefficients
|
||||||
absolute coefficients values, and relative importance from sensitivity analysis
|
values, absolute coefficient values, weighted sum of absolute coefficients
|
||||||
determined for RF, GBM, CIF, LASSO, PLSR, and SVM, respectively. Using these
|
values, and relative importance from sensitivity analysis determined for
|
||||||
scores, key predictive variables were selected if their importance scores were
|
\gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso}, \gls{plsr}, and \gls{svm},
|
||||||
within the 80th percentile ranking for the following ML methods: RF, GBM, CIF,
|
respectively. Using these scores, key predictive variables were selected if
|
||||||
LASSO, PLSR, SVM while for SR variables present in >30\% of the top-performing
|
their importance scores were within the 80th percentile ranking for the
|
||||||
SR models from DataModeler (R2>= 90\%, Complexity >= 100) were chosen to
|
following \gls{ml} methods: \gls{rf}, \gls{gbm}, \gls{cif}, \gls{lasso},
|
||||||
investigate consensus except for NMR media models at day 4 which considered a
|
\gls{plsr}, \gls{svm} while for \gls{sr} variables present in >30\% of the
|
||||||
combination of the top-performing results of models excluding lactate ppms, and
|
top-performing \gls{sr} models from DataModeler (R2>= 90\%, Complexity >= 100)
|
||||||
included those variables which were in > 40\% of the best performing models.
|
were chosen to investigate consensus except for \gls{nmr} media models at day 4
|
||||||
Only variables with those high percentile scoring values were evaluated in terms
|
which considered a combination of the top-performing results of models excluding
|
||||||
of their logical relation (intersection across ML models) and depicted using a
|
lactate ppms, and included those variables which were in > 40\% of the best
|
||||||
Venn diagram from the venn R package.
|
performing models. Only variables with those high percentile scoring values were
|
||||||
|
evaluated in terms of their logical relation (intersection across \gls{ml}
|
||||||
|
models) and depicted using a Venn diagram from the \inlinecode{venn} R package.
|
||||||
|
|
||||||
\section{results}
|
\section{results}
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue