ADD methods to aim 2a

This commit is contained in:
Nathan Dwarshuis 2021-07-29 12:48:13 -04:00
parent f6a24398ca
commit 6d9084fe92
1 changed files with 239 additions and 17 deletions

View File

@ -722,7 +722,7 @@ better retention of memory phenotype compared to current bead-based methods.
\section{methods}
\subsection{dms functionalization}
\subsection{dms functionalization}\label{sec:dms_fab}
\begin{figure*}[ht!]
\begingroup
@ -778,14 +778,6 @@ was then manually counted to obtain a concentration. Surface area for
\si{\ab\per\um\squared} was calculated using the properties for \gls{cus} and
\gls{cug} as given by the manufacturer {Table X}.
%TODO this bit belongs in the next aim
% In the case of the \gls{doe} experiment where
% variable mAb surface density was utilized, the anti-CD3/anti-CD28 mAb mixture
% was further combined with a biotinylated isotype control to reduce the overall
% fraction of targeted mAbs (for example the 60\% mAb surface density corresponded
% to 3 mass parts anti-CD3, 3 mass parts anti-CD8, and 4 mass parts isotype
% control).
\subsection{dms quality control assays}
Biotin was quantified using the \product{\gls{haba} assay}{Sigma}{H2153-1VL}. In
@ -848,11 +840,6 @@ depending on media color or a \SI{300}{\mg\per\deci\liter} minimum glucose
threshold. Media glucose was measured using a \product{GlucCell glucose
meter}{Chemglass}{CLS-1322-02}.
% TODO this belongs in aim 2
% In order to remove \glspl{dms} from
% culture, collagenase D (Sigma Aldrich) was sterile filtered in culture media and
% added to a final concentration of \SI{50}{\ug\per\ml} during media addition.
Cells on the \glspl{dms} were visualized by adding \SI{0.5}{\ul}
\product{\gls{stppe}}{\bl}{405204} and \SI{2}{ul}
\product{\acd{45}-\gls{af647}}{\bl}{368538}, incubating for \SI{1}{\hour}, and
@ -1047,7 +1034,7 @@ These equations were then used analogously to describe the reaction profile of
% METHOD add the equation governing the washing steps
\subsection{Luminex Analysis}
\subsection{Luminex Analysis}\label{sec:luminex_analysis}
Luminex was performed using a \product{ProcartaPlex kit}{\thermo}{custom} for
the markers outlined in \cref{tab:luminex_panel} with modifications (note that
@ -1055,14 +1042,21 @@ some markers were run in separate panels to allow for proper dilutions).
Briefly, media supernatents from cells were sampled as desired and immediately
placed in a \SI{-80}{\degreeCelsius} freezer until use. Before use, samples were
thawed at \gls{rt} and vortexed to ensure homogeneity. To run the plate,
\SI{25}{\ul} of magnetic beads were added to the plate and washed 3x using
\SI{25}{\ul} of magnetic beads were added to the plate and washed 3X using
\SI{300}{\ul} of wash buffer. \SI{25}{\ul} of samples or standard were added to
the plate and incubated for \SI{120}{\minute} at \SI{850}{\rpm} at \gls{rt}
before washing analogously 3X with wash. \SI{12.5}{\ul} detection \glspl{mab}
and \SI{25}{\ul} \gls{stppe} were sequentially added, incubated for
\SI{30}{\minute} and vortexed, and washed analogously to the sample step.
Finally, samples were resuspended in \SI{120}{\ul} reading buffer and analyzed
via a Biorad Bioplex 200 plate reader.
via a BioRad Bioplex 200 plate reader. An 8 point log2 standard curve was used,
and all samples were run with single replicates.
Luminex data was preprocessed using R for inclusion in downstream analysis as
follows. Any cytokine level that was over-range (`OOR >' in output spreadsheet)
was set to the maximum value of the standard curve for that cytokine. Any value
that was under-range (`OOR <l in output spreadsheet) was set to zero. All values
that were extrapolated from the standard curve were left unchanged.
\begin{table}[!h] \centering
\caption{Luminex Panel}
@ -1119,6 +1113,11 @@ lack-of-fit tests where replicates were present (to assess model fit in the
context of pure error). Statistical significance was evaluated at $\upalpha$ =
0.05.
\subsection{flow cytometry}\label{sec:flow_cytometry}
% METHOD add flow cytometry
% FIGURE add gating strategy
\section{results}
\subsection{DMSs can be fabricated in a controlled manner}
@ -1944,6 +1943,229 @@ provide these benefits.
\section{introduction}
\section{methods}
\subsection{study design}
The first DOE resulted in a randomized 18-run I-optimal custom design where each
DMS parameter was evaluated at three levels: IL2 concentration (10, 20, and 30
U/μL), DMS concentration (500, 1500, 2500 carrier/μL), and functionalized
antibody percent (60\%, 80\%, 100\%). These 18 runs consisted of 14 unique
parameter combinations where 4 of them were replicated twice to assess
prediction error. Process parameters for the ADOE were evaluated at multiple
levels: IL2 concentration (30, 35, and 40 U/μL), DMS concentration (500, 1000,
1500, 2000, 2500, 3000, 3500 carrier/μL), and functionalized antibody percent
(100\%) as depicted in Fig.1b. To further optimize the initial region explored
(DOE) in terms of total live CD4+ TN+TCM cells, a sequential adaptive
design-of-experiment (ADOE) was designed with 10 unique parameter combinations,
two of these replicated twice for a total of 12 additional samples (Fig.1b). The
fusion of cytokine and NMR profiles from media to model these responses included
30 cytokines from a custom Thermo Fisher ProcartaPlex Luminex kit and 20 NMR
features. These 20 spectral features from NMR media analysis were selected out
of approximately 250 peaks through the implementation of a variance-based
feature selection approach and some manual inspection steps.
\subsection{DMS fabrication}
\glspl{dms} were fabricated as described in \cref{sec:dms_fab} with the
following modifications in order to obtain a variable functional \gls{mab}
surface density. During the \gls{mab} coating step, the anti-CD3/anti-CD28 mAb
mixture was further combined with a biotinylated isotype control to reduce the
overall fraction of targeted \glspl{mab} (for example the \SI{60}{\percent}
\gls{mab} surface density corresponded to 3 mass parts \acd{3}, 3 mass parts
\acd{28}, and 4 mass parts isotype control).
\subsection{T cell culture}
T cell culture was performed as described in \cref{sec:tcellculture} with the
following modifications. At days 4, 6, 8, and 11, \SI{100}{\ul} media were
collected for the Luminex assay and \gls{nmr} analysis. The volume of removed
media was equivalently replaced during the media feeding step, which took place
immediately after sample collection. Additionally, the same media feeding
schedule was followed for the DOE and ADOE to improve consistency, and the same
donor lot was used for both experiments. All cell counts were performed using
\gls{aopi}.
\subsection{flow cytometry}
Flow cytometry was performed analogously to \cref{sec:flow_cytometry}.
\subsection{Cytokine quantification}
Cytokines were quantified via Luminex as described in
\cref{sec:luminex_analysis}.
% TODO paraphrase this entire section since I didn't do it
\subsection{NMR metabolomics}
Prior to analysis, samples were centrifuged at \SI{2990}{\gforce} for
\SI{20}{\minute} at \SI{4}{\degreeCelsius} to clear any debris. 5 μL of 100/3 mM
DSS-D6 in deuterium oxide (Cambridge Isotope Laboratories) were added to 1.7 mm
NMR tubes (Bruker BioSpin), followed by 45 μL of media from each sample that was
added and mixed, for a final volume of 50 μL in each tube. Samples were prepared
on ice and in predetermined, randomized order. The remaining volume from each
sample in the rack (4 μL) was combined to create an internal pool. This
material was used for internal controls within each rack as well as metabolite
annotation.
NMR spectra were collected on a Bruker Avance III HD spectrometer at 600 MHz
using a 5-mm TXI cryogenic probe and TopSpin software (Bruker BioSpin).
One-dimensional spectra were collected on all samples using the noesypr1d pulse
sequence under automation using ICON NMR software. Two-dimensional HSQC and
TOCSY spectra were collected on internal pooled control samples for metabolite
annotation.
One-dimensional spectra were manually phased and baseline corrected in TopSpin.
Two-dimensional spectra were processed in NMRpipe37. One dimensional spectra
were referenced, water/end regions removed, and normalized with the PQN
algorithm38 using an in-house MATLAB (The MathWorks, Inc.) toolbox
(https://github.com/artedison/Edison_Lab_Shared_Metabolomics_UGA).
To reduce the total number of spectral features from approximately 250 peaks and
enrich for those that would be most useful for statistical modeling, a
variance-based feature selection was performed within MATLAB. For each digitized
point on the spectrum, the variance was calculated across all experimental
samples and plotted. Clearly-resolved features corresponding to peaks in the
variance spectrum were manually binned and integrated to obtain quantitative
feature intensities across all samples (Supp.Fig.S24). In addition to highly
variable features, several other clearly resolved and easily identifiable
features were selected (glucose, BCAA region, etc). Some features were later
discovered to belong to the same metabolite but were included in further
analysis.
Two-dimensional spectra collected on pooled samples were uploaded to COLMARm web
server10, where HSQC peaks were automatically matched to database peaks. HSQC
matches were manually reviewed with additional 2D and proton spectra to confirm
the match. Annotations were assigned a confidence score based upon the levels of
spectral data supporting the match as previously described11. Annotated
metabolites were matched to previously selected features used for statistical
analysis.
Using the list of annotated metabolites obtained above, an approximation of a
representative experimental spectrum was generated using the GISSMO mixture
simulation tool.39,40 With the simulated mixture of compounds, generated at 600
MHz to match the experimental data, a new simulation was generated at 80 MHz to
match the field strength of commercially available benchtop NMR spectrometers.
The GISSMO tool allows visualization of signals contributed from each individual
compound as well as the mixture, which allows annotation of features in the
mixture belonging to specific compounds.
Several low abundance features selected for analysis did not have database
matches and were not annotated. Statistical total correlation spectroscopy41
suggested that some of these unknown features belonged to the same molecules
(not shown). Additional multidimensional NMR experiments will be required to
determine their identity.
% TODO paraphrase most of this since I didn't do much of the analysis myself
\subsection{machine learning and statistical analysis}
Seven machine learning (ML) techniques were implemented to predict three
responses related to the memory phenotype of the cultured T cells under
different process parameters conditions (i.e. Total Live CD4+ TN and TCM, Total
Live CD8+ TN+TCM, and Ratio CD4+/CD8+ TN+TCM). The ML methods executed were
Random Forest (RF), Gradient Boosted Machine (GBM), Conditional Inference Forest
(CIF), Least Absolute Shrinkage and Selection Operator (LASSO), Partial
Least-Squares Regression (PLSR), Support Vector Machine (SVM), and DataModelers
Symbolic Regression (SR). Primarily, SR models were used to optimize process
parameter values based on TN+TCM phenotype and to extract early predictive
variable combinations from the multi-omics experiments. Furthermore, all
regression methods were executed, and the high-performing models were used to
perform a consensus analysis of the important variables to extract potential
critical quality attributes and critical process parameters predictive of T-cell
potency, safety, and consistency at the early stages of the manufacturing
process.
Symbolic regression (SR) was done using Evolved Analytics DataModeler software
(Evolved Analytics LLC, Midland, MI). DataModeler utilizes genetic programming
to evolve symbolic regression models (both linear and non-linear) rewarding
simplicity and accuracy. Using the selection criteria of highest accuracy
(R2>90\% or noise-power) and lowest complexity, the top-performing models were
identified. Driving variables, variable combinations, and model dimensionality
tables were generated. The top-performing variable combinations were used to
generate model ensembles. In this analysis, DataModelers SymbolicRegression
function was used to develop explicit algebraic (linear and nonlinear) models.
The fittest models were analyzed to identify the dominant variables using the
VariablePresence function, the dominant variable combinations using the
VariableCombinations function, and the model dimensionality (number of unique
variables) using the ModelDimensionality function. CreateModelEnsemble was used
to define trustable model ensembles using selected variable combinations and
these were summarized (model expressions, model phenotype, model tree plot,
ensemble quality, model quality, variable presence map, ANOVA tables, model
prediction plot, exportable model forms) using the ModelSummaryTable function.
Ensemble prediction and residual performance were respectively assessed via the
EnsemblePredictionPlot and EnsembleResidualPlot subroutines. Model maxima
(ModelMaximum function) and model minima (ModelMinimum function) were calculated
and displayed using the ResponsePlotExplorer function. Trade-off performance of
multiple responses was explored using the MultiTargetResponseExplorer and
ResponseComparisonExplorer with additional insights derived from the
ResponseContourPlotExplorer. Graphics and tables were generated by DataModeler.
These model ensembles were used to identify predicted response values, potential
optima in the responses, and regions of parameter values where the predictions
diverge the most.
Non-parametric tree-based ensembles were done through the randomForest, gbm, and
cforest regression functions in R, for random forest, gradient boosted trees,
and conditional inference forest models, respectively. Both random forest and
conditional inference forest construct multiple decision trees in parallel, by
randomly choosing a subset of features at each decision tree split, in the
training stage. Random forest individual decision trees are split using the Gini
Index, while conditional inference forest uses a statistical significance test
procedure to select the variables at each split, reducing correlation bias. In
contrast, gradient boosted trees construct regression trees in series through an
iterative procedure that adapts over the training set. This model learns from
the mistakes of previous regression trees in an iterative fashion to correct
errors from its precursors trees (i.e. minimize mean squared errors).
Prediction performance was evaluated using leave-one-out cross-validation
(LOO)-R2 and permutation-based variable importance scores assessing \% increase
of mean squared errors (MSE), relative influence based on the increase of
prediction error, coefficient values for RF, GBM, and CID, respectively. Partial
least squares regression was executed using the plsr function from the pls
package in R while LASSO regression was performed using the cv.glmnet R package,
both using leave-one-out cross-validation. Finally, the kernlab R package was
used to construct the Support Vector Machine regression models.
Parameter tuning was done for all models in a grid search manner using the train
function from the caret R package using LOO-R2 as the optimization criteria.
Specifically, the number of features randomly sampled as candidates at each
split (mtry) and the number of trees to grow (ntree) were tuned parameters for
random forest and conditional inference forest. In particular, minimum sum of
weights in a node to be considered for splitting and the minimum sum of weights
in a terminal node were manually tuned for building the CIF models. Moreover,
GBM parameters such as the number of trees to grow, maximum depth of each tree,
learning rate, and the minimal number of observations at the terminal node, were
tuned for optimum LOO-R2 performance as well. For PLSR, the optimal number of
components to be used in the model was assessed based on the standard error of
the cross-validation residuals using the function selectNcomp from the pls
package. Moreover, LASSO regression was performed using the cv.glmnet package
with alpha = 1. The best lambda for each response was chosen using the minimum
error criteria. Lastly, a fixed linear kernel (i.e. svmLinear) was used to build
the SVM regression models evaluating the cost parameter value with best LOO-R2.
Prediction performance was measured for all models using the final model with
LOO-R2 tuned parameters. Table M2 shows the parameter values evaluated per model
at the final stages of results reporting.
\subsection{consensus analysis}
Consensus analysis of the relevant variables extracted from each machine
learning model was done to identify consistent predictive features of quality at
the early stages of manufacturing. First importance scores for all features were
measured across all ML models using varImp with caret R package except for
scores for SVM which rminer R package was used. These importance scores were
percent increase in mean squared error (MSE), relative importance through
average increase in prediction error when a given predictor is permuted,
permuted coefficients values, absolute coefficient values, weighted sum of
absolute coefficients values, and relative importance from sensitivity analysis
determined for RF, GBM, CIF, LASSO, PLSR, and SVM, respectively. Using these
scores, key predictive variables were selected if their importance scores were
within the 80th percentile ranking for the following ML methods: RF, GBM, CIF,
LASSO, PLSR, SVM while for SR variables present in >30\% of the top-performing
SR models from DataModeler (R2≥ 90\%, Complexity ≥ 100) were chosen to
investigate consensus except for NMR media models at day 4 which considered a
combination of the top-performing results of models excluding lactate ppms, and
included those variables which were in > 40\% of the best performing models.
Only variables with those high percentile scoring values were evaluated in terms
of their logical relation (intersection across ML models) and depicted using a
Venn diagram from the venn R package.
\section{results}
\subsection{DOE shows optimal conditions for expanded potent T cells}