Demonstration of replicating results of a paper using publicly available neuroimaging data.

Data ingestion

Data visualization

Demonstration of analyzing neuroimaging data using ML/DL methods.

Model building and model selection

Model validation

First, we use R to replicate some results and figures in an article published on “Nature Protocols” (NP).

paper_header

Figures to replicate include:

First, we will replicate a heatmap of voxel-level neuroimaging features that can differentiate Alzheimer’s diseased (AD) patients to normal control (NC) people like below.

Then, we will explore that connection structures between brain regions of interest (ROI) of AD and NC respectively, try to identify the differences between the two connectomes, and replicate similar figures in the NP paper as following.

Data to use

We are going to use Alzheimer’s Disease Neuroimaging Initiative (ADNI) positron emission tomography (PET) imaging scans from 53 AD patients and 72 NC people. Each imaging scan is a 3D array of dimension 91 x 109 x 91 in the dataset. Each brain scan contains 185,405 effective template voxels, grouped in 116 ROIs. The ROI numbers and corresponding brain region names can be find here

PART I-1. Ingesting neuroimaging data

Neuroconductor is an open-source platform for analysing neuroimaging (and other) data. It hosts multiple packages to analyse data using different software. We will mostly use its base library neurobase. Run the following to install the neurobase package

source("https://neuroconductor.org/neurocLite.R")
neuro_install("neurobase", release = "stable")

Loading and visulizing neuroimaging data

The function readnii reads in a time series of brain imaging scans for one subject – sub-10159 – in file “sub-10159_task-rest_bold_space-MNI152NLin2009cAsym_preproc.nii.gz” as a 4-D array.

library(neurobase)
data <- readnii("sub-10159_task-rest_bold_space-MNI152NLin2009cAsym_preproc.nii.gz")

dim(data)

Each brain imaging scan is of 3-D dimension 65 x 77 x 49, corresponding to the coronal, sagittal, and horizontal planes of the brain.

And the series contains 152 scans.

We can use check_nifti function to check the details of the scans. Those information is stored in the “header” of each imaging file. For example, the Pixel Dimension: 3 x 3 x 4 x 2 tells us that the size of each voxel in space is 3 x 3 x 4 mm. The last number is dimension in time: a scan is taken every 2 seconds.

check_nifti(data)

To visualize the nifti images, we can use the orthographic function. It will only show the 3D scan at one time point. R will automatically show the first timepoint.

orthographic(data)

The argument xyz allows us to specify where to draw the cross-lines.

orthographic(data,xyz=c(20,20,30))

Brain regions of interest

One branch of research interest in neuroimaging studies is to investigate functions of different regions of interest (ROIs) of the brain and connections between them (the so called brain connectome studies). We next do a simple analysis on this by looking at the correlations between the average signal in those brain parcels.

First load the necessary libraries (You will need to install them first! Please google installation of each of these packages).

library(RColorBrewer)
library(scales)
library(lattice)
library(Rcmdr)

Let us then load the anatomical image for subject sub-10159 and plot it using the ‘ortho2` function.

sub10159_anat <- readnii("sub-10159_T1w_space-MNI152NLin2009cAsym_preproc.nii.gz")

ortho2(sub10159_anat)

Next, we read in the atlas that separate the brain into 102 ROIs.

atlas <- readnii("mni_icbm152_CerebrA_tal_nlin_sym_09c.nii")

dim(atlas)
table(as.vector(atlas))

The table from the last syntax line above tells how many voxels are there in each ROI. “0” means voxels outside the brain template.

One can overlay the 3D brain imaging with the ROI template.

rf <- colorRampPalette(rev(brewer.pal(102,'Spectral')))
cols <- rf(102)

ortho2(sub10159_anat,atlas,col.y=alpha(sample(cols),0.5))

One can extract and plot a certain brain region.

broca <- ifelse(atlas==28,1,0)
ortho2(sub10159_anat,broca,xyz=c(150,100,60))

PART I-2. Data visulization and NP paper figure replication

Load in a reformatted ADNI dataset

load("ADNIdata.RData")
ls()
[1] "info" "mask" "PET" 

info is for the clinical information for each subject. mask is the atlas for the template of registered voxels for the brain and the ROIs. PET is the imaging data in a matrix of 125 x 185405

info[1:5,]
   Subject.ID Research.Group DX.Group visit.Time NPISCORE FAQ.Total.Score
4  003_S_1059        Patient       AD   Baseline        2              23
13 005_S_0221        Patient       AD   Baseline       13              22
34 006_S_0547        Patient       AD   Baseline        8              21
55 007_S_0316        Patient       AD   Baseline        3               3
67 007_S_1339        Patient       AD   Baseline        4              13
   MMSE.Total.Score CDR.Total.Score GDS.Total.Score Subject.Sex Subject.Age
4                NA              NA              NA           F       84.72
13               NA              NA              NA           M       67.57
34               NA              NA              NA           M       75.97
55               NA              NA              NA           M       81.01
67               NA              NA              NA           F       79.61
   Age.Qualifier Subject.WeightKg Subject.Sex.Num
4              Y            76.66               0
13             Y           112.94               1
34             Y            94.35               1
55             Y            76.02               1
67             Y            74.39               0

table(info[,3])

    AD Normal 
    53     72 
dim(mask)
[1]  91 109  91

mask_non0 <- as.vector(mask)[which(as.vector(mask)!=0)]
table(mask_non0)

mask_non0
2001 2002 2101 2102 2111 2112 2201 2202 2211 2212 2301 2302 2311 2312 2321 2322 
3526 3381 3599 4056  963  997 4863 5104  888 1015 1038 1399 2529 2151 1690 1707
 
2331 2332 2401 2402 2501 2502 2601 2602 2611 2612 2701 2702 3001 3002 4001 4002 
 990 1331 2147 2371  280  289 2992 2134  719  856  852  745 1858 1770 1400 1313
  
4011 4012 4021 4022 4101 4102 4111 4112 4201 4202 5001 5002 5011 5012 5021 5022 
1941 2203  463  335  932  946  978 1132  220  248 2258 1861 1526 1424 2095 2300 

5101 5102 5201 5202 5301 5302 5401 5402 6001 6002 6101 6102 6201 6202 6211 6212 
1366 1413 3270 2098  941  989 2310 2518 3892 3823 2065 2222 2447 1345 1256 1974
 
6221 6222 6301 6302 6401 6402 7001 7002 7011 7012 7021 7022 7101 7102 8101 8102 
1173 1752 3528 3265 1349  836  962  994 1009 1064  293  280 1100 1057  225  249
 
8111 8112 8121 8122 8201 8202 8211 8212 8301 8302 9001 9002 9011 9012 9021 9022 
2296 3141 1285 1338 4942 4409  755 1187 3200 3557 2603 2648 1894 2117  136  207
 
9031 9032 9041 9042 9051 9052 9061 9062 9071 9072 9081 9082 9100 9110 9120 9130 
1125  861 1694 1795  585  534 1887 2308  869  809  144  159   53  228  665  371 

9140 9150 9160 9170 
 194  243  174  112 

length(table(mask_non0))
[1] 116

dim(PET)
[1]    125 185405

Heatmap of hotspots in the brain that differentiate AD and NC

Calculate the Zscore for comparing the means between AD and NC for each voxel.

Zscore.v <- apply(PET, 2, function(x)((mean(x[1:53])-mean(x[54:125]))/sd(x)))

Zscore <- as.vector(mask)
Zscore[which(as.vector(mask)!=0)] <- Zscore.v 
Zscore <- array(Zscore, dim=c(91, 109, 91))

Below are some R codes I wrote for plotting brain heatmaps.

MyHeatMapADNI <- function(Beta){
 P <-nrow(Beta)
 Q <-ncol(Beta)
 Xlim <- c(0,2*(P+1))
 Ylim <- c(0,2*(Q+1))
 RGBColors <- col2rgb(colors()[1:length(colors())])
 HSVColors <- rgb2hsv( RGBColors[1,], RGBColors[2,], RGBColors[3,], maxColorValue=255)
 HueOrder <- order( HSVColors[1,], HSVColors[2,], HSVColors[3,] )
 uBeta <- unique(as.vector(Beta))
 ruBeta <- rank(uBeta)
 vect <- cbind(uBeta, ruBeta)
 l <- length(unique(ruBeta))
 plot(0, type="n", xlab="", ylab="", xlim=Xlim, ylim=Ylim, cex.lab=1.0, bty="n", axes=F)
for (p in 1:P){
for (q in 1:Q){
        k0 <- ruBeta[which(uBeta==Beta[p,q])]
k <- round(k0*212/l)
if(k==0){ k <-1}
rect(2*(P-p+1)-1,2*(Q-q+1)-1, 2*(P-p+1)+1, 2*(Q-q+1)+1, col=colors()[HueOrder[k]], border=NA, lwd=0)
}
}
}

Use a NC imaging scan as the background.

library(oro.nifti)


length(which(as.vector(mask)!=0))
N_tmplt <- as.vector(mask)
N_tmplt[which(as.vector(mask)!=0)] <- PET[100, ]
N_tmplt <- array(N_tmplt, dim=c(91, 109, 91))

Plot the brain heatmap on the horizontal plan at z=30.

RGBColors <- col2rgb(colors()[1:length(colors())])
HSVColors <- rgb2hsv(RGBColors[1,], RGBColors[2,], RGBColors[3,], maxColorValue=255)
HueOrder <- order( HSVColors[1,], HSVColors[2,], HSVColors[3,] )

#####################################################################
library(fields)

z_cod <- 30

ImgY <- N_tmplt[,,z_cod]
AtlasY <- Zscore[,,z_cod]


rn <- dim(AtlasY)[1]
cn <- dim(AtlasY)[2]

colorN <- 255 #ceiling(max(AtlasY))+1
ColorValue <- tim.colors(colorN)
#AtlasY_Vec <- ceiling(as.vector(AtlasY))


sAtlas <- list(x=c(1:cn), y=c(1:rn), z=t(AtlasY))

MyHeatMapADNI(ImgY)
Beta <- AtlasY
 P <-nrow(Beta)
 Q <-ncol(Beta)
for (p in 1:P){
for (q in 1:Q){
        k0 <- Beta[p,q]
if(k0!=0){
  k <- ceiling(k0*250/max(Beta))+1 #ceiling(k0)+1
          rect(2*(P-p+1)-1,2*(Q-q+1)-1, 2*(P-p+1)+1, 2*(Q-q+1)+1, col=ColorValue[k], border=NA)
}
}
}
text(2*P-30, 10, labels=paste("z=",z_cod,"mm",sep=""), col="white", font=2, cex=1)
sAtlas <- list(x=c((10*cn+1):(11*cn)), y=c(1:rn), z=t(AtlasY))
image.plot(sAtlas, add=TRUE, breaks=44)

heatPlot

Use similar codes to plot the ROI map projected on the horizontal plan at z=30.

ROI <- mask[,,z_cod]
AtlasY <- mask[,,z_cod]

rn <- dim(AtlasY)[1]
cn <- dim(AtlasY)[2]

colorN <- 45 #ceiling(max(AtlasY))+1
ColorValue <- tim.colors(colorN)
#AtlasY_Vec <- ceiling(as.vector(AtlasY))


sAtlas <- list(x=c(1:cn), y=c(1:rn), z=t(AtlasY))

MyHeatMapADNI(ImgY)
Beta <- AtlasY
 P <-nrow(Beta)
 Q <-ncol(Beta)
for (p in 1:P){
for (q in 1:Q){
        k0 <- Beta[p,q]
if(k0!=0){
  k <- ceiling(k0*43/max(Beta))+1 #ceiling(k0)+1
          rect(2*(P-p+1)-1,2*(Q-q+1)-1, 2*(P-p+1)+1, 2*(Q-q+1)+1, col=ColorValue[k], border=NA)
}
}
}
text(2*P-30, 10, labels=paste("z=",z_cod,"mm",sep=""), col="white", font=2, cex=1)
sAtlas <- list(x=c((10*cn+1):(11*cn)), y=c(1:rn), z=t(AtlasY))

image.plot(sAtlas, add=TRUE, col=ColorValue[-1], lab.breaks=names(table(ROI)))

ROImap

From the brain heapmap and ROI map, we can see that both left and right Cerebelum (ROIs 9301 & 9302) cortices are very significant in differentiating AD and NC. There are also differentiating hotspots in cortices Insula (ROIs 3001 & 3002) and Rectus(ROIs 2701 & 2702).

Region-level connectomic analysis for AD and NC

First, calculate the means within each ROI for AD and NC, respectively.

ROI_means_AD <- NULL
ROI_means_NC <- NULL

for(r in 1:116){ 

 ROI_mean_AD_v <- apply(PET[1:53, which(mask_non0==names(table(mask_non0))[r])], 1, function(x)(mean(x, na.rm=TRUE)))
 ROI_means_AD <- cbind(ROI_means_AD, ROI_mean_AD_v)
 
 ROI_mean_NC_v <- apply(PET[54:125, which(mask_non0==names(table(mask_non0))[r])], 1, function(x)(mean(x, na.rm=TRUE)))
 ROI_means_NC <- cbind(ROI_means_NC, ROI_mean_NC_v)

}

Plot the connection levelplots

connect_AD <- cor(ROI_means_AD)
connect_AD[which(abs(connect_AD)<0.8, arr.ind=TRUE)] <- 0
colnames(connect_AD) <- names(table(mask_non0))
rownames(connect_AD) <- names(table(mask_non0))

rgb.palette <- colorRampPalette(c("cyan", "orange"), space = "rgb")
levelplot(connect_AD, scales=list(x=list(cex=0.6, rot=90), y=list(cex=0.6)), xlab="", ylab="", col.regions=rgb.palette(120), main="AD")



connect_NC <- cor(ROI_means_NC)
connect_NC[which(abs(connect_NC)<0.8, arr.ind=TRUE)] <- 0
colnames(connect_NC) <- names(table(mask_non0))
rownames(connect_NC) <- names(table(mask_non0))

levelplot(connect_NC, scales=list(x=list(cex=0.6, rot=90), y=list(cex=0.6)), xlab="", ylab="", col.regions=rgb.palette(120), main="NC")

Plot the 3D brain connection plots

First, you need to install the R package brainconn. See here for more details about the package.

install.packages("remotes")
remotes::install_github("sidchop/brainconn")

Plot out the 3D brain connection plots.

library(brainconn)

mask_coord_roi_new1 <- read.table("mask_coord_roi_new1.txt", header =TRUE)

brainconn(atlas =mask_coord_roi_new1, conmat=abs(connect_AD), node.color = "brown", node.size = 1, edge.width = 1, edge.color.weighted = T, view="ortho")
brainconn(atlas =mask_coord_roi_new1, conmat=abs(connect_NC), node.color = "brown", node.size = 1, edge.width = 1, edge.color.weighted = T, view="ortho")

AD

NC

From the 3D plots or the level plots, we can see that there are more inter-region connections between Frontal_Sup and Frontal_Sup_Orb cortices in the brains of AD patients, compared to for NC people.

Plot the chord diagrams for AD and NC


# install.packages("circlize")
library(circlize)


mat <- abs(connect_AD)

chordDiagram(mat, annotationTrack = "grid", 
    preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(mat))))))
circos.track(track.index = 1, panel.fun = function(x, y) {
    circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index, 
        facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) 


###

mat <- abs(connect_NC)

chordDiagram(mat, annotationTrack = "grid", 
    preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(mat))))))
circos.track(track.index = 1, panel.fun = function(x, y) {
    circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index, 
        facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) 

AD

NC

PART II-1. Predictive ML/DL models for neuroimaging data analysis

We used ROI level imaging data for AD/NC classification/prediction.

dim(ROI_means_AD)
[1]  53 116

dim(ROI_means_NC)
[1]  72 116


y <- rep(0, 125)
y[1:53] <- 1

X.m <- rbind(ROI_means_AD, ROI_means_NC)
dim(X.m)
[1] 125 116

1. Lasso model

temp.lasso.cvfit <- cv.glmnet(x=X.m,y=y, family = "binomial", type.measure = "class", alpha = 1)
lamb <- temp.lasso.cvfit$lambda.min
temp.lasso <- glmnet(x=X.m,y=y, alpha = 1, lambda=lamb, family = "binomial")
pred.lasso <- predict(temp.lasso, s='lambda.min', newx=X.m, type="response")
pred.lasso01 <- predict(temp.lasso, s='lambda.min', newx=X.m, type="class")

library(caret)
confusionMatrix(as.factor(y), as.factor(pred.lasso01))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 72  0
         1  0 53
                                     
               Accuracy : 1          
                 95% CI : (0.9709, 1)
    No Information Rate : 0.576      
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.000      
            Specificity : 1.000      
         Pos Pred Value : 1.000      
         Neg Pred Value : 1.000      
             Prevalence : 0.576      
         Detection Rate : 0.576      
   Detection Prevalence : 0.576      
      Balanced Accuracy : 1.000      
                                     
       'Positive' Class : 0          

Look at which ROI’s are selected as the predictive ROI for AD/NC classification.

> beta_hat <- as.vector(temp.lasso$beta[,1])
> 
> names(beta_hat) <- paste("ROI", c(1:116), sep="")
> 
> beta_hat
       ROI1        ROI2        ROI3        ROI4        ROI5        ROI6 
  0.0000000   0.0000000  11.2834825   0.0000000   0.0000000   0.0000000 
       ROI7        ROI8        ROI9       ROI10       ROI11       ROI12 
  0.0000000   4.0562152   0.0000000   3.5390322   0.0000000   5.0257354 
      ROI13       ROI14       ROI15       ROI16       ROI17       ROI18 
  0.0000000   1.4599871   0.0000000  16.3521690   0.0000000   1.4090088 
      ROI19       ROI20       ROI21       ROI22       ROI23       ROI24 
  0.0000000   0.0000000   1.8918206   8.2886784   0.0000000   0.0000000 
      ROI25       ROI26       ROI27       ROI28       ROI29       ROI30 
  0.0000000   0.0000000   0.0000000 -23.0549599 -17.6783982   0.0000000 
      ROI31       ROI32       ROI33       ROI34       ROI35       ROI36 
  0.0000000   0.0000000   0.0000000  -1.7228266   0.0000000  -9.1364817 
      ROI37       ROI38       ROI39       ROI40       ROI41       ROI42 
 -9.7847000   0.0000000 -32.9475738   0.0000000   0.0000000   0.0000000 
      ROI43       ROI44       ROI45       ROI46       ROI47       ROI48 
  0.0000000   0.0000000   0.0000000   0.0000000   0.0000000  16.6703849 
      ROI49       ROI50       ROI51       ROI52       ROI53       ROI54 
 -0.5371030   0.0000000   0.0000000   4.6004862   0.0000000   0.0000000 
      ROI55       ROI56       ROI57       ROI58       ROI59       ROI60 
  0.0000000   0.0000000   0.1106223  27.6758996   0.0000000   8.8117034 
      ROI61       ROI62       ROI63       ROI64       ROI65       ROI66 
-16.7934707  -1.0258150   8.5639771  -3.7309900   0.0000000 -23.0580959 
      ROI67       ROI68       ROI69       ROI70       ROI71       ROI72 
  0.0000000  -9.3551591   7.1754490   0.0000000   0.0000000   0.0000000 
      ROI73       ROI74       ROI75       ROI76       ROI77       ROI78 
  7.3113457   0.0000000   6.5755033   0.2252385   0.0000000   0.6747557 
      ROI79       ROI80       ROI81       ROI82       ROI83       ROI84 
  0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
      ROI85       ROI86       ROI87       ROI88       ROI89       ROI90 
  0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
      ROI91       ROI92       ROI93       ROI94       ROI95       ROI96 
  3.9549139   3.0406333   0.0000000   1.1476977   0.0000000   0.0000000 
      ROI97       ROI98       ROI99      ROI100      ROI101      ROI102 
  0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
     ROI103      ROI104      ROI105      ROI106      ROI107      ROI108 
  0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
     ROI109      ROI110      ROI111      ROI112      ROI113      ROI114 
  2.6401216  23.3079212   0.0000000  -7.0365297   0.0000000   0.0000000 
     ROI115      ROI116 
  0.0000000   0.8033363 
> 
> 
> which(abs(beta_hat)!=0)
  ROI3   ROI8  ROI10  ROI12  ROI14  ROI16  ROI18  ROI21  ROI22  ROI28  ROI29 
     3      8     10     12     14     16     18     21     22     28     29 
 ROI34  ROI36  ROI37  ROI39  ROI48  ROI49  ROI52  ROI57  ROI58  ROI60  ROI61 
    34     36     37     39     48     49     52     57     58     60     61 
 ROI62  ROI63  ROI64  ROI66  ROI68  ROI69  ROI73  ROI75  ROI76  ROI78  ROI91 
    62     63     64     66     68     69     73     75     76     78     91 
 ROI92  ROI94 ROI109 ROI110 ROI112 ROI116 
    92     94    109    110    112    116 

Find the important ROIs names from the dictionary.

> region_code
     V1   V2                   V3
1     1 2001         Precentral_L
2     2 2002         Precentral_R
3     3 2101        Frontal_Sup_L
4     4 2102        Frontal_Sup_R
5     5 2111    Frontal_Sup_Orb_L
6     6 2112    Frontal_Sup_Orb_R
7     7 2201        Frontal_Mid_L
8     8 2202        Frontal_Mid_R
9     9 2211    Frontal_Mid_Orb_L
10   10 2212    Frontal_Mid_Orb_R
11   11 2301   Frontal_Inf_Oper_L
12   12 2302   Frontal_Inf_Oper_R

2. KNN

library(caret)
library(pROC)
ctrl <- trainControl(method="repeatedcv",repeats = 10) 
model_knn <- train(as.data.frame(X.m), unlist(y), method = "knn",
      preProcess = c("center","scale"), trControl = ctrl,
      tuneLength = 20)

pred.knn <- as.numeric(predict(model_knn, X.m)>0.5)
confusionMatrix(factor(pred.knn), factor(y))

#Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 67 26
         1  5 27
                                          
               Accuracy : 0.752           
                 95% CI : (0.6668, 0.8249)
    No Information Rate : 0.576           
    P-Value [Acc > NIR] : 3.13e-05        
                                          
                  Kappa : 0.4643          
                                          
 Mcnemar's Test P-Value : 0.000328        
                                          
            Sensitivity : 0.9306          
            Specificity : 0.5094          
         Pos Pred Value : 0.7204          
         Neg Pred Value : 0.8438          
             Prevalence : 0.5760          
         Detection Rate : 0.5360          
   Detection Prevalence : 0.7440          
      Balanced Accuracy : 0.7200          
                                          
       'Positive' Class : 0               

3. Linear discriminant analysis (LDA)


library(MASS)
model.lda <- lda(y ~ X.m)
pred.lda <- predict(model.lda, as.data.frame(X.m), type="classification")
confusionMatrix(pred.lda$class, factor(y))

#Confusion Matrix and Statistics
#
#          Reference
#Prediction  0  1
#         0 72  0
#         1  0 53
#                                     
#               Accuracy : 1          
#                 95% CI : (0.9709, 1)
#    No Information Rate : 0.576      
#    P-Value [Acc > NIR] : < 2.2e-16  
#                                     
#                  Kappa : 1          
#                                     
# Mcnemar's Test P-Value : NA         
#                                     
#            Sensitivity : 1.000      
#            Specificity : 1.000      
#         Pos Pred Value : 1.000      
#         Neg Pred Value : 1.000      
#             Prevalence : 0.576      
#         Detection Rate : 0.576      
#   Detection Prevalence : 0.576      

4. XGboost


library(xgboost)
label_tr_xgb = as.numeric(unlist(y))
xgb_train <- xgb.DMatrix(data=as.matrix(X.m, nrow=nrow(X.m)),
                         label=label_tr_xgb)
model_xgb <- xgboost(data= xgb_train, nrounds=10,
                     objective = "binary:logistic", class=2,
                     nfold=10, set.seed=2022)
xgbpred<- predict(model_xgb, xgb_train)

confusionMatrix(as.factor(as.numeric(xgbpred>0.5)), factor(y))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 72  0
         1  0 53
                                     
               Accuracy : 1          
                 95% CI : (0.9709, 1)
    No Information Rate : 0.576      
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.000      
            Specificity : 1.000      
         Pos Pred Value : 1.000      
         Neg Pred Value : 1.000      
             Prevalence : 0.576      
         Detection Rate : 0.576      
   Detection Prevalence : 0.576      
      Balanced Accuracy : 1.000      
                                     
       'Positive' Class : 0          

5. Neural network

#install.packages(c('neuralnet','keras','tensorflow'),dependencies = T)

library(tidyverse)
library(neuralnet)

colnames(X.m) <- paste("ROI", c(1:116), sep="")
Dat <- as.data.frame(cbind(y, X.m))

model = neuralnet(
    y~ .,
data=Dat,
hidden=c(4,2),
linear.output = FALSE
)


nnpred <- predict(model, X.m)

confusionMatrix(as.factor(as.numeric(nnpred>0.5)), factor(y))


Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 70  2
         1  2 51
                                          
               Accuracy : 0.968           
                 95% CI : (0.9201, 0.9912)
    No Information Rate : 0.576           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9345          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9722          
            Specificity : 0.9623          
         Pos Pred Value : 0.9722          
         Neg Pred Value : 0.9623          
             Prevalence : 0.5760          
         Detection Rate : 0.5600          
   Detection Prevalence : 0.5760          
      Balanced Accuracy : 0.9672          
                                          
       'Positive' Class : 0               

PART II-2. Predictive model validation

1. Model validation via independent datasets

Need to collect independent test dataset when collecting the traning data. Or get the test data from other data cohorts.

2. Cross validation

A self-implemented R function for cross validation for Lasso model.



Lasso.cv <-
function(X, Y, fold=5, seed=1, lamb)
{


  n=nrow(X)
    
 ############################### set seed and generate cross validation seperations
 set.seed(seed)

 
 index.cv<- vector(mode='list', length=fold)
 f.n=floor(n/fold)
 
 for(k in 1:2){
 
    index_set_k <- which(Y==(k-1))
    ran.order_set_k =sample(index_set_k, length(index_set_k), replace=F)
    
    if( length(index_set_k) < fold){
    	err <- sprintf(paste("There are less than ", fold, "samples in class ", k, sep=""))
    }
    f.nk <- floor(length(index_set_k)/fold) 
    
    for (f in 1:(fold-1)){  
     index.cv[[f]]<-c(index.cv[[f]], ran.order_set_k[(f-1)*f.nk+1:f.nk])
    } 
    index.cv[[fold]]<-c(index.cv[[fold]], ran.order_set_k[((fold-1)*f.nk+1):length(index_set_k)])
 
 }
 
 
 ################################ begin cross validation
 err.cv <- NULL
 err.tot <- 0
 
 Y.test.arr <- NULL
 Y.pred.arr <- NULL
 

 for(f in 1:fold)
  {  print(paste("fold=",f))   
     index.cur.cv<-index.cv[[f]]   
     X.m<-X[-(index.cur.cv),]   
     Y.m<-Y[-(index.cur.cv)]   
     X.t<-X[index.cur.cv,]   
     Y.t<-Y[index.cur.cv] 
     
     Y.test.arr[[f]] <- Y.t
     
     mod.train  <- glmnet(x=X.m, y=Y.m, alpha = 1, lambda=lamb, family = "binomial")
     pred.y.t <- as.numeric(predict(mod.train, s=lamb, newx=X.t, type="class"))
     
      
     err.cv.c <- length(which(pred.y.t != Y.t)) 
     err.cv<- c(err.cv, err.cv.c)
     err.tot <- err.tot + err.cv.c
       
     Y.pred.arr[[f]] <- pred.y.t
    
   }###end fold loop
   
   result=list(err.tot=err.tot, err.cv=err.cv, Y.pred.arr=Y.pred.arr, Y.test.arr=Y.test.arr) 
   
   return(result)
}
cv.tmp <- Lasso.cv(X.m, y, lamb=0.0035)

[1] "fold= 1"
[1] "fold= 2"
[1] "fold= 3"
[1] "fold= 4"
[1] "fold= 5"
> 
> 
> cv.tmp
$err.tot
[1] 32

$err.cv
[1] 9 7 4 6 6

$Y.pred.arr
$Y.pred.arr[[1]]
 [1] 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0

$Y.pred.arr[[2]]
 [1] 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1

$Y.pred.arr[[3]]
 [1] 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1

$Y.pred.arr[[4]]
 [1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0

$Y.pred.arr[[5]]
 [1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 1 1


$Y.test.arr
$Y.test.arr[[1]]
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

$Y.test.arr[[2]]
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

$Y.test.arr[[3]]
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

$Y.test.arr[[4]]
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

$Y.test.arr[[5]]
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1

Notice that the cross validation error is 32 out of 125. Remember the prediction error on the training data was 0 (overfitting).