Demonstration of replicating results of a paper using publicly available neuroimaging data.
Data ingestion
Data visualization
Demonstration of analyzing neuroimaging data using ML/DL methods.
Model building and model selection
Model validation
First, we use R to replicate some results and figures in an article published on “Nature Protocols” (NP).

Figures to replicate include:
First, we will replicate a heatmap of voxel-level neuroimaging features that can differentiate Alzheimer’s diseased (AD) patients to normal control (NC) people like below.

Then, we will explore that connection structures between brain regions of interest (ROI) of AD and NC respectively, try to identify the differences between the two connectomes, and replicate similar figures in the NP paper as following.

Data to use
We are going to use Alzheimer’s Disease Neuroimaging Initiative (ADNI) positron emission tomography (PET) imaging scans from 53 AD patients and 72 NC people. Each imaging scan is a 3D array of dimension 91 x 109 x 91 in the dataset. Each brain scan contains 185,405 effective template voxels, grouped in 116 ROIs. The ROI numbers and corresponding brain region names can be find here
PART I-1. Ingesting neuroimaging data
Neuroconductor is an open-source platform for analysing neuroimaging (and other) data. It hosts multiple packages to analyse data using different software. We will mostly use its base library neurobase. Run the following to install the neurobase package
source("https://neuroconductor.org/neurocLite.R")
neuro_install("neurobase", release = "stable")
Loading and visulizing neuroimaging data
The function readnii reads in a time series of brain imaging scans for one subject – sub-10159 – in file “sub-10159_task-rest_bold_space-MNI152NLin2009cAsym_preproc.nii.gz” as a 4-D array.
library(neurobase)
data <- readnii("sub-10159_task-rest_bold_space-MNI152NLin2009cAsym_preproc.nii.gz")
dim(data)
Each brain imaging scan is of 3-D dimension 65 x 77 x 49, corresponding to the coronal, sagittal, and horizontal planes of the brain.
And the series contains 152 scans.
We can use check_nifti function to check the details of the scans. Those information is stored in the “header” of each imaging file. For example, the Pixel Dimension: 3 x 3 x 4 x 2 tells us that the size of each voxel in space is 3 x 3 x 4 mm. The last number is dimension in time: a scan is taken every 2 seconds.
check_nifti(data)
To visualize the nifti images, we can use the orthographic function. It will only show the 3D scan at one time point. R will automatically show the first timepoint.
orthographic(data)
The argument xyz allows us to specify where to draw the cross-lines.
orthographic(data,xyz=c(20,20,30))
Brain regions of interest
One branch of research interest in neuroimaging studies is to investigate functions of different regions of interest (ROIs) of the brain and connections between them (the so called brain connectome studies). We next do a simple analysis on this by looking at the correlations between the average signal in those brain parcels.
First load the necessary libraries (You will need to install them first! Please google installation of each of these packages).
library(RColorBrewer)
library(scales)
library(lattice)
library(Rcmdr)
Let us then load the anatomical image for subject sub-10159 and plot it using the ‘ortho2` function.
sub10159_anat <- readnii("sub-10159_T1w_space-MNI152NLin2009cAsym_preproc.nii.gz")
ortho2(sub10159_anat)
Next, we read in the atlas that separate the brain into 102 ROIs.
atlas <- readnii("mni_icbm152_CerebrA_tal_nlin_sym_09c.nii")
dim(atlas)
table(as.vector(atlas))
The table from the last syntax line above tells how many voxels are there in each ROI. “0” means voxels outside the brain template.

One can overlay the 3D brain imaging with the ROI template.
rf <- colorRampPalette(rev(brewer.pal(102,'Spectral')))
cols <- rf(102)
ortho2(sub10159_anat,atlas,col.y=alpha(sample(cols),0.5))
One can extract and plot a certain brain region.
broca <- ifelse(atlas==28,1,0)
ortho2(sub10159_anat,broca,xyz=c(150,100,60))
PART I-2. Data visulization and NP paper figure replication
Load in a reformatted ADNI dataset
load("ADNIdata.RData")
ls()
[1] "info" "mask" "PET"
info is for the clinical information for each subject.
mask is the atlas for the template of registered voxels for the brain and the ROIs.
PET is the imaging data in a matrix of 125 x 185405
info[1:5,]
Subject.ID Research.Group DX.Group visit.Time NPISCORE FAQ.Total.Score
4 003_S_1059 Patient AD Baseline 2 23
13 005_S_0221 Patient AD Baseline 13 22
34 006_S_0547 Patient AD Baseline 8 21
55 007_S_0316 Patient AD Baseline 3 3
67 007_S_1339 Patient AD Baseline 4 13
MMSE.Total.Score CDR.Total.Score GDS.Total.Score Subject.Sex Subject.Age
4 NA NA NA F 84.72
13 NA NA NA M 67.57
34 NA NA NA M 75.97
55 NA NA NA M 81.01
67 NA NA NA F 79.61
Age.Qualifier Subject.WeightKg Subject.Sex.Num
4 Y 76.66 0
13 Y 112.94 1
34 Y 94.35 1
55 Y 76.02 1
67 Y 74.39 0
table(info[,3])
AD Normal
53 72
dim(mask)
[1] 91 109 91
mask_non0 <- as.vector(mask)[which(as.vector(mask)!=0)]
table(mask_non0)
mask_non0
2001 2002 2101 2102 2111 2112 2201 2202 2211 2212 2301 2302 2311 2312 2321 2322
3526 3381 3599 4056 963 997 4863 5104 888 1015 1038 1399 2529 2151 1690 1707
2331 2332 2401 2402 2501 2502 2601 2602 2611 2612 2701 2702 3001 3002 4001 4002
990 1331 2147 2371 280 289 2992 2134 719 856 852 745 1858 1770 1400 1313
4011 4012 4021 4022 4101 4102 4111 4112 4201 4202 5001 5002 5011 5012 5021 5022
1941 2203 463 335 932 946 978 1132 220 248 2258 1861 1526 1424 2095 2300
5101 5102 5201 5202 5301 5302 5401 5402 6001 6002 6101 6102 6201 6202 6211 6212
1366 1413 3270 2098 941 989 2310 2518 3892 3823 2065 2222 2447 1345 1256 1974
6221 6222 6301 6302 6401 6402 7001 7002 7011 7012 7021 7022 7101 7102 8101 8102
1173 1752 3528 3265 1349 836 962 994 1009 1064 293 280 1100 1057 225 249
8111 8112 8121 8122 8201 8202 8211 8212 8301 8302 9001 9002 9011 9012 9021 9022
2296 3141 1285 1338 4942 4409 755 1187 3200 3557 2603 2648 1894 2117 136 207
9031 9032 9041 9042 9051 9052 9061 9062 9071 9072 9081 9082 9100 9110 9120 9130
1125 861 1694 1795 585 534 1887 2308 869 809 144 159 53 228 665 371
9140 9150 9160 9170
194 243 174 112
length(table(mask_non0))
[1] 116
dim(PET)
[1] 125 185405
Heatmap of hotspots in the brain that differentiate AD and NC
Calculate the Zscore for comparing the means between AD and NC for each voxel.
Zscore.v <- apply(PET, 2, function(x)((mean(x[1:53])-mean(x[54:125]))/sd(x)))
Zscore <- as.vector(mask)
Zscore[which(as.vector(mask)!=0)] <- Zscore.v
Zscore <- array(Zscore, dim=c(91, 109, 91))
Below are some R codes I wrote for plotting brain heatmaps.
MyHeatMapADNI <- function(Beta){
P <-nrow(Beta)
Q <-ncol(Beta)
Xlim <- c(0,2*(P+1))
Ylim <- c(0,2*(Q+1))
RGBColors <- col2rgb(colors()[1:length(colors())])
HSVColors <- rgb2hsv( RGBColors[1,], RGBColors[2,], RGBColors[3,], maxColorValue=255)
HueOrder <- order( HSVColors[1,], HSVColors[2,], HSVColors[3,] )
uBeta <- unique(as.vector(Beta))
ruBeta <- rank(uBeta)
vect <- cbind(uBeta, ruBeta)
l <- length(unique(ruBeta))
plot(0, type="n", xlab="", ylab="", xlim=Xlim, ylim=Ylim, cex.lab=1.0, bty="n", axes=F)
for (p in 1:P){
for (q in 1:Q){
k0 <- ruBeta[which(uBeta==Beta[p,q])]
k <- round(k0*212/l)
if(k==0){ k <-1}
rect(2*(P-p+1)-1,2*(Q-q+1)-1, 2*(P-p+1)+1, 2*(Q-q+1)+1, col=colors()[HueOrder[k]], border=NA, lwd=0)
}
}
}
Use a NC imaging scan as the background.
library(oro.nifti)
length(which(as.vector(mask)!=0))
N_tmplt <- as.vector(mask)
N_tmplt[which(as.vector(mask)!=0)] <- PET[100, ]
N_tmplt <- array(N_tmplt, dim=c(91, 109, 91))
Plot the brain heatmap on the horizontal plan at z=30.
RGBColors <- col2rgb(colors()[1:length(colors())])
HSVColors <- rgb2hsv(RGBColors[1,], RGBColors[2,], RGBColors[3,], maxColorValue=255)
HueOrder <- order( HSVColors[1,], HSVColors[2,], HSVColors[3,] )
#####################################################################
library(fields)
z_cod <- 30
ImgY <- N_tmplt[,,z_cod]
AtlasY <- Zscore[,,z_cod]
rn <- dim(AtlasY)[1]
cn <- dim(AtlasY)[2]
colorN <- 255 #ceiling(max(AtlasY))+1
ColorValue <- tim.colors(colorN)
#AtlasY_Vec <- ceiling(as.vector(AtlasY))
sAtlas <- list(x=c(1:cn), y=c(1:rn), z=t(AtlasY))
MyHeatMapADNI(ImgY)
Beta <- AtlasY
P <-nrow(Beta)
Q <-ncol(Beta)
for (p in 1:P){
for (q in 1:Q){
k0 <- Beta[p,q]
if(k0!=0){
k <- ceiling(k0*250/max(Beta))+1 #ceiling(k0)+1
rect(2*(P-p+1)-1,2*(Q-q+1)-1, 2*(P-p+1)+1, 2*(Q-q+1)+1, col=ColorValue[k], border=NA)
}
}
}
text(2*P-30, 10, labels=paste("z=",z_cod,"mm",sep=""), col="white", font=2, cex=1)
sAtlas <- list(x=c((10*cn+1):(11*cn)), y=c(1:rn), z=t(AtlasY))
image.plot(sAtlas, add=TRUE, breaks=44)

Use similar codes to plot the ROI map projected on the horizontal plan at z=30.
ROI <- mask[,,z_cod]
AtlasY <- mask[,,z_cod]
rn <- dim(AtlasY)[1]
cn <- dim(AtlasY)[2]
colorN <- 45 #ceiling(max(AtlasY))+1
ColorValue <- tim.colors(colorN)
#AtlasY_Vec <- ceiling(as.vector(AtlasY))
sAtlas <- list(x=c(1:cn), y=c(1:rn), z=t(AtlasY))
MyHeatMapADNI(ImgY)
Beta <- AtlasY
P <-nrow(Beta)
Q <-ncol(Beta)
for (p in 1:P){
for (q in 1:Q){
k0 <- Beta[p,q]
if(k0!=0){
k <- ceiling(k0*43/max(Beta))+1 #ceiling(k0)+1
rect(2*(P-p+1)-1,2*(Q-q+1)-1, 2*(P-p+1)+1, 2*(Q-q+1)+1, col=ColorValue[k], border=NA)
}
}
}
text(2*P-30, 10, labels=paste("z=",z_cod,"mm",sep=""), col="white", font=2, cex=1)
sAtlas <- list(x=c((10*cn+1):(11*cn)), y=c(1:rn), z=t(AtlasY))
image.plot(sAtlas, add=TRUE, col=ColorValue[-1], lab.breaks=names(table(ROI)))

From the brain heapmap and ROI map, we can see that both left and right Cerebelum (ROIs 9301 & 9302) cortices are very significant in differentiating AD and NC. There are also differentiating hotspots in cortices Insula (ROIs 3001 & 3002) and Rectus(ROIs 2701 & 2702).
Region-level connectomic analysis for AD and NC
First, calculate the means within each ROI for AD and NC, respectively.
ROI_means_AD <- NULL
ROI_means_NC <- NULL
for(r in 1:116){
ROI_mean_AD_v <- apply(PET[1:53, which(mask_non0==names(table(mask_non0))[r])], 1, function(x)(mean(x, na.rm=TRUE)))
ROI_means_AD <- cbind(ROI_means_AD, ROI_mean_AD_v)
ROI_mean_NC_v <- apply(PET[54:125, which(mask_non0==names(table(mask_non0))[r])], 1, function(x)(mean(x, na.rm=TRUE)))
ROI_means_NC <- cbind(ROI_means_NC, ROI_mean_NC_v)
}
Plot the connection levelplots
connect_AD <- cor(ROI_means_AD)
connect_AD[which(abs(connect_AD)<0.8, arr.ind=TRUE)] <- 0
colnames(connect_AD) <- names(table(mask_non0))
rownames(connect_AD) <- names(table(mask_non0))
rgb.palette <- colorRampPalette(c("cyan", "orange"), space = "rgb")
levelplot(connect_AD, scales=list(x=list(cex=0.6, rot=90), y=list(cex=0.6)), xlab="", ylab="", col.regions=rgb.palette(120), main="AD")
connect_NC <- cor(ROI_means_NC)
connect_NC[which(abs(connect_NC)<0.8, arr.ind=TRUE)] <- 0
colnames(connect_NC) <- names(table(mask_non0))
rownames(connect_NC) <- names(table(mask_non0))
levelplot(connect_NC, scales=list(x=list(cex=0.6, rot=90), y=list(cex=0.6)), xlab="", ylab="", col.regions=rgb.palette(120), main="NC")


Plot the 3D brain connection plots
First, you need to install the R package brainconn. See here for more details about the package.
install.packages("remotes")
remotes::install_github("sidchop/brainconn")
Plot out the 3D brain connection plots.
library(brainconn)
mask_coord_roi_new1 <- read.table("mask_coord_roi_new1.txt", header =TRUE)
brainconn(atlas =mask_coord_roi_new1, conmat=abs(connect_AD), node.color = "brown", node.size = 1, edge.width = 1, edge.color.weighted = T, view="ortho")
brainconn(atlas =mask_coord_roi_new1, conmat=abs(connect_NC), node.color = "brown", node.size = 1, edge.width = 1, edge.color.weighted = T, view="ortho")
From the 3D plots or the level plots, we can see that there are more inter-region connections between Frontal_Sup and Frontal_Sup_Orb cortices in the brains of AD patients, compared to for NC people.
Plot the chord diagrams for AD and NC
# install.packages("circlize")
library(circlize)
mat <- abs(connect_AD)
chordDiagram(mat, annotationTrack = "grid",
preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(mat))))))
circos.track(track.index = 1, panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA)
###
mat <- abs(connect_NC)
chordDiagram(mat, annotationTrack = "grid",
preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(mat))))))
circos.track(track.index = 1, panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA)
PART II-1. Predictive ML/DL models for neuroimaging data analysis
We used ROI level imaging data for AD/NC classification/prediction.
dim(ROI_means_AD)
[1] 53 116
dim(ROI_means_NC)
[1] 72 116
y <- rep(0, 125)
y[1:53] <- 1
X.m <- rbind(ROI_means_AD, ROI_means_NC)
dim(X.m)
[1] 125 116
1. Lasso model
temp.lasso.cvfit <- cv.glmnet(x=X.m,y=y, family = "binomial", type.measure = "class", alpha = 1)
lamb <- temp.lasso.cvfit$lambda.min
temp.lasso <- glmnet(x=X.m,y=y, alpha = 1, lambda=lamb, family = "binomial")
pred.lasso <- predict(temp.lasso, s='lambda.min', newx=X.m, type="response")
pred.lasso01 <- predict(temp.lasso, s='lambda.min', newx=X.m, type="class")
library(caret)
confusionMatrix(as.factor(y), as.factor(pred.lasso01))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 72 0
1 0 53
Accuracy : 1
95% CI : (0.9709, 1)
No Information Rate : 0.576
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.576
Detection Rate : 0.576
Detection Prevalence : 0.576
Balanced Accuracy : 1.000
'Positive' Class : 0
Look at which ROI’s are selected as the predictive ROI for AD/NC classification.
> beta_hat <- as.vector(temp.lasso$beta[,1])
>
> names(beta_hat) <- paste("ROI", c(1:116), sep="")
>
> beta_hat
ROI1 ROI2 ROI3 ROI4 ROI5 ROI6
0.0000000 0.0000000 11.2834825 0.0000000 0.0000000 0.0000000
ROI7 ROI8 ROI9 ROI10 ROI11 ROI12
0.0000000 4.0562152 0.0000000 3.5390322 0.0000000 5.0257354
ROI13 ROI14 ROI15 ROI16 ROI17 ROI18
0.0000000 1.4599871 0.0000000 16.3521690 0.0000000 1.4090088
ROI19 ROI20 ROI21 ROI22 ROI23 ROI24
0.0000000 0.0000000 1.8918206 8.2886784 0.0000000 0.0000000
ROI25 ROI26 ROI27 ROI28 ROI29 ROI30
0.0000000 0.0000000 0.0000000 -23.0549599 -17.6783982 0.0000000
ROI31 ROI32 ROI33 ROI34 ROI35 ROI36
0.0000000 0.0000000 0.0000000 -1.7228266 0.0000000 -9.1364817
ROI37 ROI38 ROI39 ROI40 ROI41 ROI42
-9.7847000 0.0000000 -32.9475738 0.0000000 0.0000000 0.0000000
ROI43 ROI44 ROI45 ROI46 ROI47 ROI48
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 16.6703849
ROI49 ROI50 ROI51 ROI52 ROI53 ROI54
-0.5371030 0.0000000 0.0000000 4.6004862 0.0000000 0.0000000
ROI55 ROI56 ROI57 ROI58 ROI59 ROI60
0.0000000 0.0000000 0.1106223 27.6758996 0.0000000 8.8117034
ROI61 ROI62 ROI63 ROI64 ROI65 ROI66
-16.7934707 -1.0258150 8.5639771 -3.7309900 0.0000000 -23.0580959
ROI67 ROI68 ROI69 ROI70 ROI71 ROI72
0.0000000 -9.3551591 7.1754490 0.0000000 0.0000000 0.0000000
ROI73 ROI74 ROI75 ROI76 ROI77 ROI78
7.3113457 0.0000000 6.5755033 0.2252385 0.0000000 0.6747557
ROI79 ROI80 ROI81 ROI82 ROI83 ROI84
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
ROI85 ROI86 ROI87 ROI88 ROI89 ROI90
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
ROI91 ROI92 ROI93 ROI94 ROI95 ROI96
3.9549139 3.0406333 0.0000000 1.1476977 0.0000000 0.0000000
ROI97 ROI98 ROI99 ROI100 ROI101 ROI102
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
ROI103 ROI104 ROI105 ROI106 ROI107 ROI108
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
ROI109 ROI110 ROI111 ROI112 ROI113 ROI114
2.6401216 23.3079212 0.0000000 -7.0365297 0.0000000 0.0000000
ROI115 ROI116
0.0000000 0.8033363
>
>
> which(abs(beta_hat)!=0)
ROI3 ROI8 ROI10 ROI12 ROI14 ROI16 ROI18 ROI21 ROI22 ROI28 ROI29
3 8 10 12 14 16 18 21 22 28 29
ROI34 ROI36 ROI37 ROI39 ROI48 ROI49 ROI52 ROI57 ROI58 ROI60 ROI61
34 36 37 39 48 49 52 57 58 60 61
ROI62 ROI63 ROI64 ROI66 ROI68 ROI69 ROI73 ROI75 ROI76 ROI78 ROI91
62 63 64 66 68 69 73 75 76 78 91
ROI92 ROI94 ROI109 ROI110 ROI112 ROI116
92 94 109 110 112 116
Find the important ROIs names from the dictionary.
> region_code
V1 V2 V3
1 1 2001 Precentral_L
2 2 2002 Precentral_R
3 3 2101 Frontal_Sup_L
4 4 2102 Frontal_Sup_R
5 5 2111 Frontal_Sup_Orb_L
6 6 2112 Frontal_Sup_Orb_R
7 7 2201 Frontal_Mid_L
8 8 2202 Frontal_Mid_R
9 9 2211 Frontal_Mid_Orb_L
10 10 2212 Frontal_Mid_Orb_R
11 11 2301 Frontal_Inf_Oper_L
12 12 2302 Frontal_Inf_Oper_R
2. KNN
library(caret)
library(pROC)
ctrl <- trainControl(method="repeatedcv",repeats = 10)
model_knn <- train(as.data.frame(X.m), unlist(y), method = "knn",
preProcess = c("center","scale"), trControl = ctrl,
tuneLength = 20)
pred.knn <- as.numeric(predict(model_knn, X.m)>0.5)
confusionMatrix(factor(pred.knn), factor(y))
#Confusion Matrix and Statistics
Reference
Prediction 0 1
0 67 26
1 5 27
Accuracy : 0.752
95% CI : (0.6668, 0.8249)
No Information Rate : 0.576
P-Value [Acc > NIR] : 3.13e-05
Kappa : 0.4643
Mcnemar's Test P-Value : 0.000328
Sensitivity : 0.9306
Specificity : 0.5094
Pos Pred Value : 0.7204
Neg Pred Value : 0.8438
Prevalence : 0.5760
Detection Rate : 0.5360
Detection Prevalence : 0.7440
Balanced Accuracy : 0.7200
'Positive' Class : 0
3. Linear discriminant analysis (LDA)
library(MASS)
model.lda <- lda(y ~ X.m)
pred.lda <- predict(model.lda, as.data.frame(X.m), type="classification")
confusionMatrix(pred.lda$class, factor(y))
#Confusion Matrix and Statistics
#
# Reference
#Prediction 0 1
# 0 72 0
# 1 0 53
#
# Accuracy : 1
# 95% CI : (0.9709, 1)
# No Information Rate : 0.576
# P-Value [Acc > NIR] : < 2.2e-16
#
# Kappa : 1
#
# Mcnemar's Test P-Value : NA
#
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.576
# Detection Rate : 0.576
# Detection Prevalence : 0.576
4. XGboost
library(xgboost)
label_tr_xgb = as.numeric(unlist(y))
xgb_train <- xgb.DMatrix(data=as.matrix(X.m, nrow=nrow(X.m)),
label=label_tr_xgb)
model_xgb <- xgboost(data= xgb_train, nrounds=10,
objective = "binary:logistic", class=2,
nfold=10, set.seed=2022)
xgbpred<- predict(model_xgb, xgb_train)
confusionMatrix(as.factor(as.numeric(xgbpred>0.5)), factor(y))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 72 0
1 0 53
Accuracy : 1
95% CI : (0.9709, 1)
No Information Rate : 0.576
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.576
Detection Rate : 0.576
Detection Prevalence : 0.576
Balanced Accuracy : 1.000
'Positive' Class : 0
5. Neural network
#install.packages(c('neuralnet','keras','tensorflow'),dependencies = T)
library(tidyverse)
library(neuralnet)
colnames(X.m) <- paste("ROI", c(1:116), sep="")
Dat <- as.data.frame(cbind(y, X.m))
model = neuralnet(
y~ .,
data=Dat,
hidden=c(4,2),
linear.output = FALSE
)
nnpred <- predict(model, X.m)
confusionMatrix(as.factor(as.numeric(nnpred>0.5)), factor(y))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 70 2
1 2 51
Accuracy : 0.968
95% CI : (0.9201, 0.9912)
No Information Rate : 0.576
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9345
Mcnemar's Test P-Value : 1
Sensitivity : 0.9722
Specificity : 0.9623
Pos Pred Value : 0.9722
Neg Pred Value : 0.9623
Prevalence : 0.5760
Detection Rate : 0.5600
Detection Prevalence : 0.5760
Balanced Accuracy : 0.9672
'Positive' Class : 0
PART II-2. Predictive model validation
1. Model validation via independent datasets
Need to collect independent test dataset when collecting the traning data. Or get the test data from other data cohorts.
2. Cross validation
A self-implemented R function for cross validation for Lasso model.
Lasso.cv <-
function(X, Y, fold=5, seed=1, lamb)
{
n=nrow(X)
############################### set seed and generate cross validation seperations
set.seed(seed)
index.cv<- vector(mode='list', length=fold)
f.n=floor(n/fold)
for(k in 1:2){
index_set_k <- which(Y==(k-1))
ran.order_set_k =sample(index_set_k, length(index_set_k), replace=F)
if( length(index_set_k) < fold){
err <- sprintf(paste("There are less than ", fold, "samples in class ", k, sep=""))
}
f.nk <- floor(length(index_set_k)/fold)
for (f in 1:(fold-1)){
index.cv[[f]]<-c(index.cv[[f]], ran.order_set_k[(f-1)*f.nk+1:f.nk])
}
index.cv[[fold]]<-c(index.cv[[fold]], ran.order_set_k[((fold-1)*f.nk+1):length(index_set_k)])
}
################################ begin cross validation
err.cv <- NULL
err.tot <- 0
Y.test.arr <- NULL
Y.pred.arr <- NULL
for(f in 1:fold)
{ print(paste("fold=",f))
index.cur.cv<-index.cv[[f]]
X.m<-X[-(index.cur.cv),]
Y.m<-Y[-(index.cur.cv)]
X.t<-X[index.cur.cv,]
Y.t<-Y[index.cur.cv]
Y.test.arr[[f]] <- Y.t
mod.train <- glmnet(x=X.m, y=Y.m, alpha = 1, lambda=lamb, family = "binomial")
pred.y.t <- as.numeric(predict(mod.train, s=lamb, newx=X.t, type="class"))
err.cv.c <- length(which(pred.y.t != Y.t))
err.cv<- c(err.cv, err.cv.c)
err.tot <- err.tot + err.cv.c
Y.pred.arr[[f]] <- pred.y.t
}###end fold loop
result=list(err.tot=err.tot, err.cv=err.cv, Y.pred.arr=Y.pred.arr, Y.test.arr=Y.test.arr)
return(result)
}
cv.tmp <- Lasso.cv(X.m, y, lamb=0.0035)
[1] "fold= 1"
[1] "fold= 2"
[1] "fold= 3"
[1] "fold= 4"
[1] "fold= 5"
>
>
> cv.tmp
$err.tot
[1] 32
$err.cv
[1] 9 7 4 6 6
$Y.pred.arr
$Y.pred.arr[[1]]
[1] 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0
$Y.pred.arr[[2]]
[1] 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1
$Y.pred.arr[[3]]
[1] 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1
$Y.pred.arr[[4]]
[1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0
$Y.pred.arr[[5]]
[1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 1 1
$Y.test.arr
$Y.test.arr[[1]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
$Y.test.arr[[2]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
$Y.test.arr[[3]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
$Y.test.arr[[4]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
$Y.test.arr[[5]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
Notice that the cross validation error is 32 out of 125. Remember the prediction error on the training data was 0 (overfitting).