Statistics to prove anything: 2011

Its been a while. In that time there have been several instances when I've been in need of some of the following functions and I had to go digging through my old homeworks to find specific instances when I used them; so I thought they'd be worth posting on here.

First the proc.time() function is useful to see the runtime of a program. Here's an example that will give the run time in seconds:

time <- proc.time()
# - run some function - #
proc.time()[3] - time[3]

Some of the functions that I wrote for missing data have proven to be quite useful as well. First I always forget the is.na() function and often confuse complete.cases with na.omit

### Ways to find what values are missing
complete.cases(sed) #T/F if row has any missing
na.omit(sed) #gives only rows with no missing values
is.na(sed)==F #have to use is.na as a logical argument, not ==NA

Here are some functions that I think are quite useful and I am surprised they aren't available in R, or if they are, I don't know what they are called

### Identify cols that are entirely NA
NA.cols<-function(X) {
cols<-apply(X,2,function(x) sum(is.na(x)))==nrow(X)
names(cols)<-colnames(X)
na.cols<-which(cols==T)
if(length(na.cols)==0) na.cols<-'Each column has at least one non NA value'
return(na.cols)
}

### Identify rows that are entirely NA (opposite of

### complete.cases)

NA.rows<-function(X) {
rows<-apply(X,1,function(x) sum(is.na(x)))==ncol(X)
names(rows)<-rownames(X)
na.rows<-which(rows==T)
if(length(na.rows)==0) na.rows<-'Each row has at least one non NA value'
return(na.rows)
}

### Identify cols that have no have no NA values (I guess you
### could also transpose the data and do na.omit)
complete.cols<-function(X) {
cols<-apply(X,2,function(x) sum(is.na(x)))==0
names(cols)<-colnames(X)
complete.cols<-which(cols==T)
if(length(complete.cols)==0) complete.cols<-'There are no complete variables'
return(complete.cols)
}

I also have my simple imputation function. For real imputation techniques I recommend the Amelia package (at least thats what I used for the multivariate class, and it seemed to have some nice features). I found using proc.time() that it takes like 22 seconds to run on my machine, mostly because the 'hot deck' imputation could probably be more efficient. The real lesson here is that when you write a function, you get to name it after yourself

### A function that returns some simple imputation methods
Alan.imputations<-function(X) {
require(fields)

mean.x<-rep(NA,ncol(X))
median.x<-rep(NA,ncol(X))
min.x<-rep(NA,ncol(X))
max.x<-rep(NA,ncol(X))

X.mean<-X
X.median<-X
X.min<-X
X.max<-X
X.zero<-X
X.sample<-X

for(j in 1:ncol(X)) {
  if(is.numeric(X[,j])==T) {
   mean.x[j]<-mean(X[,j],na.rm=T)
   median.x[j]<-median(X[,j],na.rm=T)
   min.x[j]<-min(X[,j],na.rm=T)
   max.x[j]<-max(X[,j],na.rm=T)
  }
}
for(j in 1:ncol(X)) {
  X.mean[is.na(X.mean[,j]),j]<-mean.x[j]
  X.median[is.na(X.median[,j]),j]<-median.x[j]
  X.min[is.na(X.min[,j]),j]<-min.x[j]
  X.max[is.na(X.max[,j]),j]<-max.x[j]

  i.na<-is.na(X[,j])
  if(sum(i.na)!=length(i.na))
  X.sample[i.na,j]<-sample(na.omit(X[,j]),sum(i.na),replace=T)
}
X.zero[is.na(X.zero)]<-0

cols<-apply(X,2,function(x) sum(is.na(x)))==nrow(X)
na.cols<-which(cols==T)
comp.X<-na.omit(X[,-na.cols])
numeric.cols<-rep(NA,ncol(comp.X))
for(i in 1:ncol(comp.X)) {
  numeric.cols[i] <- is.numeric(comp.X[,i])
}

X.na<-X[,-na.cols];X.na<-X.na[,numeric.cols]
NA.index<-which(is.na(X.na)==T,arr.ind=T)
sX <- scale(X.na)

X.na.center<-X.na - matrix(attr(sX,"scaled:center"),nrow(X.na),ncol(X.na),byrow=T)
X.na.scaled<-X.na.center/matrix(attr(sX,"scaled:scale"),nrow(X.na.center),ncol(X.na.center),byrow=T)
new.na.cols<-NA.cols(X.na.scaled)
X.na.scaled<-X.na.scaled[,-new.na.cols]
new.complete.cols<-complete.cols(X.na.scaled)

dist<-matrix(NA,nrow(X.na.scaled),nrow(X.na.scaled))
for(i in 1:nrow(X.na.scaled)) {
  dist[i,]<-rdist(X.na.scaled[i,new.complete.cols],X.na.scaled[,new.complete.cols])
}

for(i in 1:nrow(dist)) {
if(complete.cases(X.na.scaled)[i]==F) {
j<-2
   min<-which(dist[i,]==dist[i,order(dist[i,])[j]])
   while(sum(is.na(X.na.scaled[min,is.na(X.na.scaled[i,])]))!=0) {
   min<-which(dist[i,]==dist[i,order(dist[i,])[j]])
   j<-j+1
   }
  X.na.scaled[i,is.na(X.na.scaled[i,])]<- X.na.scaled[min,is.na(X.na.scaled[i,])]
}
}

X.hotdeck<-X.na.scaled*matrix(attr(sX,"scaled:scale")[-new.na.cols],nrow(X.na.scaled),ncol(X.na.scaled),byrow=T)+matrix(attr(sX,"scaled:center")[-new.na.cols],nrow(X.na.scaled),ncol(X.na.scaled),byrow=T)

alan<-list(max=X.max, mean=X.mean, median=X.median, min=X.min, sample=X.sample, zero=X.zero, hotdeck=X.hotdeck)
return(alan)
}

(note that some of the spacing gets messed up, just in case something doesn't run right)

Statistics to prove anything

Sunday, January 9, 2011

Some useful R functions