Archive for category R

Logistic Regression Explained

Logistic regression is a type of regression used when the dependant variable is binary or ordinal (e.g. when the outcome is either “dead” or “alive”). It is commonly used for predicting the probability of occurrence of an event, based on several predictor variables that may either be numerical or categorical. For example, suppose a researcher is interested in how Graduate Record Exam scores (GRE) and grade point average (GPA) effect admission into graduate school. By deriving a logistic regression model from previously observed admissions (we will use an hypothetical dataset from the UCLA Academic Technology Services here), it becomes possible to predict future admissions.

mydata = read.csv(url('http://www.ats.ucla.edu/stat/r/dae/binary.csv'))

In this dataset, gre, gpa and rank represent the predictor variables, and admit the outcome. A typical regression analysis using pre-established packages from R could then be applied as follows:

mylogit = glm(admit~gre+gpa+as.factor(rank), family=binomial, data=mydata)

However, in order to understand the mechanisms of logistic regression we can write out its likelihood function. We will employ maximum likelihood estimation (MLE) to find the optimal parameters values, here represented by the unknown regression coefficients:

################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
#
# fmla : model formula
# x : a [n x p] dataframe with the data. Factors should be coded accordingly
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 22 dec 2011
################################################################################
mle.logreg = function(fmla, data)
{
    # Define the negative log likelihood function
    logl <- function(theta,x,y){
      y <- y
      x <- as.matrix(x)
      beta <- theta[1:ncol(x)]

      # Use the log-likelihood of the Bernouilli distribution, where p is
      # defined as the logistic transformation of a linear combination
      # of predictors, according to logit(p)=(x%*%beta)
      loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
      return(-loglik)
    }

    # Prepare the data
    outcome = rownames(attr(terms(fmla),"factors"))[1]
    dfrTmp = model.frame(data)
    x = as.matrix(model.matrix(fmla, data=dfrTmp))
    y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))

    # Define initial values for the parameters
    theta.start = rep(0,(dim(x)[2]))
    names(theta.start) = colnames(x)

    # Calculate the maximum likelihood
    mle = optim(theta.start,logl,x=x,y=y,hessian=T)
    out = list(beta=mle$par,vcov=solve(mle$hessian),ll=2*mle$value)
}
################################################################################

We can implement this function as follows:

mydata$rank = factor(mydata$rank) #Treat rank as a categorical variable
fmla = as.formula("admit~gre+gpa+rank") #Create model formula
mylogit = mle.logreg(fmla, mydata) #Estimate coefficients
mylogit

Note that the categorical variable rank is modeled as a factor. This implies that a separate regression coefficient is estimated for ranks 2, 3 and 4 (with rank 1 as reference). Instead of obtaining the observed information matrix from the numerically differentiated Hessian matrix (through the optim-command), it is possible to calculate an unbiased estimate directly from the data:

################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
#
# fmla : model formula
# x : a [n x p] dataframe with the data. Factors should be coded accordingly
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 22 dec 2011
################################################################################
mle.logreg = function(fmla, data)
{
    # Define the negative log likelihood function
    logl <- function(theta,x,y){
      y <- y
      x <- as.matrix(x)
      beta <- theta[1:ncol(x)]

      # Use the log-likelihood of the Bernouilli distribution, where p is
      # defined as the logistic transformation of a linear combination
      # of predictors, according to logit(p)=(x%*%beta)
      loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
      return(-loglik)
    }

    # Prepare the data
    outcome = rownames(attr(terms(fmla),"factors"))[1]
    dfrTmp = model.frame(data)
    x = as.matrix(model.matrix(fmla, data=dfrTmp))
    y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))

    # Define initial values for the parameters
    theta.start = rep(0,(dim(x)[2]))
    names(theta.start) = colnames(x)

    # Calculate the maximum likelihood
    mle = optim(theta.start,logl,x=x,y=y,hessian=F)

    # Obtain regression coefficients
    beta = mle$par
 
    # Calculate the Information matrix
    # The variance of a Bernouilli distribution is given by p(1-p)
    p = 1/(1+exp(-x%*%beta))
    V = array(0,dim=c(dim(x)[1],dim(x)[1]))
    diag(V) = p*(1-p)
    IB = t(x)%*%V%*%x
 
    # Return estimates
    out = list(beta=beta,vcov=solve(IB),dev=2*mle$value)
}
################################################################################

Finally, in some scenarios it is necessary to constrain the parameter search space. For instance, in stacked regressions it is important to put non-negative constraints on the regression slopes. This can be achieved by a small modification in the optim-command:

################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
# Slopes are constrained to non-negative values
#
# fmla : model formula
# x : a [n x p] dataframe with the data. Factors should be coded accordingly
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 22 dec 2011
################################################################################
mle.logreg.constrained = function(fmla, data)
{
    # Define the negative log likelihood function
    logl <- function(theta,x,y){
      y <- y
      x <- as.matrix(x)
      beta <- theta[1:ncol(x)]

      # Use the log-likelihood of the Bernouilli distribution, where p is
      # defined as the logistic transformation of a linear combination
      # of predictors, according to logit(p)=(x%*%beta)
      loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
      return(-loglik)
    }

    # Prepare the data
    outcome = rownames(attr(terms(fmla),"factors"))[1]
    dfrTmp = model.frame(data)
    x = as.matrix(model.matrix(fmla, data=dfrTmp))
    y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))

    # Define initial values for the parameters
    theta.start = rep(0,(dim(x)[2]))
    names(theta.start) = colnames(x)
    
    # Non-negative slopes constraint
    lower = c(-Inf,rep(0,(length(theta.start)-1)))

    # Calculate the maximum likelihood
    mle = optim(theta.start,logl,x=x,y=y,hessian=T,lower=lower,method="L-BFGS-B")

    # Obtain regression coefficients
    beta = mle$par
 
    # Calculate the Information matrix
    # The variance of a Bernouilli distribution is given by p(1-p)
    p = 1/(1+exp(-x%*%beta))
    V = array(0,dim=c(dim(x)[1],dim(x)[1]))
    diag(V) = p*(1-p)
    IB = t(x)%*%V%*%x
 
    # Return estimates
    out = list(beta=beta,vcov=solve(IB),dev=2*mle$value)
}
################################################################################
Advertisements

, , , , , ,

4 Comments

Meta-analysis

Introduction

Effect estimation is an important task in modern research. An example is the identification of risk factors for disease and the qualification of medical treatments. Usually, researchers are interested in estimating the global, common effect. Since actual effects tend to differ across populations, estimates based on sample of a particular population seldomly generalize well. When different estimates of an effect are known, a summary can improve their objectiveness and performance. Recently, meta-analysis has become a popular approach for combining such estimates into a summary.

If all studies in the analysis were equally precise, a possible approach would be to compute the mean of the effect sizes. However, usually some studies are more precise than others. For this reason, it is favorable to assign more weight to the studies that carried more information. This is what happens in a meta-analysis. Rather than compute a simple mean of the effect sizes, a weighted mean is calculated, with more weight given to some studies and less weight given to others.

The question that we need to address, then, is how the weights are assigned. It turns out that this depends on what we mean by a “combined effect”. There are two models used in meta-analysis, the fixed effect model and the random effects model. The two make different assumptions about the nature of the studies, and these assumptions lead to different definitions for the combined effect, and different mechanisms for assigning weights.

Under the fixed effect model we assume that there is one true effect size which is shared by all the included studies. It follows that the combined effect is our estimate of this common effect size.

By contrast, under the random effects model we allow that the true effect could vary from study to study. For example, the effect size might be a little higher if the subjects are older, or more educated, or healthier; or if the study used a slightly more intensive or longer variant of the intervention; or if the effect was measured more reliably; and so on. The studies included in the meta-analysis are assumed to be a random sample of the relevant distribution of effects, and the combined effect estimates the mean effect in this distribution.

Example

We illustrate both approaches using a fictional example from Borenstein where the impact of an intervention on reading scores in children is investigated.

Study name Effect estimate Variance
Carroll 0.10 0.03
Grant 0.30 0.03
Peck 0.35 0.05
Donat 0.65 0.01
Stewart 0.45 0.05
Young 0.15 0.02

We can insert this evidence into R as follows:

ests = c(0.10,0.30,0.35,0.65,0.45,0.15)
vars = c(0.03,0.03,0.05,0.01,0.05,0.02)

Fixed effects model

The fixed effects model assumes absence of heterogeneity. This implies that all studies are measuring the same effect. The study estimates are weighted by the inverse of their reported variances.

# Weights
w = 1/vars

# Combined effect
est.fixef = sum(ests*w)/sum(w)

# Standard error of the combined effect
se.fixef = sqrt(1/sum(w))

# The Z-value
z.fixef = est.fixef/se.fixef

The standardized combined effect (z-value) allows us calculating a confidence interval and p-values for the combined effect.

# p-values
fe_p1t = pnorm(z.fixef,lower=F)   #1-tailed p-value
fe_p2t = 2*pnorm(z.fixef,lower=F) #2-tailed p-value

# 95% confidence interval
fe.lowerconf =  est.fixef + qnorm(0.025)*se.fixef
fe.upperconf =  est.fixef + qnorm(0.975)*se.fixef

If we apply the methodology from above on the presented example, we obtain a combined effect of 0.40 (SE: 0.06) which was found to be significant (2-tailed p-value: 2e-10).

Random effects model

, , , ,

1 Comment