Logistic Regression Explained

Logistic regression is a type of regression used when the dependant variable is binary or ordinal (e.g. when the outcome is either “dead” or “alive”). It is commonly used for predicting the probability of occurrence of an event, based on several predictor variables that may either be numerical or categorical. For example, suppose a researcher is interested in how Graduate Record Exam scores (GRE) and grade point average (GPA) effect admission into graduate school. By deriving a logistic regression model from previously observed admissions (we will use an hypothetical dataset from the UCLA Academic Technology Services here), it becomes possible to predict future admissions.

mydata = read.csv(url('http://www.ats.ucla.edu/stat/r/dae/binary.csv'))

In this dataset, gre, gpa and rank represent the predictor variables, and admit the outcome. A typical regression analysis using pre-established packages from R could then be applied as follows:

mylogit = glm(admit~gre+gpa+as.factor(rank), family=binomial, data=mydata)

However, in order to understand the mechanisms of logistic regression we can write out its likelihood function. We will employ maximum likelihood estimation (MLE) to find the optimal parameters values, here represented by the unknown regression coefficients:

################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
#
# fmla : model formula
# x : a [n x p] dataframe with the data. Factors should be coded accordingly
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 22 dec 2011
################################################################################
mle.logreg = function(fmla, data)
{
    # Define the negative log likelihood function
    logl <- function(theta,x,y){
      y <- y
      x <- as.matrix(x)
      beta <- theta[1:ncol(x)]

      # Use the log-likelihood of the Bernouilli distribution, where p is
      # defined as the logistic transformation of a linear combination
      # of predictors, according to logit(p)=(x%*%beta)
      loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
      return(-loglik)
    }

    # Prepare the data
    outcome = rownames(attr(terms(fmla),"factors"))[1]
    dfrTmp = model.frame(data)
    x = as.matrix(model.matrix(fmla, data=dfrTmp))
    y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))

    # Define initial values for the parameters
    theta.start = rep(0,(dim(x)[2]))
    names(theta.start) = colnames(x)

    # Calculate the maximum likelihood
    mle = optim(theta.start,logl,x=x,y=y,hessian=T)
    out = list(beta=mle$par,vcov=solve(mle$hessian),ll=2*mle$value)
}
################################################################################

We can implement this function as follows:

mydata$rank = factor(mydata$rank) #Treat rank as a categorical variable
fmla = as.formula("admit~gre+gpa+rank") #Create model formula
mylogit = mle.logreg(fmla, mydata) #Estimate coefficients
mylogit

Note that the categorical variable rank is modeled as a factor. This implies that a separate regression coefficient is estimated for ranks 2, 3 and 4 (with rank 1 as reference). Instead of obtaining the observed information matrix from the numerically differentiated Hessian matrix (through the optim-command), it is possible to calculate an unbiased estimate directly from the data:

################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
#
# fmla : model formula
# x : a [n x p] dataframe with the data. Factors should be coded accordingly
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 22 dec 2011
################################################################################
mle.logreg = function(fmla, data)
{
    # Define the negative log likelihood function
    logl <- function(theta,x,y){
      y <- y
      x <- as.matrix(x)
      beta <- theta[1:ncol(x)]

      # Use the log-likelihood of the Bernouilli distribution, where p is
      # defined as the logistic transformation of a linear combination
      # of predictors, according to logit(p)=(x%*%beta)
      loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
      return(-loglik)
    }

    # Prepare the data
    outcome = rownames(attr(terms(fmla),"factors"))[1]
    dfrTmp = model.frame(data)
    x = as.matrix(model.matrix(fmla, data=dfrTmp))
    y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))

    # Define initial values for the parameters
    theta.start = rep(0,(dim(x)[2]))
    names(theta.start) = colnames(x)

    # Calculate the maximum likelihood
    mle = optim(theta.start,logl,x=x,y=y,hessian=F)

    # Obtain regression coefficients
    beta = mle$par
 
    # Calculate the Information matrix
    # The variance of a Bernouilli distribution is given by p(1-p)
    p = 1/(1+exp(-x%*%beta))
    V = array(0,dim=c(dim(x)[1],dim(x)[1]))
    diag(V) = p*(1-p)
    IB = t(x)%*%V%*%x
 
    # Return estimates
    out = list(beta=beta,vcov=solve(IB),dev=2*mle$value)
}
################################################################################

Finally, in some scenarios it is necessary to constrain the parameter search space. For instance, in stacked regressions it is important to put non-negative constraints on the regression slopes. This can be achieved by a small modification in the optim-command:

################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
# Slopes are constrained to non-negative values
#
# fmla : model formula
# x : a [n x p] dataframe with the data. Factors should be coded accordingly
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 22 dec 2011
################################################################################
mle.logreg.constrained = function(fmla, data)
{
    # Define the negative log likelihood function
    logl <- function(theta,x,y){
      y <- y
      x <- as.matrix(x)
      beta <- theta[1:ncol(x)]

      # Use the log-likelihood of the Bernouilli distribution, where p is
      # defined as the logistic transformation of a linear combination
      # of predictors, according to logit(p)=(x%*%beta)
      loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
      return(-loglik)
    }

    # Prepare the data
    outcome = rownames(attr(terms(fmla),"factors"))[1]
    dfrTmp = model.frame(data)
    x = as.matrix(model.matrix(fmla, data=dfrTmp))
    y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))

    # Define initial values for the parameters
    theta.start = rep(0,(dim(x)[2]))
    names(theta.start) = colnames(x)
    
    # Non-negative slopes constraint
    lower = c(-Inf,rep(0,(length(theta.start)-1)))

    # Calculate the maximum likelihood
    mle = optim(theta.start,logl,x=x,y=y,hessian=T,lower=lower,method="L-BFGS-B")

    # Obtain regression coefficients
    beta = mle$par
 
    # Calculate the Information matrix
    # The variance of a Bernouilli distribution is given by p(1-p)
    p = 1/(1+exp(-x%*%beta))
    V = array(0,dim=c(dim(x)[1],dim(x)[1]))
    diag(V) = p*(1-p)
    IB = t(x)%*%V%*%x
 
    # Return estimates
    out = list(beta=beta,vcov=solve(IB),dev=2*mle$value)
}
################################################################################
Advertisements

, , , , , ,

  1. #1 by CM on February 24, 2012 - 9:28 am

    Thanks! Very nice solution to my problem (needed to have non-negative constraints on the coefficients).

    However, models as Y ~ (A1 * A2) + (B1 * B2) trigger an error, whereas Y ~ (A1 : A2) + (B1 : B2) passes. If I reformulate Y ~ (A1 * A2) + (B1 * B2) to Y ~ A1 + A2 + (A1 * A2) + B1 + B2 + (B1 * B2), coefficients can still be negative.

    Can you point me to my mistake or give a hint on the solution? Would be very nice

  2. #2 by CM on February 24, 2012 - 9:48 am

    Oops, the reformulated model should read “:”s instead of “*”s, of course. And the mentioned error is “Error in optim(theta.start, logl, x = x, y = y, hessian = T, lower = lower, : L-BFGS-B needs finite values of ‘fn'”

    Sorry for that.

    • #3 by nightlordtw on February 24, 2012 - 12:00 pm

      Hi CM, I think it should work if you change the following (in “prepare the data” section):

       y = as.numeric(as.matrix(data[,match(outcome,colnames(data))]))
      

      This change is necessary when the outcome is defined as a factor or TRUE/FALSE variable. Also, make sure Y is either boolean or a 0/1 variable.

      The example below then works for me:

      set.seed(12)
      x = array(rnorm(4000,0,1),dim=c(1000,4))
      y = rbinom(1000,1,( 1/(1+exp(-(-3.2 + 0.5*x[,1]+1.2*x[,2]-0.7*x[,3]+0.8*x[,4])))))
      mydata = as.data.frame(cbind(x,y))
      colnames(mydata)=c("A1","A2","B1","B2","Y")
      mydata$Y = as.factor(mydata$Y)
      fmla=as.formula("Y ~ (A1 * A2) + (B1 * B2)")
      model=mle.logreg.constrained(fmla,mydata)
      model
      

      This yields the following results for the regression coefficients:

      (Intercept): -2.646302771
      A1: 0.351304056
      A2: 1.007674551
      B1: 0.000000000
      B1: 0.608490582
      A1:A2 0.001191156
      B1:B2 0.000000000

      Hope it works for you now!

  1. The statistical methods in logistic regression – its mechanics | Design of Studies, Data and Analysis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: