Statistics 101: Which Type of Regression Should I Choose?
It Mostly Depends on Your Dependent Variable
Regression is a set of statistical techniques for relating a dependent variable to one or more independent variables. Briefly, a dependent variable (sometimes called an outcome variable) is one that you think is related to the independent variables. Although regression can't prove causation, you usually think that the relationship goes from the independent variable(s) and to the dependent variable. (For more on the distinction, see this article).
The most common kind of regression; the first one (and sometimes the only one) learned in statistics classes is known as ordinary least squares, or OLS regression. OLS regression is appropriate when the dependent variable is continuous, either interval or ratio level (if you do not know what those terms mean, see this article). OLS regression also makes other assumptions, but if the dependent variable is not continuous, OLS regression cannot be appropriate.
Of course, many variables are continuous, or nearly so: Income, weight, IQ, SAT score and many others. But many are not. They come in various forms.
Some variables are counts: For example, the number of children you have, the number of cars you own, the number of times you have been married. What distinguishes count variables is that they are integers, that is, whole numbers, and that they never negative. You cannot have a negative number of children or cars. The two most common kinds of regression for count variables are Poisson regression and negative binomial regression. So, for example, if you wanted to look at the number of children people had, and relate it to (say) age, income, racial/ethnic group, and the number of brothers and sisters the parents have, you would use one of these models. The main difference is that Poisson regression makes a very restrictive assumption about the relationship between the conditional mean and the conditional variance, while negative binomial regression relaxes this assumption. In my experience, the assumption of Poisson regression is almost never met.
Many count variables have a lot of zero counts. For example, the number of heart attacks a person has had: Most people have had none. To deal with such variables, there are variations called zero-inflated Poisson regression and zero-inflated negative binomial regression.
Some variables are dichotomies: They can only take two values. These include living vs. dying, getting a particular disease vs. not getting the disease, voting for Obama vs. voting for McCain and so on). For these variables, the appropriate regression method is binary logistic regression .
Some variables are nominal or ordinal, again, see this article for a detailed explanation, but, briefly, ordinal variables have an order but no defined interval and nominal variables have not even order. For nominal variables, the right kind of regression is multinomial logistic regression. For ordinal variables the right method is ordinal logistic regression.
Finally, some variables are times to events. For example, time until death, time until getting a degree and so on. One key trait here is that such variables are often censored. Although there are different kinds of censoring, the most common is right censoring. This means that some people do not have the event happen while you are studying them. But they may have one after the study is over. There is a whole field of statistics called survival analysis which deals with these variables, but the most common kind of regression for times is Cox proportional hazards regression.
This is just a brief survey; typically, courses are offered in each of the topics listed. But I hope it gives you some ideas about the varieties of regression.
The most common kind of regression; the first one (and sometimes the only one) learned in statistics classes is known as ordinary least squares, or OLS regression. OLS regression is appropriate when the dependent variable is continuous, either interval or ratio level (if you do not know what those terms mean, see this article). OLS regression also makes other assumptions, but if the dependent variable is not continuous, OLS regression cannot be appropriate.
Of course, many variables are continuous, or nearly so: Income, weight, IQ, SAT score and many others. But many are not. They come in various forms.
Some variables are counts: For example, the number of children you have, the number of cars you own, the number of times you have been married. What distinguishes count variables is that they are integers, that is, whole numbers, and that they never negative. You cannot have a negative number of children or cars. The two most common kinds of regression for count variables are Poisson regression and negative binomial regression. So, for example, if you wanted to look at the number of children people had, and relate it to (say) age, income, racial/ethnic group, and the number of brothers and sisters the parents have, you would use one of these models. The main difference is that Poisson regression makes a very restrictive assumption about the relationship between the conditional mean and the conditional variance, while negative binomial regression relaxes this assumption. In my experience, the assumption of Poisson regression is almost never met.
Many count variables have a lot of zero counts. For example, the number of heart attacks a person has had: Most people have had none. To deal with such variables, there are variations called zero-inflated Poisson regression and zero-inflated negative binomial regression.
Some variables are dichotomies: They can only take two values. These include living vs. dying, getting a particular disease vs. not getting the disease, voting for Obama vs. voting for McCain and so on). For these variables, the appropriate regression method is binary logistic regression .
Some variables are nominal or ordinal, again, see this article for a detailed explanation, but, briefly, ordinal variables have an order but no defined interval and nominal variables have not even order. For nominal variables, the right kind of regression is multinomial logistic regression. For ordinal variables the right method is ordinal logistic regression.
Finally, some variables are times to events. For example, time until death, time until getting a degree and so on. One key trait here is that such variables are often censored. Although there are different kinds of censoring, the most common is right censoring. This means that some people do not have the event happen while you are studying them. But they may have one after the study is over. There is a whole field of statistics called survival analysis which deals with these variables, but the most common kind of regression for times is Cox proportional hazards regression.
This is just a brief survey; typically, courses are offered in each of the topics listed. But I hope it gives you some ideas about the varieties of regression.
Published by Peter Flom
I am a statistician, working with a wide variety of clients, mostly researchers in psychology, education, medicine, social sciences and other fields. I also have given talks and written articles on learning... View profile
- Age Regression HypnotherapyAge regression is a very intense form of hypnosis. By taking a patient back in time a therapist hopes to undo damage done when the patient was 4 or 5 years old. It is a very complex form of hypnosis requiring many s...
- The Economic Determinants of Presidential Approval Ratings and Their Effect in a T...This paper shows what independent variables are statistically significant in determining presidential approval.
Effect of Technical Variables for an OrganizationWhat are the social and technical variables and how do they affect an organization, and what are the consequences of these effects on that organization?- Residuals in Linear Regression and Assorted Exam-Style QuestionsSection 61 of The Actuary's Free Study Guide for Exam 3L discusses the concept of the residual in least squares regression and gives 5 exam-style practice problems and solutions.
- Comparative Analysis of Urban and Rural Socioeconomic Effects on Female Labor Forc...The increasing population of the country and the worsening economic conditions in the recent years due to the world financial crisis, it is imperative that we observe the factors that induce the adult females to parti...
- Statistics 101: What is Simple Linear Regression?
- Guide to Independent and Dependent Variables
- Which Factors Affect Why Some People Have Health Insurance While Others Have None?
- Statistics 101: Dependent and Independent Variables
- Statistics 101: Dependent and Independent Data
- The Coefficient of Determination in Linear Regression, Fully Continuous Benefit Re...
- Ideas in Mathematics and Probability: Covariance of Random Variables




5 Comments
Post a CommentYou are a gifted writer, bravo and loved the one on Autism - It's Not a Spectrum, It's a Ballpark
Wha...? I can't even begin to comprehend statistics.
Nicely written, cheers ;)
I don't remember anything from my statistics classes in college.
I was here earlier but couldn't leave a comment. This is intriguing. Nice work, Peter.