Introduction to Correlation and Regression Analysis
In this section we will first discuss correlation analysis, which is used to quantify the association between two continuous variables (e.g., between an independent and a dependent variable or between two independent variables). Regression analysis is a related technique to assess the relationship between an outcome variable and one or more risk factors or confounding variables. The outcome variable is also called the response or dependent variable and the risk factors and confounders are called the predictors, or explanatory or independent variables. In regression analysis, the dependent variable is denoted "y" and the independent variables are denoted by "x".
[NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict even beyond the limits of the data. Also, the term "explanatory variable" might give an impression of a causal effect in a situation in which inferences should be limited to identifying associations. The terms "independent" and "dependent" variable are less subject to these interpretations as they do not strongly imply cause and effect.
In correlation analysis, we estimate a sample correlation coefficient, more specifically the Pearson Product Moment correlation coefficient. The sample correlation coefficient, denoted r,
ranges between -1 and +1 and quantifies the direction and strength of the linear association between the two variables. The correlation between two variables can be positive (i.e., higher levels of one variable are associated with higher levels of the other) or negative (i.e., higher levels of one variable are associated with lower levels of the other).
The sign of the correlation coefficient indicates the direction of the association. The magnitude of the correlation coefficient indicates the strength of the association.
For example, a correlation of r = 0.9 suggests a strong, positive association between two variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation close to zero suggests no linear association between two continuous variables.
LISA: [I find this description confusing. You say that the correlation coefficient is a measure of the "strength of association", but if you think about it, isn't the slope a better measure of association? We use risk ratios and odds ratios to quantify the strength of association, i.e., when an exposure is present it has how many times more likely the outcome is. The analogous quantity in correlation is the slope, i.e., for a given increment in the independent variable, how many times is the dependent variable going to increase? And "r" (or perhaps better R-squared) is a measure of how much of the variability in the dependent variable can be accounted for by differences in the independent variable. The analogous measure for a dichotomous variable and a dichotomous outcome would be the attributable proportion, i.e., the proportion of Y that can be attributed to the presence of the exposure.]
It is important to note that there may be a non-linear association between two continuous variables, but computation of a correlation coefficient does not detect this. Therefore, it is always important to evaluate the data carefully before computing a correlation coefficient. Graphical displays are particularly useful to explore associations between variables.
The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the X-axis and the other along the Y-axis.
- Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see for the correlation between infant birth weight and birth length.
- Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between age and body mass index (which tends to increase with age).
- Scenario 3 might depict the lack of association (r approximately 0) between the extent of media exposure in adolescence and age at which adolescents initiate sexual activity.
- Scenario 4 might depict the strong negative association (r= -0.9) generally observed between the number of hours of aerobic exercise per week and percent body fat.
Example - Correlation of Gestational Age and Birth Weight
A small study is conducted involving 17 infants to investigate the association between gestational age at birth, measured in weeks, and birth weight, measured in grams.
We wish to estimate the association between gestational age and infant birth weight. In this example, birth weight is the dependent variable and gestational age is the independent variable. Thus y=birth weight and x=gestational age. The data are displayed in a scatter diagram in the figure below.
Each point represents an (x,y) pair (in this case the gestational age, measured in weeks, and the birth weight, measured in grams). Note that the independent variable is on the horizontal axis (or X-axis), and the dependent variable is on the vertical axis (or Y-axis). The scatter plot shows a positive or direct association between gestational age and birth weight. Infants with shorter gestational ages are more likely to be born with lower weights and infants with longer gestational ages are more likely to be born with higher weights.
The formula for the sample correlation coefficient is
where Cov(x,y) is the covariance of x and y defined as
are the sample variances of x and y, defined as
The variances of x and y measure the variability of the x scores and y scores around their respective sample means (
, considered separately). The covariance measures the variability of the (x,y) pairs around the mean of x and mean of y, considered simultaneously.
To compute the sample correlation coefficient, we need to compute the variance of gestational age, the variance of birth weight and also the covariance of gestational age and birth weight.
We first summarize the gestational age data. The mean gestational age is:
To compute the variance of gestational age, we need to sum the squared deviations (or differences) between each observed gestational age and the mean gestational age. The computations are summarized below.
The variance of gestational age is:
Next, we summarize the birth weight data. The mean birth weight is:
The variance of birth weight is computed just as we did for gestational age as shown in the table below.
The variance of birth weight is:
Next we compute the covariance,
To compute the covariance of gestational age and birth weight, we need to multiply the deviation from the mean gestational age by the deviation from the mean birth weight for each participant (i.e.,
The computations are summarized below. Notice that we simply copy the deviations from the mean gestational age and birth weight from the two tables above into the table below and multiply.
The covariance of gestational age and birth weight is:
We now compute the sample correlation coefficient:
Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.
As we noted, sample correlation coefficients range from -1 to +1. In practice, meaningful correlations (i.e., correlations that are clinically or practically important) can be as small as 0.4 (or -0.4) for positive (or negative) associations. There are also statistical tests to determine whether an observed correlation is statistically significant or not (i.e., statistically significantly different from zero). Procedures to test whether an observed sample correlation is suggestive of a statistically significant correlation are described in detail in Kleinbaum, Kupper and Muller.1
return to top | previous page | next page
Multivariate and Bivariate Analysis
Introduction to Multivariate and Bivariate Analysis
When conducting research, analysts attempt to measure cause and effect to draw conclusions among variables. For example, in order to test whether a drug can reduce appetite, researchers give participants a dose of the drug before each meal. The independent variable (or predictor) is the taking of the drug and appetite is the dependent variable (or outcome). The independent variable is the variable you manipulate in the study. The dependent variable is the variable you measure (appetite, for example).
One group takes the drug before each meal and a control group does not take drugs at all. After several days, the researchers note that the drug-takers have reduced their caloric intake voluntarily by 30%. Researchers now know that regular consumption of the drug reduces appetite. This type of study is called a univariate study because it examines the effect of the independent variable (drug use) on a single dependent variable (appetite).
Bivariate studies are different from univariate studies because it allows the researcher to analyze the relationship between two variables (often denoted as X, Y) ins order to test simple hypotheses of association and causality. For example, if you wanted to know whether there is a relationship between the number of students in an engineering classroom (independent variable) and their grades in that subject (dependent variable), you would use bivariate analysis since it measures two elements based on the observation of data.
There are essentially four steps to conducting bivariate analysis as follows:
Step 1: Define the nature of the relationship
For example, if you were testing the relationship of class size and grades in an engineering class, then you would report the following: “The data show a relationship between class size and grades. Smaller class sizes (20 or less students) have a grade point average of 4,4 whereas larger class sizes (21-100 students) have a grade point average of 3,1. This demonstrates that students in smaller classes earn grades that are 30% higher than those in large classes.”
Step 2: Identify the type and direction of the relationship
In order to determine the type and direction of the relationship you must determine which of the four levels of measurement you will use for your data:
- Nominal, which is non-numerical and places an object within a category (ex. male or female)
- Ordinal, which ranks data from lowest to highest, 3) interval, which indicates the distance of one object to the next and
- Ratio, which contains all of the above, but also has an absolute zero point. In the example above, the variable number of students is ordinal and the grade point average is also ordinal, so it is a correlative relationship.
Correlation describes the relationship or degree of association that exists between variables. We can conclude that small class size has had a positive effect on grades. The decrease in number of students in a class attributed to an increase in grades. This is a negative correlation. If an increase in number of students led to an increase in grades, then that would have been a positive correlation.
Step 3: Determine if the relationship is statistically significant
Statistical significance is used to determine whether the results are significant enough to truly make a connection. In other words, do we think the results occurred by chance, or do we truly expect to see the same results with another similar study population? In many types of studies, a relationship is considered significant (the association seen in this sample is not occurring randomly or by chance) if it has a significance level of .05. This means that in only 5/100 times will the pattern of observations for these two variables that we have measured occur by chance.
Step 4: Identify the strength of the relationship
To determine whether a bivariate correlation is significant researchers choose a standard formula depending upon the type of data used. For example, Pearson's correlation coefficient measures the strength of linear relationship between X and Y. The relationship between two ordinal variables can be measured by using a formula entitled Spearman’s rho. Spearman’s rho calculates a correlation coefficient on rankings rather than on the actual data. In our example, we looked at how smaller class sizes led to higher grade point averages. Both the number of students in a class and the grades can be ranked.
Spearman’s rho will vary between –1 and +1, with –1 being a perfect negative correlation (if you rank high on X, you will rank low on Y), +1 being a perfect positive correlation (if you rank high on X, you will rank high on Y), and 0 being no relationship between the two (rank on X tells us nothing about rank on Y).
There are several other formulas that can be used to measure significance based on type of data used including Kendall’s Tau, Kendall’s Tau-B, Tau-c, Goodman-Kruskal Gamma, Chi-square(x2), Lambda A, Mann Whitney U-test, Wilcoxon Signed-Rank Test.
Multivariate studies are similar to bivariate studies, but multivariate studies have more than one dependent variable. For example, if an advertiser wanted to examine the effectiveness of three different banner ads on a popular website, the advertiser could measure the ads click rate for both men and women. Researchers could then use multivariate statistical analysis to examine the relationships between all of the variables.
Multivariate analytical techniques represent a variety of mathematical models used to measure and quantify outcomes, taking into account important factors that can influence this relationship. There are several multivariate analytical techniques that one can use to examine the relationship among variables. The most popular is multiple regression analysis which helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Other techniques include factor analysis, path analysis and multiple analyses of variance (MANOVA).
When conducting research, analysts will choose among the univariate, bivariate or multivariate analytical techniques based on their particular study purpose and proposed hypothesis. Each method has its own advantages and uses specific statistical tools to draw conclusions and identify relationships between variables.