Notes on Bivariate Analysis

 

With bivariate analysis, we are testing hypotheses of "association" and causality.  In its simplest form, association simply refers to the extent to which it becomes easier to know/predict a value for the Dependent variable if we know a case's value on the independent variable.

 

A measure of association helps us to understand this relationship.  These measures of association relate to how much better this prediction becomes with knowledge of the IV or how well an independent variable relates to the dependent variable.  We have already discussed this in more abstract terms of "correlation".  A measure of association often ranges between –1 and 1.  Where the sign of the integer represents the "direction" of correlation (negative or positive relationships) and the distance away from 0 represents the degree or extent of correlation – the farther the number away from 0, the higher or "more perfect" the relationship is between the IV and DV.

 

Statistical significance relates to the generalizability of the relationship AND, more importantly, the likelihood the observed relationship occurred BY CHANCE.  In political science, we typically consider a relationship significant (the association seen in this sample is not occurring randomly or by chance) if it has a significance level of .05 – In only 5/100 times will the pattern of observations for these two variables that we have measured occur by chance.  Often significance levels, when n (total number of cases in a sample) is large, can approach .001 (only 1/1000 times will the observed association occur).

 

Measures of association and statistical significance that are used vary by the level of measurement of the variables analyzed.

 

For categorical and some limited scale ordinal variables, most analysis begins with a crosstab *see page 1 of computer printout.  The crosstab or contingency table will show how many persons fall into each cell of the crosstab as well as the marginals for the columns and rows.  These crosstabs show the relationships between the two variables – If people have this value for Variable X, what variable are they more likely to have for Variable Y?  We quantify this relationship with the following statistics.

 

For Nominal Variables:

 

Measure of association is Lambda.  Literally, lambda is the extent to which guessing the values of the dependent variable is improved by knowing which category the case falls in for the IV. 

The formula for Lambda is

(Reduction in Error from guessing to predicting based on IV)

 Number of Original Error 

This gives you a ratio of how much improvement your prediction has by knowing values on the IV. Lambda ranges from 0 to 1.  (Remember, there is no order to nominal variables, therefore, you can not predict a "direction" of association).  The higher the number, the stronger the relationship between the two variables.

The measure of statistical significance for nominal variables (and limited scale ordinal variables) is Chi Square.  In fact, Chi Square can measure the statistical significance of ANY crosstab.  It tells us How different the values in the cells of a crosstab are from expected values (or values predicted if no real relationship between the two variables existed – uses marginals to calculate these expected values).  It is based on two factors – 1)the distribution of cases among the cells (can show the extent to which differences are observed; 2) the number of cells (degrees of freedom), and 3) the size of the sample (n).

 

To determine Chi square.

            For each cell in the table

 

A

B

C

D

 

1. Determine the Expected Value for each Cell  -- (Column Marginal) (Row Marginal)

                                                                                                n

 

2. Determine the Observed Value for each Cell – Actual number of cases in that cell

 

3. Determine the Difference between Expected Value and Observed Values for each cell.

 

4.  Square these values for each cell.

 

5.  Divided these squared numbers for each cell by the expected value for each cell.

 

6.  Sum  the numbers found in step five for each cell.

 

7.  This sum is the Chi-square.

 

8.  Determine the degrees of freedom for the cross tab.

            df = (# of rows – 1) (# of columns – 1)

 

9.  Using the Appendix in your text, degrees of freedom and Chi square, determine the level of statistical significance for this crosstab.  Did this particular crosstab – pattern of relationships- occur by chance?

 

 

For ordinal variables:

 

The appropriate measures of association all attempt to measure how values of ordered variables relate for the sample of cases.  For instance, how many times are high values associated with high values, how many times are they associated with low values.  They each use discordant and concordant pairs to create a value between –1 and 1.  O indicates no relationship between how the values for the cases pair up.  The closer to –1  means the stronger the negative (inverse) relationship, and the closer to 1 the more "perfect" the positive relationship.

There are several measures of association which measure ordinal variables' relationships.

            Somer's d, tau b, tau c, and gamma are the most usual ones.  All are slight variations of the formula related in layman's terms above.  While Somer's d, tau b, and tau c will all have very close to the same value, gamma usually will appear to have a slightly higher (stronger) relationship.

            Measures of statistical significance – beyond our scope – will also be available when the computer calculates these values.  Use these values (often call p-values) to see whether this measure of association is significant.

 

For interval variables

 

One statistic that is often used with interval data is the t-test.  This statistics tests whether two subgroups of a sample have large enough difference between their means that the difference actually exists (is statistically significant).  For instance, suppose you wanted to determine whether or not women and mean had different levels of income – Do mean make more than women?.  You are really testing how knowing the independent variable (gender) can help you predict income.  You can do the t-test knowing the means of each subgroup for the independent variable and the variance of these means (standard deviation squared).  This is most often used when the DV is interval and the IV is nominal.

 

Determine the null hypothesis and the hypothesis you are testing.  If your hypothesis is directional, you should use a 1-sided t-test.  If it is not – it just postulates “a relationship”, then use a 2-sided t-test.

 

The formula is   t =                               

 

 

Then, calculate the degrees of freedom.  df=n1 + n2 - 2

 

Then, using a Table of

 

Next,

Regression and correlation analysis are used to determine the association between 2 or more interval level variables.

 

Regression analysis attempts to determine a line (formula) whereby our knowledge of a value on the Independent Variable (X) will help us predict the value of the dependent variable (Y)

This type of analysis will ONLY WORK for variables that have a linear relationship.

Regression analysis draws a line through the scatterplot that minimizes the combined distances between the line and every other point on the plot.  It produces an equation that look that the equation for a line on a 2 dimensional coordinate plane.

 

 

 

 

 

Y = a + bX

 

Y =

X =

a = intercept

b = slope of the line.

 

In the printouts, you will often see the b given – it is interpreted like this:  For every one unit change in X, you may expect b change in Y.

 

You can then actually plug in values of cases on the DV, and “predict” what value it will have for variable Y.

 

The regression line allows us to predict values for the Dependent variable, but it doesn’t give you any real idea of how good your line is for predicting the values, or how close the association is between these two variables. 

 

From this line, we can determine r (correlation coefficient) and r2 (r-square) -- Proportion of change in Y that is attributable to X.

 

r  (correlation coefficient) tells us how good our line is – It is a measure of how close all the points on the scatterplot are to the line – It really tells of how much of the variance the dependent variable is explained by the regression line.

 

r gives us a measure between –1 and 1, indicating the direction of correlation (negative or positive) and the strength of correlation – Closer to –1 or 1 indicates a stronger relationship between the two variables.

 

r2 gives us a measure of the total variance in the DV that is explained by the IV (or several independent variables).  An r-square of .15 would then be interpreted as meaning that .15 of the variance in the DV is explained by the line that we have drawn.  – It is often used as a measure of “goodness of fit” by the model.

 

We must then determine if the correlation coefficient ( r ) is statistically significant.  We calculate the df.  In this case df= n-2.  We then look at the chart in the back of the book – Table A.5 in the back of the book, at what level r is significant.

 

Additionally, the line can be extended for analyzing how multiple variables help to explain the dependent variable.  This is called multiple regression.