Term
Associations between Variables |
|
Definition
• Positively associated if increased values of one variable tend to occur with increased values of the other
• Negatively associated if increased values of one variable occur with decreased values of the other |
|
|
Term
|
Definition
A response variable (Y-axis) measures an outcome of interest. Also called dependent |
|
|
Term
|
Definition
An explanatory variable (X-axis) explains changes in response. Also called independent
• Explanatory does not mean causal: there are often several possible explanatory variables • Example: Study of heart disease & smoking
• Response: death due to heart disease • Explanatory: number of cigarettes smoked per day
• Example: City dataset
• Response: mortality • Explanatory: education |
|
|
Term
|
Definition
Some associations are not just positive or negative, but also appear to be linear
A perfect linear relationship is Y = a + bX |
|
|
Term
|
Definition
Can examine relationships of categorical variables this way. |
|
|
Term
Describing/ Examining a Scatterplot |
|
Definition
Look for the overall pattern and for striking deviations from that patter.
You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship.
An important kind of deviation is an outlier |
|
|
Term
|
Definition
measures the direction and strength of the linear relationship between two quantitatie variables. Correlation is usually written as r.
r is positive when there is a positive association
correlation makes no use of the disticntion between explanatory and response variables. Doesn't make a difference which variable you call x and which you call y in calculating the correlation.
requires that both variables be quantitative* key difference between correlation and association, because association can apply to categorical
always a number between -1 and 1. values near 0 indicate very weak linear relationships. Extreme values of -1 and 1 only occur when the points in a scatterlot lie exactly along a straight line.
correlation can ONLY apply to linear relationships (a curved relationship, no matter how strong, cannot be described by a correlation)
not resistant to outliers. |
|
|
Term
|
Definition
If our X and Y variables do show a linear relationship, we can calculate a best fit line in addition to the correlation The values a and b together are called the regression coefficients
• a = intercept • b = slope
How to determine our “best” line ? (ie. best regression coefficients a and b ?)
• Must square “Y-residuals” and add them up: total residuals = ∑ (y i -(a + b ⋅ x i )) 2 |
|
|
Term
|
Definition
Line with smallest total residuals
Best values for slope and intercept |
|
|