Term
Basic Context for Data (5-6 questions to ask) |
|
Definition
Who, What, When, Where, Why and How? |
|
|
Term
|
Definition
When Data answers questions but does not represent a sumable or manipulatable quantity. Can be represented by a # |
|
|
Term
|
Definition
Whenever a variable is in units representing exact amounts of something or some occurrence. |
|
|
Term
|
Definition
A number assigned to each individual case for sorting purposes |
|
|
Term
Frequency Table/ Relative Frequency Table |
|
Definition
A table with different categories and and total counts or one which represents the proportion of each count as a percent |
|
|
Term
|
Definition
Displays distribution of a categorical variable. NOT a quantitative variable |
|
|
Term
|
Definition
A table which represents categories and breaks down the totals into their representative parts. The margins represent the totals |
|
|
Term
|
Definition
When graphing data, make sure each catagory has an area which is proportional to its total in the group |
|
|
Term
|
Definition
Unfair averaging over different groups without the same conditions and quantity |
|
|
Term
|
Definition
Only for quantitative data. Looks like bar graph (only for catagorical data) except that there is no space between bars unless there is a gap in the data. Good for illustrating distribution |
|
|
Term
Stem and Leaf Displays (and Dotplots) |
|
Definition
Writing the first digit on one side of the table, then listing one following digit for each case in that range. Dotplots replace digits with dots |
|
|
Term
Three things to mention when describing distribution |
|
Definition
Shape: Describe how many modes in data set/ symmetricallity/ outliers? Center: Median/ Mean Spread: Average variation/ interquartile range |
|
|
Term
Unimodal /Bimodal/ Multimodal |
|
Definition
With one hump/ 2 humps/ more than 2 heads |
|
|
Term
|
Definition
Data which is fairly consistent, no modes or trend |
|
|
Term
|
Definition
When there is a Tail (thinner ends of the distribution) one way or the other, the graph is said to be this |
|
|
Term
Interquartile Range (IQR) |
|
Definition
The upper quartile (75th percentile)- lower quartile (25th percentile) |
|
|
Term
|
Definition
The total sum of the difference between each y value and the mean squared divided by (n-1)
It is just before you square root to find the standard deviation |
|
|
Term
|
Definition
Take the square root of: Sum of difference between y and the mean squared/ (n-1) |
|
|
Term
|
Definition
1. Make boxes with lower, upper quartiles and mean. Add whiskers up to 1.5 times the IQR and add outliars |
|
|
Term
Z-Score (Standardized Value) |
|
Definition
(y-the mean of y)/ standard deviation. Written z(x) or z(y) |
|
|
Term
How does standardizing data change data |
|
Definition
Shape: Does not change Center: Makes the mean 0 Spread: The standard deviation becomes 1 |
|
|
Term
|
Definition
The shape of the data's distribution is unimodal and symmetric, then you can apply different things. Make a Picture |
|
|
Term
|
Definition
Within 1 sd positively and negatively of 0 is 68% of data, within 2 is 95% of data, within 3 is 99.7 |
|
|
Term
Finding Normal Percentiles |
|
Definition
Calculate Z-Score then look to left of table for 1st 2 digits and match with the top of the table to find the corresponding normal percentile |
|
|
Term
|
Definition
The y axis is the x of the corresponding histogram (ex. mpg) and the x axis is each data points Z-score. Should be a diagonal, left-right graph |
|
|
Term
Things to look for in Scatterplots |
|
Definition
Direction: Is it positive or negative Form: Is it linear? Curved? Strength: How much does it scatter? Outliers: Anything that significantly skews the data |
|
|
Term
Predictor/ Explanatory Variable |
|
Definition
The x-axis which is believed to inform or predict the y value |
|
|
Term
|
Definition
The y axis and variable of interest. This is the variable used in St. dev. etc... |
|
|
Term
|
Definition
Measures the strength of the linear association between two quantitative variables.
r= The sum of z(x) times z(y) / (n-1) |
|
|
Term
|
Definition
Quantitative Variables Condition: Make sure data isn't categorical Straight Enough Condition: It is subjective, but make sure the data isn't clearly non-linear Outlier Condition: Make sure outliers are not present as they can distory the correlation dramatically
Check these conditions with a scatter plot |
|
|
Term
|
Definition
The explanation of why correlation is misleading and does not prove causation |
|
|
Term
|
Definition
Designed to assess how close the relationship between two variables is to being monotone. A monotone relationship is how consistently they increase or decrease, not necessarily linearly. A value of -1 means constant decreasing, 1 means constant increase. Its a nonparametric value |
|
|
Term
|
Definition
Is less sensitive to outliers. Gives a rank (starting with 1, 2,3 etc....) to each x value. Also between -1 and 1. It is a nonparametric value. |
|
|
Term
|
Definition
The difference of the y value of a coordinate and the predicted y value of a linear regression (also refered to as y(hat). |
|
|
Term
|
Definition
Also know as the least squares line |
|
|
Term
Linear Regression equation |
|
Definition
|
|
Term
b1 (The slope of linear regression) equation |
|
Definition
r (sy/sx) or the correlation x times (standard deviation of y/ stand. dev. of x) |
|
|
Term
|
Definition
|
|
Term
|
Definition
Gives a positive fraction of the data's variation accounted for by the model |
|
|
Term
Does the Plot Thinken? Condition |
|
Definition
When you plot the residuals against the model, there should be no discernable pattern. If there is, your model isn't ideal |
|
|
Term
|
Definition
You can't simply rearrange regresion line equations unless correlation is 1.0. You must do the b1 and b0 formulas again |
|
|
Term
|
Definition
The extent to which a point influences analysis |
|
|
Term
|
Definition
Distinguishable traits of the data that can allow you to fit different regression lines to different segments of information (male/female etc...) |
|
|
Term
|
Definition
1. Make the distribution of a variable more symmetric 2. Make the spread of several groups (as seen in side-by-side boxplots) more alike, even if their centers differ (often achieved with logs) 3. Make the form of a scatterplot more nearly linear 4. Make the scatter in a scatterplot spread out evenly rather than thickening at one end |
|
|
Term
|
Definition
Try for unimodal, left skewed histograms |
|
|
Term
Ladder of Powers: "0" aka Logs |
|
Definition
This is the go to. You can't have negative or 0 numbers, so add small constants to all data to avoid mistakes. Try logging y, then logging x, and if all else fails log both. |
|
|
Term
|
Definition
Negative square root perserves the direction of relationships. Your last bet |
|
|
Term
|
Definition
Positive or negative, depending on which way you want the data to go. Ratios of 2 quantities benefit the most. |
|
|
Term
Sample Strategies and Ideals to keep in mind: |
|
Definition
1: Examine a Part of the Whole: Try to avoid bias by representing all parts of the population equally proportional to their representation in the whole 2: Randomize: When in doubt, make sure there is nothing that could be associated with what your sample 3: Its the Sample Size: The fraction of the population doesn't matter, just the actual sample size (2,000 is a good number). |
|
|
Term
Sample Strategies and Ideals to keep in mind: |
|
Definition
1: Examine a Part of the Whole: Try to avoid bias by representing all parts of the population equally proportional to their representation in the whole 2: Randomize: When in doubt, make sure there is nothing that could be associated with what your sample 3: Its the Sample Size: The fraction of the population doesn't matter, just the actual sample size (2,000 is a good number). |
|
|
Term
|
Definition
A sample of the entire population, often quite inefficent |
|
|
Term
|
Definition
Parameters are real information about the world that we are trying to get at, often in vain. Statistics are anything we calculate from data |
|
|
Term
Simple Random Sample (SRS) |
|
Definition
A method by which any combination of samples could be selected. The basis for comparison with all other statistical methods |
|
|
Term
|
Definition
The list of individuals from which the sample is drawn |
|
|
Term
Stratified Random Sampling |
|
Definition
Dividing the population into distinct strata of samples, and using a simple random sample within each strata. |
|
|
Term
|
Definition
Taking a representative cluster of the population which expresses the population as a whole. If it doesn't represent the population as a whole it will be bias. Can also be a piece of multistage samples |
|
|
Term
|
Definition
When you use a nonrandom, but systematic sample of individuals. For example, selected every 20th person in a population. |
|
|
Term
|
Definition
A trial run of a survey before it is employed in a larger group at higher cost. Gives you a chance to recognize flaws in your design |
|
|
Term
Sampling Technique Errors |
|
Definition
Voluntary Response Sample: Because it is self-selective, it is inherently bias Convenience Sampling: Does not usually make unbiased information |
|
|
Term
|
Definition
Nonrespondants: Its always a good investment to limit the amount of Nonrespondants, because their lack of incorporation can shift data Response Bias: Anything in the survey which influences response (wording of a question, the environment its taken in) |
|
|
Term
|
Definition
When people or subjects are viewed in their natural environments. Often retrospective studies |
|
|
Term
Prospective v. Retrospective Studies |
|
Definition
Prospective studies follow randomly picked individuals and watch them for a given amount of time, generally favored over retrospective options |
|
|
Term
|
Definition
When you attempt to isolate very simple variables through random assignment of treatments to subjects. Active manipulation by researchers. |
|
|
Term
The 4 Principles of Experimental Design |
|
Definition
1. Control: Control sources of variation other than what we are testing 2. Randomization: Equalizes the effects of unforseen or uncontrollable sources of variation 3. Replicate: Results have to be replicated in slightly altered situations to show no bias 4. Block: Sometimes attributes affect outcomes of an experiment, so grouping different blocks together is more accurate |
|
|
Term
The 4 Principles of Experimental Design |
|
Definition
1. Control: Control sources of variation other than what we are testing 2. Randomization: Equalizes the effects of unforseen or uncontrollable sources of variation 3. Replicate: Results have to be replicated in slightly altered situations to show no bias 4. Block: Sometimes attributes affect outcomes of an experiment, so grouping different blocks together is more accurate |
|
|
Term
|
Definition
Limiting the effect knowledge can influence the experiment, by keeping key catagorical variables a secret from the subject and from the researcher. An experiment is "double blind" when even those who interprete the data are unaware of its identity. |
|
|
Term
|
Definition
Pairing subjects because they are similar in ways not under study |
|
|