Term
What are the differences between descriptive vs. inferential statistics? |
|
Definition
Descriptive statistics are tables, graphs, or numbers that organize or summarize a set of data. Inferential statistics are mathematical techniques that allow us to make decisions, estimates, or predictions about a larger group of indivdiuals on the basis of data collected from a much smaller group.
Descriptive Statistics: certainty, results could describe sample or population, results reflect same 'level' as the data I have
Inferential: probability, results describe population, results reflect different level than the data I have |
|
|
Term
What role do inferential statistics play in the scientific process? |
|
Definition
To determine if we can be confident that the results for our sample will hold true for the entire population from which it was drawn. |
|
|
Term
How/why are descriptive statistics used when we conduct inferential statistic tests? |
|
Definition
Descriptive stats are used when we conduct inferential stats tests because scientists typically cannot study entire large populations, so we describe samples and then conduct inferential statistical tests to determine if we can be confident that the results for our sample will hold true for the entire population from which it was drawn. |
|
|
Term
What do we mean when we say the results of an inferential test are "statistially significant?" |
|
Definition
If the results of a study are statistically significant, we are confident that the results we see for our sample will hold true for 1. most other samples drawn from the same populations, and thus 2. for the entire populations from which the sample was drawn. |
|
|
Term
What's the difference between a sample statistic and a population parameter? How are they similar? |
|
Definition
A descriptive statistic for a sample is called a sample statistic. A descriptive statistic for an entire population is called a population parameter. They are similar because both are descriptive statistics. |
|
|
Term
What is a representative sample? |
|
Definition
A representative sample is a subset of the population that exhibits the important characteristics, the diversity of the population |
|
|
Term
What 3 questions should you ask yourself to determine whether a particular sample described in a study is representative or unrepresentative? |
|
Definition
1. What is the population of interest? 2. What diversity would we expect to find in that population? 3. Does the sample described in the problem include the diversity expected in the population? |
|
|
Term
Why are representative samples critically important in science? |
|
Definition
Representative samples are critically important in science because if a sample is not representative of a particular population, then conclusions we reach about the population based on that sample are not valid. |
|
|
Term
Which 2 sampling methods are likely to yield representative samples? How are these methods similar to and different from each other? |
|
Definition
Two sampling methods that are likely to yield representative samples are simple random sample (one in which every individual in the population has an equal chance of being chosen in the sample) and stratified random sample(to ensure that particular groups within a population are adequately represented in a sample, we can randomly selected individuals from each group until the proportion of individuals from each group in our sample equals the proportion of that group in the larger population |
|
|
Term
Why do convenience and voluntary response sampling methods often fail to yield representative samples? |
|
Definition
Convenience sampling and voluntary response sampling are likely to yield unrepresentative (biased) samples. Convenience sampling occurs whenever the sample is selected based on the ease of collecting data, rather than using a random method. Voluntary response sampling is a type of convenience sampling in which only those who volunteer are in the sample. |
|
|
Term
What is replication? Why do scientists conduct replication? |
|
Definition
Replication is repeating the study, with essentially the same methodology, on a new sample. Scientists conduct replications to check to see if findings based on a sample are really true for an entire population. |
|
|
Term
Most consumers of scientific information want to understand and predict characteristics of an individual. What caution must we consider when using scientific principles to predict what may be true for individuals? |
|
Definition
Remember that scientific principles are meant to describe what is true in most cases in a population. We know that there will always be individuals that are different because there will always be variability among individuals in any population |
|
|
Term
What is the key difference between categorical (qualitative) and numerical (quantitative) data? |
|
Definition
Categorical / qualitative data is not naturally numerical. Numerical / quantitative data is naturally numerical. |
|
|
Term
What characteristics define nominal variables? |
|
Definition
a categorical variable that cannot be ranked |
|
|
Term
What characteristics define ordinal variables? |
|
Definition
a categorical variable that can be ranked; if denoted in numbers, the distances between values are not necessarily equal in interval |
|
|
Term
What characteristics define interval variables? |
|
Definition
a numeric variable that falls on a scale with equal intervals, but does not have an absolute zero point |
|
|
Term
What characteristics define ratio variables? |
|
Definition
a numeric variable that falls on a scale with equal intervals and DOES have an absolute 0 point |
|
|
Term
How does precision of measurement vary along the NOIR scale? |
|
Definition
At each successive level along the NOIR scale (Nominal --> Ordinal --> Internval --> Ratio) we have a MORE precise measurement. Scientists generally choose to use the most precise scale of measurement possible. |
|
|
Term
What is the difference between discrete and continuous variables? |
|
Definition
Discrete variables produce numerical responses, typically from a counting process, and therefore tend to take on only a finite number of real values.
Continuous variables produce numerical responses, typically from a measuring process, and therefore can assume an infinite number of values. |
|
|
Term
Science proceeds from words to numbers to words. When we are measuring categorical variables, the only numbers we can generate are frequencies -- the frequency or relative frequency of individuals who fall in each category. What's the difference between class frequency and class relative frequency? |
|
Definition
Class frequency is the number of observations in the data set falling in a particular class. Class relative frequency is the proportion or percentage of observations falling in a class. |
|
|
Term
Can bar charts present CF, CRF, or either one? |
|
Definition
Bar charts can be used to represent either CF or CRF. |
|
|
Term
Why are there typically spaces between bars on bar charts? |
|
Definition
The separation between bars communicates to the viewer that each bar represents a distinct category. |
|
|
Term
Can pie charts present CF, CRF, or either one? |
|
Definition
Pie charts can present either CF or CRF. |
|
|
Term
How does the placement of dots on a dot plot differ depending on whether our variable is discrete or continuous? |
|
Definition
If your quantitative variable is discre3te then you put the dots directly above the values on the x-axis. If the variable is continuous, then you have to estimate where each dot should fall relative to the values on the x-axis. |
|
|
Term
When creating a pie chart, in what order do we place the slices if our variable is nominal? In what order do we place the slices if our variable is ordinal? |
|
Definition
If data is nominal, pie slices are typically placed in order from the category with the highest frequency (largest slice) to lowest frequency, starting at 0-degrees. If data is ordinal, the order of slices is typically determined by the order in which the categories naturally fall. |
|
|
Term
Dot plots and histograms are used to display quantitative data -- whcih one requires that we create measurement classes? |
|
Definition
Histrograms require measurement classes. |
|
|
Term
Why are there never spaces between bars on a histogram? |
|
Definition
Bars on a histogram are placed side-to-side because a histogram shows the frequency of individual scores that fall into measurement classes that are continuous along a quantitative scale. |
|
|
Term
Do we typically present continuous numerical variables on bar charts or histograms?
Under what circumstances are we likely to present discrete numerical variables on a histogram? |
|
Definition
Typically, continuous numerical variables are displayed on a histogram.
We create measurement classes for discrete numerical variables when the number of discrete values in our data set exceeds 15. |
|
|
Term
What do measures of central tendency represent? |
|
Definition
Measures of central tendency reflect something about "middleness" -- the middle of a data set. Measures of central tendency are single numerical measured used to represent an entire set of scores. |
|
|
Term
How do the concepts of mean, median, and mode differ? |
|
Definition
The mean is the average of a set of scores. The mean is calculated by adding up all of the scores and dividing by the total number of scores.
The median is the middle number in a set of measurements. To calculate a median, arrange the N measurements from smallest to largest. If N is odd, the median is the middle number. If N is even, it is the mean of the middle two numbers.
Mode is the most frequently occurring score in a set of scores. In a large data set, we estimate the mode as the midpoint of the modal class. |
|
|
Term
If the median of a set of test scores is 84 -- what does that mean about how the other scores in the set are distributed? |
|
Definition
It means that half of the scores are above 84 and half are below it. |
|
|
Term
If, for a given set of scores, the mode = median = mean, what shape is the distribution? |
|
Definition
|
|
Term
Which is the more reliable measure of central tendency: mode or median? Why? |
|
Definition
You have to know how the scores in your data set are distributed before you can know which measure of central tendency best represents your data |
|
|
Term
Why is the median a more reliable measure of central tendency than the mean when a distribution has outliers? |
|
Definition
The mean is most strongly affected in a skewed distribution. Extreme scores (outliers) in the tail of the distribution "pull" the mean in that direction and the more extreme the outliers are the more the mean will be affected. The median is somewhat "pulled" in the direction of the tail, but not as much as the mean because the median only responds to the number of scores above or below it, not how far above or below the outliers fall. |
|
|
Term
If you have high outliers in a set of data, what will the distribution be?
If you have low outliers in a datset, what will the distribution be? |
|
Definition
High outliers = skewed right
Low outliers = skewed left |
|
|
Term
What do measures of variability represent? |
|
Definition
A measure of variability represents how alike or different are the scores in a dataset -- whether they cluster together around the mean or whether they are widely dispersed. |
|
|
Term
When and why can range be an unreliable measure of variability? |
|
Definition
Range can be an unreliable measure of variability if you have either a high or low outlier, the range can be misleading because range is based on the high and low scores in a data set. Just two numbers, no matter how large the set. |
|
|
Term
What quality do variance and standard deviation have that make them preferred over range as a measured of variability? |
|
Definition
Range is the least reliable measure of variability, as it only includes 2 scores in the set. Variance and standard deviation make calculations with all of the scores in a set. |
|
|
Term
When using the definitional formula to calculate variance, why can't we just sum up the distances of each score from the mean -- and use that as a measure of variability? |
|
Definition
When we sum the distances of each score from the mean, the sum always equals zero because the mean is the balancing point of any distribution. |
|
|
Term
Why is standard deviation the preferred measure for reporting variability in a set of data, as compared to variance? |
|
Definition
We prefer to report standard deviation instead of variance because standard deviation is in the original units of measurement -- which makes standard deviation easier to interpret than variance (which is expressed as [original units of measurement]^2). |
|
|
Term
How does the Empirical Rule help us describe dats that is distributed normally? |
|
Definition
If we know that a set of data is distributed normally, and we know the mean and standard deviation of the distribution, we can use the empirical rule to estimate how many measurements will occur within one or two or three standard deviations of the mean. |
|
|
Term
What does a measure of relative standing represent? |
|
Definition
A measure of relative standing indicates how a particular score compares to the other scores in a data set. |
|
|
Term
What does a percentile rank represent? |
|
Definition
The percentile rank for a score is the percentage of scores that are less than that score. |
|
|
Term
How do we calculate percentile rank differently depending on whether the score for which we want to know the rank occurs only once or whether it occurs more than once? |
|
Definition
If a score only appears once in a data set, then its percentile rank is simply the percentage of scores that are less than it. If a score appears more than once in a dataset, then its percentile rank is the percentage of scores less than it plus half of the percentage of scores that are equal to that value. |
|
|
Term
What do z-scores represent? |
|
Definition
A z-score represents the number of standard deviations away from the mean a score falls. |
|
|
Term
When we want to compare scores for individuals who belong to two different distributions why are z-scores useful?
What do "standard normal distributions" have to do with this sort of comparison? |
|
Definition
When we calculate z-scores for raw scores drawn from two different normal distributions, we are standardizing both distributions so that the mean = 0 and standard deviation = 1 for both distributions. We can then compare how far each scores falls from the mean of the distribution from which it was drawn. |
|
|
Term
Why is using a z-table preferred as the method for estimating areas beneath the normal curve instead of using the empirical rule? |
|
Definition
A z-table is preferred as the method for estimating areas beneath the normal curve instead of using the empirical rule because the empirical rule can only address questions about relative standing when the score falls EXACTLY on a standard deviation mark. |
|
|
Term
In lesson 2, we used the empirical rule to estimate areas beneath the normal curve to answer questions about proportions and percentages. In this lesson we began to estimate areas beneath the normal curve to answer questions about probability (p) -- why will this be important to keep in mind as we conduct inferential statistical tests? |
|
Definition
Inferential statistics are based on estimates or probability. We draw nromal curves and shade in areas beneath the curve because we are reasoning about probability. |
|
|
Term
Do normal distributions all have the same height and width? |
|
Definition
No, normal distributions do not all have the same height and width because they differ in mean and standard deviation. |
|
|
Term
What must we do to normally distributed data so that we can estimate probability using the normal probability curve? |
|
Definition
Any normal distribution can be transformed to match the qualities of the normal probability curve. |
|
|
Term
What do you do to estimate an area in one tail of the distribution? |
|
Definition
Shade the area you are estimating. Look up area between 0 and z in the z-table.Subtract the area you got from the table from 0.5 to get an area in one tail of the distribution. |
|
|
Term
What do you do to estimate an area that is more than 1/2 of the distribution? |
|
Definition
Shade the area you are estimating. Look up area between 0 and z in the table. Add 0.5 to the area you got from the table to find an area that is more than 1/2 of the distribution? |
|
|
Term
What do you do to estimate an area that is between a negative z and a positive z? |
|
Definition
Shade the area you are estimating. Look up the area between 0 and the negative z. Look up the area between 0 and the positive z. Add the two areas together. |
|
|
Term
What do you do to estimate an area between 2 negative or 2 positive z-scores? |
|
Definition
Take the absolute values of the two positive or negative z-scores. Look up the area between 0 and the larger z. Look up the area between 0 and the smaller z. Subtract the smaller area from the larger. |
|
|
Term
Explain the problem-solving steps we use to estimate probabilites of events in a normal distribution. |
|
Definition
1. Convert the questions in words to a probability statement using the raw score (x). 2. Label the normal distribution with appropriate scales: (1) z-scores, (2) raw scores (x), and (3) standard deviations. 3. Calculate z-score(s) for the value(s) of x you're given and label the z-scores on the distribution. 4. Shade the area on the distribution corresponding to the probability you want to find. 5. Look up the area between 0 and z in the z-table and label it on the distribution -- always! 6. Separately label any other sections of the distribution that are included in the area you shaded. |
|
|
Term
What do you need to remembet about what a percentile rank represents to set up and solve problems that ask you to use a z-table to estimate the percentile rank of a score? |
|
Definition
You must recall that a percentile rank is a measure of relative standing that tells us what percentage of individuals in the distribution score below a particular individual. |
|
|
Term
Explain: To estimate probabilities of events in a normal distribution, we start with an X, solve for a Z, and then obtain a P. |
|
Definition
|
|
Term
What is the key difference in how mystery-z problems are worded, as compared to problems in which you are asked to estimate the probability of a particular event (x)? |
|
Definition
In mystery-z problems we will be given a percentage or proportion or probability (p) or -- ans asked to determine the score (x) that relates to the probability given.
Mystery z: start with P. We then solve for z and then x. Others: Start with x. We then solve for z and then P. |
|
|
Term
Explain: In mystery-z problems, we start with a P, solve for a z, and then obtain an x. |
|
Definition
|
|
Term
Explain the problem-solving steps we use to conduct mystery-z problems. |
|
Definition
1. Shade and label the area given in the problem on a normal distribution. 2. Place a "tag" for the mystery z on the x-axis at the boundary of the area you have shaded and labeled -- to remind yourself that it is the mystery z for which you are solving. 3. Calculate the area between 0 and the mystery z and label it on your distribution. 4. Look up the "area between 0 and mystery z" - you will read the z-table "inside-out" because areas between 0 and z are shown in the body of the z-table. You know the area between 0 and z and must find it IN the body of the table and "raed" the z-score from the OUTer left column and top row. 5. Plug the mystery z into the formula and solve for x. |
|
|
Term
What is a Distribution of Sample Means (DSM)? |
|
Definition
A DSM shows the means of many samples drawn from one particular population |
|
|
Term
How do we create a DSM from scratch? |
|
Definition
To create a DSM from scratch, we collect many samples form one particular population; calculate the mean of each sample; and then display the sample means (x-bar) on a distribution |
|
|
Term
Do we expect that samples drawn from the same population will have identical means? Why or why not? |
|
Definition
No, it is unusual to draw samples with means (x-bar) that are identical to each other or to the population mean (mu) |
|
|
Term
Do we expect the mean of a sample drawn at random from a population to equal the mean of the population? Why or why not? |
|
Definition
Samples means (x bar) tend to resemble the population mean (mu) more than individual scores (x) from the population. THe means of samples (x bars) drawn from a population will cluster more tightly around the population mean (mu) than will individual scores (x) drawn from the population |
|
|
Term
When we calculate the standard deviation of all sample means for a DSM, is "n" equal to the total number of samples we collect in order to create the DSM or is "n" equal to the number of individuals in each sample collected? |
|
Definition
n is equal to the total number of samples we collect in order to create the DSM |
|
|
Term
Why do we create a DSM for a population? |
|
Definition
1. So we can visualize / describe the types of sample means we would expect to draw from a particular population. 2. So we can compare one sample (x bar) to other sample (x-bar's) drawn from the same population -- to say how rare or common that type of sample is in the population. 3. So we can compare a "test sample to the DSM -- to say whether it is unlikely that the "test sample" came from that population |
|
|
Term
Which two population parameters do we need to know to create a DSM without having to actually collect 100s of samples?
How do we use those population parameters to get the mean and standard deviation of the DSM? |
|
Definition
We set the mean of all sample means (mu sub x-bar) equal to the mean of the population (mu)
We calculate the standard deviation of all sample means, also called standard error (sigma sub x-bar) using
sigma sub x-bar = sigma/square root of n |
|
|
Term
The standard deviation of sample means is also called what? |
|
Definition
|
|
Term
What can we say will always be true about the relationship between variability among individual scores (sigma) drawn from a population as compared to variability among the means of samples (sigma sub x bar) drawn from the same population? Explain. |
|
Definition
Sigma sub x bar = sigma / square root of n
So, standard error -- which represents variability among sample means (x bar) in a DSM -- will always be LESS THAN the variability among individual scores in a population (sigma)
Why? 1. Each sample mean calculated is a summary statistic that represents the center of that sample --> high and low individual scores in each sample are 'washed out' 2. Because the mean of each sample is an approximation of the population, the x-bar's cluster more tightly around the population mean (mu) than do individual scores (x's) |
|
|
Term
If we used both methods to create DSMs for a particular population -- one DSM from scratch and one using population parameters -- which would be most accurate? |
|
Definition
Estimation from population parameters would be more accurate because the number of samples in a sampling distribution is assumed to be infinite. Even though creating a DSM from scratch includes hundreds of sample means, we would need many more to get the exact same results as we get when we estimate from population parameters. |
|
|
Term
THe amount of variability among sample means drawn from a population depends on what two values? |
|
Definition
-the size of samples (n) used to create the DSM -the amount of variability in individual scores in the population (sigma) |
|
|
Term
As sample size increases, what happens to standard error? As variability of scores in a population increases, what happens to standard error? |
|
Definition
-Standard error (sigma sub x-bar) will be reduced as sample size (n) is increased. - Standard error (sigma sub x-bar) will increase as variability of scores in a population (population standard deviation, sigma) is increased. |
|
|
Term
The accuracy of sample means (x-bar) as estimates of the population mean (mu) increases as sample size does what? Explain why this is so by explaing how sample size affects standard error and how standard of the DSM indicates whether sample means cluster tightly around the mean of the DSM / mean of the population. |
|
Definition
THe accuracy of sample means (x-bar) as estimates of the population mean (mu) increases as the size of samples (n) used to create the DSM increases. The larger sample size (n), the lower standard error in the DSM (sigma sub x-bar). The lower standard error, the more sample means will cluster tightly around the mean of the DSM, and thus, the more likely it is that a sample mean drawn at random from the population will accurately estimate the population mean. |
|
|
Term
State the exact definition of the Central LImit THeorem. |
|
Definition
When a distribution of sample means is created from large samples (N > or equal to 30), the DSM will resemble a normal distribution regardless of whether the samples were drawn from a population that was distributed normally or non-normally |
|
|
Term
What sample sizes are needed to ensure a normally distributed DSM if individual scores are normally distributed in the population of interest?
What sample sizes are needed to ensure a normally distributed DSM if individual scores are not normally distributed in a population of interest? |
|
Definition
-If the individual scores in a population are distributed normally, then a DSM created for that population will be normal for any/all sample sizes. -If the individual scores in a population are not distributed normally, a DSM created for that population will be at least approximately normal if we create the DSM using a sample size of N is greater than or equal to 30. |
|
|
Term
We want to conduct inferential statistial tests on data taken from many poplations that are not distributed normally. Why is the Central Limit Theorem crucially important to ensure the accuracy of our estimates of probability for these tests? |
|
Definition
Individual scores for many populations in which we are interest may NOT be distributed normally. Still, we can conduct our inferential statistical tests using the standard normal distribution because we know that any DSM, created with sample size of 30 or greater, will be at least approximately normal -- even when the data in the population from which the samples were drawn is not normally distributed. |
|
|