Term
|
Definition
relative frequency of events |
|
|
Term
|
Definition
a simple model that assumes only two outcomes are possible. models probability to observe k events among a sample of n individuals. |
|
|
Term
|
Definition
when n=1 and we are interested in the probability of observing a case with a single draw from a binomaially distributed population |
|
|
Term
name three ways to tell if something is Gaussian distributed |
|
Definition
-investigate histogram -qq plot -apply significance test (Shapiro-Wilks test) |
|
|
Term
|
Definition
compares quantiles of an observed frequency distribution to quantiles of an expected distribution. used for testing Gaussian distribution |
|
|
Term
|
Definition
the null assumes the sample is Gaussian and if the test is not significant we accept the alternative that it is not |
|
|
Term
de Moivre-Laplace Theorem |
|
Definition
when the success probability/prevalence of binomial distribution converges to 0.5 or binomial population is increasing, the binomial distribution is becoming more symmetric |
|
|
Term
|
Definition
deviation of each observation from the mean |
|
|
Term
|
Definition
average of the deviations of the observations from the mean |
|
|
Term
|
Definition
standard deviation expressed as percentage of the mean |
|
|
Term
the simplest method in R to estimate the mean and its confidence interval |
|
Definition
|
|
Term
addition rule (probability) |
|
Definition
when two events are mutually exclusive (cannot occur at the same time), the probability of either occurring is the sum of the probability of each event |
|
|
Term
multiplication rule (probability) |
|
Definition
two events are independent (occurrence of one does not affect the other) then the probability of both events occurring is the product of individual probabilities |
|
|
Term
|
Definition
when two events are not independent; the probability of A occurring when we know B has occurred |
|
|
Term
when is the binomial distribution used (list 2) |
|
Definition
-when investigating a binary response (only two possible outcomes) -for analyzing proportions and making inferences about them |
|
|
Term
what do we do when data is skewed right in Gaussian ditribution |
|
Definition
take the lognormal distribution |
|
|
Term
properties of Gaussian distribution (6) |
|
Definition
-described by 2 parameters (mean, SD) -unimodal -symmetrical about the mean -mean, median and mode all equal -if SD doesn't change, but mean increases then curve shifts right (decrease and it shifts left) -decrease SD makes curve thinner, increase SD makes it fatter |
|
|
Term
properties of t distribution (3) |
|
Definition
-symmetrical about the mean -characterized by degrees of freedom -when large degrees of freedom, looks like normal distribution |
|
|
Term
properties of chi-squared distribution (2) |
|
Definition
-can only take positive values, highly skewed -characterized by degrees of freedom (approaches normal when large) |
|
|
Term
properties of f-distribution (3) |
|
Definition
-distribution of a ratio -two separate degrees of freedom (numerator and denominator) -tabulated probabilities relate to ratio>1 |
|
|
Term
two distributions used when we are dealing with discrete variables |
|
Definition
|
|
Term
normal distribution is used for what sort of variable |
|
Definition
|
|
Term
when do we use continuity correction in normal distribution |
|
Definition
if we use tables of normal distribution to approximate the poisson or binomial distribution |
|
|
Term
what is the sampling distribution of the mean and what does it depend on |
|
Definition
the extent to which a sample mean differs from population mean depends on size of the sample (larger means less error) variability of the observations (error greater if sample more diverse) |
|
|
Term
what are the properties of sampling distribution of the mean (3) |
|
Definition
-normal distribution if parent distribution is normal (assume normality if sample size >30) -mean of the sampling distribution of the mean is same as parent pop -standard deviation known as standard error of the mean (smaller with larger sample sizes) |
|
|
Term
what is the difference between standard error of the mean and standard deviation |
|
Definition
-SD measures scatter of the observations where SEM measures precision of the sample mean as an estimate of the population mean |
|
|
Term
what is a confidence interval (for the mean) |
|
Definition
defined by upper and lower limits, is a range of values within which we expect the true population mean to lie with a certain probability |
|
|
Term
what is a null hypothesis |
|
Definition
the converse of the study hypothesis (usually try to disprove it) |
|
|
Term
what is an alternate hypothesis |
|
Definition
states there is a difference between parameter values but the direction is not known (therefore usually leads to a two tailed test) if we know one txt can only be better and not worse we may use a one sided test |
|
|
Term
|
Definition
the chance of getting the observed effect if the null hypothesis is true |
|
|
Term
|
Definition
if the two means are equal and we have rejected the null hypothesis when we should not have rejected it -limit the probability of TI error to be less than alpha (significance level) |
|
|
Term
|
Definition
if the two means differ and we have not rejected the null when we should have -probability of TII error designated by beta -1-beta is the power of a test |
|
|
Term
what are the different types of t-test and give a brief description |
|
Definition
-one sample t-test: comparing mean/expected value to a reference value -two sample t-test: comparing means/expected values of two independent populations -Welch's test: a version of two sample test when variances are unequal -paired t-test: when data is not independent (paired), so it is reduced to a one sample t-test |
|
|
Term
what are the two main assumptions for using a t-test |
|
Definition
-mean of the sample data is Gaussian distributed -unknown variance can be estimated by sample variance |
|
|
Term
describe the one sample t-test |
|
Definition
tests whether the mean/expected value differs from a reference value |
|
|
Term
describe the two sample t-test |
|
Definition
if we have data from independent populations and want to compare the means/expected values |
|
|
Term
when is the Welch's test used |
|
Definition
when using a two sample t-test but the variances are unequal, the standard error of the 2t-test is modified |
|
|
Term
describe the paired t-test |
|
Definition
when we want to use two sample t-test but the data are paired (not independent), this will reduce it to a one sample t-test |
|
|
Term
two assumptions of the one sample t-test |
|
Definition
-sample data from normally distributed population -values are representative of the population |
|
|
Term
assumptions of the two sample t-test (3) |
|
Definition
-samples must be independent and representative of the population -approx. normally distributed -variances should be approx. equal |
|
|
Term
what is the Wilcoxon rank sum test |
|
Definition
during a two sample t-test, when variances are not equal, we can transform the data to make them equal |
|
|
Term
assumptions od a paired t-test |
|
Definition
-the difference between the observations of each pair is approx. normally distributed |
|
|
Term
what are the assumptions of the f test |
|
Definition
-samples are independent and from normally distributed population -samples are representative of the population |
|
|
Term
|
Definition
tests for the equality of two variances |
|
|
Term
what is the Levene's test |
|
Definition
used to compare two or more variances test statistic follows the f distribution -less dependent on the assumptions for the f test |
|
|
Term
what does ANOVA stand for and what is it used for |
|
Definition
analysis of variances compares the means of two or more groups by investigating their variances |
|
|
Term
what does the one way ANOVA do |
|
Definition
it is an extension of the two tailed t-test for when we compare the means of more than two groups |
|
|
Term
describe one way repeated measures ANOVA |
|
Definition
extension of the paired t-test when we are comparing three or more treatments |
|
|
Term
|
Definition
examines the effect of two factors on a response variable |
|
|
Term
assumptions of the one way ANOVA |
|
Definition
-variable of interest is numerical -samples are independent and come from normally distributed population |
|
|
Term
what is bonferroni's correction used for |
|
Definition
when we reject the null in a one way ANOVA and we need to know which of the group means differ |
|
|
Term
what are the most appropriate tests for comparing the mean of one or more populations when we have continuous variables |
|
Definition
|
|
Term
what test should we use for categorical variables (ie binary) |
|
Definition
chi square test, fishers exact test, Cochran Armitage test, McNemar test |
|
|
Term
what does the Pearson correlation coefficient do |
|
Definition
describes the strength of the linear relation (aka correlation) between two variables |
|
|
Term
what is the purpose of a linear regression model |
|
Definition
describes the linear relationship between two variables by using math equation |
|
|
Term
what types of distribution is most appropriate for categorical variables |
|
Definition
|
|
Term
describe how fishers exact test would be used |
|
Definition
when we are testing for an association between categorical variables from independent groups of small sample size (<20) |
|
|
Term
when is Chochran Armitage test used |
|
Definition
when we are testing for a trend in proportions of categorical variables |
|
|
Term
when is McNemars test used |
|
Definition
when we have paired groups of categorical variables and we want to test for agreement |
|
|
Term
name three different types of chi squared tests |
|
Definition
McNemars test Chochran Armitage test Fishers exact test |
|
|
Term
what would the value of the correlation coefficient be if there was perfect correlation |
|
Definition
|
|
Term
what would the value be of the correlation coefficient if there was no correlation |
|
Definition
|
|
Term
what assumptions need to be made when testing the correlation coefficient |
|
Definition
-both variables (X and Y) are numeric -one of the variables is normally distributed |
|
|
Term
under what circumstances should we not calculate the correlation coefficient |
|
Definition
-when there is a relationship between the variables that is non-linear -observations are not independent -outliers present |
|
|
Term
what is the point of linear regression |
|
Definition
to model a linear relation between an outcome variable and one or more predictor/explanatory variables |
|
|
Term
the outcome in a linear regression model is the dependent or independent variable |
|
Definition
|
|
Term
True or false: a linear correlation proves causation |
|
Definition
|
|
Term
true or false: a linear regression model proves causation |
|
Definition
|
|
Term
|
Definition
the differences between the observed outcome y and its model predicted value (y^) |
|
|
Term
what assumption needs to be true for linear regression models |
|
Definition
-residuals should be approximately Gaussian distributed -relationship between x and y s linear -observations are independent -for each value of x, population values of y are normally distributed |
|
|
Term
what does a linear regression model describe and how |
|
Definition
the relationship between 2 numerical variables by determining a straight line that approximates the data points on a scatter diagram most closely |
|
|
Term
if a data point in a linear regression model has high leverage, what might this imply |
|
Definition
it may be an outlier any point with leverage greater than 4/n should be investigated |
|
|
Term
|
Definition
a standardized measure of change in the parameters of the regression equation if the parameter point were omitted |
|
|
Term
at what distance according to cooks distance is a point influential |
|
Definition
|
|
Term
describe coefficient of determination |
|
Definition
measures the fit of the regression model how much variation in the outcome is explained by the variation in the predictor variable it is the square of the correlation coefficient |
|
|
Term
what is the difference between simple and multiple linear regression |
|
Definition
simple: only one predictor variable multiple: many predictor variables contribute to the explanation of an outcome in one model |
|
|
Term
what is logistic regression and when is it used |
|
Definition
used when we have categorical or binary outcome models the influence of predictor variables its an extension of chi squared/Chochran Armitage tests between a binary outcome and an ordered predictor variable |
|
|
Term
what are the assumptions of multiple linear regression |
|
Definition
-there is a linear relationship between a response variable and each explanatory variable -residuals are independent (each individual appears once in the sample) -residuals are normally distributed with 0 mean and constant variance |
|
|
Term
what should we do if the regression coefficient in a logistic regression model has a large standard error |
|
Definition
this means there is possible co-linearity |
|
|
Term
why is the coefficient of determination not a good measure to compare multiple regression models |
|
Definition
it cannot decrease by inclusion of more variables into the model |
|
|
Term
what is the adjusted R squared |
|
Definition
can be interpreted as the % variance reduction in the model predicted residuals as opposed to the residuals in the observed data y |
|
|
Term
how can we check the goodness of fit in a multiple regression model |
|
Definition
-check the model assumptions (linear, Gaussian, variance homogeny) -check model fit (wald test p-value, outlier, leverage and influential observations) -compare models (adjusted R squared, ANOVA, AIC) |
|
|
Term
|
Definition
akaike information criterion its an alternative to R squared used to compare regression models model with the lower value is better fitting model |
|
|
Term
what is logistic regression |
|
Definition
equivalent to pearson chi square test used to investigate the relation of a binary outcome to multiple predictors |
|
|
Term
true or false: the residuals in logistic regression model are Gaussian |
|
Definition
false unlike linear regression, they are not Gaussian |
|
|
Term
what does the slope of a logistic regression model represent |
|
Definition
|
|
Term
what is survival analysis |
|
Definition
the outcome of interest is the time from a certain starting point to the occurrence of an event sometimes called "time to event" analysis |
|
|
Term
|
Definition
in a survival analysis when some animals never experience the outcome of interest |
|
|
Term
what is an uninformative censor |
|
Definition
the probability than an animal is censored not being related to the probability they experience the outcome of interest |
|
|
Term
what is an administrative censor |
|
Definition
also known as left censoring when animals enter the study at different times, but the study ends at the same time so not all animals were followed for the same amount of time |
|
|
Term
|
Definition
in survival analysis when for part of the study population, the time to the event is not known |
|
|
Term
what is interval censoring |
|
Definition
when the exact time to event is not known but is approximated |
|
|
Term
what is the Kaplan-Meier estimator and how is it used |
|
Definition
it is an estimator for the survival probability it is the probability of surviving from a start point to a particular point in time can be used when survival and censor times are known exactly |
|
|
Term
what does the Kaplan-Meier method assume |
|
Definition
that losses to follow up survive longer than deaths at the time |
|
|
Term
what does the logrank test allow us to do |
|
Definition
we can compare survival curves of two groups the test statistic followsa chi square distribution |
|
|