Basic Medical Statistics

Types of Data

the type of data determines the type of statistical analysis used

Categorical Data

data is grouped into categories based on some qualitative trait

Nominal Data

categories are unordered, and no hierarchy is implied
examples include gender, race, nationality

Ordinal Data

order or hierarchy is implied
examples include educational status, severity of illness, cancer stage

Quantitative Data

data is a set of numbers that are counted or measured

Discrete Data

only certain values are possible
represents counted values
examples include test scores, number of operations performed

Continuous Data

data comes from measurements
theoretically, an infinite number of values is possible
examples include age, height, weight, lab values

Data Visualization

Bar Charts

represents categorical data with rectangular bars with heights or lengths proportional to the values that they represent
bars can be plotted vertically or horizontally
one axis shows the specific categories being compared
the other axis shows a counted value
usually displayed with gaps between the bars to make it easier to distinguish them from histograms

Histograms

represents the distribution of continuous numerical data
the range of values is divided up into a number of buckets or bins
the number of values that fall into each bin is then counted
one dimensional plot (only one variable is being plotted)
there is no best number of bins, and different bin numbers can be used to help reveal different features of the data
valuable to reveal the ‘shape’ of the data: symmetric, skewed left, skewed right

Histogram - Symmetric — Top Left: Skewed Left; Top Right: Skewed Right; Bottom: Symmetric

Pie Charts

the arc length (and area) of each slice is proportional to the quantity it represents
shows relative comparisons, not actual values
if some of the pie slices are small, it can be very difficult to compare them
not useful for comparing large amounts of data

Box Plots

graphically depicts numerical data through their quartiles
the band inside the box represents the 2nd quartile (median)
vertical lines extending from the boxes indicate variability outside the upper and lower quartiles
the ends of the whiskers can represent the minimum and maximum values of the data, or values within 1.5 IQR of the upper and lower quartiles
used when the data is skewed and would not accurately be represented by the mean and standard deviation
any data not included between the whiskers can be plotted individually as an outlier

Scatter Plots

displays values for two variables
the dependent, or response, variable is plotted on the y-axis
the independent variable is plotted on the x-axis

Correlation

scatter plots can be used to see relationships between two variables
when the Y variable tends to increase as the X variable increases, there is a positive correlation between the variables
when the Y variable tends to decrease as the X variable increases, there is a negative correlation between the variables
if there is no clear pattern between the 2 variables, then there is no correlation between the variables
cause and effect is not implied by correlation: there can be many possible explanations for the patterns seen

Descriptive Statistics

Averages

Mean

central value of a discrete set of numbers (‘balance point’)
calculated by adding up all the values and dividing by the number of values
the population mean (μ) is the mean of the entire underlying population
in medicine, since we are usually sampling from the underlying population, the sample mean (x̄) is what is calculated
the mean is greatly affected by outliers

Median

separates a data set into two halves – is the middle value
calculation requires the data to be sorted from lowest to highest
much less affected by outlier values than the mean, so it gives a better estimate of a ‘typical’ value
used primarily for skewed data sets

Mode

most frequent value in a data set; i.e. the value most likely to be sampled
used primarily for categorical data
it is possible for a data set to be bimodal or multimodal

Variance and Standard Deviation

used to characterize how wide the spread of the data is from its central point

Variance (σ²)

calculated in several steps:

subtract the sample mean (μ) from each data item (X)
square these differences (converts negative values to positive)
sum up all the squared differences and divide by the number of data items (N)
if using a sampled mean, then divide by N-1

variance is always ≥ 0
since it has units of data², variance is difficult to interpret
if all the data items have the same value, then the variance is 0

Standard Deviation (σ)

square root of the variance
most common measure of spread since it has the same units as the data

Standard Score (z)

may be positive or negative
represents the number of standard deviations that a data point is above or below the mean value

The Normal Distribution

data is distributed around a central value and is symmetrical
mean = median = mode
many processes in medicine follow a normal distribution: height, intelligence, lab tests
in the normal distribution, 68% of values lie within one standard deviation of the mean; 95% of values are within two standard deviations; 99.7% of values are within 3 standard deviations
in a normally distributed data set, the mean and standard deviation completely describe the data

Central Limit Theorem (CLT)

given a sufficiently large sample from a population (≥ 30), the mean of all samples from the same population will approach the mean of the population
for large samples, the distribution of means calculated from repeated sampling will approach the normal distribution
the average of all the standard deviations in your samples will equal the standard deviation in the population
the value of the CLT is that manageable sample sizes can be used to make accurate predictions about a population

Confidence Intervals

used when making conclusions on samples of data
defines a range of values that we are fairly sure our true value lies in
the sample should contain at least 30 values
need to decide how confident we want to be – in medicine, usually 95% is chosen
using a table, find the ‘Z’ value for the chosen confidence interval
‘Z’ value represents an area under the normal curve and equals the z-score
a small confidence interval = more confidence that our sample mean represents the true mean
as the sample size increases, the confidence interval gets smaller (i.e., we are more confident in the data)
the value after the ± is called the margin of error

Hypothesis Testing

Null and Alternative Hypotheses

Null Hypothesis (H₀)

this hypothesis is assumed to be true until proven otherwise - the status quo
typically represents a value in a population that can be measured
an experimenter is typically trying to produce enough evidence to reject the null hypothesis

Alternative Hypothesis (H₁)

research question being answered
null and alternative hypotheses are mutually exclusive
a ‘two-sided’ test is designed to demonstrate that the null hypothesis is not equal to a certain value
a ‘left-tailed’ test is designed to demonstrate that the null hypothesis is less than a certain value
a ‘right-sided’ test is designed to demonstrate that the null hypothesis is greater than a certain value

Choose an Appropriate Statistical Test to Analyze the Data

a ‘two-sided’ test is designed to demonstrate that the null hypothesis is not equal to a certain value
a ‘left-tailed’ test is designed to demonstrate that the null hypothesis is less than a certain value
a ‘right-sided’ test is designed to demonstrate that the null hypothesis is greater than a certain value
a t-test is used when the population standard deviation is unknown or for small sample sizes

Set the Significance Level (α)

probability level for making decisions about the null hypothesis
set by the researcher before examining any data
most common level in medicine is 0.05 or 5%
if the test data are inconsistent with the null hypothesis, then the null hypothesis is rejected

Analyze the Test Data

an appropriate experiment must be designed, and data collected and analyzed
a test statistic must be calculated that can be compared to the null hypothesis
compute the p-value of the test statistic
the smaller the p-value, the stronger the evidence against the null hypothesis

P-Value

is a measure of the strength of the evidence against the null hypothesis
defined as the probability of getting the observed value of the test statistic if the null hypothesis is actually true
since p-value is a probability, it represents an area under the probability curve and its value can be looked up in a table or calculated with computer software
the null hypothesis H₀ is rejected if p-value ≤ α

Meaning of P-Value

a low p-value is statistically significant, meaning that there is sufficient evidence to reject the null hypothesis
however, statistically significant not does necessarily mean clinically significant
a low p-value does not tell us how clinically different two treatments are

Error Types

Type I Error (α)

rejection of a true null hypothesis (false positive)
leads one to conclude that an effect exists when it in fact doesn’t
reducing type I errors requires reducing the preset significance level α
reducing type I errors leads to increasing type II errors
confidence level = 1 - α

Type II Error (β)

failure to reject a false null hypothesis (false negative)
related to the statistical power of the test (1 – β)
ways to reduce type II errors include increasing the sample size or relaxing the α level

Linear Regression

linear approach to modeling the relationship between a dependent (response) variable (y) and one or more explanatory (predictor) variables (x)
simple linear regression involves only one predictor variable
multiple linear regression involves two or more predictor variables
used for predicting a 'y' value if the 'x' value(s) are known

Best Fitting Line

basic idea is to find a line that best fits the data
best fit line minimizes the differences in the ‘y’ direction
linear regression line equation: y = bx + a
a is the y intercept, and x is the slope

Correlation Coefficient (r)

numerical measure of correlation between 2 continuous variables
values range from +1 (strongest positive correlation) to -1 (strongest negative correlation)

proportion of the variance for a dependent variable that is explained by an independent variable or multiple variables in a regression model

Basic Medical Statistics

Types of Data

Data Visualization

Descriptive Statistics

References