Basic Biostatistics

Basic Biostatistics


“Statistics is a curious amalgam of mathematics, logic and judgement” – Douglas Altman

Health professionals (particularly doctors) have deep rooted fear of statistics. It is commonly held that statistics is an abstract science with no real life applications and that one has to have an IQ of around 1 40 or be a maths whiz to be able to understand statistics! The problem is often worsened by the fact that hardly any statistics is taught during undergraduate medical training.

The truth, as a matter of fact, could hardly be more different. The concepts underlying basic statistics are embarrassingly simple and can be learnt and used by anyone who knows school-level arithmetic! Statistics is a discipline that can be understood using common sense and its power and beauty are appreciated only when it is successfully used to handle real life research problems.


Statistics is the science and art of collecting, summarizing, and analysing ‘data that are subject to random variation (Last, 1995). Biostatistics is the application of statistics to biological problems. Data refers to a collection of items of information, A variable is any quantity that varies. It is any attribute, phenomenon, or event that can have different values.

Uses of statistics

Statistics is an indispensable tool in epidemiology. All epidemiological studies rely on the quantification of health and disease events in populations. Data collected in epidemiological studies usually involve several observations on several variables. Analysing and presenting such large volumes of raw data can be very cumbersome and painful. Fortunately, statistics, a very powerful tool, comes to rescue because it allows us to:

* Describe large data sets using only few numbers (like mean, range, etc.,). This is essentially called descriptive statistics. Descriptive statistics are used to present and summarise data in a form that permits cleanest presentation of the most information. For instance, the important aspects of the census data of a country can be summarized in just a few numbers like total population, sex ratio, age distribution, etc.

* Generalize the results of a small sample to the larger population from which the sample is drawn (extrapolation). For example, to find out the immunization coverage of a district, not all the children in the district have to be surveyed – one could take a random sample and still estimate the coverage with good accuracy.

* Compare different variables and test an underlying hypothesis. This is called inferential statistics. For example, is the observed difference between cure rates for two treatment groups in a clinical trial due to the real difference in the treatments or is it due to sheer chance? A statistical test of significance will I needed to answer this.

Describing Data

Types of measurements

The basic building blocks of any study are the data – the measurements which describe the factors being studied. There are four types of measurements: nominal, ordinal, interval and ratio.

Nominal variables are observations which can be classified into one of a number of mutually exclusive categories. For instance, a person can have only one of 4 blood groups (A, B, AB & 0). Data of this type also known as categorical or discrete data. An important feature of categorical data is that the different categories are in no sense better or worse than one another. They are simply different from one another. Or cannot, therefore, find out an average for categorical data; it would be nonsensical to compute an average blood group! Sex is another nominal variable.

Ordinal variables are slightly more sophisticated measures than nominal data. It involves placing an individual into one of a number of ordered categories. For example, a patient’s disease condition can be described as mil’ illness, moderate illness or severe illness. There is some gradient in ordinal measurements but the magnitude of difference between mild illness and moderate illness may not be the same as the magnitude of different between moderate illness and severe illness. Like nominal data, it is difficult to find averages of ordinal data it is difficult to compute an average level of illness!

The most sophisticated measures are those where individuals are placed on a scale of continuous scale ii which the distance between two measurements are well defined. Ratio and interval scales are thus called continuous data. For example, weight can be measured kilograms (54 kg, 55 kg and so on) but further refinement is possible if weight can be measured as kilograms and grams (e.g. 55 kg and 200 grams). Most biological variables are continuous variables (e.g. blood pressure, serum cholesterol, height, weight, pulse rate etc.). Continuous variables can be collated and averages can be computed (e.g. mean systolic BP),

Why is it important to know what type of variable is being measured? Continuous variables and categorical variables are treated very differently and need different statistical tests for their analysis. Continuous variable; are analysed using parametric tests (those that assume an underlying normal distribution). Examples o: parametric tests are Zee test and Student’s t test. Categorical variables are analysed using non-parametric test; (those that do not assume a normal distribution). Examples of non parametric tests are Chisquare test and Fisher’s exact test. For example, if the effect on sex on mortality is being studied (both categorical variables) then the appropriate statistical test would be a non-parametric test like (Chi square). Parametric tests like Z test or t test can not be used.

Summarizing data

One of the most useful aspect of statistics is its ability to summarize large data sets using only a few numbers. There are two commonly used measures for summarizing continuously distributed data:

  1. Measures of central tendency 2.Measures of dispersion

Measures of central tendency

These are commonly called the averages. The mean, median and mode are measures of central tendency. They aim to describe the typical number of a data set.

The mean is simply the average of all the individual observations (<^x/n) where ^x is the sum of all the individual values and n is the total number of observations. If all the values were to be lined up either in the ascending or descending order, the median value will lie exactly in the middle. If 100 observations were to be lined up, the median is the 50th value. The mode is the most frequently repeated value in the data.

The mean is the most commonly used measure of central tendency. However, if the data has extreme values (very high or very low values produce skewness in the data), such values can pull the mean towards one side and make it less usefull. The median is less influenced by extreme values and this makes it useful in certain epidemiological studies where skewed distributions are studied.

  1. Measures of dispersion

Describing or summarizing a data set using only measures of central tendency is often not enough to capture the complexity of the data. Though the averages give us some idea about the central value of the data, they do not tell us much about how other values are dispersed around this central value. For example, two data sets may have the same mean but the distributions around them may be very different. In other words, we need to know the spread of the data around the central value. Measures of dispersion tell us just that. The most important measures are range, variance and standard deviation.

Range indicates the distance between the highest and lowest numbers in the data. Though it tells us the extent of the spread of values around the mean, it does not really give us a good idea of the shape of the distribution.

Variance (s”) is the mean square deviation. Each value in the data set ( x) is subtracted from the mean of the entire set (x). The resultant numbers are squared ( x- x Y and the sum of all the squares {<$”( x – xf} is divided by the total number of observations (n) to give the variance. Standard deviation (s) is nothing but the square root of the variance.

s^ S( x -x )” n

Both the variance and standard deviation tell us how widely or closely values are dispersed around the central value (mean). Suppose the mean weight of a group of infants is 8 kg, and the standard deviation is 5 kg, this shows that there is a wide dispersion of values around the mean. On the other hand, if the mean is 8 kg and the standard deviation is I kg, then it shows that most values are closely packed around the mean value. As one can see from the above formula, the larger the sample size (n), the smaller the standard deviation. Variability tends to decrease as larger samples are studied.

Patterns in data

Every researcher faces a fundamental problem, that of variation. Any set of data involves variation. If the serum cholesterol levels of 100 people is measured, many different values will be obtained. How does one deal with such variation? Luckily for us, most biological measurements fit into well-understood, mathematical patterns. One such pattern is the Normal or Gaussian distribution. Other less important patterns are the Binomial and Poissondistributions. The normal distribution will be described in greater detail here.

Normal distribution

If we were to collect data on say, serum cholesterol, for a large number of people and depict it as a line diagram, it would look somewhat like the figure shown.

Most individuals will have values close to 200, very few will have values which are extreme on either end. Most continuous variables are distributed like this. If large enough numbers are studied, the entire pattern will smoothen out to produce a smooth, symmetrical, bell-shaped curve called the normal distribution. The normal distribution is also called the Z (Zee) distribution.

Why is the normal distribution important? The normal curve is actually a mathematical, abstract curve. All properties have been already worked out. We know that its mean, median and mode are the same and the area under the normal curve is 1. We also know that if I standard deviation is constructed either side of its mea the area it encompasses will include roughly 68% of all the observations in the data set. If 1.96 standard deviations are taken, 95% of the observations are included in the area under the curve. This knowledge of t normal curve allows us to use it in statistics. If our data is large enough, we assume that the data is normal distributed and apply all the above ideas to the data. It is this assumption that allows us to compute well know numbers like the P value and 95% confidence intervals. It is also the basis for tests of significance which a called parametric tests (tests that are used for continuous data that assume an underlying normal distribution like Z test and student’s t test.

References & further reading

  1. Last JM (Ed.). A Dictionary of Epidemiology, 3rd Edition. Oxford University Press, 1995.
  2. Hill AB, Hill ID. Bradford Hill’s Principles of Medical Statistics, 12th edition (Indian). Delhi: B.I. Publications, 1993
  3. Hassard TH. Understanding Biostatistics. St.Louis: Mosby Year Book, 1991.
  4. Altaian DG, Bland MJ. Statistics Notes. The normal distribution. BMJ 1995,310:298.
  5. Bland M. An Introduction to Medical Statistics. Oxford University Press, 1987.
  6. Swinscow TDV. Statistics at Square One. British Medical Journal, 1981.
  7. Altaian DG. Practical Statistics for Medical Research. Chapman and Hall, London, 1991.
  8. Guyatt GH et al. Basic Statistics for Clinicians : 1Hypothesis Testing Can Med Assco J1995; 152:27-32.
  9. Guyatt GH et al. Basic Statistics for Clinicians : 2 Interperting study results: Confidence Intervals.Can Med Assoc J  1995; 152:169-173.
  10. Jaeschke R et al. Basic statistics for Clinicians 3. Assessing the effects of treatment: Measures of association. Can Med Assoc J 1995;152:351-357.
  11. Guyatt GH et al. Basics Statistics for Clinicians : 4 Correlation and regression. Can Med Assoc J 1995:152:497-504.

Dr. Madhukar Pai MD, DNB
Consultant, Community Medicine & Epidemiology
Email: [email protected]