Statistical Computing: an introduction

Statistical Computing: an introduction

Statistical Computing: an introduction


What is statistical Computing?

Statistical computing is basically analysing data using computers. The field of statistical computing is relatively recent. With the widespread computer access, data processing has become an easy task. Even dummies like doctors can process data these days! Not only has data analysis become easy, we can also perform advanced statistical analysis (like multivariate analysis) with utmost ease. Modem epidemiological research, therefore, has come to depend heavily on the use of computers and statistical software packages. Computers are particularly useful if one has to perform the same task several times; if the data set is large; or if computations required are intensive and complex.

Statistical Software

While database managers like dBASE and FoxPro can easily be used for creating large databases and sorting and generating reports, complex statistical analyses require relatively more specialised software packages like Epi Info, EpiStat, SPSS for Windows, SAS, BMDP, MINITAB, EGRET, STATA, etc. These packages differ in their complexity and ease of use. Some of them work in the DOS environment while others (more recent ones) woik in the Windows environment. To help researchers choose the right package, the American Statistician, a journal of the American Statistical Association, periodically publishes reviews of the various software packages. Indeed, there is a regular section on “Statistical Computing.”

With so many packages available on the menu, bow does one choose the best? Which package should a beginner use? While packages like SPSS and SAS are very powerful and great for working with large data sets, they are not very user-friendly [the more recent Windows versions are definitely more user-friendly!]. One look at the fat manuals that come along with these packages is enough to make a beginner sweat! Also, these packages cost the earth. In fact, it is virtually impossible for many individuals to buy these. Institutions tend to buy these packages on lease. It should also be kept in mind that many of these packages are designed for pure statistical and social research and not specifically tailored for health or medical research.

Recognising the need for a simple yet comprehensive statistical package for health researchers. Centers for Disease Control (CDC) Atlanta, along with the World Health Organization, have created a software called Epi Info. Epi Info is fast becoming one of the most commonly used software packages by health researchers. The fact that it is specifically designed for health research makes it very useful for medical researchers and doctors. More importantly, Epi Info is in the public domain (it is available free and any number of copies can be made and used by anyone), and is therefore accessible to everyone. Epi Info is indeed a godsend for the student researcher who barely manages to keep his body and soul together with his research stipend! Software like SPSS and SAS, on the other hand, are very expensive.

Epi Info is comprehensive because it contains a word processor, a database manager and statistical software -all in one package. Data can be entered and analysed using Epi Info. Epi Info also has other unique features like outbreak investigation, disease surveillance modules, etc. Epi Info is one of the most user-friendly statistical packages. It even has a manual built-in! Before learning Epi Info, one has to first have an orientation to statistics, types of data and basics of database management systems.

Levels of data analysis

In statistical analysis, the first level of analysis of any data set is describing the data set using simple descriptive statistics like proportion, mean, standard deviation, etc. Graphic display of the data is also a part of the first level of analysis.

The second level of analysis is called ‘bivariate analysis’ where two variables are taken up for analysis time. This can be done using 2 X 2 tables and P values, relative risks, odds ratios and other measures of association can be computed.

The most sophisticated analysis would involve multivariable analysis. This includes statistical models like multiple regression, logistic regression, survival analyses, etc. These models should not be attempted until the more basic levels of analyses are completed.

Basic principles of database management systems

An understanding of the distinction between data and information is important to appreciate the concepts behind database management systems (DBMS).Data can be understood as a collection of “facts,” but without any inherent associations or patterns among them.information, on the other hand, is the end result of data that has been ordered, organised and analysed. When data is processed and analysed to bring out meaningful patterns, the result is information.

Over the last few decades, database management systems have evolved out of a need to generate useful information from data. Some of the best known DBMS packages are dBASE IV. FoxPro, etc. These can run on computers using DOS. Windows based packages are ACCESS, APPROACH, Visual FoxPro, etc. These days DBMS are being rapidly replaced by RDBMS [Relational Data Base Managers] packages like ORACI and SQL Server. These are really high end software and not meant for the individual user.

Structure of a database:The most important part of any DBMS is the database files. Each database files can have up to a billion data records. A data record (or just called a record) is a logical collection of information about a single item that you want to keep track of. For instance, all the clinical and lab results of a particular patient could be contained in one record. Each record contains data about each patient. Within each record you would store information, such as name of the patient, age, sex. Hospital No., diagnosis, clinical findings lab values, etc. Each piece of information is stored in a data field. Thus, a database is a file that contains records, with each record containing fields of information. DBMS files are usually stored with a “.DBF extension. This helps in distinguishing database files from ordinary text files (which have extensions like -.doc or *.txt).

Cleaning a database:Once data is entered into a database, processing of data begins with “cleaning” of the database. Cleaning means checking for data entry errors and inconsistencies and correcting them. If the same record has been entered twice (duplicate entry], such errors have to be picked up during cleaning. Errors can be picked up doing a “double entry” – the same data set is entered by two people and merged; fields which do not merge are obvious errors. Another way of picking up errors would be for one person to read the entered data aloud and for the second person to verify it in the paper questionnaires.

Epi lnfo Version 6.04

About Epi Info:

Epi Info is a word-processing, data base, and statistical program for public health on microcomputers. It is now in its 6th version. It requires an IBM-compatible computer running DOS, 512 K of RAM, and at least one floppy drive. In short, it can run very well even on a 386 system! With the latest Pentium systems, one can exploit Epi Info exceptionally well. The year 2000 version of Epi Info will be Windows based (a beta version is now available).

Epi Info comes as 3 floppy disks (as compressed files). It can be either downloaded from the Internet [download from:] or can be obtained from any of the WHO offices [addresses at the end of this module] for free. It can also be freely copied from others who have it. Epi Info data files can consist of as many records as DOS and the disk storage can handle (up to two billion). A questionnaire file can have up to 500 lines (as many variables as will fit into 500 lines).

Programs in Epi Info: The software contains a set of programs, each designed for specific tasks.

  • EPED: This is a word processor for Creating questionnaires. Creating questionnaires is the first step in entering and analysing data. EPED can be entered from the main menu. Most of the commands used in EPED are simple DOS commands.
  • ENTER: This program produces a data file automatically from a questionnaire created in EPED. It creates a file which has blanks into which data is entered. Once data is entered, the file can then be analysed using the ANALYSIS program.
  • ANALYSIS: This program produces lists, frequencies, cross tabulations and several other statistical information (like p value, odds ratios, relative risks, confidence limits, etc.) by analysing the data set.
  • CHECK: Data entry errors can be minimised by using the CHECK program. Fox example, if a given field can have values only between I and 5, then CHECK program can be used to deny the entry of any number other than I to 5.
  • CSAMPLE: This performs analysis of data from surveys of complex design (cluster design, multistage stratified sampling, etc.).
  • STATCALC: This calculates statistics like Chi-square p value from 2 X 2 (or n X n) tables entered directly from the keyboard.
  • EPITABLE: This is an epidemiological calculator. Statistical tests of significance like t-test, Chi-square test, test of proportions, etc. can be easily performed. Epitable also allows computation of a wide range of measures like odds ratios, relative risks, sensitivity & specificity of a test, incidence rates, etc.
  • EPINUT: This is a program for nutritional anthropometry (heights, weights, mid-arm circumference, etc.) data and provides indices like height for age, weight for age, etc.
  • EXPORT: This program exports data files from Epi Info into 12 other formats (like dBASE, Excel, etc.)
  • IMPORT: Brings in files from other programs (like dBASE, Excel, etc.) so that they can be analysed using Epi Info.
  • MERGE: This program merges files. This allows combining of data files entered on different computers.
  • VALIDATE: Compares two Epi Info files entered by different operators and reports any discrepancies.

In addition to the above, there are tutorials and help files. The entire manual of the software is available within the program and can be read from the screen.

Levels of sophistication in Epi Info

There are three levels of facilities in Epi Info for processing data.

1. In the simplest level, you can run the main menu, create a new questionnaire using EPED, the word processor. Data can then be entered using ENTER program. Using ANALYSIS program, simple analysis like producing frequency counts, lists, graphs, means, etc. can be done.2. At the next higher level, additional features like selecting records, recoding variables, carrying out conditional operations using IF command can be done. Error checking during data entry can be done using CHECK program. Files can be imported and exported from other systems like dBASE, Lotus 1-2-3, Excel, etc.

3. At the third level, programs can be written for mathematical operations, logical checks, customizing entry, etc. Customised reports can be generated from ANALYSIS program. Data entry can be validated using double entry and merging.

To analyse data using statistical packages, the data has to be first entered. There are 3 ways of doing this:

  • Create a database using a DBMS like dBASE and import this into Epi Info and start the analysis.
  • Epi Info has an “IMPORT” program which allows it to import files with a *.DBF extension.
  • Enter data into a Spreadsheet package like Excel or Lotus 1-2-3 and import this into Epi Info and start the analysis.
  • Enter data directly into Epi Info and analyse. Epi Info has a built-in DBMS which allows this.

Creating a questionnaire and entering data: Data entry in Epi Info: Data can be entered in Epi Info using the following steps:

1. From the main menu, open the EPED program. EPED is a word processing program.

2. Within EPED, create a questionnaire file that contains all the variables that has to be a part of the data Essentially, field types can be numeric (only integers are accepted in these fields), string (alphanumeric), character (alphabets are accepted), date fields (accepts dates – day, month and year) and Yes/No variables (accepts only Yes or No). In all the above, the width of each field has to be fixed by the researcher. It is very useful to enter data as numeric fields. Text fields can rarely be analysed.

Here is an example of a small questionnaire on nutritional status of children. The comments in parentheses are given just for understanding, they do not form a part of the questionnaire.

Name ___________ [12 characters]Age ## [actual age]

Sex* [male=l,female=2]

Date of birth <dd/mm/yy>

Height m ftn cm]

Weight ##.## [in kg]

Malnourished <Y> [yes or no]

3. Having created the questionnaire, save the questionnaire using a unique file name which ends in a *. QES extension.

4. Exit the EPED program and from the main menu, enter the ENTER data program.

5. In the ENTER data program, give the name of the file which is going to receive the data. This file has to have a *.REC extension. Then, give the *.QES file that you have already created. Choose the option to create a data file from a new *.QES file.

6. ENTER data program will then create a new data entry file with blank fields.

7, Start entering data in the blank fields. At the end of the record, save the record. The next blank record will then automatically come up. Continue till all records have been entered.

8. Check for data entry errors. If clean, start analysis by running the ANALYSIS program from the main menu.

Analysis of data

Once data has been entered, it will be stored as a *.REC file. This REC file is unique to. Epi Info. For analysis, follow these steps:

1. From the main menu, enter the ANALYSIS program. At the command prompt EP16> below the window type <READ>. This will open up a box which has all the *.REC files. Choose the file to be analysed by using the arrow keys and hit <ENTER>. Type <BROWSE> to see the data set.2. The first step in data analysis is to scan the data visually to get an idea about what is looks like. A “line listing” is helpful for this. To produce a line listing of all the data within the database, type: EP16>LIST.

3. The next step in analysis is to generate frequencies. How many males, females are there in the data set? How many ill people are there in the data set? To generate all these, type EP16>FREQ*. If the name of the variable is specified, frequency of only that variable will be shown. For example, if only sex distribution is needed, type EP16>FREQ SEX.

4. The next step, after all the frequencies are generated, is to graphically show the data. Graphs that can be generated in Epi Info are bar diagrams, pie diagrams, scatter plots, line diagrams and histograms. For example, to generate a bar diagram of the age distribution, type EP16>BAR AGE. To show the same data as a pie diagram, type EP16>PIE AGE.

5. Cross tabulations of two or more variables can be done using the TABLES command. Fox example, are more males ill than females? To do this cross tabulation, type EP16>TABLES SEX ILL. The exposure factor (in this case sex) should precede the outcome (illness). This will produce a table and will also generate statistics like Chi-square p value, odds ratios and relative risks automatically.

6. More sophisticated analysis will involve creating new variables by changing old ones; recoding data to produce new data;selecting only specific records and analysing them, etc.

7. Printing the results of analysis is simple: press <F5> and the output of the analysis will directly go to the printer. Pressing <F5> again will route the output to the screen.

What Epi Info can’t do

Epi Info can handle most of the basic data analysis needed in research. It can not, however, perform very complicated statistical manipulations like multivariate analysis (logistic regression, survival analysis, etc.) For these requirements, it is best to use high-end software like SPSS, SAS, or STATA.

Ten useful tips to make your life easier!

The key to mastering Epi Info, indeed, any statistical package, is familiarity. Repeatedly using the same package breeds familiarity and familiarity breeds success! It is best to experiment with small data sets and learn to exploit each program and function within the package. Here are some useful tips:

1. While learning to use a package, always enter data in yourself Nothing teaches you more than the mistakes you make while learning!2. Try and avoid entering data as string or text variables: they are almost impossible to analyse!

3. Try and convert most variables into numeric fields. Fox example, sex can be keyed in as M or F. But it is better to key it in as I and 2 (l=male, 2=female). Numeric variables are easier to handle if sophisticated analysis (multivariate techniques) needs to be done later.

4. Do not enter anything into the database that you are not planning to analyse (e.g. names).

5. While learning, try out each program and command by experimenting on small data sets.

6. Always clean the data set before analysis. You will be surprised at the number of data entry errors that can creep in during entry!

7. Always backup data by saving the data set in floppies. Data entry is painful and it is not worth losing the whole set because of some hardware/software failure or error! These days it makes a lot of sense to transfer large data sets on to writable CDs.

8. Do not go blindly by the computer output! Try and see whether the results of the analysis are consistent with what you expect. If needed, perform some statistics manually and cross check the computer results.

9. Don’t refuse an opportunity to help others in data analysis (you always learn from the experience!).

10. Remember the golden rule: garbage in is garbage out! No amount of sophisticated data analysis using statistical software can compensate for badly collected data of poor quality (garbage!).

References And further reading

1. Dean AG, et al. Epi Info, Version 6: A Word-Processing, Database, and Statistics Program for Public Health on IBM-compatible Microcomputers. Centers for Disease Control and Prevention, Atlanta, 1995.2. Dean AG, et al. Epi Info: A general purpose microcomputer program for public health information systems. American Journal of Preventive Medicine 1991,7:178-82.

3. Brown RA, Beck SJ. Medical Statistics on Microcomputers. A guide to the appropriate use of statistical packages. Articles reprinted from the Journal of Clinical Pathology. Published by the BMJ, 1990.

4. Beaglehole R, Bonita R, Kjellstrom T. Basic Epidemiology. Geneva, World Health Organization, 1993.

5. The American Statistician. Reviews on various statistical packages periodically. Published by American Statistical Association.

6. Dean AG. Using a microcomputer for field investigations. In: Gregg M [Ed]. Field Epidemiology. Oxford University Press, 1996.


Q 1. A survey was done to measure the haemoglobin levels among a group of pregnant women attending an ante-natal clinic. 126 women were screened and the mean Hb was found to be 10.2 gm%. The standard deviation was 3.2.

Using the Epitable program, compute:

1. The 95% confidence interval (Cl) for the point estimate of the mean Hb value.

2. Compute the 95% Cl using the same mean and SD but with a sample size of 1260.

3. Comment on the difference in the Cl when the sample size was increased.

Q2. Researchers at SMF recently did a population-based seroepidemiological study on the prevalent of Helicobacter pylori infection in the local community. Simple random sampling was used. Out of 354 respondents 175 were seropositive by the latex agglutination test.

Using the Epitable program, compute:

1. The prevalence of H. pylori infection.

2. The 95% confidence interval (Cl) for the point estimate of the prevalence.

Q3. A study was done to compare the birth weights of babies born to mothers with and without diabetes. The birth weights of 400 babies born to mothers without diabetes and 350 babies born to mothers with diabetes were compared. The table below gives the actual data:

Population Sample size Mean birth weight Variance
Diabetic mothers 350 3.9kg 0.36
Non-diabetic mothers 400 2.7kg 0.25

Using the Epitable program, compute:

1. Perform a test of significance to determine whether the observed difference between the two mean birth weights is a real difference or merely due to chance, random variation.

P value:


2. Graph the two distributions and visually determine whether both the distributions could have originated from a single distribution.

Q4. A study was done to compare the use of a rapid serological test for malaria with the standard blood peripheral smear exam. 100 known positive blood smear cases were subjected to rapid serological test. 200 known blood smear negative patients with other febrile illness were also subjected to the same rapid serological test. The data is presented below:

  Blood smear positive Blood smear negative  
Rapid test positive 72 44 116
Rapid test negative 28 156 184
  100 200 300


1. Using Epitable, compute the sensitivity, specificity, positive and negative predictive values of the rapid test.

Sensitivity 95% CI:

Specificity 95% CI:

PPV 95% CI:

NPV 95% CI:

Q5. A study was done to determine whether smokers had a greater risk of a second myocardial infarction as compared to non-smokers. 140 smokers who had had an acute MI were followed up for a period of 3 years and 130 non-smokers who had had an AMI were followed up for 4 years. 26 of the smokers and 12 of the non-smokers developed a second MI during the follow-up period.

I. What kind of a study design was used?

2. Using Epitable program, compute the Relative Risk and the 95% Cl for the RR.

3. From the above, is there a statistically significant association between smoking and risk of second MI?

4. Graph the risk ratio and see the difference visually.

Q6. To test the hypothesis that BCG protects against leprosy, 200 cases of leprosy and 200 controls with no leprosy (similar age, sex distribution) were examined for the BCG scar. The results were as follows:

Scar + among HD cases : 50Scar + among controls : 150

The remaining in each group were scar negative.

I. What kind of a study design is this?

2. What is the exposure?

3. What is the outcome?

4. Set up a 2×2 table and compute the Odds Ratio (OR) and its 95% Cl using Epitable.

Scar Case Control


5. From this data, can you conclude that BCG protects against leprosy?

Q7. In a study done to pick up inter-observer variation between trained workers who read and measure induration after mantoux tests, the induration sizes were classified as positive or negative using 10 mm or more as a cut off. The results of 160 tests read by are shown in the table below:

    Observer 1  
Observer 2      
  + 40 20
  20 80


1. What is the observed agreement between the two observers?

2. Compute the kappa coefficient (which is more robust than simple agreement because it takes into account chance agreement) using Epitable program:

For copies of Epi info, write to

    1. World Health Organization, Regional Office for South-East Asia,

Indraprastha Estate, Mahatma Gandhl Marg, New Delhi 110002

    1. Global Programme for Vaccines,

World Health Organization, Geneva 21, Switzerland

  1. Epidemiology Program Office Centers for Disease Control and Prevention Atlanta, Georgia 30333, USA


1, Specie the sampling method used in each of the following studies. Write A, B, C, or D.

A = simple random sampling

B = stratified random sampling

C = cluster sampling

D = systematic sampling

  • A die is rolled to decide which one of the six volunteers will get a new, experimental vaccine: _____
  • A sample of students in a school is chosen as follows: two students are selected from each batch by picking roll numbers at random from the attendance registers: _____
  • A target population for a telephonic survey is picked by selecting 10 pages from a total of 100 pages from a telephone directory by using a table of random numbers. In each of the selected pages, all listed persons are called for the interview: _____
  • The number 35 is a two-digit random number generated by a calculator. A sample of two wheelers in a state is selected-by picking all those vehicles which have registration numbers ending with 35: _____

2.95% confidence intervals (Cl) for an estimated mean, when expressed in simple language implies:

  • that the true population mean has a 5% chance of being within the Cl
  • that 95% of the sample observations will lie within the Cl
  • that there is a 95% chance that Cl will include the true population mean
  • none of the above

3. In a cohort study done to explore the association between milk consumption and fracture of neck of femur, it was found that the incidence of fractures among those who drank less than 200 ml of milk per day was 3% in a 6-year follow-up period. Among those who drank 200 ml or more per day the average annual incidence was 0.5% per year. From this study, can you make out a statistical association between milk consumption and fracture neck of femur?

  • Yes
  • No

4. The most important reason for randomization in RCTs is:

  • avoid information bias Q facilitate blinded outcome ascertainment Q make numbers equal in both arms
  • randomly distribute known and unknown confounders between the two arms of the trial

5. Compared to observational study designs, the reason why experimental studies are considered stronger for making causal associations is:

  • experimental designs are repeatable
  • temporal sequence is always very clear in observational studies
  • confounding by other factors is eliminated in experimental studies by a process of randomization
  • biases can not occur in experimental studies

6. The figure below gives the results of 5 drug trials for lowering blood pressure. It shows reduction in blood pressure with the 95% Cl for each estimate of the mean reduction in BP. At least 5 mm Hg reduction is considered necessary for a clinical significance.


Match the drug to the appropriate inference:

Drug __ Statistically not significant. Clinically not significantDrug __ Statistically significant. Clinically not significant

Drug ___ Statistically significant. Clinically significant

Drug __ Statistically not significant, may be Clinically important


Dr. Madhukar Pai MD, DNB
Consultant, Community Medicine & Epidemiology
Email: [email protected]