KNOWLEDGEBASE - ARTICLE #2191

How do I evaluate if my data meet necessary assumptions before applying parametric tests?

Whether it is appropriate to use various statistical tests depends on the characteristics of the data being analyzed. Even for common parametric analyses comparing two groups, basic assumptions about the data must be fulfilled for these tests to be meaningful. Statistical analysis of data is a powerful tool to compare different groups when interpreting studies and reaching scientific conclusions. For these results and conclusions to be meaningful, the analysis must be done carefully. In the words of Anne Segonds-Pichon, “If we give dodgy data to a statistical test, it will give us back a dodgy p value.”

Assumptions for Parametric Tests

Before conducting any parametric analyses, it is essential to explore the data and validate that certain assumptions are met. Parametric tests include some of the most commonly used analytical tools to compare groups of data with continuous variables, such as the Student’s t test and Analysis of Variance (ANOVA) test. These tests compare the mean values of data in each group, so two primary assumptions are made about data when applying these tests:

  1. Data in each comparison group show a Normal (or Gaussian) distribution

  2. Data in each comparison group exhibit similar degrees of Homoscedasticity, or Homogeneity of Variance

Normality, also known as a Gaussian distribution or a “bell-shaped” curve, refers to the degree to which data show a central tendency and symmetrical distribution relative to the mean. It is important for data in each group being compared to exhibit characteristics of normality. Because parametric tests compare means between data groups, the means must be a faithful representation of the data. Parametric tests’ p values will only be valid when the data exhibit a normal distribution. 

One way in which data can depart from normality is by being asymmetrical in one direction or another. Data that is positively skewed has a longer tail in the direction of higher values, whereas data that is negatively skewed has a longer tail in the direction of lower values. Skewness is, therefore, the degree to which data departs from a symmetrical normal distribution, as shown graphically below:

Data may be symmetrical around the mean but still depart from a perfect normal distribution by showing certain flatness or peakedness characteristics. Kurtosis is a statistical measurement used to describe the degree to which values cluster in the tails or the peak of a frequency distribution, ranging from platykurtic to leptokurtic. As shown below, platykurtic refers to a flatter, more uniform distribution. Leptokurtic refers to a narrower and more peaked central distribution.  Mesokurtic refers to the kurtosis of a perfectly normal distribution. 

 

Homoscedasticity, or homogeneity of variance, is the other primary assumption for parametric tests. This refers to the dispersion pattern of data and how similar this pattern is between groups being compared. Not only is it assumed that data in each comparison group demonstrate a normal distribution, but each group is also assumed to exhibit similar levels of variability or “noise” for parametric tests to compare them accurately. When groups exhibit different variances, alternative analytical approaches may be recommended.

 

Graphical Tools to Explore Data

A powerful and effective way to explore data is by plotting them graphically to inspect data characteristics visually. While a simple bar chart may be suggested initially, there are other graphing tools that are more effective ways to visualize all data points and compare groups of data. The Scatterplot is an excellent tool to explore data because it includes every data point and can include the mean value as a horizontal line within each group. It is possible to scan for the symmetry of the data and identify points that may create a skewed distribution, potentially outliers that need further consideration. 

A second useful plotting technique is the Box and Whiskers plot, originally developed by John Tukey. This plotting technique creates a box bound by defined upper & lower quartile values surrounding the median value of the data. Minimum and maximum data values are represented by whiskers protruding from the box. The geometric representations of these values can be qualitatively examined for symmetry and amount of variance. Potential outlying data points can also be identified.  

A newer graphical tool is the Violin Plot, which represents the data in a unique way that allows visualization of normality and homogeneity of variance by enclosing data in a single shape for each comparison group. Simply examining the symmetry and size of the shapes will reveal if data conforms to a normal distribution with homogeneous variability.

 

A final graphical tool that is particularly useful in assessing normality assumptions is the Quantile-Quantile plot, also referred to as the QQ plot. By graphing the actual values of data (along the x-axis) against predicted values of similar hypothetical data that obey a perfect normal distribution (along the y-axis), one can assess normality by simply observing how closely the data points adhere to the diagonal line. It is an elegant way to illustrate if data conform to normality and if either comparison group has outliers or patterns of variability that defy the assumption of homogeneity of variance.

 

Formal Tests to Explore Data

Exploring data graphically is an indispensable qualitative tool to check the assumption of normality before applying parametric tests, but there are also formal tests available within Prism to assess how likely it is that data are adhering to normality. Four formal methods to test data distribution for normality are the following:

  1. Anderson-Darling test
  2. D’Agostino-Pearson omnibus normality test
  3. Shapiro-Wilk normality test
  4. Kolmogornov-Smirnov normality test

Each of these methods will calculate a p value for each comparison group evaluating whether it is likely that a normal distribution is reflected by the data. The application of these formal tests can be a reassuring tool to confirm the normality assumption that was visually observed during graphical exploration of the data.

Outliers

The consideration of potential outliers is an important topic unto itself, but graphical exploration of data can be a good tool to identify when further consideration is warranted. Data points that are far removed from the group, lying an abnormal distance from the mean, can skew the symmetry of distribution and call in a question the assumption of normality. There are several steps to consider in handling potential outlying data points. There is no guaranteed way to separate outliers from values sampled from a Gaussian distribution. There is always a chance that some true outliers will be missed, and that some "good points" will be falsely identified as outliers.

Firstly, these observations should be scrutinized for possible technical errors of measurement, or even recording mistakes that may have produced inaccurate data points. In addition, any biological bases for abnormal values should be evaluated, such as inclusion or exclusion criteria that are not appropriately met.  If other reasons for the outlying data are not identified, then careful consideration must be given to whether the data should be included or excluded from the analysis. Formal statistical methods have also been developed that can assist in determining outliers, such as the Grubbs’ test.

In summary, exploring data is an essential step in the scientific process. As Anne Segonds-Pichon admonishes, “Applying the wrong statistical tests can have disastrous consequences, as it can mean misinterpreting the experimental results and drawing the wrong conclusions.” Visualizing data graphically using various plotting techniques available in Prism is a powerful tool to confirm basic assumptions about data before applying parametric statistical tests.   



Keywords: data exploration

Explore the Knowledgebase

Analyze, graph and present your scientific work easily with GraphPad Prism. No coding required.