Check normality of distributions

Material to support teaching in Environmental Science at The University of Western Australia

Units ENVT3361, ENVT4461, and ENVT5503

We need to load the R package car, as it's needed for the powerTransform() function. For this session we load and use the sv2017 data from Smith's Lake and Charles Veryard Reserves in 2017.

library(car)
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
sv2017 <- read.csv(paste0(git, "sv2017_original.csv"))

It's often a good idea to make a copy of the data, so we don't lose any information from the original if something goes horribly wrong. We then check our data using the summary() function:

data0 <- sv2017
summary(data0)

##   ICP.Seq.No            Group           Sample              Type              Easting      
##  Length:95          Min.   : 1.000   Length:95          Length:95          Min.   :391064  
##  Class :character   1st Qu.: 3.000   Class :character   Class :character   1st Qu.:391242  
##  Mode  :character   Median : 6.000   Mode  :character   Mode  :character   Median :391338  
##                     Mean   : 6.344                                         Mean   :391301  
##                     3rd Qu.: 9.000                                         3rd Qu.:391384  
##                     Max.   :12.000                                         Max.   :391431  
##                     NA's   :2                                                              
##     Northing         Longitude        Latitude            pH              EC       
##  Min.   :6466246   Min.   :115.8   Min.   :-31.93   Min.   :4.630   Min.   : 29.1  
##  1st Qu.:6466430   1st Qu.:115.8   1st Qu.:-31.93   1st Qu.:6.388   1st Qu.:106.6  
##  Median :6466589   Median :115.9   Median :-31.93   Median :6.820   Median :152.6  
##  Mean   :6466556   Mean   :115.9   Mean   :-31.93   Mean   :6.813   Mean   :188.5  
##  3rd Qu.:6466676   3rd Qu.:115.9   3rd Qu.:-31.93   3rd Qu.:7.280   3rd Qu.:215.1  
##  Max.   :6466771   Max.   :115.9   Max.   :-31.93   Max.   :8.630   Max.   :835.0  
##                                                     NA's   :1       NA's   :34     
##        Al             As                Ba               Ca              Cd        
##  Min.   : 396   Min.   :  0.480   Min.   : 3.990   Min.   :  283   Min.   :0.0200  
##  1st Qu.:2106   1st Qu.:  1.738   1st Qu.: 9.755   1st Qu.: 1189   1st Qu.:0.0500  
##  Median :2503   Median :  2.100   Median :13.550   Median : 1837   Median :0.0700  
##  Mean   :2558   Mean   :  3.758   Mean   :15.102   Mean   : 5768   Mean   :0.1008  
##  3rd Qu.:2933   3rd Qu.:  2.800   3rd Qu.:19.938   3rd Qu.: 6510   3rd Qu.:0.1200  
##  Max.   :8463   Max.   :113.630   Max.   :41.670   Max.   :66286   Max.   :0.6600  
##  NA's   :7      NA's   :7         NA's   :7        NA's   :7       NA's   :19      
##        Ce               Cr               Cu               Fe              Gd        
##  Min.   : 1.190   Min.   : 1.170   Min.   : 2.130   Min.   :  833   Min.   :0.2100  
##  1st Qu.: 3.290   1st Qu.: 5.610   1st Qu.: 4.213   1st Qu.: 2013   1st Qu.:0.3700  
##  Median : 4.270   Median : 7.085   Median : 6.010   Median : 2589   Median :0.4650  
##  Mean   : 4.576   Mean   : 7.249   Mean   :10.794   Mean   : 3132   Mean   :0.5166  
##  3rd Qu.: 5.420   3rd Qu.: 8.450   3rd Qu.:12.200   3rd Qu.: 3362   3rd Qu.:0.6300  
##  Max.   :14.080   Max.   :26.290   Max.   :87.380   Max.   :36052   Max.   :2.4100  
##  NA's   :8        NA's   :7        NA's   :7        NA's   :7       NA's   :9       
##        K               La              Mg                Mn               Mo        
##  Min.   : 41.1   Min.   :0.750   Min.   :  57.34   Min.   :  2.68   Min.   : 0.050  
##  1st Qu.:101.7   1st Qu.:1.560   1st Qu.: 237.21   1st Qu.: 18.89   1st Qu.: 0.170  
##  Median :151.7   Median :1.895   Median : 351.26   Median : 30.82   Median : 0.230  
##  Mean   :173.5   Mean   :2.110   Mean   : 467.74   Mean   : 35.57   Mean   : 0.418  
##  3rd Qu.:221.9   3rd Qu.:2.413   3rd Qu.: 563.90   3rd Qu.: 47.97   3rd Qu.: 0.310  
##  Max.   :634.1   Max.   :7.530   Max.   :3253.43   Max.   :127.42   Max.   :14.470  
##  NA's   :7       NA's   :7       NA's   :7         NA's   :7        NA's   :7       
##        Na                Nd              Ni               P               Pb        
##  Min.   :  27.80   Min.   :0.050   Min.   : 0.250   Min.   : 18.5   Min.   :  3.23  
##  1st Qu.:  70.75   1st Qu.:1.230   1st Qu.: 1.252   1st Qu.:113.0   1st Qu.: 10.16  
##  Median : 107.95   Median :1.620   Median : 2.040   Median :186.8   Median : 15.36  
##  Mean   : 142.02   Mean   :1.692   Mean   : 2.536   Mean   :218.2   Mean   : 28.42  
##  3rd Qu.: 163.43   3rd Qu.:2.110   3rd Qu.: 2.768   3rd Qu.:298.8   3rd Qu.: 27.12  
##  Max.   :1613.70   Max.   :8.540   Max.   :21.340   Max.   :846.9   Max.   :173.78  
##  NA's   :7         NA's   :10      NA's   :7        NA's   :7       NA's   :7       
##        S                  Sr               Th              V                Y         
##  Min.   :   42.96   Min.   :  1.87   Min.   :0.480   Min.   : 1.280   Min.   : 0.440  
##  1st Qu.:  183.44   1st Qu.:  6.24   1st Qu.:0.650   1st Qu.: 4.330   1st Qu.: 1.130  
##  Median :  281.77   Median : 10.37   Median :0.940   Median : 5.360   Median : 1.490  
##  Mean   :  897.30   Mean   : 31.24   Mean   :1.171   Mean   : 5.727   Mean   : 1.808  
##  3rd Qu.:  425.77   3rd Qu.: 27.82   3rd Qu.:1.400   3rd Qu.: 6.400   3rd Qu.: 2.053  
##  Max.   :41484.84   Max.   :473.42   Max.   :3.840   Max.   :33.620   Max.   :12.320  
##  NA's   :7          NA's   :7        NA's   :24      NA's   :7        NA's   :7       
##        Zn        
##  Min.   :  5.59  
##  1st Qu.: 23.96  
##  Median : 38.60  
##  Mean   : 58.71  
##  3rd Qu.: 63.63  
##  Max.   :452.16  
##  NA's   :7

The summary() shows us that the first numeric variable we want to check the distribution for is pH. There are other numeric columns before this (Group, Easting, Northing,Longitude,Latitude) but we don't usually tend to check the distribution for this type of numeric or integer variable, as they are really sample identifiers.

If you use this code for your own data, change the value of v0 to your first numeric variable of interest (i.e. the first you want to check the distribution of).

v0 <- "pH"

Subset the input data from first relevant numeric column to end

data0 <- data0[,which(colnames(data0)==v0):ncol(data0)]

We then create a temporary object names.of.cols storing the names of variables to be transformed, and a vector colz of the column indices which are numeric:

names.of.cols <- names(data0)
colzTF <- unlist(lapply(data0, is.numeric), use.names = FALSE)
colz <- which(colzTF==TRUE)

We make an empty output data frame transf_results to hold the results of our distribution checks:

transf_results <- data.frame("Variable"=rep(NA,length(colz)),
              "W_orig"=rep(NA,length(colz)),
              "p_orig"=rep(NA,length(colz)), "W_log_tr"=rep(NA,length(colz)),
              "p_log_tr"=rep(NA,length(colz)), "W_pow_tr"=rep(NA,length(colz)),
              "p_pow_tr"=rep(NA,length(colz)), "Pow_term"=rep(NA,length(colz)))

We then start a programming loop using for(...) that assesses variable distributions and writes the statistical output to the data frame transf_results we created earlier.

Note that we create R objects to hold the results of statistical tests – this stores them temporarily so that we can write them to the results data frame.
The object pt1 that hold the results of the powerTansform() function contains an item pt1$lambda which is the relevant power term for each variable. Each time we cycle through the for() loop, we write a row of the output data frame transf_results. We use signif() to round to a specified number (4) of significant digits.

for (i in colz) {
  pt1 <- powerTransform(data0[, i])
  sw0 <- shapiro.test(data0[, i])
  sw1 <- shapiro.test(log10(data0[, i]))
  sw2 <- shapiro.test((data0[, i]) ^ as.numeric(pt1$lambda))
  transf_results[i-(colz[1]-1),] <- c(names.of.cols[i], signif(sw0$statistic, 4),
                                 signif(sw0$p.value, 4), signif(sw1$statistic, 4),
                                 signif(sw1$p.value, 4), signif(sw2$statistic, 4),
                                 signif(sw2$p.value, 4), signif(as.vector(pt1$lambda), 4))
}
# output to console (screen)
{cat("Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
(_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.\n\n")
print(transf_results, row.names = FALSE)}

## Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
## (_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.
## 
##  Variable W_orig    p_orig W_log_tr  p_log_tr W_pow_tr  p_pow_tr  Pow_term
##        pH 0.9766   0.08979   0.9474 0.0008693   0.9867    0.4612     2.002
##        EC   0.76 1.274e-08   0.9823    0.5219   0.9824    0.5283  -0.01806
##        Al 0.8511 6.021e-08   0.8958 3.289e-06   0.9365 0.0003185    0.4553
##        As  0.143 2.892e-20   0.7882 6.415e-10   0.9269 9.749e-05   -0.4294
##        Ba 0.8951 3.067e-06   0.9885    0.6358   0.9885    0.6373 -0.007109
##        Ca 0.5646 9.559e-15   0.9585  0.006584   0.9763    0.1073   -0.1945
##        Cd   0.63 1.528e-12   0.9727    0.1006   0.9788    0.2345   -0.1594
##        Ce 0.8913 2.368e-06    0.985    0.4128    0.987    0.5425    0.1248
##        Cr 0.8176 4.732e-09   0.9192  3.96e-05   0.9406 0.0005403     0.348
##        Cu 0.5644 9.465e-15   0.9318 0.0001761   0.9867    0.5128   -0.5011
##        Fe 0.2707 7.295e-19   0.8549 8.176e-08    0.948  0.001465   -0.4639
##        Gd 0.6443 4.034e-13   0.9495   0.00207   0.9802    0.2102   -0.4718
##         K 0.8554 8.542e-08   0.9911    0.8206   0.9915    0.8426   -0.0517
##        La 0.7681 1.814e-10    0.969   0.03308   0.9889    0.6618   -0.4369
##        Mg 0.6524 4.011e-13   0.9853    0.4248    0.986    0.4653   -0.0477
##        Mn 0.9272 0.0001014   0.9602  0.008538   0.9936    0.9489    0.3702
##        Mo 0.1319 2.222e-20   0.8012 1.514e-09   0.9471  0.001283   -0.3986
##        Na 0.3949 2.606e-17   0.9437 0.0008182   0.9881    0.6041   -0.3711
##        Nd 0.7408 6.027e-11   0.8226 1.038e-08   0.9182 4.772e-05    0.4521
##        Ni 0.5497 5.354e-15   0.9732   0.06437   0.9764    0.1083  -0.07573
##         P 0.9036 7.254e-06   0.9796    0.1789   0.9932    0.9326    0.2506
##        Pb 0.6184 8.783e-14   0.9517  0.002464   0.9854    0.4297   -0.2961
##         S 0.1364 2.473e-20   0.8212 6.144e-09    0.962    0.0112   -0.4024
##        Sr 0.4609 2.194e-16   0.9523  0.002664   0.9822    0.2688   -0.2506
##        Th 0.8023 2.339e-08   0.9431  0.002948   0.9695   0.08177   -0.6177
##         V 0.5028 9.405e-16   0.9073 1.074e-05   0.9168 3.034e-05   -0.1409
##         Y 0.5532 6.128e-15   0.9579  0.006028   0.9829    0.2984   -0.3309
##        Zn 0.6155  7.75e-14   0.9828     0.295   0.9915    0.8421   -0.1493

We can export our results to a csv file for Excel (if desired):

write.csv(transf_results, file = "transformations.csv", row.names = FALSE)

Alternatively, we can make a publication-quality table using the flextable R package (Table 1).

flextable(transf_results) |> bold(part="header") |> width(width=2, unit="cm") |>
  set_caption(caption="Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))

Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.
Variable	W_orig	p_orig	W_log_tr	p_log_tr	W_pow_tr	p_pow_tr	Pow_term
pH	0.9766	0.08979	0.9474	0.0008693	0.9867	0.4612	2.002
EC	0.76	1.274e-08	0.9823	0.5219	0.9824	0.5283	-0.01806
Al	0.8511	6.021e-08	0.8958	3.289e-06	0.9365	0.0003185	0.4553
As	0.143	2.892e-20	0.7882	6.415e-10	0.9269	9.749e-05	-0.4294
Ba	0.8951	3.067e-06	0.9885	0.6358	0.9885	0.6373	-0.007109
Ca	0.5646	9.559e-15	0.9585	0.006584	0.9763	0.1073	-0.1945
Cd	0.63	1.528e-12	0.9727	0.1006	0.9788	0.2345	-0.1594
Ce	0.8913	2.368e-06	0.985	0.4128	0.987	0.5425	0.1248
Cr	0.8176	4.732e-09	0.9192	3.96e-05	0.9406	0.0005403	0.348
Cu	0.5644	9.465e-15	0.9318	0.0001761	0.9867	0.5128	-0.5011
Fe	0.2707	7.295e-19	0.8549	8.176e-08	0.948	0.001465	-0.4639
Gd	0.6443	4.034e-13	0.9495	0.00207	0.9802	0.2102	-0.4718
K	0.8554	8.542e-08	0.9911	0.8206	0.9915	0.8426	-0.0517
La	0.7681	1.814e-10	0.969	0.03308	0.9889	0.6618	-0.4369
Mg	0.6524	4.011e-13	0.9853	0.4248	0.986	0.4653	-0.0477
Mn	0.9272	0.0001014	0.9602	0.008538	0.9936	0.9489	0.3702
Mo	0.1319	2.222e-20	0.8012	1.514e-09	0.9471	0.001283	-0.3986
Na	0.3949	2.606e-17	0.9437	0.0008182	0.9881	0.6041	-0.3711
Nd	0.7408	6.027e-11	0.8226	1.038e-08	0.9182	4.772e-05	0.4521
Ni	0.5497	5.354e-15	0.9732	0.06437	0.9764	0.1083	-0.07573
P	0.9036	7.254e-06	0.9796	0.1789	0.9932	0.9326	0.2506
Pb	0.6184	8.783e-14	0.9517	0.002464	0.9854	0.4297	-0.2961
S	0.1364	2.473e-20	0.8212	6.144e-09	0.962	0.0112	-0.4024
Sr	0.4609	2.194e-16	0.9523	0.002664	0.9822	0.2688	-0.2506
Th	0.8023	2.339e-08	0.9431	0.002948	0.9695	0.08177	-0.6177
V	0.5028	9.405e-16	0.9073	1.074e-05	0.9168	3.034e-05	-0.1409
Y	0.5532	6.128e-15	0.9579	0.006028	0.9829	0.2984	-0.3309
Zn	0.6155	7.75e-14	0.9828	0.295	0.9915	0.8421	-0.1493

CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.

Material to support teaching in Environmental Science at The University of Western Australia

Units ENVT3361, ENVT4461, and ENVT5503

Check normality of distributions

Many variables at once using a programming loop

Andrew Rate

2025-06-19

Subset the input data from first relevant numeric column to end