UWA logoMaterial to support teaching in Environmental Science at The University of Western Australia

Units ENVT3361, ENVT4461, and ENVT5503

We need to load the R package car, as it's needed for the powerTransform() function. For this session we load and use the sv2017 data from Smith's Lake and Charles Veryard Reserves in 2017.

library(car)
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
sv2017 <- read.csv(paste0(git, "sv2017_original.csv"))

It's often a good idea to make a copy of the data, so we don't lose any information from the original if something goes horribly wrong. We then check our data using the summary() function:

data0 <- sv2017
summary(data0)
##   ICP.Seq.No            Group           Sample              Type          
##  Length:95          Min.   : 1.000   Length:95          Length:95         
##  Class :character   1st Qu.: 3.000   Class :character   Class :character  
##  Mode  :character   Median : 6.000   Mode  :character   Mode  :character  
##                     Mean   : 6.344                                        
##                     3rd Qu.: 9.000                                        
##                     Max.   :12.000                                        
##                     NA's   :2                                             
##     Easting          Northing         Longitude        Latitude     
##  Min.   :391064   Min.   :6466246   Min.   :115.8   Min.   :-31.93  
##  1st Qu.:391242   1st Qu.:6466430   1st Qu.:115.8   1st Qu.:-31.93  
##  Median :391338   Median :6466589   Median :115.9   Median :-31.93  
##  Mean   :391301   Mean   :6466556   Mean   :115.9   Mean   :-31.93  
##  3rd Qu.:391384   3rd Qu.:6466676   3rd Qu.:115.9   3rd Qu.:-31.93  
##  Max.   :391431   Max.   :6466771   Max.   :115.9   Max.   :-31.93  
##                                                                     
##        pH              EC              Al             As         
##  Min.   :4.630   Min.   : 29.1   Min.   : 396   Min.   :  0.480  
##  1st Qu.:6.388   1st Qu.:106.6   1st Qu.:2106   1st Qu.:  1.738  
##  Median :6.820   Median :152.6   Median :2503   Median :  2.100  
##  Mean   :6.813   Mean   :188.5   Mean   :2558   Mean   :  3.758  
##  3rd Qu.:7.280   3rd Qu.:215.1   3rd Qu.:2933   3rd Qu.:  2.800  
##  Max.   :8.630   Max.   :835.0   Max.   :8463   Max.   :113.630  
##  NA's   :1       NA's   :34      NA's   :7      NA's   :7        
##        Ba               Ca              Cd               Ce        
##  Min.   : 3.990   Min.   :  283   Min.   :0.0200   Min.   : 1.190  
##  1st Qu.: 9.755   1st Qu.: 1189   1st Qu.:0.0500   1st Qu.: 3.290  
##  Median :13.550   Median : 1837   Median :0.0700   Median : 4.270  
##  Mean   :15.102   Mean   : 5768   Mean   :0.1008   Mean   : 4.576  
##  3rd Qu.:19.938   3rd Qu.: 6510   3rd Qu.:0.1200   3rd Qu.: 5.420  
##  Max.   :41.670   Max.   :66286   Max.   :0.6600   Max.   :14.080  
##  NA's   :7        NA's   :7       NA's   :19       NA's   :8       
##        Cr               Cu               Fe              Gd        
##  Min.   : 1.170   Min.   : 2.130   Min.   :  833   Min.   :0.2100  
##  1st Qu.: 5.610   1st Qu.: 4.213   1st Qu.: 2013   1st Qu.:0.3700  
##  Median : 7.085   Median : 6.010   Median : 2589   Median :0.4650  
##  Mean   : 7.249   Mean   :10.794   Mean   : 3132   Mean   :0.5166  
##  3rd Qu.: 8.450   3rd Qu.:12.200   3rd Qu.: 3362   3rd Qu.:0.6300  
##  Max.   :26.290   Max.   :87.380   Max.   :36052   Max.   :2.4100  
##  NA's   :7        NA's   :7        NA's   :7       NA's   :9       
##        K               La              Mg                Mn        
##  Min.   : 41.1   Min.   :0.750   Min.   :  57.34   Min.   :  2.68  
##  1st Qu.:101.7   1st Qu.:1.560   1st Qu.: 237.21   1st Qu.: 18.89  
##  Median :151.7   Median :1.895   Median : 351.26   Median : 30.82  
##  Mean   :173.5   Mean   :2.110   Mean   : 467.74   Mean   : 35.57  
##  3rd Qu.:221.9   3rd Qu.:2.413   3rd Qu.: 563.90   3rd Qu.: 47.97  
##  Max.   :634.1   Max.   :7.530   Max.   :3253.43   Max.   :127.42  
##  NA's   :7       NA's   :7       NA's   :7         NA's   :7       
##        Mo               Na                Nd              Ni        
##  Min.   : 0.050   Min.   :  27.80   Min.   :0.050   Min.   : 0.250  
##  1st Qu.: 0.170   1st Qu.:  70.75   1st Qu.:1.230   1st Qu.: 1.252  
##  Median : 0.230   Median : 107.95   Median :1.620   Median : 2.040  
##  Mean   : 0.418   Mean   : 142.02   Mean   :1.692   Mean   : 2.536  
##  3rd Qu.: 0.310   3rd Qu.: 163.43   3rd Qu.:2.110   3rd Qu.: 2.768  
##  Max.   :14.470   Max.   :1613.70   Max.   :8.540   Max.   :21.340  
##  NA's   :7        NA's   :7         NA's   :10      NA's   :7       
##        P               Pb               S                  Sr        
##  Min.   : 18.5   Min.   :  3.23   Min.   :   42.96   Min.   :  1.87  
##  1st Qu.:113.0   1st Qu.: 10.16   1st Qu.:  183.44   1st Qu.:  6.24  
##  Median :186.8   Median : 15.36   Median :  281.78   Median : 10.37  
##  Mean   :218.2   Mean   : 28.42   Mean   :  897.30   Mean   : 31.24  
##  3rd Qu.:298.8   3rd Qu.: 27.12   3rd Qu.:  425.77   3rd Qu.: 27.82  
##  Max.   :846.9   Max.   :173.78   Max.   :41484.84   Max.   :473.42  
##  NA's   :7       NA's   :7        NA's   :7          NA's   :7       
##        Th              V                Y                Zn        
##  Min.   :0.480   Min.   : 1.280   Min.   : 0.440   Min.   :  5.59  
##  1st Qu.:0.650   1st Qu.: 4.330   1st Qu.: 1.130   1st Qu.: 23.96  
##  Median :0.940   Median : 5.360   Median : 1.490   Median : 38.60  
##  Mean   :1.171   Mean   : 5.727   Mean   : 1.808   Mean   : 58.71  
##  3rd Qu.:1.400   3rd Qu.: 6.400   3rd Qu.: 2.053   3rd Qu.: 63.63  
##  Max.   :3.840   Max.   :33.620   Max.   :12.320   Max.   :452.16  
##  NA's   :24      NA's   :7        NA's   :7        NA's   :7

The summary() shows us that the first numeric variable we want to check the distribution for is pH. There are other numeric columns before this (Group, Easting, Northing,Longitude,Latitude) but we don't usually tend to check the distribution for this type of numeric or integer variable, as they are really sample identifiers.

If you use this code for your own data, change the value of v0 to your first numeric variable of interest (i.e. the first you want to check the distribution of).

v0 <- "pH"

Subset the input data from first relevant numeric column to end

data0 <- data0[,which(colnames(data0)==v0):ncol(data0)]

We then create a temporary object names.of.cols storing the names of variables to be transformed, and a vector colz of the column indices which are numeric:

names.of.cols <- names(data0)
colzTF <- unlist(lapply(data0, is.numeric), use.names = FALSE)
colz <- which(colzTF==TRUE)

We make an empty output data frame transf_results to hold the results of our distribution checks:

transf_results <- data.frame("Variable"=rep(NA,length(colz)),
              "W_orig"=rep(NA,length(colz)),
              "p_orig"=rep(NA,length(colz)), "W_log_tr"=rep(NA,length(colz)),
              "p_log_tr"=rep(NA,length(colz)), "W_pow_tr"=rep(NA,length(colz)),
              "p_pow_tr"=rep(NA,length(colz)), "Pow_term"=rep(NA,length(colz)))

We then start a programming loop using for(...) that assesses variable distributions and writes the statistical output to the data frame transf_results we created earlier.

Note that we create R objects to hold the results of statistical tests – this stores them temporarily so that we can write them to the results data frame.
The object pt1 that hold the results of the powerTansform() function contains an item pt1$lambda which is the relevant power term for each variable. Each time we cycle through the for() loop, we write a row of the output data frame transf_results. We use signif() to round to a specified number (4) of significant digits.

for (i in colz) {
  pt1 <- powerTransform(data0[, i])
  sw0 <- shapiro.test(data0[, i])
  sw1 <- shapiro.test(log10(data0[, i]))
  sw2 <- shapiro.test((data0[, i]) ^ as.numeric(pt1$lambda))
  transf_results[i-(colz[1]-1),] <- c(names.of.cols[i], signif(sw0$statistic, 4),
                                 signif(sw0$p.value, 4), signif(sw1$statistic, 4),
                                 signif(sw1$p.value, 4), signif(sw2$statistic, 4),
                                 signif(sw2$p.value, 4), signif(as.vector(pt1$lambda), 4))
}
# output to console (screen)
{cat("Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
(_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.\n\n")
print(transf_results, row.names = FALSE)}
## Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
## (_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.
## 
##  Variable W_orig    p_orig W_log_tr  p_log_tr W_pow_tr  p_pow_tr  Pow_term
##        pH 0.9766   0.08979   0.9474 0.0008693   0.9867    0.4612     2.002
##        EC   0.76 1.274e-08   0.9823    0.5219   0.9824    0.5283  -0.01806
##        Al 0.8511 6.021e-08   0.8958 3.289e-06   0.9365 0.0003185    0.4553
##        As  0.143 2.892e-20   0.7882 6.415e-10   0.9269 9.749e-05   -0.4294
##        Ba 0.8951 3.067e-06   0.9885    0.6358   0.9885    0.6373 -0.007109
##        Ca 0.5646 9.559e-15   0.9585  0.006584   0.9763    0.1073   -0.1945
##        Cd   0.63 1.528e-12   0.9727    0.1006   0.9788    0.2345   -0.1594
##        Ce 0.8913 2.368e-06    0.985    0.4128    0.987    0.5425    0.1248
##        Cr 0.8176 4.732e-09   0.9192  3.96e-05   0.9406 0.0005403     0.348
##        Cu 0.5644 9.465e-15   0.9318 0.0001761   0.9867    0.5128   -0.5011
##        Fe 0.2707 7.295e-19   0.8549 8.176e-08    0.948  0.001465   -0.4639
##        Gd 0.6443 4.034e-13   0.9495   0.00207   0.9802    0.2102   -0.4718
##         K 0.8554 8.542e-08   0.9911    0.8206   0.9915    0.8426   -0.0517
##        La 0.7681 1.814e-10    0.969   0.03308   0.9889    0.6618   -0.4369
##        Mg 0.6524 4.011e-13   0.9853    0.4248    0.986    0.4653   -0.0477
##        Mn 0.9272 0.0001014   0.9602  0.008538   0.9936    0.9489    0.3702
##        Mo 0.1319 2.222e-20   0.8012 1.514e-09   0.9471  0.001283   -0.3986
##        Na 0.3949 2.606e-17   0.9437 0.0008182   0.9881    0.6041   -0.3711
##        Nd 0.7408 6.027e-11   0.8226 1.038e-08   0.9182 4.772e-05    0.4521
##        Ni 0.5497 5.354e-15   0.9732   0.06437   0.9764    0.1083  -0.07573
##         P 0.9036 7.254e-06   0.9796    0.1789   0.9932    0.9326    0.2506
##        Pb 0.6184 8.783e-14   0.9517  0.002464   0.9854    0.4297   -0.2961
##         S 0.1364 2.473e-20   0.8212 6.144e-09    0.962    0.0112   -0.4024
##        Sr 0.4609 2.194e-16   0.9523  0.002664   0.9822    0.2688   -0.2506
##        Th 0.8023 2.339e-08   0.9431  0.002948   0.9695   0.08177   -0.6177
##         V 0.5028 9.405e-16   0.9073 1.074e-05   0.9168 3.034e-05   -0.1409
##         Y 0.5532 6.128e-15   0.9579  0.006028   0.9829    0.2984   -0.3309
##        Zn 0.6155  7.75e-14   0.9828     0.295   0.9915    0.8421   -0.1493

We can export our results to a csv file for Excel (if desired):

write.csv(transf_results, file = "transformations.csv", row.names = FALSE)

Alternatively, we can make a publication-quality table using the flextable R package (Table 1).

flextable(transf_results) |> bold(part="header") |> width(width=2, unit="cm") |>
  set_caption(caption="Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))
Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.

Variable

W_orig

p_orig

W_log_tr

p_log_tr

W_pow_tr

p_pow_tr

Pow_term

pH

0.9766

0.08979

0.9474

0.0008693

0.9867

0.4612

2.002

EC

0.76

1.274e-08

0.9823

0.5219

0.9824

0.5283

-0.01806

Al

0.8511

6.021e-08

0.8958

3.289e-06

0.9365

0.0003185

0.4553

As

0.143

2.892e-20

0.7882

6.415e-10

0.9269

9.749e-05

-0.4294

Ba

0.8951

3.067e-06

0.9885

0.6358

0.9885

0.6373

-0.007109

Ca

0.5646

9.559e-15

0.9585

0.006584

0.9763

0.1073

-0.1945

Cd

0.63

1.528e-12

0.9727

0.1006

0.9788

0.2345

-0.1594

Ce

0.8913

2.368e-06

0.985

0.4128

0.987

0.5425

0.1248

Cr

0.8176

4.732e-09

0.9192

3.96e-05

0.9406

0.0005403

0.348

Cu

0.5644

9.465e-15

0.9318

0.0001761

0.9867

0.5128

-0.5011

Fe

0.2707

7.295e-19

0.8549

8.176e-08

0.948

0.001465

-0.4639

Gd

0.6443

4.034e-13

0.9495

0.00207

0.9802

0.2102

-0.4718

K

0.8554

8.542e-08

0.9911

0.8206

0.9915

0.8426

-0.0517

La

0.7681

1.814e-10

0.969

0.03308

0.9889

0.6618

-0.4369

Mg

0.6524

4.011e-13

0.9853

0.4248

0.986

0.4653

-0.0477

Mn

0.9272

0.0001014

0.9602

0.008538

0.9936

0.9489

0.3702

Mo

0.1319

2.222e-20

0.8012

1.514e-09

0.9471

0.001283

-0.3986

Na

0.3949

2.606e-17

0.9437

0.0008182

0.9881

0.6041

-0.3711

Nd

0.7408

6.027e-11

0.8226

1.038e-08

0.9182

4.772e-05

0.4521

Ni

0.5497

5.354e-15

0.9732

0.06437

0.9764

0.1083

-0.07573

P

0.9036

7.254e-06

0.9796

0.1789

0.9932

0.9326

0.2506

Pb

0.6184

8.783e-14

0.9517

0.002464

0.9854

0.4297

-0.2961

S

0.1364

2.473e-20

0.8212

6.144e-09

0.962

0.0112

-0.4024

Sr

0.4609

2.194e-16

0.9523

0.002664

0.9822

0.2688

-0.2506

Th

0.8023

2.339e-08

0.9431

0.002948

0.9695

0.08177

-0.6177

V

0.5028

9.405e-16

0.9073

1.074e-05

0.9168

3.034e-05

-0.1409

Y

0.5532

6.128e-15

0.9579

0.006028

0.9829

0.2984

-0.3309

Zn

0.6155

7.75e-14

0.9828

0.295

0.9915

0.8421

-0.1493

 


CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.