UWA logoMaterial to support teaching in Environmental Science at The University of Western Australia

Units ENVT3361, ENVT4461, and ENVT5503

We need to load the R package car, as it's needed for the powerTransform() function. For this session we load and use the sv2017 data from Smith's Lake and Charles Veryard Reserves in 2017.

library(car)
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
sv2017 <- read.csv(paste0(git, "sv2017_original.csv"))

It's often a good idea to make a copy of the data, so we don't lose any information from the original if something goes horribly wrong. We then check our data using the summary() function:

data0 <- sv2017
summary(data0)
##   ICP.Seq.No            Group           Sample              Type              Easting      
##  Length:95          Min.   : 1.000   Length:95          Length:95          Min.   :391064  
##  Class :character   1st Qu.: 3.000   Class :character   Class :character   1st Qu.:391242  
##  Mode  :character   Median : 6.000   Mode  :character   Mode  :character   Median :391338  
##                     Mean   : 6.344                                         Mean   :391301  
##                     3rd Qu.: 9.000                                         3rd Qu.:391384  
##                     Max.   :12.000                                         Max.   :391431  
##                     NA's   :2                                                              
##     Northing         Longitude        Latitude            pH              EC       
##  Min.   :6466246   Min.   :115.8   Min.   :-31.93   Min.   :4.630   Min.   : 29.1  
##  1st Qu.:6466430   1st Qu.:115.8   1st Qu.:-31.93   1st Qu.:6.388   1st Qu.:106.6  
##  Median :6466589   Median :115.9   Median :-31.93   Median :6.820   Median :152.6  
##  Mean   :6466556   Mean   :115.9   Mean   :-31.93   Mean   :6.813   Mean   :188.5  
##  3rd Qu.:6466676   3rd Qu.:115.9   3rd Qu.:-31.93   3rd Qu.:7.280   3rd Qu.:215.1  
##  Max.   :6466771   Max.   :115.9   Max.   :-31.93   Max.   :8.630   Max.   :835.0  
##                                                     NA's   :1       NA's   :34     
##        Al             As                Ba               Ca              Cd        
##  Min.   : 396   Min.   :  0.480   Min.   : 3.990   Min.   :  283   Min.   :0.0200  
##  1st Qu.:2106   1st Qu.:  1.738   1st Qu.: 9.755   1st Qu.: 1189   1st Qu.:0.0500  
##  Median :2503   Median :  2.100   Median :13.550   Median : 1837   Median :0.0700  
##  Mean   :2558   Mean   :  3.758   Mean   :15.102   Mean   : 5768   Mean   :0.1008  
##  3rd Qu.:2933   3rd Qu.:  2.800   3rd Qu.:19.938   3rd Qu.: 6510   3rd Qu.:0.1200  
##  Max.   :8463   Max.   :113.630   Max.   :41.670   Max.   :66286   Max.   :0.6600  
##  NA's   :7      NA's   :7         NA's   :7        NA's   :7       NA's   :19      
##        Ce               Cr               Cu               Fe              Gd        
##  Min.   : 1.190   Min.   : 1.170   Min.   : 2.130   Min.   :  833   Min.   :0.2100  
##  1st Qu.: 3.290   1st Qu.: 5.610   1st Qu.: 4.213   1st Qu.: 2013   1st Qu.:0.3700  
##  Median : 4.270   Median : 7.085   Median : 6.010   Median : 2589   Median :0.4650  
##  Mean   : 4.576   Mean   : 7.249   Mean   :10.794   Mean   : 3132   Mean   :0.5166  
##  3rd Qu.: 5.420   3rd Qu.: 8.450   3rd Qu.:12.200   3rd Qu.: 3362   3rd Qu.:0.6300  
##  Max.   :14.080   Max.   :26.290   Max.   :87.380   Max.   :36052   Max.   :2.4100  
##  NA's   :8        NA's   :7        NA's   :7        NA's   :7       NA's   :9       
##        K               La              Mg                Mn               Mo        
##  Min.   : 41.1   Min.   :0.750   Min.   :  57.34   Min.   :  2.68   Min.   : 0.050  
##  1st Qu.:101.7   1st Qu.:1.560   1st Qu.: 237.21   1st Qu.: 18.89   1st Qu.: 0.170  
##  Median :151.7   Median :1.895   Median : 351.26   Median : 30.82   Median : 0.230  
##  Mean   :173.5   Mean   :2.110   Mean   : 467.74   Mean   : 35.57   Mean   : 0.418  
##  3rd Qu.:221.9   3rd Qu.:2.413   3rd Qu.: 563.90   3rd Qu.: 47.97   3rd Qu.: 0.310  
##  Max.   :634.1   Max.   :7.530   Max.   :3253.43   Max.   :127.42   Max.   :14.470  
##  NA's   :7       NA's   :7       NA's   :7         NA's   :7        NA's   :7       
##        Na                Nd              Ni               P               Pb        
##  Min.   :  27.80   Min.   :0.050   Min.   : 0.250   Min.   : 18.5   Min.   :  3.23  
##  1st Qu.:  70.75   1st Qu.:1.230   1st Qu.: 1.252   1st Qu.:113.0   1st Qu.: 10.16  
##  Median : 107.95   Median :1.620   Median : 2.040   Median :186.8   Median : 15.36  
##  Mean   : 142.02   Mean   :1.692   Mean   : 2.536   Mean   :218.2   Mean   : 28.42  
##  3rd Qu.: 163.43   3rd Qu.:2.110   3rd Qu.: 2.768   3rd Qu.:298.8   3rd Qu.: 27.12  
##  Max.   :1613.70   Max.   :8.540   Max.   :21.340   Max.   :846.9   Max.   :173.78  
##  NA's   :7         NA's   :10      NA's   :7        NA's   :7       NA's   :7       
##        S                  Sr               Th              V                Y         
##  Min.   :   42.96   Min.   :  1.87   Min.   :0.480   Min.   : 1.280   Min.   : 0.440  
##  1st Qu.:  183.44   1st Qu.:  6.24   1st Qu.:0.650   1st Qu.: 4.330   1st Qu.: 1.130  
##  Median :  281.77   Median : 10.37   Median :0.940   Median : 5.360   Median : 1.490  
##  Mean   :  897.30   Mean   : 31.24   Mean   :1.171   Mean   : 5.727   Mean   : 1.808  
##  3rd Qu.:  425.77   3rd Qu.: 27.82   3rd Qu.:1.400   3rd Qu.: 6.400   3rd Qu.: 2.053  
##  Max.   :41484.84   Max.   :473.42   Max.   :3.840   Max.   :33.620   Max.   :12.320  
##  NA's   :7          NA's   :7        NA's   :24      NA's   :7        NA's   :7       
##        Zn        
##  Min.   :  5.59  
##  1st Qu.: 23.96  
##  Median : 38.60  
##  Mean   : 58.71  
##  3rd Qu.: 63.63  
##  Max.   :452.16  
##  NA's   :7

The summary() shows us that the first numeric variable we want to check the distribution for is pH. There are other numeric columns before this (Group, Easting, Northing,Longitude,Latitude) but we don't usually tend to check the distribution for this type of numeric or integer variable, as they are really sample identifiers.

If you use this code for your own data, change the value of v0 to your first numeric variable of interest (i.e. the first you want to check the distribution of).

v0 <- "pH"

Subset the input data from first relevant numeric column to end

data0 <- data0[,which(colnames(data0)==v0):ncol(data0)]

We then create a temporary object names.of.cols storing the names of variables to be transformed, and a vector colz of the column indices which are numeric:

names.of.cols <- names(data0)
colzTF <- unlist(lapply(data0, is.numeric), use.names = FALSE)
colz <- which(colzTF==TRUE)

We make an empty output data frame transf_results to hold the results of our distribution checks:

transf_results <- data.frame("Variable"=rep(NA,length(colz)),
              "W_orig"=rep(NA,length(colz)),
              "p_orig"=rep(NA,length(colz)), "W_log_tr"=rep(NA,length(colz)),
              "p_log_tr"=rep(NA,length(colz)), "W_pow_tr"=rep(NA,length(colz)),
              "p_pow_tr"=rep(NA,length(colz)), "Pow_term"=rep(NA,length(colz)))

We then start a programming loop using for(...) that assesses variable distributions and writes the statistical output to the data frame transf_results we created earlier.

Note that we create R objects to hold the results of statistical tests – this stores them temporarily so that we can write them to the results data frame.
The object pt1 that hold the results of the powerTansform() function contains an item pt1$lambda which is the relevant power term for each variable. Each time we cycle through the for() loop, we write a row of the output data frame transf_results. We use signif() to round to a specified number (4) of significant digits.

for (i in colz) {
  pt1 <- powerTransform(data0[, i])
  sw0 <- shapiro.test(data0[, i])
  sw1 <- shapiro.test(log10(data0[, i]))
  sw2 <- shapiro.test((data0[, i]) ^ as.numeric(pt1$lambda))
  transf_results[i-(colz[1]-1),] <- c(names.of.cols[i], signif(sw0$statistic, 4),
                                 signif(sw0$p.value, 4), signif(sw1$statistic, 4),
                                 signif(sw1$p.value, 4), signif(sw2$statistic, 4),
                                 signif(sw2$p.value, 4), signif(as.vector(pt1$lambda), 4))
}
# output to console (screen)
{cat("Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
(_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.\n\n")
print(transf_results, row.names = FALSE)}
## Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
## (_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.
## 
##  Variable W_orig    p_orig W_log_tr  p_log_tr W_pow_tr  p_pow_tr  Pow_term
##        pH 0.9766   0.08979   0.9474 0.0008693   0.9867    0.4612     2.002
##        EC   0.76 1.274e-08   0.9823    0.5219   0.9824    0.5283  -0.01806
##        Al 0.8511 6.021e-08   0.8958 3.289e-06   0.9365 0.0003185    0.4553
##        As  0.143 2.892e-20   0.7882 6.415e-10   0.9269 9.749e-05   -0.4294
##        Ba 0.8951 3.067e-06   0.9885    0.6358   0.9885    0.6373 -0.007109
##        Ca 0.5646 9.559e-15   0.9585  0.006584   0.9763    0.1073   -0.1945
##        Cd   0.63 1.528e-12   0.9727    0.1006   0.9788    0.2345   -0.1594
##        Ce 0.8913 2.368e-06    0.985    0.4128    0.987    0.5425    0.1248
##        Cr 0.8176 4.732e-09   0.9192  3.96e-05   0.9406 0.0005403     0.348
##        Cu 0.5644 9.465e-15   0.9318 0.0001761   0.9867    0.5128   -0.5011
##        Fe 0.2707 7.295e-19   0.8549 8.176e-08    0.948  0.001465   -0.4639
##        Gd 0.6443 4.034e-13   0.9495   0.00207   0.9802    0.2102   -0.4718
##         K 0.8554 8.542e-08   0.9911    0.8206   0.9915    0.8426   -0.0517
##        La 0.7681 1.814e-10    0.969   0.03308   0.9889    0.6618   -0.4369
##        Mg 0.6524 4.011e-13   0.9853    0.4248    0.986    0.4653   -0.0477
##        Mn 0.9272 0.0001014   0.9602  0.008538   0.9936    0.9489    0.3702
##        Mo 0.1319 2.222e-20   0.8012 1.514e-09   0.9471  0.001283   -0.3986
##        Na 0.3949 2.606e-17   0.9437 0.0008182   0.9881    0.6041   -0.3711
##        Nd 0.7408 6.027e-11   0.8226 1.038e-08   0.9182 4.772e-05    0.4521
##        Ni 0.5497 5.354e-15   0.9732   0.06437   0.9764    0.1083  -0.07573
##         P 0.9036 7.254e-06   0.9796    0.1789   0.9932    0.9326    0.2506
##        Pb 0.6184 8.783e-14   0.9517  0.002464   0.9854    0.4297   -0.2961
##         S 0.1364 2.473e-20   0.8212 6.144e-09    0.962    0.0112   -0.4024
##        Sr 0.4609 2.194e-16   0.9523  0.002664   0.9822    0.2688   -0.2506
##        Th 0.8023 2.339e-08   0.9431  0.002948   0.9695   0.08177   -0.6177
##         V 0.5028 9.405e-16   0.9073 1.074e-05   0.9168 3.034e-05   -0.1409
##         Y 0.5532 6.128e-15   0.9579  0.006028   0.9829    0.2984   -0.3309
##        Zn 0.6155  7.75e-14   0.9828     0.295   0.9915    0.8421   -0.1493

We can export our results to a csv file for Excel (if desired):

write.csv(transf_results, file = "transformations.csv", row.names = FALSE)

Alternatively, we can make a publication-quality table using the flextable R package (Table 1).

flextable(transf_results) |> bold(part="header") |> width(width=2, unit="cm") |>
  set_caption(caption="Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))
Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.

Variable

W_orig

p_orig

W_log_tr

p_log_tr

W_pow_tr

p_pow_tr

Pow_term

pH

0.9766

0.08979

0.9474

0.0008693

0.9867

0.4612

2.002

EC

0.76

1.274e-08

0.9823

0.5219

0.9824

0.5283

-0.01806

Al

0.8511

6.021e-08

0.8958

3.289e-06

0.9365

0.0003185

0.4553

As

0.143

2.892e-20

0.7882

6.415e-10

0.9269

9.749e-05

-0.4294

Ba

0.8951

3.067e-06

0.9885

0.6358

0.9885

0.6373

-0.007109

Ca

0.5646

9.559e-15

0.9585

0.006584

0.9763

0.1073

-0.1945

Cd

0.63

1.528e-12

0.9727

0.1006

0.9788

0.2345

-0.1594

Ce

0.8913

2.368e-06

0.985

0.4128

0.987

0.5425

0.1248

Cr

0.8176

4.732e-09

0.9192

3.96e-05

0.9406

0.0005403

0.348

Cu

0.5644

9.465e-15

0.9318

0.0001761

0.9867

0.5128

-0.5011

Fe

0.2707

7.295e-19

0.8549

8.176e-08

0.948

0.001465

-0.4639

Gd

0.6443

4.034e-13

0.9495

0.00207

0.9802

0.2102

-0.4718

K

0.8554

8.542e-08

0.9911

0.8206

0.9915

0.8426

-0.0517

La

0.7681

1.814e-10

0.969

0.03308

0.9889

0.6618

-0.4369

Mg

0.6524

4.011e-13

0.9853

0.4248

0.986

0.4653

-0.0477

Mn

0.9272

0.0001014

0.9602

0.008538

0.9936

0.9489

0.3702

Mo

0.1319

2.222e-20

0.8012

1.514e-09

0.9471

0.001283

-0.3986

Na

0.3949

2.606e-17

0.9437

0.0008182

0.9881

0.6041

-0.3711

Nd

0.7408

6.027e-11

0.8226

1.038e-08

0.9182

4.772e-05

0.4521

Ni

0.5497

5.354e-15

0.9732

0.06437

0.9764

0.1083

-0.07573

P

0.9036

7.254e-06

0.9796

0.1789

0.9932

0.9326

0.2506

Pb

0.6184

8.783e-14

0.9517

0.002464

0.9854

0.4297

-0.2961

S

0.1364

2.473e-20

0.8212

6.144e-09

0.962

0.0112

-0.4024

Sr

0.4609

2.194e-16

0.9523

0.002664

0.9822

0.2688

-0.2506

Th

0.8023

2.339e-08

0.9431

0.002948

0.9695

0.08177

-0.6177

V

0.5028

9.405e-16

0.9073

1.074e-05

0.9168

3.034e-05

-0.1409

Y

0.5532

6.128e-15

0.9579

0.006028

0.9829

0.2984

-0.3309

Zn

0.6155

7.75e-14

0.9828

0.295

0.9915

0.8421

-0.1493

 


CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.