Material to support teaching in Environmental Science at The University of Western AustraliaWe need to load the R package car, as
it's needed for the powerTransform() function. For this
session we load and use the sv2017 data from Smith's Lake
and Charles Veryard Reserves in 2017.
library(car)
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
sv2017 <- read.csv(paste0(git, "sv2017_original.csv"))It's often a good idea to make a copy of the data, so we don't lose
any information from the original if something goes horribly wrong. We
then check our data using the summary() function:
## ICP.Seq.No Group Sample Type Easting Northing
## Length:95 Min. : 1.000 Length:95 Length:95 Min. :391064 Min. :6466246
## Class :character 1st Qu.: 3.000 Class :character Class :character 1st Qu.:391242 1st Qu.:6466430
## Mode :character Median : 6.000 Mode :character Mode :character Median :391338 Median :6466589
## Mean : 6.344 Mean :391301 Mean :6466556
## 3rd Qu.: 9.000 3rd Qu.:391384 3rd Qu.:6466676
## Max. :12.000 Max. :391431 Max. :6466771
## NA's :2
## Longitude Latitude pH EC Al As Ba
## Min. :115.8 Min. :-31.93 Min. :4.630 Min. : 29.1 Min. : 396 Min. : 0.480 Min. : 3.990
## 1st Qu.:115.8 1st Qu.:-31.93 1st Qu.:6.388 1st Qu.:106.6 1st Qu.:2106 1st Qu.: 1.738 1st Qu.: 9.755
## Median :115.9 Median :-31.93 Median :6.820 Median :152.6 Median :2503 Median : 2.100 Median :13.550
## Mean :115.9 Mean :-31.93 Mean :6.813 Mean :188.5 Mean :2558 Mean : 3.758 Mean :15.102
## 3rd Qu.:115.9 3rd Qu.:-31.93 3rd Qu.:7.280 3rd Qu.:215.1 3rd Qu.:2933 3rd Qu.: 2.800 3rd Qu.:19.938
## Max. :115.9 Max. :-31.93 Max. :8.630 Max. :835.0 Max. :8463 Max. :113.630 Max. :41.670
## NA's :1 NA's :34 NA's :7 NA's :7 NA's :7
## Ca Cd Ce Cr Cu Fe Gd
## Min. : 283 Min. :0.0200 Min. : 1.190 Min. : 1.170 Min. : 2.130 Min. : 833 Min. :0.2100
## 1st Qu.: 1189 1st Qu.:0.0500 1st Qu.: 3.290 1st Qu.: 5.610 1st Qu.: 4.213 1st Qu.: 2013 1st Qu.:0.3700
## Median : 1837 Median :0.0700 Median : 4.270 Median : 7.085 Median : 6.010 Median : 2589 Median :0.4650
## Mean : 5768 Mean :0.1008 Mean : 4.576 Mean : 7.249 Mean :10.794 Mean : 3132 Mean :0.5166
## 3rd Qu.: 6510 3rd Qu.:0.1200 3rd Qu.: 5.420 3rd Qu.: 8.450 3rd Qu.:12.200 3rd Qu.: 3362 3rd Qu.:0.6300
## Max. :66286 Max. :0.6600 Max. :14.080 Max. :26.290 Max. :87.380 Max. :36052 Max. :2.4100
## NA's :7 NA's :19 NA's :8 NA's :7 NA's :7 NA's :7 NA's :9
## K La Mg Mn Mo Na Nd
## Min. : 41.1 Min. :0.750 Min. : 57.34 Min. : 2.68 Min. : 0.050 Min. : 27.80 Min. :0.050
## 1st Qu.:101.7 1st Qu.:1.560 1st Qu.: 237.21 1st Qu.: 18.89 1st Qu.: 0.170 1st Qu.: 70.75 1st Qu.:1.230
## Median :151.7 Median :1.895 Median : 351.26 Median : 30.82 Median : 0.230 Median : 107.95 Median :1.620
## Mean :173.5 Mean :2.110 Mean : 467.74 Mean : 35.57 Mean : 0.418 Mean : 142.02 Mean :1.692
## 3rd Qu.:221.9 3rd Qu.:2.413 3rd Qu.: 563.90 3rd Qu.: 47.97 3rd Qu.: 0.310 3rd Qu.: 163.43 3rd Qu.:2.110
## Max. :634.1 Max. :7.530 Max. :3253.43 Max. :127.42 Max. :14.470 Max. :1613.70 Max. :8.540
## NA's :7 NA's :7 NA's :7 NA's :7 NA's :7 NA's :7 NA's :10
## Ni P Pb S Sr Th V
## Min. : 0.250 Min. : 18.5 Min. : 3.23 Min. : 42.96 Min. : 1.87 Min. :0.480 Min. : 1.280
## 1st Qu.: 1.252 1st Qu.:113.0 1st Qu.: 10.16 1st Qu.: 183.44 1st Qu.: 6.24 1st Qu.:0.650 1st Qu.: 4.330
## Median : 2.040 Median :186.8 Median : 15.36 Median : 281.77 Median : 10.37 Median :0.940 Median : 5.360
## Mean : 2.536 Mean :218.2 Mean : 28.42 Mean : 897.30 Mean : 31.24 Mean :1.171 Mean : 5.727
## 3rd Qu.: 2.768 3rd Qu.:298.8 3rd Qu.: 27.12 3rd Qu.: 425.77 3rd Qu.: 27.82 3rd Qu.:1.400 3rd Qu.: 6.400
## Max. :21.340 Max. :846.9 Max. :173.78 Max. :41484.84 Max. :473.42 Max. :3.840 Max. :33.620
## NA's :7 NA's :7 NA's :7 NA's :7 NA's :7 NA's :24 NA's :7
## Y Zn
## Min. : 0.440 Min. : 5.59
## 1st Qu.: 1.130 1st Qu.: 23.96
## Median : 1.490 Median : 38.60
## Mean : 1.808 Mean : 58.71
## 3rd Qu.: 2.053 3rd Qu.: 63.63
## Max. :12.320 Max. :452.16
## NA's :7 NA's :7
The summary() shows us that the first numeric variable
we want to check the distribution for is pH. There are
other numeric columns before this (Group,
Easting,
Northing,Longitude,Latitude) but
we don't usually tend to check the distribution for this type of numeric
or integer variable, as they are really sample identifiers.
If you use this code for your own data, change the value of
v0 to your first numeric variable of interest
(i.e. the first you want to check the distribution of).
We then create a temporary object names.of.cols storing
the names of variables to be transformed, and a vector colz
of the column indices which are numeric:
names.of.cols <- names(data0)
colzTF <- unlist(lapply(data0, is.numeric), use.names = FALSE)
colz <- which(colzTF==TRUE)We make an empty output data frame transf_results to
hold the results of our distribution checks:
transf_results <- data.frame("Variable"=rep(NA,length(colz)),
"W_orig"=rep(NA,length(colz)),
"p_orig"=rep(NA,length(colz)), "W_log_tr"=rep(NA,length(colz)),
"p_log_tr"=rep(NA,length(colz)), "W_pow_tr"=rep(NA,length(colz)),
"p_pow_tr"=rep(NA,length(colz)), "Pow_term"=rep(NA,length(colz)))We then start a programming loop using for(...) that
assesses variable distributions and writes the statistical output to the
data frame transf_results we created earlier.
Note that we create R objects to hold the results of statistical
tests – this stores them temporarily so that we can write them to the
results data frame.
The object pt1 that hold the
results of the powerTansform() function contains an item
pt1$lambda which is the relevant power term for each
variable. Each time we cycle through the for() loop, we
write a row of the output data frame transf_results. We use
signif() to round to a specified number (4) of
significant digits.
for (i in colz) {
pt1 <- powerTransform(data0[, i])
sw0 <- shapiro.test(data0[, i])
sw1 <- shapiro.test(log10(data0[, i]))
sw2 <- shapiro.test((data0[, i]) ^ as.numeric(pt1$lambda))
transf_results[i-(colz[1]-1),] <- c(names.of.cols[i], signif(sw0$statistic, 4),
signif(sw0$p.value, 4), signif(sw1$statistic, 4),
signif(sw1$p.value, 4), signif(sw2$statistic, 4),
signif(sw2$p.value, 4), signif(as.vector(pt1$lambda), 4))
}
# output to console (screen)
{cat("Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
(_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.\n\n")
print(transf_results, row.names = FALSE)}## Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
## (_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.
##
## Variable W_orig p_orig W_log_tr p_log_tr W_pow_tr p_pow_tr Pow_term
## pH 0.9766 0.08979 0.9474 0.0008693 0.9867 0.4612 2.002
## EC 0.76 1.274e-08 0.9823 0.5219 0.9824 0.5283 -0.01806
## Al 0.8511 6.021e-08 0.8958 3.289e-06 0.9365 0.0003185 0.4553
## As 0.143 2.892e-20 0.7882 6.415e-10 0.9269 9.749e-05 -0.4294
## Ba 0.8951 3.067e-06 0.9885 0.6358 0.9885 0.6373 -0.007109
## Ca 0.5646 9.559e-15 0.9585 0.006584 0.9763 0.1073 -0.1945
## Cd 0.63 1.528e-12 0.9727 0.1006 0.9788 0.2345 -0.1594
## Ce 0.8913 2.368e-06 0.985 0.4128 0.987 0.5425 0.1248
## Cr 0.8176 4.732e-09 0.9192 3.96e-05 0.9406 0.0005403 0.348
## Cu 0.5644 9.465e-15 0.9318 0.0001761 0.9867 0.5128 -0.5011
## Fe 0.2707 7.295e-19 0.8549 8.176e-08 0.948 0.001465 -0.4639
## Gd 0.6443 4.034e-13 0.9495 0.00207 0.9802 0.2102 -0.4718
## K 0.8554 8.542e-08 0.9911 0.8206 0.9915 0.8426 -0.0517
## La 0.7681 1.814e-10 0.969 0.03308 0.9889 0.6618 -0.4369
## Mg 0.6524 4.011e-13 0.9853 0.4248 0.986 0.4653 -0.0477
## Mn 0.9272 0.0001014 0.9602 0.008538 0.9936 0.9489 0.3702
## Mo 0.1319 2.222e-20 0.8012 1.514e-09 0.9471 0.001283 -0.3986
## Na 0.3949 2.606e-17 0.9437 0.0008182 0.9881 0.6041 -0.3711
## Nd 0.7408 6.027e-11 0.8226 1.038e-08 0.9182 4.772e-05 0.4521
## Ni 0.5497 5.354e-15 0.9732 0.06437 0.9764 0.1083 -0.07573
## P 0.9036 7.254e-06 0.9796 0.1789 0.9932 0.9326 0.2506
## Pb 0.6184 8.783e-14 0.9517 0.002464 0.9854 0.4297 -0.2961
## S 0.1364 2.473e-20 0.8212 6.144e-09 0.962 0.0112 -0.4024
## Sr 0.4609 2.194e-16 0.9523 0.002664 0.9822 0.2688 -0.2506
## Th 0.8023 2.339e-08 0.9431 0.002948 0.9695 0.08177 -0.6177
## V 0.5028 9.405e-16 0.9073 1.074e-05 0.9168 3.034e-05 -0.1409
## Y 0.5532 6.128e-15 0.9579 0.006028 0.9829 0.2984 -0.3309
## Zn 0.6155 7.75e-14 0.9828 0.295 0.9915 0.8421 -0.1493
We can export our results to a csv file for Excel (if desired):
Alternatively, we can make a publication-quality table using the
flextable R package (Table 1).
flextable(transf_results) |> bold(part="header") |> width(width=2, unit="cm") |>
set_caption(caption="Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))Variable | W_orig | p_orig | W_log_tr | p_log_tr | W_pow_tr | p_pow_tr | Pow_term |
|---|---|---|---|---|---|---|---|
pH | 0.9766 | 0.08979 | 0.9474 | 0.0008693 | 0.9867 | 0.4612 | 2.002 |
EC | 0.76 | 1.274e-08 | 0.9823 | 0.5219 | 0.9824 | 0.5283 | -0.01806 |
Al | 0.8511 | 6.021e-08 | 0.8958 | 3.289e-06 | 0.9365 | 0.0003185 | 0.4553 |
As | 0.143 | 2.892e-20 | 0.7882 | 6.415e-10 | 0.9269 | 9.749e-05 | -0.4294 |
Ba | 0.8951 | 3.067e-06 | 0.9885 | 0.6358 | 0.9885 | 0.6373 | -0.007109 |
Ca | 0.5646 | 9.559e-15 | 0.9585 | 0.006584 | 0.9763 | 0.1073 | -0.1945 |
Cd | 0.63 | 1.528e-12 | 0.9727 | 0.1006 | 0.9788 | 0.2345 | -0.1594 |
Ce | 0.8913 | 2.368e-06 | 0.985 | 0.4128 | 0.987 | 0.5425 | 0.1248 |
Cr | 0.8176 | 4.732e-09 | 0.9192 | 3.96e-05 | 0.9406 | 0.0005403 | 0.348 |
Cu | 0.5644 | 9.465e-15 | 0.9318 | 0.0001761 | 0.9867 | 0.5128 | -0.5011 |
Fe | 0.2707 | 7.295e-19 | 0.8549 | 8.176e-08 | 0.948 | 0.001465 | -0.4639 |
Gd | 0.6443 | 4.034e-13 | 0.9495 | 0.00207 | 0.9802 | 0.2102 | -0.4718 |
K | 0.8554 | 8.542e-08 | 0.9911 | 0.8206 | 0.9915 | 0.8426 | -0.0517 |
La | 0.7681 | 1.814e-10 | 0.969 | 0.03308 | 0.9889 | 0.6618 | -0.4369 |
Mg | 0.6524 | 4.011e-13 | 0.9853 | 0.4248 | 0.986 | 0.4653 | -0.0477 |
Mn | 0.9272 | 0.0001014 | 0.9602 | 0.008538 | 0.9936 | 0.9489 | 0.3702 |
Mo | 0.1319 | 2.222e-20 | 0.8012 | 1.514e-09 | 0.9471 | 0.001283 | -0.3986 |
Na | 0.3949 | 2.606e-17 | 0.9437 | 0.0008182 | 0.9881 | 0.6041 | -0.3711 |
Nd | 0.7408 | 6.027e-11 | 0.8226 | 1.038e-08 | 0.9182 | 4.772e-05 | 0.4521 |
Ni | 0.5497 | 5.354e-15 | 0.9732 | 0.06437 | 0.9764 | 0.1083 | -0.07573 |
P | 0.9036 | 7.254e-06 | 0.9796 | 0.1789 | 0.9932 | 0.9326 | 0.2506 |
Pb | 0.6184 | 8.783e-14 | 0.9517 | 0.002464 | 0.9854 | 0.4297 | -0.2961 |
S | 0.1364 | 2.473e-20 | 0.8212 | 6.144e-09 | 0.962 | 0.0112 | -0.4024 |
Sr | 0.4609 | 2.194e-16 | 0.9523 | 0.002664 | 0.9822 | 0.2688 | -0.2506 |
Th | 0.8023 | 2.339e-08 | 0.9431 | 0.002948 | 0.9695 | 0.08177 | -0.6177 |
V | 0.5028 | 9.405e-16 | 0.9073 | 1.074e-05 | 0.9168 | 3.034e-05 | -0.1409 |
Y | 0.5532 | 6.128e-15 | 0.9579 | 0.006028 | 0.9829 | 0.2984 | -0.3309 |
Zn | 0.6155 | 7.75e-14 | 0.9828 | 0.295 | 0.9915 | 0.8421 | -0.1493 |
CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.