We need to load the R package car
, as
it's needed for the powerTransform()
function. For this
session we load and use the sv2017
data from Smith's Lake
and Charles Veryard Reserves in 2017.
library(car)
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
sv2017 <- read.csv(paste0(git, "sv2017_original.csv"))
It's often a good idea to make a copy of the data, so we don't lose
any information from the original if something goes horribly wrong. We
then check our data using the summary()
function:
## ICP.Seq.No Group Sample Type Easting
## Length:95 Min. : 1.000 Length:95 Length:95 Min. :391064
## Class :character 1st Qu.: 3.000 Class :character Class :character 1st Qu.:391242
## Mode :character Median : 6.000 Mode :character Mode :character Median :391338
## Mean : 6.344 Mean :391301
## 3rd Qu.: 9.000 3rd Qu.:391384
## Max. :12.000 Max. :391431
## NA's :2
## Northing Longitude Latitude pH EC
## Min. :6466246 Min. :115.8 Min. :-31.93 Min. :4.630 Min. : 29.1
## 1st Qu.:6466430 1st Qu.:115.8 1st Qu.:-31.93 1st Qu.:6.388 1st Qu.:106.6
## Median :6466589 Median :115.9 Median :-31.93 Median :6.820 Median :152.6
## Mean :6466556 Mean :115.9 Mean :-31.93 Mean :6.813 Mean :188.5
## 3rd Qu.:6466676 3rd Qu.:115.9 3rd Qu.:-31.93 3rd Qu.:7.280 3rd Qu.:215.1
## Max. :6466771 Max. :115.9 Max. :-31.93 Max. :8.630 Max. :835.0
## NA's :1 NA's :34
## Al As Ba Ca Cd
## Min. : 396 Min. : 0.480 Min. : 3.990 Min. : 283 Min. :0.0200
## 1st Qu.:2106 1st Qu.: 1.738 1st Qu.: 9.755 1st Qu.: 1189 1st Qu.:0.0500
## Median :2503 Median : 2.100 Median :13.550 Median : 1837 Median :0.0700
## Mean :2558 Mean : 3.758 Mean :15.102 Mean : 5768 Mean :0.1008
## 3rd Qu.:2933 3rd Qu.: 2.800 3rd Qu.:19.938 3rd Qu.: 6510 3rd Qu.:0.1200
## Max. :8463 Max. :113.630 Max. :41.670 Max. :66286 Max. :0.6600
## NA's :7 NA's :7 NA's :7 NA's :7 NA's :19
## Ce Cr Cu Fe Gd
## Min. : 1.190 Min. : 1.170 Min. : 2.130 Min. : 833 Min. :0.2100
## 1st Qu.: 3.290 1st Qu.: 5.610 1st Qu.: 4.213 1st Qu.: 2013 1st Qu.:0.3700
## Median : 4.270 Median : 7.085 Median : 6.010 Median : 2589 Median :0.4650
## Mean : 4.576 Mean : 7.249 Mean :10.794 Mean : 3132 Mean :0.5166
## 3rd Qu.: 5.420 3rd Qu.: 8.450 3rd Qu.:12.200 3rd Qu.: 3362 3rd Qu.:0.6300
## Max. :14.080 Max. :26.290 Max. :87.380 Max. :36052 Max. :2.4100
## NA's :8 NA's :7 NA's :7 NA's :7 NA's :9
## K La Mg Mn Mo
## Min. : 41.1 Min. :0.750 Min. : 57.34 Min. : 2.68 Min. : 0.050
## 1st Qu.:101.7 1st Qu.:1.560 1st Qu.: 237.21 1st Qu.: 18.89 1st Qu.: 0.170
## Median :151.7 Median :1.895 Median : 351.26 Median : 30.82 Median : 0.230
## Mean :173.5 Mean :2.110 Mean : 467.74 Mean : 35.57 Mean : 0.418
## 3rd Qu.:221.9 3rd Qu.:2.413 3rd Qu.: 563.90 3rd Qu.: 47.97 3rd Qu.: 0.310
## Max. :634.1 Max. :7.530 Max. :3253.43 Max. :127.42 Max. :14.470
## NA's :7 NA's :7 NA's :7 NA's :7 NA's :7
## Na Nd Ni P Pb
## Min. : 27.80 Min. :0.050 Min. : 0.250 Min. : 18.5 Min. : 3.23
## 1st Qu.: 70.75 1st Qu.:1.230 1st Qu.: 1.252 1st Qu.:113.0 1st Qu.: 10.16
## Median : 107.95 Median :1.620 Median : 2.040 Median :186.8 Median : 15.36
## Mean : 142.02 Mean :1.692 Mean : 2.536 Mean :218.2 Mean : 28.42
## 3rd Qu.: 163.43 3rd Qu.:2.110 3rd Qu.: 2.768 3rd Qu.:298.8 3rd Qu.: 27.12
## Max. :1613.70 Max. :8.540 Max. :21.340 Max. :846.9 Max. :173.78
## NA's :7 NA's :10 NA's :7 NA's :7 NA's :7
## S Sr Th V Y
## Min. : 42.96 Min. : 1.87 Min. :0.480 Min. : 1.280 Min. : 0.440
## 1st Qu.: 183.44 1st Qu.: 6.24 1st Qu.:0.650 1st Qu.: 4.330 1st Qu.: 1.130
## Median : 281.77 Median : 10.37 Median :0.940 Median : 5.360 Median : 1.490
## Mean : 897.30 Mean : 31.24 Mean :1.171 Mean : 5.727 Mean : 1.808
## 3rd Qu.: 425.77 3rd Qu.: 27.82 3rd Qu.:1.400 3rd Qu.: 6.400 3rd Qu.: 2.053
## Max. :41484.84 Max. :473.42 Max. :3.840 Max. :33.620 Max. :12.320
## NA's :7 NA's :7 NA's :24 NA's :7 NA's :7
## Zn
## Min. : 5.59
## 1st Qu.: 23.96
## Median : 38.60
## Mean : 58.71
## 3rd Qu.: 63.63
## Max. :452.16
## NA's :7
The summary()
shows us that the first numeric variable
we want to check the distribution for is pH
. There are
other numeric columns before this (Group
,
Easting
,
Northing
,Longitude
,Latitude
) but
we don't usually tend to check the distribution for this type of numeric
or integer variable, as they are really sample identifiers.
If you use this code for your own data, change the value of
v0
to your first numeric variable of interest
(i.e. the first you want to check the distribution of).
We then create a temporary object names.of.cols
storing
the names of variables to be transformed, and a vector colz
of the column indices which are numeric:
names.of.cols <- names(data0)
colzTF <- unlist(lapply(data0, is.numeric), use.names = FALSE)
colz <- which(colzTF==TRUE)
We make an empty output data frame transf_results
to
hold the results of our distribution checks:
transf_results <- data.frame("Variable"=rep(NA,length(colz)),
"W_orig"=rep(NA,length(colz)),
"p_orig"=rep(NA,length(colz)), "W_log_tr"=rep(NA,length(colz)),
"p_log_tr"=rep(NA,length(colz)), "W_pow_tr"=rep(NA,length(colz)),
"p_pow_tr"=rep(NA,length(colz)), "Pow_term"=rep(NA,length(colz)))
We then start a programming loop using for(...)
that
assesses variable distributions and writes the statistical output to the
data frame transf_results
we created earlier.
Note that we create R objects to hold the results of statistical
tests – this stores them temporarily so that we can write them to the
results data frame.
The object pt1
that hold the
results of the powerTansform()
function contains an item
pt1$lambda
which is the relevant power term for each
variable. Each time we cycle through the for()
loop, we
write a row of the output data frame transf_results
. We use
signif()
to round to a specified number (4
) of
significant digits.
for (i in colz) {
pt1 <- powerTransform(data0[, i])
sw0 <- shapiro.test(data0[, i])
sw1 <- shapiro.test(log10(data0[, i]))
sw2 <- shapiro.test((data0[, i]) ^ as.numeric(pt1$lambda))
transf_results[i-(colz[1]-1),] <- c(names.of.cols[i], signif(sw0$statistic, 4),
signif(sw0$p.value, 4), signif(sw1$statistic, 4),
signif(sw1$p.value, 4), signif(sw2$statistic, 4),
signif(sw2$p.value, 4), signif(as.vector(pt1$lambda), 4))
}
# output to console (screen)
{cat("Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
(_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.\n\n")
print(transf_results, row.names = FALSE)}
## Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
## (_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.
##
## Variable W_orig p_orig W_log_tr p_log_tr W_pow_tr p_pow_tr Pow_term
## pH 0.9766 0.08979 0.9474 0.0008693 0.9867 0.4612 2.002
## EC 0.76 1.274e-08 0.9823 0.5219 0.9824 0.5283 -0.01806
## Al 0.8511 6.021e-08 0.8958 3.289e-06 0.9365 0.0003185 0.4553
## As 0.143 2.892e-20 0.7882 6.415e-10 0.9269 9.749e-05 -0.4294
## Ba 0.8951 3.067e-06 0.9885 0.6358 0.9885 0.6373 -0.007109
## Ca 0.5646 9.559e-15 0.9585 0.006584 0.9763 0.1073 -0.1945
## Cd 0.63 1.528e-12 0.9727 0.1006 0.9788 0.2345 -0.1594
## Ce 0.8913 2.368e-06 0.985 0.4128 0.987 0.5425 0.1248
## Cr 0.8176 4.732e-09 0.9192 3.96e-05 0.9406 0.0005403 0.348
## Cu 0.5644 9.465e-15 0.9318 0.0001761 0.9867 0.5128 -0.5011
## Fe 0.2707 7.295e-19 0.8549 8.176e-08 0.948 0.001465 -0.4639
## Gd 0.6443 4.034e-13 0.9495 0.00207 0.9802 0.2102 -0.4718
## K 0.8554 8.542e-08 0.9911 0.8206 0.9915 0.8426 -0.0517
## La 0.7681 1.814e-10 0.969 0.03308 0.9889 0.6618 -0.4369
## Mg 0.6524 4.011e-13 0.9853 0.4248 0.986 0.4653 -0.0477
## Mn 0.9272 0.0001014 0.9602 0.008538 0.9936 0.9489 0.3702
## Mo 0.1319 2.222e-20 0.8012 1.514e-09 0.9471 0.001283 -0.3986
## Na 0.3949 2.606e-17 0.9437 0.0008182 0.9881 0.6041 -0.3711
## Nd 0.7408 6.027e-11 0.8226 1.038e-08 0.9182 4.772e-05 0.4521
## Ni 0.5497 5.354e-15 0.9732 0.06437 0.9764 0.1083 -0.07573
## P 0.9036 7.254e-06 0.9796 0.1789 0.9932 0.9326 0.2506
## Pb 0.6184 8.783e-14 0.9517 0.002464 0.9854 0.4297 -0.2961
## S 0.1364 2.473e-20 0.8212 6.144e-09 0.962 0.0112 -0.4024
## Sr 0.4609 2.194e-16 0.9523 0.002664 0.9822 0.2688 -0.2506
## Th 0.8023 2.339e-08 0.9431 0.002948 0.9695 0.08177 -0.6177
## V 0.5028 9.405e-16 0.9073 1.074e-05 0.9168 3.034e-05 -0.1409
## Y 0.5532 6.128e-15 0.9579 0.006028 0.9829 0.2984 -0.3309
## Zn 0.6155 7.75e-14 0.9828 0.295 0.9915 0.8421 -0.1493
We can export our results to a csv file for Excel (if desired):
Alternatively, we can make a publication-quality table using the
flextable
R package (Table 1).
flextable(transf_results) |> bold(part="header") |> width(width=2, unit="cm") |>
set_caption(caption="Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))
Variable | W_orig | p_orig | W_log_tr | p_log_tr | W_pow_tr | p_pow_tr | Pow_term |
---|---|---|---|---|---|---|---|
pH | 0.9766 | 0.08979 | 0.9474 | 0.0008693 | 0.9867 | 0.4612 | 2.002 |
EC | 0.76 | 1.274e-08 | 0.9823 | 0.5219 | 0.9824 | 0.5283 | -0.01806 |
Al | 0.8511 | 6.021e-08 | 0.8958 | 3.289e-06 | 0.9365 | 0.0003185 | 0.4553 |
As | 0.143 | 2.892e-20 | 0.7882 | 6.415e-10 | 0.9269 | 9.749e-05 | -0.4294 |
Ba | 0.8951 | 3.067e-06 | 0.9885 | 0.6358 | 0.9885 | 0.6373 | -0.007109 |
Ca | 0.5646 | 9.559e-15 | 0.9585 | 0.006584 | 0.9763 | 0.1073 | -0.1945 |
Cd | 0.63 | 1.528e-12 | 0.9727 | 0.1006 | 0.9788 | 0.2345 | -0.1594 |
Ce | 0.8913 | 2.368e-06 | 0.985 | 0.4128 | 0.987 | 0.5425 | 0.1248 |
Cr | 0.8176 | 4.732e-09 | 0.9192 | 3.96e-05 | 0.9406 | 0.0005403 | 0.348 |
Cu | 0.5644 | 9.465e-15 | 0.9318 | 0.0001761 | 0.9867 | 0.5128 | -0.5011 |
Fe | 0.2707 | 7.295e-19 | 0.8549 | 8.176e-08 | 0.948 | 0.001465 | -0.4639 |
Gd | 0.6443 | 4.034e-13 | 0.9495 | 0.00207 | 0.9802 | 0.2102 | -0.4718 |
K | 0.8554 | 8.542e-08 | 0.9911 | 0.8206 | 0.9915 | 0.8426 | -0.0517 |
La | 0.7681 | 1.814e-10 | 0.969 | 0.03308 | 0.9889 | 0.6618 | -0.4369 |
Mg | 0.6524 | 4.011e-13 | 0.9853 | 0.4248 | 0.986 | 0.4653 | -0.0477 |
Mn | 0.9272 | 0.0001014 | 0.9602 | 0.008538 | 0.9936 | 0.9489 | 0.3702 |
Mo | 0.1319 | 2.222e-20 | 0.8012 | 1.514e-09 | 0.9471 | 0.001283 | -0.3986 |
Na | 0.3949 | 2.606e-17 | 0.9437 | 0.0008182 | 0.9881 | 0.6041 | -0.3711 |
Nd | 0.7408 | 6.027e-11 | 0.8226 | 1.038e-08 | 0.9182 | 4.772e-05 | 0.4521 |
Ni | 0.5497 | 5.354e-15 | 0.9732 | 0.06437 | 0.9764 | 0.1083 | -0.07573 |
P | 0.9036 | 7.254e-06 | 0.9796 | 0.1789 | 0.9932 | 0.9326 | 0.2506 |
Pb | 0.6184 | 8.783e-14 | 0.9517 | 0.002464 | 0.9854 | 0.4297 | -0.2961 |
S | 0.1364 | 2.473e-20 | 0.8212 | 6.144e-09 | 0.962 | 0.0112 | -0.4024 |
Sr | 0.4609 | 2.194e-16 | 0.9523 | 0.002664 | 0.9822 | 0.2688 | -0.2506 |
Th | 0.8023 | 2.339e-08 | 0.9431 | 0.002948 | 0.9695 | 0.08177 | -0.6177 |
V | 0.5028 | 9.405e-16 | 0.9073 | 1.074e-05 | 0.9168 | 3.034e-05 | -0.1409 |
Y | 0.5532 | 6.128e-15 | 0.9579 | 0.006028 | 0.9829 | 0.2984 | -0.3309 |
Zn | 0.6155 | 7.75e-14 | 0.9828 | 0.295 | 0.9915 | 0.8421 | -0.1493 |
CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.