We need to load the R package car
, as
it's needed for the powerTransform()
function. For this
session we load and use the sv2017
data from Smith's Lake
and Charles Veryard Reserves in 2017.
library(car)
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
sv2017 <- read.csv(paste0(git, "sv2017_original.csv"))
It's often a good idea to make a copy of the data, so we don't lose
any information from the original if something goes horribly wrong. We
then check our data using the summary()
function:
## ICP.Seq.No Group Sample Type
## Length:95 Min. : 1.000 Length:95 Length:95
## Class :character 1st Qu.: 3.000 Class :character Class :character
## Mode :character Median : 6.000 Mode :character Mode :character
## Mean : 6.344
## 3rd Qu.: 9.000
## Max. :12.000
## NA's :2
## Easting Northing Longitude Latitude
## Min. :391064 Min. :6466246 Min. :115.8 Min. :-31.93
## 1st Qu.:391242 1st Qu.:6466430 1st Qu.:115.8 1st Qu.:-31.93
## Median :391338 Median :6466589 Median :115.9 Median :-31.93
## Mean :391301 Mean :6466556 Mean :115.9 Mean :-31.93
## 3rd Qu.:391384 3rd Qu.:6466676 3rd Qu.:115.9 3rd Qu.:-31.93
## Max. :391431 Max. :6466771 Max. :115.9 Max. :-31.93
##
## pH EC Al As
## Min. :4.630 Min. : 29.1 Min. : 396 Min. : 0.480
## 1st Qu.:6.388 1st Qu.:106.6 1st Qu.:2106 1st Qu.: 1.738
## Median :6.820 Median :152.6 Median :2503 Median : 2.100
## Mean :6.813 Mean :188.5 Mean :2558 Mean : 3.758
## 3rd Qu.:7.280 3rd Qu.:215.1 3rd Qu.:2933 3rd Qu.: 2.800
## Max. :8.630 Max. :835.0 Max. :8463 Max. :113.630
## NA's :1 NA's :34 NA's :7 NA's :7
## Ba Ca Cd Ce
## Min. : 3.990 Min. : 283 Min. :0.0200 Min. : 1.190
## 1st Qu.: 9.755 1st Qu.: 1189 1st Qu.:0.0500 1st Qu.: 3.290
## Median :13.550 Median : 1837 Median :0.0700 Median : 4.270
## Mean :15.102 Mean : 5768 Mean :0.1008 Mean : 4.576
## 3rd Qu.:19.938 3rd Qu.: 6510 3rd Qu.:0.1200 3rd Qu.: 5.420
## Max. :41.670 Max. :66286 Max. :0.6600 Max. :14.080
## NA's :7 NA's :7 NA's :19 NA's :8
## Cr Cu Fe Gd
## Min. : 1.170 Min. : 2.130 Min. : 833 Min. :0.2100
## 1st Qu.: 5.610 1st Qu.: 4.213 1st Qu.: 2013 1st Qu.:0.3700
## Median : 7.085 Median : 6.010 Median : 2589 Median :0.4650
## Mean : 7.249 Mean :10.794 Mean : 3132 Mean :0.5166
## 3rd Qu.: 8.450 3rd Qu.:12.200 3rd Qu.: 3362 3rd Qu.:0.6300
## Max. :26.290 Max. :87.380 Max. :36052 Max. :2.4100
## NA's :7 NA's :7 NA's :7 NA's :9
## K La Mg Mn
## Min. : 41.1 Min. :0.750 Min. : 57.34 Min. : 2.68
## 1st Qu.:101.7 1st Qu.:1.560 1st Qu.: 237.21 1st Qu.: 18.89
## Median :151.7 Median :1.895 Median : 351.26 Median : 30.82
## Mean :173.5 Mean :2.110 Mean : 467.74 Mean : 35.57
## 3rd Qu.:221.9 3rd Qu.:2.413 3rd Qu.: 563.90 3rd Qu.: 47.97
## Max. :634.1 Max. :7.530 Max. :3253.43 Max. :127.42
## NA's :7 NA's :7 NA's :7 NA's :7
## Mo Na Nd Ni
## Min. : 0.050 Min. : 27.80 Min. :0.050 Min. : 0.250
## 1st Qu.: 0.170 1st Qu.: 70.75 1st Qu.:1.230 1st Qu.: 1.252
## Median : 0.230 Median : 107.95 Median :1.620 Median : 2.040
## Mean : 0.418 Mean : 142.02 Mean :1.692 Mean : 2.536
## 3rd Qu.: 0.310 3rd Qu.: 163.43 3rd Qu.:2.110 3rd Qu.: 2.768
## Max. :14.470 Max. :1613.70 Max. :8.540 Max. :21.340
## NA's :7 NA's :7 NA's :10 NA's :7
## P Pb S Sr
## Min. : 18.5 Min. : 3.23 Min. : 42.96 Min. : 1.87
## 1st Qu.:113.0 1st Qu.: 10.16 1st Qu.: 183.44 1st Qu.: 6.24
## Median :186.8 Median : 15.36 Median : 281.78 Median : 10.37
## Mean :218.2 Mean : 28.42 Mean : 897.30 Mean : 31.24
## 3rd Qu.:298.8 3rd Qu.: 27.12 3rd Qu.: 425.77 3rd Qu.: 27.82
## Max. :846.9 Max. :173.78 Max. :41484.84 Max. :473.42
## NA's :7 NA's :7 NA's :7 NA's :7
## Th V Y Zn
## Min. :0.480 Min. : 1.280 Min. : 0.440 Min. : 5.59
## 1st Qu.:0.650 1st Qu.: 4.330 1st Qu.: 1.130 1st Qu.: 23.96
## Median :0.940 Median : 5.360 Median : 1.490 Median : 38.60
## Mean :1.171 Mean : 5.727 Mean : 1.808 Mean : 58.71
## 3rd Qu.:1.400 3rd Qu.: 6.400 3rd Qu.: 2.053 3rd Qu.: 63.63
## Max. :3.840 Max. :33.620 Max. :12.320 Max. :452.16
## NA's :24 NA's :7 NA's :7 NA's :7
The summary()
shows us that the first numeric variable
we want to check the distribution for is pH
. There are
other numeric columns before this (Group
,
Easting
,
Northing
,Longitude
,Latitude
) but
we don't usually tend to check the distribution for this type of numeric
or integer variable, as they are really sample identifiers.
If you use this code for your own data, change the value of
v0
to your first numeric variable of interest
(i.e. the first you want to check the distribution of).
We then create a temporary object names.of.cols
storing
the names of variables to be transformed, and a vector colz
of the column indices which are numeric:
names.of.cols <- names(data0)
colzTF <- unlist(lapply(data0, is.numeric), use.names = FALSE)
colz <- which(colzTF==TRUE)
We make an empty output data frame transf_results
to
hold the results of our distribution checks:
transf_results <- data.frame("Variable"=rep(NA,length(colz)),
"W_orig"=rep(NA,length(colz)),
"p_orig"=rep(NA,length(colz)), "W_log_tr"=rep(NA,length(colz)),
"p_log_tr"=rep(NA,length(colz)), "W_pow_tr"=rep(NA,length(colz)),
"p_pow_tr"=rep(NA,length(colz)), "Pow_term"=rep(NA,length(colz)))
We then start a programming loop using for(...)
that
assesses variable distributions and writes the statistical output to the
data frame transf_results
we created earlier.
Note that we create R objects to hold the results of statistical
tests – this stores them temporarily so that we can write them to the
results data frame.
The object pt1
that hold the
results of the powerTansform()
function contains an item
pt1$lambda
which is the relevant power term for each
variable. Each time we cycle through the for()
loop, we
write a row of the output data frame transf_results
. We use
signif()
to round to a specified number (4
) of
significant digits.
for (i in colz) {
pt1 <- powerTransform(data0[, i])
sw0 <- shapiro.test(data0[, i])
sw1 <- shapiro.test(log10(data0[, i]))
sw2 <- shapiro.test((data0[, i]) ^ as.numeric(pt1$lambda))
transf_results[i-(colz[1]-1),] <- c(names.of.cols[i], signif(sw0$statistic, 4),
signif(sw0$p.value, 4), signif(sw1$statistic, 4),
signif(sw1$p.value, 4), signif(sw2$statistic, 4),
signif(sw2$p.value, 4), signif(as.vector(pt1$lambda), 4))
}
# output to console (screen)
{cat("Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
(_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.\n\n")
print(transf_results, row.names = FALSE)}
## Table 1: Shapiro-Wilk statistics and p-values for untransformed (_orig) and transformed
## (_log, _pow) variables from soil and sediemnt analysis at Smith's Lake Reserve.
##
## Variable W_orig p_orig W_log_tr p_log_tr W_pow_tr p_pow_tr Pow_term
## pH 0.9766 0.08979 0.9474 0.0008693 0.9867 0.4612 2.002
## EC 0.76 1.274e-08 0.9823 0.5219 0.9824 0.5283 -0.01806
## Al 0.8511 6.021e-08 0.8958 3.289e-06 0.9365 0.0003185 0.4553
## As 0.143 2.892e-20 0.7882 6.415e-10 0.9269 9.749e-05 -0.4294
## Ba 0.8951 3.067e-06 0.9885 0.6358 0.9885 0.6373 -0.007109
## Ca 0.5646 9.559e-15 0.9585 0.006584 0.9763 0.1073 -0.1945
## Cd 0.63 1.528e-12 0.9727 0.1006 0.9788 0.2345 -0.1594
## Ce 0.8913 2.368e-06 0.985 0.4128 0.987 0.5425 0.1248
## Cr 0.8176 4.732e-09 0.9192 3.96e-05 0.9406 0.0005403 0.348
## Cu 0.5644 9.465e-15 0.9318 0.0001761 0.9867 0.5128 -0.5011
## Fe 0.2707 7.295e-19 0.8549 8.176e-08 0.948 0.001465 -0.4639
## Gd 0.6443 4.034e-13 0.9495 0.00207 0.9802 0.2102 -0.4718
## K 0.8554 8.542e-08 0.9911 0.8206 0.9915 0.8426 -0.0517
## La 0.7681 1.814e-10 0.969 0.03308 0.9889 0.6618 -0.4369
## Mg 0.6524 4.011e-13 0.9853 0.4248 0.986 0.4653 -0.0477
## Mn 0.9272 0.0001014 0.9602 0.008538 0.9936 0.9489 0.3702
## Mo 0.1319 2.222e-20 0.8012 1.514e-09 0.9471 0.001283 -0.3986
## Na 0.3949 2.606e-17 0.9437 0.0008182 0.9881 0.6041 -0.3711
## Nd 0.7408 6.027e-11 0.8226 1.038e-08 0.9182 4.772e-05 0.4521
## Ni 0.5497 5.354e-15 0.9732 0.06437 0.9764 0.1083 -0.07573
## P 0.9036 7.254e-06 0.9796 0.1789 0.9932 0.9326 0.2506
## Pb 0.6184 8.783e-14 0.9517 0.002464 0.9854 0.4297 -0.2961
## S 0.1364 2.473e-20 0.8212 6.144e-09 0.962 0.0112 -0.4024
## Sr 0.4609 2.194e-16 0.9523 0.002664 0.9822 0.2688 -0.2506
## Th 0.8023 2.339e-08 0.9431 0.002948 0.9695 0.08177 -0.6177
## V 0.5028 9.405e-16 0.9073 1.074e-05 0.9168 3.034e-05 -0.1409
## Y 0.5532 6.128e-15 0.9579 0.006028 0.9829 0.2984 -0.3309
## Zn 0.6155 7.75e-14 0.9828 0.295 0.9915 0.8421 -0.1493
We can export our results to a csv file for Excel (if desired):
Alternatively, we can make a publication-quality table using the
flextable
R package (Table 1).
flextable(transf_results) |> bold(part="header") |> width(width=2, unit="cm") |>
set_caption(caption="Table 1: Table summarising normality tests for distributions of variables when untransformed, log10-transformed, or power-transformed. [W_orig = Shapiro-Wilk statistic (W) for the untransfomed variable; p_orig = Shapiro-Wilk p-value for the untransfomed variable; W_log_tr = W for log10-transformed; p_log_tr = p-value for log10-transformed; W_pow_tr = W for power-transformed; p_pow_tr = p-value for power-transformed; Pow_term = exponent for power transformation.] Data from Smiths-Veryard 2017.", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))
Variable | W_orig | p_orig | W_log_tr | p_log_tr | W_pow_tr | p_pow_tr | Pow_term |
---|---|---|---|---|---|---|---|
pH | 0.9766 | 0.08979 | 0.9474 | 0.0008693 | 0.9867 | 0.4612 | 2.002 |
EC | 0.76 | 1.274e-08 | 0.9823 | 0.5219 | 0.9824 | 0.5283 | -0.01806 |
Al | 0.8511 | 6.021e-08 | 0.8958 | 3.289e-06 | 0.9365 | 0.0003185 | 0.4553 |
As | 0.143 | 2.892e-20 | 0.7882 | 6.415e-10 | 0.9269 | 9.749e-05 | -0.4294 |
Ba | 0.8951 | 3.067e-06 | 0.9885 | 0.6358 | 0.9885 | 0.6373 | -0.007109 |
Ca | 0.5646 | 9.559e-15 | 0.9585 | 0.006584 | 0.9763 | 0.1073 | -0.1945 |
Cd | 0.63 | 1.528e-12 | 0.9727 | 0.1006 | 0.9788 | 0.2345 | -0.1594 |
Ce | 0.8913 | 2.368e-06 | 0.985 | 0.4128 | 0.987 | 0.5425 | 0.1248 |
Cr | 0.8176 | 4.732e-09 | 0.9192 | 3.96e-05 | 0.9406 | 0.0005403 | 0.348 |
Cu | 0.5644 | 9.465e-15 | 0.9318 | 0.0001761 | 0.9867 | 0.5128 | -0.5011 |
Fe | 0.2707 | 7.295e-19 | 0.8549 | 8.176e-08 | 0.948 | 0.001465 | -0.4639 |
Gd | 0.6443 | 4.034e-13 | 0.9495 | 0.00207 | 0.9802 | 0.2102 | -0.4718 |
K | 0.8554 | 8.542e-08 | 0.9911 | 0.8206 | 0.9915 | 0.8426 | -0.0517 |
La | 0.7681 | 1.814e-10 | 0.969 | 0.03308 | 0.9889 | 0.6618 | -0.4369 |
Mg | 0.6524 | 4.011e-13 | 0.9853 | 0.4248 | 0.986 | 0.4653 | -0.0477 |
Mn | 0.9272 | 0.0001014 | 0.9602 | 0.008538 | 0.9936 | 0.9489 | 0.3702 |
Mo | 0.1319 | 2.222e-20 | 0.8012 | 1.514e-09 | 0.9471 | 0.001283 | -0.3986 |
Na | 0.3949 | 2.606e-17 | 0.9437 | 0.0008182 | 0.9881 | 0.6041 | -0.3711 |
Nd | 0.7408 | 6.027e-11 | 0.8226 | 1.038e-08 | 0.9182 | 4.772e-05 | 0.4521 |
Ni | 0.5497 | 5.354e-15 | 0.9732 | 0.06437 | 0.9764 | 0.1083 | -0.07573 |
P | 0.9036 | 7.254e-06 | 0.9796 | 0.1789 | 0.9932 | 0.9326 | 0.2506 |
Pb | 0.6184 | 8.783e-14 | 0.9517 | 0.002464 | 0.9854 | 0.4297 | -0.2961 |
S | 0.1364 | 2.473e-20 | 0.8212 | 6.144e-09 | 0.962 | 0.0112 | -0.4024 |
Sr | 0.4609 | 2.194e-16 | 0.9523 | 0.002664 | 0.9822 | 0.2688 | -0.2506 |
Th | 0.8023 | 2.339e-08 | 0.9431 | 0.002948 | 0.9695 | 0.08177 | -0.6177 |
V | 0.5028 | 9.405e-16 | 0.9073 | 1.074e-05 | 0.9168 | 3.034e-05 | -0.1409 |
Y | 0.5532 | 6.128e-15 | 0.9579 | 0.006028 | 0.9829 | 0.2984 | -0.3309 |
Zn | 0.6155 | 7.75e-14 | 0.9828 | 0.295 | 0.9915 | 0.8421 | -0.1493 |
CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.