Load the R packages we need and set a default table style
(‘theme
’):
In environmental reporting it's common to produce tables containing statistical summaries of the variables which have been measured at a particular location or in a specific environment. The statistical parameters presented often include basic statistics such as mean, median, standard deviation, minimum, and maximum. In many environmental contexts (e.g. assessing environments for pollution or contamination) it's also very useful to know if any samples have concentrations of contaminants that exceed environmental thresholds, and if so, how many.
The first code chunk (above) shows that we will use the
RcmdrMisc
package, as it contains the useful
numSummary()
function, and the flextable
package for producing nicely-formatted tables. We also load the
officer
package to make use of some formatting options in
flextable
. The rest of the code just makes use of function
in base R.
In this example we will be looking at data on analysis of sediments and soils, for which environmental guideline values exist in Australia. For sediments we use the guidelines provided by Water Quality Australia (2024), which were derived from the interim sediment quality guidelines (ISQG) (hence the name of the file and data frame!). Soil guideline values are from NEPC (2013).
In the code chunk below we make use of the ability to name rows in R data frames. This will be useful later on as we can include the row names in indices to interrogate the data frame for the correct environmental threshold.
ISQG <- read.csv("https://github.com/Ratey-AtUWA/bookChapters/raw/main/ISQG.csv")
row.names(ISQG) <- ISQG$Tox
HIL <- read.csv("https://github.com/Ratey-AtUWA/bookChapters/raw/main/HIL_NEPM.csv")
row.names(HIL) <- HIL$Tox
print(ISQG)
cat("\nNOTES:\nDGV = Default Guideline Value; GV_high = Upper Guideline Value; \n")
cat("Tox = Toxicant\n")
## DGV GV_high Tox
## Ag 1.00 4 Ag
## As 20.00 70 As
## Cd 1.50 10 Cd
## Cr 80.00 370 Cr
## Cu 65.00 270 Cu
## Hg 0.15 1 Hg
## Pb 50.00 220 Pb
## Ni 21.00 52 Ni
## Sb 2.00 25 Sb
## Zn 200.00 410 Zn
##
## NOTES:
## DGV = Default Guideline Value; GV_high = Upper Guideline Value;
## Tox = Toxicant
## Element Tox Resid...A Resid...B Recreat...C Industr...D
## As Arsenic As 100 500 300 3000
## Be Beryllium Be 60 90 90 500
## B Boron B 4500 40000 20000 300000
## Cd Cadmium Cd 20 150 90 900
## Cr Chromium(VI) Cr 100 500 300 3600
## Co Cobalt Co 100 600 300 4000
## Cu Copper Cu 6000 30000 17000 240000
## Pb Lead Pb 300 1200 600 1500
## Mn Manganese Mn 3800 14000 19000 60000
## Hg Mercury_(inorganic) Hg 40 120 80 730
## CH3Hg Methyl_mercury CH3Hg 10 30 13 180
## Ni Nickel Ni 400 1200 1200 6000
## Se Selenium Se 200 1400 700 10000
## Zn Zinc Zn 7400 60000 30000 400000
## CN Cyanide_(free) CN 250 300 240 1500
##
## NOTES:
## Tox = Toxicant; Resid = Residential;
## Recreat = Recreational / Public Open Space; Industr = Industrial
We can see from the output above that, for reporting, it would be
worthwhile using a package (such as flextable
, but there
are several others) that can produce more attractive and readable
tables.
We will use a subset of a dataset generated by some of our environmental science classes at The University of Western Australia (see Rate and McGrath, 2022 for an earlier version).
git <- "https://github.com/Ratey-AtUWA/bookChapters/raw/main/"
sedsoil <- read.csv(paste0(git,"afs1923abridged.csv"), stringsAsFactors = TRUE)
# convert Year to a factor, and SampID to character
sedsoil$Year <- factor(sedsoil$Year)
sedsoil$SampID <- as.character(sedsoil$SampID)
summary(sedsoil)
## Year SampID Type DepthType Longitude Latitude
## 2019:91 Length:336 Drain_Sed: 61 Core : 47 Min. :115.9 Min. :-31.92
## 2020:75 Class :character Lake_Sed :121 Surface:289 1st Qu.:115.9 1st Qu.:-31.92
## 2021:65 Mode :character Other : 15 Median :115.9 Median :-31.92
## 2022:31 Saltmarsh:139 Mean :115.9 Mean :-31.92
## 2023:74 3rd Qu.:115.9 3rd Qu.:-31.92
## Max. :115.9 Max. :-31.92
##
## As Cr Cu Mn Ni
## Min. : 0.400 Min. : 1.00 Min. : 2.0 Min. : 6.0 Min. : 1.00
## 1st Qu.: 4.000 1st Qu.:34.00 1st Qu.: 32.0 1st Qu.: 55.7 1st Qu.:13.00
## Median : 6.050 Median :54.70 Median : 61.0 Median : 79.0 Median :19.00
## Mean : 8.159 Mean :49.33 Mean : 111.7 Mean : 143.6 Mean :19.05
## 3rd Qu.: 9.000 3rd Qu.:65.45 3rd Qu.: 165.5 3rd Qu.: 120.5 3rd Qu.:24.80
## Max. :62.000 Max. :84.00 Max. :1007.8 Max. :3503.0 Max. :50.00
## NA's :13 NA's :9 NA's :9
## Pb Zn
## Min. : 1.00 Min. : 1.0
## 1st Qu.: 25.73 1st Qu.: 108.2
## Median : 37.30 Median : 222.5
## Mean : 54.91 Mean : 471.5
## 3rd Qu.: 52.00 3rd Qu.: 359.5
## Max. :831.10 Max. :7556.4
## NA's :4
# first define the variables we want to summarise, and make the basic summary
elem0 <- c("As","Cr","Cu","Mn","Ni","Pb","Zn")
(summ0 <- numSummary(sedsoil[,elem0],
statistics = c("mean","sd","quantiles"),
quantiles = c(0,0.5,1)))
## mean sd 0% 50% 100% n NA
## As 8.158529 7.712881 0.4 6.05 62.0 323 13
## Cr 49.332589 20.581402 1.0 54.70 84.0 336 0
## Cu 111.716208 121.118755 2.0 61.00 1007.8 327 9
## Mn 143.608036 309.044956 6.0 79.00 3503.0 336 0
## Ni 19.049388 7.998509 1.0 19.00 50.0 327 9
## Pb 54.906101 84.550788 1.0 37.30 831.1 336 0
## Zn 471.488554 1016.997243 1.0 222.50 7556.4 332 4
At this stage it's worth inspecting the contents of the list object
created using numSummary()
:
## List of 5
## $ type : num 3
## $ table : num [1:7, 1:5] 8.16 49.33 111.72 143.61 19.05 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "As" "Cr" "Cu" "Mn" ...
## .. ..$ : chr [1:5] "mean" "sd" "0%" "50%" ...
## $ statistics: chr [1:3] "mean" "sd" "quantiles"
## $ n : Named num [1:7] 323 336 327 336 327 336 332
## ..- attr(*, "names")= chr [1:7] "As" "Cr" "Cu" "Mn" ...
## $ NAs : Named num [1:7] 13 0 9 0 9 0 4
## ..- attr(*, "names")= chr [1:7] "As" "Cr" "Cu" "Mn" ...
## - attr(*, "class")= chr "numSummary"
We can see that the main summary table ($table
), the
count of observations ($n
), and the count of missing values
($NAs
) are stored as separate items in the list. We will
make use of this in the code below; we make a new data frame to hold the
results, since this is what we need to input into
flextable
.
To perform the actual counting of sample numbers, we use base R's
which()
function and the customised column and row names we
generated earlier, e.g.,
which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"])
.
sedsoil[,elem0[i]]
selects the column from
sedsoil
with the same name as the ith element of
the list of variables elem0
, i.e.
elem0[i]
.ISQG[elem0[i],"DGV"]
find the desired value
from the ISQG table using the row index elem0[i]
, and the
column index "DGV"
. So, the variable names in the data must
match the toxicant names in the thresholds table!!which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"])
will give the row indices in sedsoil
for which the
condition is TRUE
, and the code then simply counts these
using
length(which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"]))
# just in case our variables have no NA values, make different ways
# of creating the data frame to hold our results
if(length(summ0$NAs)==0){
OutputDF <- data.frame(summ0$table,n=summ0$n,NAs=rep(0,length(summ0$n)))
} else {
OutputDF <- data.frame(summ0$table,n=summ0$n,NAs=summ0$NAs)
}
# add blank columns in data frame for environmental thresholds
OutputDF$DGV <- rep(NA,length(elem0))
OutputDF$GV_high <- rep(NA,length(elem0))
OutputDF$HIL_C <- rep(NA,length(elem0))
# count the values for each element exceeding the various thresholds
# and populate the data frame
for(i in 1:length(elem0)){
OutputDF$DGV[i] <-
length(which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"]))
OutputDF$GV_high[i] <-
length(which(sedsoil[,elem0[i]] > ISQG[elem0[i],"GV_high"]))
OutputDF$HIL_C[i] <-
length(which(sedsoil[,elem0[i]] > HIL[elem0[i],"Recreational_C"]))
}
# rename the columns to more understandable names
colnames(OutputDF)[c(3:5,8:10)] <- c("min","median","max","n > DGV","n > GV-high", "n > HIL(C)")
print(OutputDF)
## mean sd min median max n NAs n > DGV n > GV-high n > HIL(C)
## As 8.158529 7.712881 0.4 6.05 62.0 323 13 26 0 0
## Cr 49.332589 20.581402 1.0 54.70 84.0 336 0 6 0 0
## Cu 111.716208 121.118755 2.0 61.00 1007.8 327 9 156 26 0
## Mn 143.608036 309.044956 6.0 79.00 3503.0 336 0 0 0 0
## Ni 19.049388 7.998509 1.0 19.00 50.0 327 9 137 0 0
## Pb 54.906101 84.550788 1.0 37.30 831.1 336 0 91 9 4
## Zn 471.488554 1016.997243 1.0 222.50 7556.4 332 4 185 73 0
We could stop here, but this is not an attractively-formatted table.
To make a publication-quality table, we first make a new data frame with
columns and rows transposed (using t()
), with the previous
column names as the first column.
To avoid lots of nesting of flextable
functions, we find
it easiest to pipe the successive lines of code containing formatting
modifications. We use the [new] native R pipe operator
|>
here, but the older magrittr
pipe
operator %>%
would work if you prefer. Where we need the
officer
package is to interpret the formatting option
fp_p=fp_par(text.align = "left", padding.bottom = 6)
in the
set_caption()
function.
# make a new data frame with columns and rows transposed, and the
# previous column names as the first column:
ft <- as.data.frame(cbind(colnames(OutputDF),t(signif(OutputDF,3))))
# Then, use flextable to output a table in publication-quality form
flextable(ft) |>
width(j=c(1:7), width=c(3,rep(2.2,6)), unit="cm") |>
set_header_labels(values=list(V1="")) |>
align(j=2:8, align="right", part="all") |>
padding(i=8, padding.top = 8) |>
bold(bold=TRUE, part="header") |>
set_caption(caption="Table 1: Summary statistics for trace element concentrations (mg/kg) in sediment or soil at Ashfield Flats. Abbreviations: n = number of valid observations; NAs = number of missing observations; n > DGV is number of samples exceeding the sediment Default Guideline Value; n > GV-high is number of samples exceeding the sediment upper Guideline Value at which toxicity effects might be expected (Water Quality Australia, 2024). HIL(C) is the human-health based investigation level for Recreational (public open space) land use (NEPC, 2013).", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))
As | Cr | Cu | Mn | Ni | Pb | Zn | |
mean | 8.16 | 49.3 | 112 | 144 | 19 | 54.9 | 471 |
sd | 7.71 | 20.6 | 121 | 309 | 8 | 84.6 | 1020 |
min | 0.4 | 1 | 2 | 6 | 1 | 1 | 1 |
median | 6.05 | 54.7 | 61 | 79 | 19 | 37.3 | 222 |
max | 62 | 84 | 1010 | 3500 | 50 | 831 | 7560 |
n | 323 | 336 | 327 | 336 | 327 | 336 | 332 |
NAs | 13 | 0 | 9 | 0 | 9 | 0 | 4 |
n > DGV | 26 | 6 | 156 | 0 | 137 | 91 | 185 |
n > GV-high | 0 | 0 | 26 | 0 | 0 | 9 | 73 |
n > HIL(C) | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
The final table could now go into a report, and most readers would be
happy with its appearance. We could probably do something about the
alignment of the numeric columns, but decimal point alignment is not
available in flextable
yet.
The next really useful type of information to obtain would be where the samples which exceed environmental thresholds are. That question leads us to another more basic question: “How can we map our data in R,” and so further chapters will cover preparing maps in R, and from there we can move on to spatial analysis.
NEPC (National Environment Protection Council). (2013). Schedule B(1): Guideline on the Investigation Levels for Soil and Groundwater. In National Environment Protection (Assessment of Site Contamination) Measure (Amended). Commonwealth of Australia.
Rate, A. W., & McGrath, G. S. (2022). Data for assessment of sediment, soil, and water quality at Ashfield Flats Reserve, Western Australia. Data in Brief, 41, 107970. https://doi.org/10.1016/j.dib.2022.107970
Water Quality Australia. (2024). Toxicant default guideline values for sediment quality. Department of Climate Change, Energy, the Environment and Water, Government of Australia. Retrieved 2024-04-11 from https://www.waterquality.gov.au/anz-guidelines/guideline-values/default/sediment-quality-toxicants
CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio using the cyborg theme from Bootswatch via the bslib package, and fontawesome v5 icons.