Counting samples with R

Summary statistics tables which include sample counts above environmental thresholds

Load the R packages we need and set a default table style (‘theme’):

library(RcmdrMisc)
library(flextable)
  set_flextable_defaults(font.family = "Arial", font.size = 11, 
                       theme_fun = "theme_booktabs", padding = 1)
library(officer)

Introduction

In environmental reporting it's common to produce tables containing statistical summaries of the variables which have been measured at a particular location or in a specific environment. The statistical parameters presented often include basic statistics such as mean, median, standard deviation, minimum, and maximum. In many environmental contexts (e.g. assessing environments for pollution or contamination) it's also very useful to know if any samples have concentrations of contaminants that exceed environmental thresholds, and if so, how many.

The first code chunk (above) shows that we will use the RcmdrMisc package, as it contains the useful numSummary() function, and the flextable package for producing nicely-formatted tables. We also load the officer package to make use of some formatting options in flextable. The rest of the code just makes use of function in base R.

Importing environmental threshold data

In this example we will be looking at data on analysis of sediments and soils, for which environmental guideline values exist in Australia. For sediments we use the guidelines provided by Water Quality Australia (2024), which were derived from the interim sediment quality guidelines (ISQG) (hence the name of the file and data frame!). Soil guideline values are from NEPC (2013).

In the code chunk below we make use of the ability to name rows in R data frames. This will be useful later on as we can include the row names in indices to interrogate the data frame for the correct environmental threshold.

ISQG <- read.csv("https://github.com/Ratey-AtUWA/bookChapters/raw/main/ISQG.csv")
row.names(ISQG) <- ISQG$Tox
HIL <- read.csv("https://github.com/Ratey-AtUWA/bookChapters/raw/main/HIL_NEPM.csv")
row.names(HIL) <- HIL$Tox

print(ISQG)
cat("\nNOTES:\nDGV = Default Guideline Value; GV_high = Upper Guideline Value; \n")
cat("Tox = Toxicant\n")

##       DGV GV_high Tox
## Ag   1.00       4  Ag
## As  20.00      70  As
## Cd   1.50      10  Cd
## Cr  80.00     370  Cr
## Cu  65.00     270  Cu
## Hg   0.15       1  Hg
## Pb  50.00     220  Pb
## Ni  21.00      52  Ni
## Sb   2.00      25  Sb
## Zn 200.00     410  Zn
## 
## NOTES:
## DGV = Default Guideline Value; GV_high = Upper Guideline Value; 
## Tox = Toxicant

print(HIL)

##                   Element   Tox Resid...A Resid...B Recreat...C Industr...D
## As                Arsenic    As       100       500         300        3000
## Be              Beryllium    Be        60        90          90         500
## B                   Boron     B      4500     40000       20000      300000
## Cd                Cadmium    Cd        20       150          90         900
## Cr           Chromium(VI)    Cr       100       500         300        3600
## Co                 Cobalt    Co       100       600         300        4000
## Cu                 Copper    Cu      6000     30000       17000      240000
## Pb                   Lead    Pb       300      1200         600        1500
## Mn              Manganese    Mn      3800     14000       19000       60000
## Hg    Mercury_(inorganic)    Hg        40       120          80         730
## CH3Hg      Methyl_mercury CH3Hg        10        30          13         180
## Ni                 Nickel    Ni       400      1200        1200        6000
## Se               Selenium    Se       200      1400         700       10000
## Zn                   Zinc    Zn      7400     60000       30000      400000
## CN         Cyanide_(free)    CN       250       300         240        1500
## 
## NOTES:
## Tox = Toxicant; Resid = Residential; 
## Recreat = Recreational / Public Open Space; Industr = Industrial

We can see from the output above that, for reporting, it would be worthwhile using a package (such as flextable, but there are several others) that can produce more attractive and readable tables.

Importing the environmental data we want to summarise

We will use a subset of a dataset generated by some of our environmental science classes at The University of Western Australia (see Rate and McGrath, 2022 for an earlier version).

git <- "https://github.com/Ratey-AtUWA/bookChapters/raw/main/"
sedsoil <- read.csv(paste0(git,"afs1923abridged.csv"), stringsAsFactors = TRUE)
# convert Year to a factor, and SampID to character
sedsoil$Year <- factor(sedsoil$Year)
sedsoil$SampID <- as.character(sedsoil$SampID)
summary(sedsoil)

##    Year       SampID                 Type       DepthType     Longitude        Latitude     
##  2019:91   Length:336         Drain_Sed: 61   Core   : 47   Min.   :115.9   Min.   :-31.92  
##  2020:75   Class :character   Lake_Sed :121   Surface:289   1st Qu.:115.9   1st Qu.:-31.92  
##  2021:65   Mode  :character   Other    : 15                 Median :115.9   Median :-31.92  
##  2022:31                      Saltmarsh:139                 Mean   :115.9   Mean   :-31.92  
##  2023:74                                                    3rd Qu.:115.9   3rd Qu.:-31.92  
##                                                             Max.   :115.9   Max.   :-31.92  
##                                                                                             
##        As               Cr              Cu               Mn               Ni       
##  Min.   : 0.400   Min.   : 1.00   Min.   :   2.0   Min.   :   6.0   Min.   : 1.00  
##  1st Qu.: 4.000   1st Qu.:34.00   1st Qu.:  32.0   1st Qu.:  55.7   1st Qu.:13.00  
##  Median : 6.050   Median :54.70   Median :  61.0   Median :  79.0   Median :19.00  
##  Mean   : 8.159   Mean   :49.33   Mean   : 111.7   Mean   : 143.6   Mean   :19.05  
##  3rd Qu.: 9.000   3rd Qu.:65.45   3rd Qu.: 165.5   3rd Qu.: 120.5   3rd Qu.:24.80  
##  Max.   :62.000   Max.   :84.00   Max.   :1007.8   Max.   :3503.0   Max.   :50.00  
##  NA's   :13                       NA's   :9                         NA's   :9      
##        Pb               Zn        
##  Min.   :  1.00   Min.   :   1.0  
##  1st Qu.: 25.73   1st Qu.: 108.2  
##  Median : 37.30   Median : 222.5  
##  Mean   : 54.91   Mean   : 471.5  
##  3rd Qu.: 52.00   3rd Qu.: 359.5  
##  Max.   :831.10   Max.   :7556.4  
##                   NA's   :4

# first define the variables we want to summarise, and make the basic summary
elem0 <- c("As","Cr","Cu","Mn","Ni","Pb","Zn")
(summ0 <- numSummary(sedsoil[,elem0], 
                    statistics = c("mean","sd","quantiles"),
                    quantiles = c(0,0.5,1)))

##          mean          sd  0%    50%   100%   n NA
## As   8.158529    7.712881 0.4   6.05   62.0 323 13
## Cr  49.332589   20.581402 1.0  54.70   84.0 336  0
## Cu 111.716208  121.118755 2.0  61.00 1007.8 327  9
## Mn 143.608036  309.044956 6.0  79.00 3503.0 336  0
## Ni  19.049388    7.998509 1.0  19.00   50.0 327  9
## Pb  54.906101   84.550788 1.0  37.30  831.1 336  0
## Zn 471.488554 1016.997243 1.0 222.50 7556.4 332  4

At this stage it's worth inspecting the contents of the list object created using numSummary():

str(summ0)

## List of 5
##  $ type      : num 3
##  $ table     : num [1:7, 1:5] 8.16 49.33 111.72 143.61 19.05 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "As" "Cr" "Cu" "Mn" ...
##   .. ..$ : chr [1:5] "mean" "sd" "0%" "50%" ...
##  $ statistics: chr [1:3] "mean" "sd" "quantiles"
##  $ n         : Named num [1:7] 323 336 327 336 327 336 332
##   ..- attr(*, "names")= chr [1:7] "As" "Cr" "Cu" "Mn" ...
##  $ NAs       : Named num [1:7] 13 0 9 0 9 0 4
##   ..- attr(*, "names")= chr [1:7] "As" "Cr" "Cu" "Mn" ...
##  - attr(*, "class")= chr "numSummary"

We can see that the main summary table ($table), the count of observations ($n), and the count of missing values ($NAs) are stored as separate items in the list. We will make use of this in the code below; we make a new data frame to hold the results, since this is what we need to input into flextable.

To perform the actual counting of sample numbers, we use base R's which() function and the customised column and row names we generated earlier, e.g., which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"]).

The code sedsoil[,elem0[i]] selects the column from sedsoil with the same name as the i^th element of the list of variables elem0, i.e. elem0[i].
The code ISQG[elem0[i],"DGV"] find the desired value from the ISQG table using the row index elem0[i], and the column index "DGV". So, the variable names in the data must match the toxicant names in the thresholds table!!
So, which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"]) will give the row indices in sedsoil for which the condition is TRUE, and the code then simply counts these using length(which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"]))

# just in case our variables have no NA values, make different ways
# of creating the data frame to hold our results

if(length(summ0$NAs)==0){
  OutputDF <- data.frame(summ0$table,n=summ0$n,NAs=rep(0,length(summ0$n)))
} else {
  OutputDF <- data.frame(summ0$table,n=summ0$n,NAs=summ0$NAs)
}

# add blank columns in data frame for environmental thresholds

OutputDF$DGV <- rep(NA,length(elem0))
OutputDF$GV_high <- rep(NA,length(elem0))
OutputDF$HIL_C <- rep(NA,length(elem0))

# count the values for each element exceeding the various thresholds
# and populate the data frame

for(i in 1:length(elem0)){
  OutputDF$DGV[i] <- 
    length(which(sedsoil[,elem0[i]] > ISQG[elem0[i],"DGV"]))
  OutputDF$GV_high[i] <- 
    length(which(sedsoil[,elem0[i]] > ISQG[elem0[i],"GV_high"]))
  OutputDF$HIL_C[i] <- 
    length(which(sedsoil[,elem0[i]] > HIL[elem0[i],"Recreational_C"]))
}

# rename the columns to more understandable names

colnames(OutputDF)[c(3:5,8:10)] <- c("min","median","max","n > DGV","n > GV-high", "n > HIL(C)")
print(OutputDF)

##          mean          sd min median    max   n NAs n > DGV n > GV-high n > HIL(C)
## As   8.158529    7.712881 0.4   6.05   62.0 323  13      26           0          0
## Cr  49.332589   20.581402 1.0  54.70   84.0 336   0       6           0          0
## Cu 111.716208  121.118755 2.0  61.00 1007.8 327   9     156          26          0
## Mn 143.608036  309.044956 6.0  79.00 3503.0 336   0       0           0          0
## Ni  19.049388    7.998509 1.0  19.00   50.0 327   9     137           0          0
## Pb  54.906101   84.550788 1.0  37.30  831.1 336   0      91           9          4
## Zn 471.488554 1016.997243 1.0 222.50 7556.4 332   4     185          73          0

We could stop here, but this is not an attractively-formatted table. To make a publication-quality table, we first make a new data frame with columns and rows transposed (using t()), with the previous column names as the first column.

To avoid lots of nesting of flextable functions, we find it easiest to pipe the successive lines of code containing formatting modifications. We use the [new] native R pipe operator |> here, but the older magrittr pipe operator %>% would work if you prefer. Where we need the officer package is to interpret the formatting option fp_p=fp_par(text.align = "left", padding.bottom = 6) in the set_caption() function.

# make a new data frame with columns and rows transposed, and the 
# previous column names as the first column:
ft <- as.data.frame(cbind(colnames(OutputDF),t(signif(OutputDF,3))))

# Then, use flextable to output a table in publication-quality form

flextable(ft) |>
  width(j=c(1:7), width=c(3,rep(2.2,6)), unit="cm") |>
  set_header_labels(values=list(V1="")) |>
  align(j=2:8, align="right", part="all") |>
  padding(i=8, padding.top = 8) |>
  bold(bold=TRUE, part="header") |>
  set_caption(caption="Table 1: Summary statistics for trace element concentrations (mg/kg) in sediment or soil at Ashfield Flats. Abbreviations: n = number of valid observations; NAs = number of missing observations; n > DGV is number of samples exceeding the sediment Default Guideline Value; n > GV-high is number of samples exceeding the sediment upper Guideline Value at which toxicity effects might be expected (Water Quality Australia, 2024). HIL(C) is the human-health based investigation level for Recreational (public open space) land use (NEPC, 2013).", align_with_table=F, fp_p=fp_par(text.align = "left", padding.bottom = 6))

Table 1: Summary statistics for trace element concentrations (mg/kg) in sediment or soil at Ashfield Flats. Abbreviations: n = number of valid observations; NAs = number of missing observations; n > DGV is number of samples exceeding the sediment Default Guideline Value; n > GV-high is number of samples exceeding the sediment upper Guideline Value at which toxicity effects might be expected (Water Quality Australia, 2024). HIL(C) is the human-health based investigation level for Recreational (public open space) land use (NEPC, 2013).
	As	Cr	Cu	Mn	Ni	Pb	Zn
mean	8.16	49.3	112	144	19	54.9	471
sd	7.71	20.6	121	309	8	84.6	1020
min	0.4	1	2	6	1	1	1
median	6.05	54.7	61	79	19	37.3	222
max	62	84	1010	3500	50	831	7560
n	323	336	327	336	327	336	332
NAs	13	0	9	0	9	0	4
n > DGV	26	6	156	0	137	91	185
n > GV-high	0	0	26	0	0	9	73
n > HIL(C)	0	0	0	0	0	4	0

The final table could now go into a report, and most readers would be happy with its appearance. We could probably do something about the alignment of the numeric columns, but decimal point alignment is not available in flextable yet.

The next really useful type of information to obtain would be where the samples which exceed environmental thresholds are. That question leads us to another more basic question: “How can we map our data in R,” and so further chapters will cover preparing maps in R, and from there we can move on to spatial analysis.

References

NEPC (National Environment Protection Council). (2013). Schedule B(1): Guideline on the Investigation Levels for Soil and Groundwater. In National Environment Protection (Assessment of Site Contamination) Measure (Amended). Commonwealth of Australia.

Rate, A. W., & McGrath, G. S. (2022). Data for assessment of sediment, soil, and water quality at Ashfield Flats Reserve, Western Australia. Data in Brief, 41, 107970. https://doi.org/10.1016/j.dib.2022.107970

Water Quality Australia. (2024). Toxicant default guideline values for sediment quality. Department of Climate Change, Energy, the Environment and Water, Government of Australia. Retrieved 2024-04-11 from https://www.waterquality.gov.au/anz-guidelines/guideline-values/default/sediment-quality-toxicants

CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio using the cyborg theme from Bootswatch via the bslib package, and fontawesome v5 icons.