UWA logoMaterial to support teaching in Environmental Science at The University of Western Australia

Units ENVT3361, ENVT4461, and ENVT5503

This guide gets you started with reading data into R (R Core Team, 2022) from a file, including checking that the data have been read in correctly. We will always be using R in the RStudio environment (Posit Software, 2022).

If you need or would like a more basic introduction to R, you could first read our
Guide to R and RStudio for absolute beginners.

Reading the data

We use the read.csv() function – we will mostly supply data to you as .csv files. Sometimes you need to download these into your working directory to use them, and sometimes you can read them directly from a web URL (e.g. https://github.com/.../afs19.csv).

With the type of dataset we usually use, there are columns containing categorical information, which R calls factors. These are typically stored as text or character information, i.e. character strings, or just strings. R identifies factors in a particular way so the categories are recognised, so we need to use the stringsAsFactors = TRUE argument in the read.csv() function.

The result of the read.csv() function is a data frame object stored in the R environment. We need a data frame, since it is a data structure in R which allows having columns of different classes (e.g. integer, numeric, factor, date, logical, etc.) in the same object (each column contains just one class of data, though).

Objects we create in an R session like data frames are only stored while we have our R session active, and disappear when we close R. Fortunately we can save our whole environment by clicking the 💾 icon on the  Environment  tab, or by running the save.image() function. Either of these saving methods will create a file with extension .RData which we can then load in later R sessions using the load() function, or clicking the 📁 icon in the  Environment  tab in RStudio.

git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
hubbard <- read.csv(file = paste0(git,"hubbard.csv"), stringsAsFactors = TRUE)
# ... and do a quick check
is.data.frame(hubbard) # check that it worked; if so, the result should be TRUE
## [1] TRUE

These data are from the Hubbard Brook Experimental Forest near Woodstock in New Hampshire, USA (Figure 1).

Figure 1: Location of the Hubbard Brook Experimental Forest in New Hampshire, USA. The data used in this workshop were generated at this site.

Figure 1: Location of the Hubbard Brook Experimental Forest in New Hampshire, USA. The data used in this workshop were generated at this site.

 

First proper check - summarise some of the data

We can use the summary() function to make a quick check of our data, to make sure the file has read correctly (this may not happen if the file is improperly formatted, etc.).

summary(hubbard[,1:9]) # just the first 9 columns to save space!
##       PLOT       Rel.To.Brook    Transect   UTM_EASTING      UTM_NORTHING           PH       
##  Min.   :  1.0   North:135    E278000:44   Min.   :276000   Min.   :4866300   Min.   :2.510  
##  1st Qu.:126.8   South:125    E277000:43   1st Qu.:277000   1st Qu.:4867800   1st Qu.:4.168  
##  Median :254.5                E280000:37   Median :279000   Median :4869000   Median :4.350  
##  Mean   :249.2                E276000:35   Mean   :278996   Mean   :4868769   Mean   :4.322  
##  3rd Qu.:392.2                E279000:34   3rd Qu.:281000   3rd Qu.:4869700   3rd Qu.:4.503  
##  Max.   :460.0                E282000:25   Max.   :283000   Max.   :4871100   Max.   :4.870  
##                               (Other):42                                                     
##   MOISTURE.pct        OM.pct             Cd        
##  Min.   : 0.510   Min.   : 2.240   Min.   :0.0980  
##  1st Qu.: 3.068   1st Qu.: 9.188   1st Qu.:0.3070  
##  Median : 4.131   Median :11.162   Median :0.6350  
##  Mean   : 5.367   Mean   :12.009   Mean   :0.7426  
##  3rd Qu.: 6.324   3rd Qu.:13.582   3rd Qu.:1.1310  
##  Max.   :26.262   Max.   :50.177   Max.   :1.8890  
##                                    NA's   :167

 

The summary() function creates a little table for each column - note that these little tables do not all look the same. Integer or numeric columns get a numeric summary with minimum, mean etc., and sometime the number of missing (NA) values. Categorical (Factor) columns show the number of samples (rows) in each category (unless there are too many categories). These summaries are useful to check if there are zero or negative values in columns, how many missing observations there might be, and if the data have been read correctly into R.

 

Note: You would usually check the whole data frame, without restricting rows or columns, by running summary(hubbard). We could also:

  • summarise all variables, but only the first 10 rows by running summary(hubbard[1:10,]) (we also call rows 'observations' which often represent separate field samples);
  • summarise a defined range of both rows and columns, e.g. summary(hubbard[1:20,6:10]), which would summarise only the first 20 rows of columns 6 to 10.

 

Final checks of the data frame

Usually we would not restrict the output as done below with [,1:15]. We only do it here so we're not bored with pages of similar-looking output. You should look at structure for the whole data frame using str(hubbard) (or substitute hubbard for whatever your data object is called). The whole hubbard data frame has 62 variables (i.e. columns), not 15.

str(hubbard[,1:15]) # 'str' gives the structure of an object
## 'data.frame':    260 obs. of  15 variables:
##  $ PLOT        : int  1 2 3 5 6 7 8 9 10 11 ...
##  $ Rel.To.Brook: Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Transect    : Factor w/ 8 levels "E276000","E277000",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ UTM_EASTING : int  280000 280000 280000 280000 280000 280000 280000 280000 280000 280000 ...
##  $ UTM_NORTHING: int  4868400 4868500 4868700 4869000 4869100 4869300 4869400 4869600 4869700 4869900 ...
##  $ PH          : num  4.29 4.66 4.23 4.15 4.49 4.79 4.11 4.52 4.51 4.43 ...
##  $ MOISTURE.pct: num  4.74 7.47 5.55 3.77 4.82 ...
##  $ OM.pct      : num  12.2 10.46 14.88 9.14 12.01 ...
##  $ Cd          : num  0.498 0.207 1.359 0.913 0.099 ...
##  $ Cu          : num  22.1 27.1 18.6 7 17 ...
##  $ Ni          : num  12.25 20.5 14.43 8.72 10.62 ...
##  $ Cr          : num  21.1 28.2 19.6 21.3 22.8 ...
##  $ Co          : num  14.4 17 13.1 11.1 11.4 ...
##  $ Zn          : num  47.6 63.1 47.5 18.2 31.9 ...
##  $ Mn          : num  490 453 578 261 335 ...

We can see that some columns are integer (int) values (e.g. PLOT, UTM_EASTING), some columns contain Factor values i.e. in fixed categories (e.g. Rel.To.Brook, Transect), and some columns are numeric (num) (e.g. PH, OM.pct, Ni). Applying the str() function to a data object is always a good idea, to check that the data have read correctly into R. [NOTE that other variable types are possible such as character chr, date (Date or POSIXct), logical, etc.]


"Data is like garbage. You'd better know what you are going to do with it before you collect it."

— Mark Twain


The following section describes how to make graphs and plots in base-R and in the ggplot2 package. Go to this page if you want base-R only, without the ggplot2 material.

Base R plotting: x-y plot using plot()

We can use either plot(x, y, ...) OR plot(y ~ x, ...)
In R the ~ symbol means 'as a function of', so ~ indicates a formula.

In R we need to tell the program which 'object' our variables are in. We've just made a Data Frame (a type of data object) called hubbard.

The following 3 styles of code do exactly the same thing:

  1. Specifying the data frame using with() syntax -- (we recommend this one!)
with(hubbard,
     plot(EXCH_Al ~ PH)
)

...which can just be written on a single line:

with(hubbard, plot(EXCH_Al ~ PH))
Figure 2: Plot of exchangeable Al vs. pH using with() to specify the data frame.

Figure 2: Plot of exchangeable Al vs. pH using with() to specify the data frame.

 

  1. Specifying the data frame using the dollar-sign operator
plot(hubbard$EXCH_Al ~ hubbard$PH) # look at axis titles!
Figure 3: Plot of exchangeable Al vs. pH using dollar-sign syntax to specify the data frame. Notice the axis titles!

Figure 3: Plot of exchangeable Al vs. pH using dollar-sign syntax to specify the data frame. Notice the axis titles!

 

  1. Specifying the data frame using attach() and detach() (not recommended)
attach(hubbard)
plot(EXCH_Al ~ PH)
detach(hubbard)
Figure 4: Plot of exchangeable Al vs. pH using `attach()` to specify the data frame.

Figure 4: Plot of exchangeable Al vs. pH using attach() to specify the data frame.

 

Without changing any of the (numerous) options or parameters in the plot() function, the base-R plot is not very attractive (e.g. axis titles!).

We can also change the overall plot appearance by using the function par() before plotting; par() sets graphics parameters. Let's try some variations:

Alternative plot using ggplot (Wickham, 2016)

library(ggplot2)
ggplot(data=hubbard, aes(x= PH, y = EXCH_Al)) +
  geom_point()
Figure 5: Basic plot of exchangeable Al vs. pH using the `ggplot2` R package.

Figure 5: Basic plot of exchangeable Al vs. pH using the ggplot2 R package.

 

Plots made using ggplot2 such as Figure 5 set par()options automatically, and come with a default table style, or theme.

Setting some overall graphics parameters using par()

  • mar= sets margins in 'lines' units: c(bottom,left,top,right) e.g. c(4,4,3,1)
  • mgp= sets distance of text from axes: c(titles, tickLabels, ticks)
  • font.lab= sets font style for axis titles: 2=bold, 3=italic, etc.
    • and within the plot() function itself, xlab= and ylab= set axis titles
par(mar=c(4,4,1,1), mgp=c(2,0.6,0), font.lab=2)
# We'll also include some better axis title text using xlab, ylab
with(hubbard,
     plot(EXCH_Al ~ PH, 
     xlab="Soil pH", 
     ylab="Exchangeable Al (centimoles/kg)")
     )
Figure 6: Plot of exchangeable Al vs. pH improved by changing graphics parameters and including custom axis titles.

Figure 6: Plot of exchangeable Al vs. pH improved by changing graphics parameters and including custom axis titles.

 

This is starting to look a lot better!

We can still add more information to the graph; for example, by making use of the factors (categories) in the dataset. We also need to learn these graphics parameters:

  • col = plotting colour(s) - it's easiest to use words like "red", "darkblue" and so on
    see this R colour chart, or just run the R function colors() for a list of all 657 names!

  • pch = plot character(s) - numbers from 0 to 24 (run help(points) or see this page from YaRrr).

par(mar=c(4,4,1,1), mgp=c(2,0.6,0), font.lab=2)
with(hubbard,
     plot(EXCH_Al ~ PH, xlab="Soil pH",
          ylab="Exchangeable Al (centimoles/kg)",
          pch=c(1,16)[Rel.To.Brook], 
          col=c("blue","darkred")[Rel.To.Brook])
     )
Figure 7: Plot of exchangeable Al vs. pH with custom graphics parameters and axis titles, improved by separating points by a Factor.

Figure 7: Plot of exchangeable Al vs. pH with custom graphics parameters and axis titles, improved by separating points by a Factor.

 

The parameter pch=c(1,16)[Rel.To.Brook] separates the points by the information in the Factor column Rel.To.Brook, shown inside [ ]. This column is a 2-level factor, so can be one of two categories (North or South), and so we need a vector with two numbers in it (pch=c(1,16)). The code for specifying colour is very similar, except our vector has 2 colour names in it.

There is still one thing missing; a graph legend. We can add one using the legend() function. We will use the following options:

  • "topleft" position of legend -- run help(legend) for options, or we can use x-y coordinates
  • legend = a vector of names identifying the plot symbols - we have used the categories in the factor 'Rel.To.Brook', levels(hubbard$Rel.To.Brook), but we could have used legend=c("North","South") instead
  • pch = plot symbols - should be exactly the same vector as in the plot function
  • col = plot colours - should be exactly the same vector as in the plot function
  • title = a title for the legend - optional
par(mar=c(4,4,1,1), mgp=c(2,0.6,0), font.lab=2)
with(hubbard,
       plot(EXCH_Al ~ PH, xlab="Soil pH", 
            ylab="Exchangeable Al (centimoles/kg)",
            pch=c(1,16)[Rel.To.Brook], 
            col=c("blue","darkred")[Rel.To.Brook])
       )
legend("topleft", legend=levels(hubbard$Rel.To.Brook), pch=c(1,16), 
       col=c("blue","darkred"), title="Location")
Figure 8: Plot of exchangeable Al vs. pH by sample location, with custom graphics parameters and axis titles, and a legend to explain the plot symbols.

Figure 8: Plot of exchangeable Al vs. pH by sample location, with custom graphics parameters and axis titles, and a legend to explain the plot symbols.


Alternative plot using ggplot

library(ggplot2)
ggplot(data=hubbard, 
       aes(x= PH, y = EXCH_Al, color=Rel.To.Brook, shape=Rel.To.Brook)) +
  geom_point(size=2.5) + 
  labs(y="Exchangeable Al (cmolc/kg)", x = "Soil pH") +
  theme_bw()
Figure 9: Plot of exchangeable Al vs. pH in the Hubbard Brook soil data using the `ggplot2` R package showing points categorized by the factor `Rel.To.Brook`, and using a non-default plot theme.

Figure 9: Plot of exchangeable Al vs. pH in the Hubbard Brook soil data using the ggplot2 R package showing points categorized by the factor Rel.To.Brook, and using a non-default plot theme.

 

In the ggplot2 example in Figure 9 we have used the aes() option to specify different colours and symbols depending on the value of a factor (Rel.To.Brook). A legend is created automatically. We have also used a different theme option (theme_bw()), which is arguably more appropriate than the default which has a grey plot background.

Alternative to base-R plot: scatterplot() (from the car package)

The R package car (Companion to Applied Regression) has many useful additional functions that extend the capability of R. The next two examples produce nearly the same plot as in the previous examples, using the scatterplot() function in the car package.

# load required package(s)
library(car)
# par() used to set overall plot appearance using options within 
# par(), e.g.
#     mar sets plot margins, mgp sets distance of axis title and 
#     tick labels from axis
par(font.lab=2, mar=c(4,4,1,1), mgp=c(2.2,0.7,0.0))
# draw scatterplot with customised options 
# remember pch sets plot character (symbol); 
# we will also use the parameter cex which sets symbol sizes and
# 
scatterplot(EXCH_Al ~ PH, data=hubbard, smooth=FALSE, 
            legend = c(coords="topleft"), 
            cex=1.5, cex.lab=1.5, cex.axis=1.2)
Figure 10: Plot of exchangeable Al vs. pH made using the car::scatterplot() function.

Figure 10: Plot of exchangeable Al vs. pH made using the car::scatterplot() function.

 

Note that we get some additional graph features by default:

  1. boxplots for each variable in the plot margins – these are useful for evaluating the distribution of our variables and any extreme values
  2. a linear regression line showing the trend of the relationship (it's possible to add this in base R plots, too)
  3. grid lines in the plot area (also available in base R)

We can turn all of these features off if we want - run help(scatterplot) in the RStudio Console, and look under Arguments and Details.

Also, we separately specify the dataset to be used as a function argument, i.e., data=hubbard.

Scatterplot (car) with groups, Hubbard Brook soil data

# 'require()' loads package(s) if they haven't already been loaded
require(car)
# adjust overall plot appearance using options within par()
# mar sets plot margins, mgp sets distance of axis title and tick 
#   labels from axis
par(font.lab=2, mar=c(4,4,1,1), mgp=c(2.2,0.7,0.0))
# create custom palette with nice colours :)
# this lets us specify colours using numbers - try it!
palette(c("black","red3","blue3","darkgreen","sienna"))
# draw scatterplot with points grouped by a factor (Rel.To.Brook) 
scatterplot(EXCH_Al ~ PH | Rel.To.Brook, data=hubbard, smooth=FALSE,
            legend = c(coords="topleft"), col=c(5,3,1), 
            pch=c(16,0,2), cex=1.2, cex.lab=1.3, cex.axis=1.0)
Figure 11: Plot of exchangeable Al vs. pH with points grouped by location, made using the car::scatterplot() function.

Figure 11: Plot of exchangeable Al vs. pH with points grouped by location, made using the car::scatterplot() function.

 

The scatterplot() function creates a legend automatically if we plot by factor groupings (note the different way that the legend position is specified within the scatterplot() function). This is pretty similar to the base R plot above (we can also customise the axis titles in scatterplot(), using xlab= and ylab= as before).


Other types of data presentation: Plot types and Tables

We'll give you some starting code chunks, and the output from them. You can then use the help in RStudio to try to customise the plots according to the suggestions below each plot!

Histograms

Histograms are an essential staple of statistical plots, because we always need to know something about the distribution of our variable(s). Histograms are a good visual way to assess the shape of a variable's distribution, whether it's symmetrical and bell-shaped (normal), left- or right-skewed, bimodal (two 'peaks') or even multimodal. As with any check of a distribution, the 'shape' will be clearer if we have more observations.

with(hubbard, hist(MOISTURE.pct))
Figure 12: Histogram of percent soil moisture content at Hubbard Brook.

Figure 12: Histogram of percent soil moisture content at Hubbard Brook.

 

Histogram in ggplot

ggplot(data = hubbard, aes(x=MOISTURE.pct)) +
  geom_histogram(fill="lightgray", color="black") +
  theme_bw()
Figure 13: Histogram of water content (%) in the Hubbard Brook soil data using `ggplot2`, and using a non-default ggplot theme.

Figure 13: Histogram of water content (%) in the Hubbard Brook soil data using ggplot2, and using a non-default ggplot theme.

 

For histograms, try the following:

  • add suitable axis labels (titles)
  • make bars a different colour (all the same)
  • change the number of cells (bars) on the histogram to give wider or narrower intervals
  • log10-transform the x-axis (horizontal axis)
  • remove the title above the graph (this information would usually go in a caption)
    ...and so on.

Box plots

Box plots also give us some information about a variable's distribution, but one of their great strengths is in comparing values of a variable between different groups in our data.

Tukey box plots implemented by R have 5 key values; from least to greatest these are:

  1. the lower whisker, which is the greatest of the minimum or the lower hinge − 1.5 × IQRa
  2. the lower hinge, which is the lower quartile (25th percentile)
  3. the median (i.e. the 50th percentile)
  4. the upper hinge, which is the upper quartile (75th percentile)
  5. the upper whisker, , which is the least of the maximum or the upper hinge + 1.5 × IQR

Any points less than the lower whisker, or greater than the upper whisker, are plotted separately and represent potential outliers.

aIQR is the interquartile range between the upper and lower quartiles (25th and 75th percentiles)

boxplot(MOISTURE.pct ~ Rel.To.Brook, data=hubbard)
Figure 14: Box plot of percent soil moisture content at Hubbard Brook. Only default `boxplot()` arguments have been used (except for the annotations explaining the boxplot features).

Figure 14: Box plot of percent soil moisture content at Hubbard Brook. Only default boxplot() arguments have been used (except for the annotations explaining the boxplot features).

 

For box plots, try the following:

  • include informative axis labels (titles)
  • plotting boxes separated by categories in a Factor
  • make boxes a different colour (all the same, and all different!)
  • add notches to boxes representing approximate 95% confidence intervals around the median
  • give the (vertical) y-axis a log10 scale
    ...and so on.

Boxplot in ggplot

ggplot(data = hubbard, aes(x=Rel.To.Brook, y=MOISTURE.pct)) +
  geom_boxplot(fill="lightgray", color="black") +
  theme_bw()
Figure 15: Boxplot of water content (%) in the Hubbard Brook soil data using `ggplot2`, and using a non-default ggplot theme.

Figure 15: Boxplot of water content (%) in the Hubbard Brook soil data using ggplot2, and using a non-default ggplot theme.

 

Strip Charts

Strip charts, or one-dimensional scatter plots, can be a useful companion (or alternative) to box plots, especially when we don't have many observations of a variable.

stripchart(hubbard$OM.pct)
Figure 16: Soil organic matter content (%) at Hubbard Brook: as a one-dimensional scatterplot (`stripchart()`).

Figure 16: Soil organic matter content (%) at Hubbard Brook: as a one-dimensional scatterplot (stripchart()).

 

Stripchart in ggplot

ggplot(data = hubbard, aes(x=MOISTURE.pct, y="")) +
  geom_jitter(position=position_jitter(0.1)) +
  theme_bw()
Figure 17: Strip chart of water content (%) in the Hubbard Brook soil data using `ggplot2`, and using a non-default ggplot theme.

Figure 17: Strip chart of water content (%) in the Hubbard Brook soil data using ggplot2, and using a non-default ggplot theme.

 

For stripcharts, try the following:

  • add suitable axis labels (titles), in bold font;
  • make symbols a different shape ± colour (all the same, and all different!);
  • make the strip chart vertical instead of horizontal;
  • apply some 'jitter' (noise) to the symbols so that overlapping points are easier to see;
  • log10-transform the numerical axis, so that overlapping points are easier to see;
  • rotate the y-axis titles 90° clockwise (make sure the left margin is wide enough!);
  • remove the titles above the graphs (this information would usually go in caption/s)
    ...and so on.

"With data collection, 'the sooner the better' is always the best answer."

Marissa Mayer


Tables

There are a few ways to make useful tables in R to summarise your data. Here are a couple of examples.

Using the tapply() function in base R

We can use the tapply() function to make very simple tables:

# use the cat() [conCATenate] function to make a Table heading 
#     (\n is a line break)
cat("One-way table of means\n")
with(hubbard, tapply(X = EXCH_Ni, INDEX=Transect, 
       FUN=mean, na.rm=TRUE))
cat("\nTwo-way table of means\n")
with(hubbard, tapply(X = EXCH_Ni, 
       INDEX=list(Transect,Rel.To.Brook), 
       FUN=mean, na.rm=TRUE))
## One-way table of means
##     E276000     E277000     E278000     E279000     E280000     E281000     E282000     E283000 
## 0.002588815 0.002136418 0.002827809 0.002813563 0.002528296 0.002262386 0.002221885 0.002957922 
## 
## Two-way table of means
##               North       South
## E276000 0.002652212 0.002559759
## E277000 0.002154745 0.002117219
## E278000 0.003133112 0.002360874
## E279000 0.002733461 0.002927993
## E280000 0.002176569 0.002861511
## E281000 0.002537732 0.001962009
## E282000 0.002219457 0.002224313
## E283000 0.003542597 0.002039146

 

For tapply() tables, try the following:

  • we have used the mean function (FUN=mean) – try another function to get minima, maxima, standard deviations, etc.
  • try copying the output to Word or Excel and using this to make a table in that software
    ...and so on.

Using the numSummary() function in the 'RcmdrMisc' R package

require(RcmdrMisc)
# use the cat() [conCATenate] function to make a Table heading 
#   (\n is a line break)
cat("Summary statistics for EXCH_Ni\n")
numSummary(hubbard$EXCH_Ni)
cat("\nSummary statistics for EXCH_Ni grouped by Rel.To.Brook\n")
numSummary(hubbard$EXCH_Ni, groups=hubbard$Rel.To.Brook)
## Summary statistics for EXCH_Ni
##         mean          sd         IQR         0%         25%         50%         75%       100%   n NA
##  0.002536502 0.001382344 0.001126141 0.00070457 0.001829135 0.002246775 0.002955276 0.01784169 257  3
## 
## Summary statistics for EXCH_Ni grouped by Rel.To.Brook
##              mean          sd         IQR          0%         25%         50%         75%        100% data:n
## North 0.002635924 0.001605190 0.001114533 0.001120626 0.001933342 0.002277016 0.003047875 0.017841689    132
## South 0.002431513 0.001096042 0.001170948 0.000704570 0.001719421 0.002174767 0.002890369 0.008483806    125
##       data:NA
## North       3
## South       0

For numSummary() tables, try the following:

  • generating summary tables for more than one variable at a time
  • generating summary tables with fewer statistical parameters (e.g. omit IQR) or more statistical parameters (e.g. include skewness)
    (use R Studio Help!)
  • try copying the output to Word or Excel and using this to make a table in that software
    ...and so on.

Tables using print() on a data frame

Data frames are themselves tables, and if they already contain the type of summary we need, we can just use the print() function to get output. Let's do something [slightly] fancy (see if you can figure out what is going on here¹):

output <- 
  numSummary(hubbard[,c("PH","MOISTURE.pct","OM.pct","Al.pct","Ca.pct","Fe.pct")],
             statistics = c("mean","sd","quantiles"), quantiles=c(0,0.5,1))
mytable <- t(cbind(output$table,output$NAs))
row.names(mytable) <- c("Mean","Std.dev.","Min.","Median","Max.","Missing")
# here's where we get the output
print(mytable, digits=3)
write.table(mytable,"clipboard",sep="\t")
cat("\nThe table has now been copied to the clipboard, so you can paste it into Excel!\n")
##             PH MOISTURE.pct OM.pct Al.pct Ca.pct Fe.pct
## Mean     4.322         5.37  12.01  1.641  0.208  3.163
## Std.dev. 0.293         3.83   5.17  1.478  0.214  1.112
## Min.     2.510         0.51   2.24  0.155  0.002  0.497
## Median   4.350         4.13  11.16  0.970  0.139  3.168
## Max.     4.870        26.26  50.18  7.132  1.107  7.709
## Missing  0.000         0.00   0.00 60.000 62.000 60.000
## 
## The table has now been copied to the clipboard, so you can paste it into Excel!

¹Hints: we have made two data frame objects, one from the output of numSummary(); t() is the transpose function; there are also some other useful functions which might be new to you like: row.names(), cbind(), print(), write.table() . . .

If you want to take this further, we can start making really nice Tables for reports with various R packages. I use the flextable package (Gohel & Skintzos, 2022) and sometimes the kable() function from the knitr package. Here's an example (Table 1) using flextable and the table object mytable made above:

library(flextable)
flextable(data.frame(Statistic=row.names(mytable),signif(mytable,3))) |>
  bold(bold=TRUE, part="header") |>
  set_caption(caption="Table 1: A table created by the `flextable` R package. Many more table formatting, text formatting, and number formatting options are available in this package.")
Table 1: A table created by the `flextable` R package. Many more table formatting, text formatting, and number formatting options are available in this package.

Statistic

PH

MOISTURE.pct

OM.pct

Al.pct

Ca.pct

Fe.pct

Mean

4.320

5.37

12.00

1.640

0.208

3.160

Std.dev.

0.293

3.83

5.17

1.480

0.214

1.110

Min.

2.510

0.51

2.24

0.155

0.002

0.497

Median

4.350

4.13

11.20

0.970

0.139

3.170

Max.

4.870

26.30

50.20

7.130

1.110

7.710

Missing

0.000

0.00

0.00

60.000

62.000

60.000

 

The following two excellent websites can extend your basic knowledge of using R and RStudio:

  • A great free resource for R beginners is An Introduction to R by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.

  • Getting used to R, RStudio, and R Markdown is an awesome (and free) eBook by Chester Ismay which is super-helpful if you want to start using R Markdown for reproducible coding and reporting.


References

Fox J (2022). RcmdrMisc: R Commander Miscellaneous Functions. R package version 2.7-2, https://CRAN.R-project.org/package=RcmdrMisc.

Fox J, Weisberg S (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/ (car package).

Gohel D, Skintzos P (2022). flextable: Functions for Tabular Reporting. R package version 0.8.1, https://CRAN.R-project.org/package=flextable.

Posit Software (2022) RStudio 2022.12.0+353 "Elsbeth Geranium" Release. https://posit.co/products/open-source/rstudio/.

R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Wickham, H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. https://ggplot2.tidyverse.org.

 


CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.