This guide gets you started with reading data into R (R Core Team, 2022) from a file, including checking that the data have been read in correctly. We will always be using R in the RStudio environment (Posit Software, 2022).
If you need or would like a more basic
introduction to R, you could first read our
Guide to R and RStudio for
absolute beginners.
We use the read.csv()
function – we will mostly supply
data to you as .csv
files. Sometimes you need to download
these into your working directory to use them, and sometimes you can
read them directly from a web URL (e.g. https://github.com/.../afs19.csv
).
With the type of dataset we usually use, there are columns containing
categorical information, which R calls
factors
. These are typically stored as text or character
information, i.e. character strings, or just
strings
. R identifies factors
in a particular way so the categories are recognised, so we need to use
the stringsAsFactors = TRUE
argument in the
read.csv()
function.
The result of the read.csv()
function is a data
frame object stored in the R environment. We
need a data frame, since it is a data structure in R
which allows having columns of different classes (e.g. integer,
numeric, factor, date, logical, etc.) in the same object (each
column contains just one class of data, though).
Objects we create in an R session like data frames
are only stored while we have our R session active, and
disappear when we close R. Fortunately we can save our
whole environment by clicking the 💾 icon on the Environment tab,
or by running the save.image()
function. Either of these
saving methods will create a file with extension .RData
which we can then load in later R sessions using the
load()
function, or clicking the 📁 icon in the Environment tab
in RStudio.
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
hubbard <- read.csv(file = paste0(git,"hubbard.csv"), stringsAsFactors = TRUE)
# ... and do a quick check
is.data.frame(hubbard) # check that it worked; if so, the result should be TRUE
## [1] TRUE
These data are from the Hubbard Brook Experimental Forest near Woodstock in New Hampshire, USA (Figure 1).
We can use the summary()
function to make a quick check
of our data, to make sure the file has read correctly (this may not
happen if the file is improperly formatted, etc.).
## PLOT Rel.To.Brook Transect UTM_EASTING UTM_NORTHING PH
## Min. : 1.0 North:135 E278000:44 Min. :276000 Min. :4866300 Min. :2.510
## 1st Qu.:126.8 South:125 E277000:43 1st Qu.:277000 1st Qu.:4867800 1st Qu.:4.168
## Median :254.5 E280000:37 Median :279000 Median :4869000 Median :4.350
## Mean :249.2 E276000:35 Mean :278996 Mean :4868769 Mean :4.322
## 3rd Qu.:392.2 E279000:34 3rd Qu.:281000 3rd Qu.:4869700 3rd Qu.:4.503
## Max. :460.0 E282000:25 Max. :283000 Max. :4871100 Max. :4.870
## (Other):42
## MOISTURE.pct OM.pct Cd
## Min. : 0.510 Min. : 2.240 Min. :0.0980
## 1st Qu.: 3.068 1st Qu.: 9.188 1st Qu.:0.3070
## Median : 4.131 Median :11.162 Median :0.6350
## Mean : 5.367 Mean :12.009 Mean :0.7426
## 3rd Qu.: 6.324 3rd Qu.:13.582 3rd Qu.:1.1310
## Max. :26.262 Max. :50.177 Max. :1.8890
## NA's :167
The summary()
function creates a little table for each
column - note that these little tables do not all look the same.
Integer
or numeric
columns get a numeric
summary with minimum, mean etc., and sometime the number of
missing (NA
) values. Categorical (Factor
)
columns show the number of samples (rows) in each category (unless there
are too many categories). These summaries are useful to check if there
are zero or negative values in columns, how many missing observations
there might be, and if the data have been read correctly into R.
Note: You would usually check the whole data
frame, without restricting rows or columns, by running
summary(hubbard)
. We could also:
summary(hubbard[1:10,])
(we also call rows 'observations'
which often represent separate field samples);summary(hubbard[1:20,6:10])
, which would summarise only the
first 20 rows of columns 6 to 10.
Usually we would not restrict the output as done below with
[,1:15]
. We only do it here so we're not bored with pages
of similar-looking output. You should look at structure for the whole
data frame using str(hubbard)
(or substitute
hubbard
for whatever your data object is called). The whole
hubbard
data frame has 62 variables (i.e.
columns), not 15.
## 'data.frame': 260 obs. of 15 variables:
## $ PLOT : int 1 2 3 5 6 7 8 9 10 11 ...
## $ Rel.To.Brook: Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...
## $ Transect : Factor w/ 8 levels "E276000","E277000",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ UTM_EASTING : int 280000 280000 280000 280000 280000 280000 280000 280000 280000 280000 ...
## $ UTM_NORTHING: int 4868400 4868500 4868700 4869000 4869100 4869300 4869400 4869600 4869700 4869900 ...
## $ PH : num 4.29 4.66 4.23 4.15 4.49 4.79 4.11 4.52 4.51 4.43 ...
## $ MOISTURE.pct: num 4.74 7.47 5.55 3.77 4.82 ...
## $ OM.pct : num 12.2 10.46 14.88 9.14 12.01 ...
## $ Cd : num 0.498 0.207 1.359 0.913 0.099 ...
## $ Cu : num 22.1 27.1 18.6 7 17 ...
## $ Ni : num 12.25 20.5 14.43 8.72 10.62 ...
## $ Cr : num 21.1 28.2 19.6 21.3 22.8 ...
## $ Co : num 14.4 17 13.1 11.1 11.4 ...
## $ Zn : num 47.6 63.1 47.5 18.2 31.9 ...
## $ Mn : num 490 453 578 261 335 ...
We can see that some columns are integer (int
) values
(e.g. PLOT, UTM_EASTING), some columns contain
Factor
values i.e. in fixed categories
(e.g. Rel.To.Brook, Transect), and some columns are numeric
(num
) (e.g. PH, OM.pct, Ni). Applying the
str()
function to a data object is always
a good idea, to check that the data have read correctly into R. [NOTE
that other variable types are possible such as character
chr
, date (Date
or POSIXct
),
logical
, etc.]
"Data is like garbage. You'd better know what you are going to do with it before you collect it."
— Mark Twain
The following section describes how to make graphs and plots in
base-R and in the ggplot2
package. Go to this page if you want base-R only, without
the ggplot2
material.
plot()
We can use either plot(x, y, ...)
OR
plot(y ~ x, ...)
In R the ~
symbol means
'as a function of', so ~
indicates a formula.
In R we need to tell the program which 'object' our variables are in.
We've just made a Data Frame (a type of data object) called
hubbard
.
The following 3 styles of code do exactly the same thing:
with()
syntax -- (we
recommend this one!)...which can just be written on a single line:
attach()
and detach()
(not
recommended)
Without changing any of the (numerous) options or parameters in the
plot()
function, the base-R plot is not very attractive
(e.g. axis titles!).
We can also change the overall plot appearance by using the function
par()
before plotting; par()
sets graphics
parameters. Let's try some variations:
ggplot
(Wickham, 2016)
Plots made using ggplot2
such as Figure 5 set
par()
options automatically, and come with a default table
style, or theme
.
par()
mar=
sets margins in 'lines' units:
c(bottom,left,top,right)
e.g.
c(4,4,3,1)
mgp=
sets distance of text from axes:
c(titles, tickLabels, ticks)
font.lab=
sets font style for axis titles:
2=bold, 3=italic, etc.plot()
function itself,
xlab=
and ylab=
set axis titlespar(mar=c(4,4,1,1), mgp=c(2,0.6,0), font.lab=2)
# We'll also include some better axis title text using xlab, ylab
with(hubbard,
plot(EXCH_Al ~ PH,
xlab="Soil pH",
ylab="Exchangeable Al (centimoles/kg)")
)
This is starting to look a lot better!
We can still add more information to the graph; for example, by making use of the factors (categories) in the dataset. We also need to learn these graphics parameters:
col =
plotting colour(s) - it's easiest to use words
like "red", "darkblue" and so on
see this R colour chart, or just run the R function
colors()
for a list of all 657 names!
pch =
plot character(s) - numbers from 0 to 24 (run
help(points)
or see
this
page from YaRrr).
par(mar=c(4,4,1,1), mgp=c(2,0.6,0), font.lab=2)
with(hubbard,
plot(EXCH_Al ~ PH, xlab="Soil pH",
ylab="Exchangeable Al (centimoles/kg)",
pch=c(1,16)[Rel.To.Brook],
col=c("blue","darkred")[Rel.To.Brook])
)
The parameter pch=c(1,16)[Rel.To.Brook]
separates the
points by the information in the Factor column
Rel.To.Brook
, shown inside [ ]
. This column is
a 2-level factor, so can be one of two categories (North or South), and
so we need a vector with two numbers in it (pch=c(1,16)
).
The code for specifying colour is very similar, except our vector has 2
colour names in it.
There is still one thing missing; a graph legend. We
can add one using the legend()
function. We will use the
following options:
"topleft"
position of legend -- run
help(legend)
for options, or we can use x-y
coordinateslegend =
a vector of names identifying the plot symbols
- we have used the categories in the factor 'Rel.To.Brook',
levels(hubbard$Rel.To.Brook)
, but we could have used
legend=c("North","South")
insteadpch =
plot symbols - should be exactly the same vector
as in the plot functioncol =
plot colours - should be exactly the same vector
as in the plot functiontitle =
a title for the legend - optionalpar(mar=c(4,4,1,1), mgp=c(2,0.6,0), font.lab=2)
with(hubbard,
plot(EXCH_Al ~ PH, xlab="Soil pH",
ylab="Exchangeable Al (centimoles/kg)",
pch=c(1,16)[Rel.To.Brook],
col=c("blue","darkred")[Rel.To.Brook])
)
legend("topleft", legend=levels(hubbard$Rel.To.Brook), pch=c(1,16),
col=c("blue","darkred"), title="Location")
ggplot
library(ggplot2)
ggplot(data=hubbard,
aes(x= PH, y = EXCH_Al, color=Rel.To.Brook, shape=Rel.To.Brook)) +
geom_point(size=2.5) +
labs(y="Exchangeable Al (cmolc/kg)", x = "Soil pH") +
theme_bw()
In the ggplot2
example in Figure 9 we have used the
aes()
option to specify different colours and symbols
depending on the value of a factor (Rel.To.Brook
). A legend
is created automatically. We have also used a different theme option
(theme_bw()
), which is arguably more appropriate than the
default which has a grey plot background.
scatterplot()
(from the
car
package)The R package car
(Companion to Applied Regression) has
many useful additional functions that extend the capability of
R. The next two examples produce nearly the same plot
as in the previous examples, using the scatterplot()
function in the car
package.
# load required package(s)
library(car)
# par() used to set overall plot appearance using options within
# par(), e.g.
# mar sets plot margins, mgp sets distance of axis title and
# tick labels from axis
par(font.lab=2, mar=c(4,4,1,1), mgp=c(2.2,0.7,0.0))
# draw scatterplot with customised options
# remember pch sets plot character (symbol);
# we will also use the parameter cex which sets symbol sizes and
#
scatterplot(EXCH_Al ~ PH, data=hubbard, smooth=FALSE,
legend = c(coords="topleft"),
cex=1.5, cex.lab=1.5, cex.axis=1.2)
Note that we get some additional graph features by default:
We can turn all of these features off if we want - run
help(scatterplot)
in the RStudio Console, and look under
Arguments and Details.
Also, we separately specify the dataset to be used as a function
argument, i.e., data=hubbard
.
car
) with groups, Hubbard Brook soil
data# 'require()' loads package(s) if they haven't already been loaded
require(car)
# adjust overall plot appearance using options within par()
# mar sets plot margins, mgp sets distance of axis title and tick
# labels from axis
par(font.lab=2, mar=c(4,4,1,1), mgp=c(2.2,0.7,0.0))
# create custom palette with nice colours :)
# this lets us specify colours using numbers - try it!
palette(c("black","red3","blue3","darkgreen","sienna"))
# draw scatterplot with points grouped by a factor (Rel.To.Brook)
scatterplot(EXCH_Al ~ PH | Rel.To.Brook, data=hubbard, smooth=FALSE,
legend = c(coords="topleft"), col=c(5,3,1),
pch=c(16,0,2), cex=1.2, cex.lab=1.3, cex.axis=1.0)
The scatterplot()
function creates a legend
automatically if we plot by factor groupings (note the different way
that the legend position is specified within the
scatterplot()
function). This is pretty similar to the base
R plot above (we can also customise the axis titles in
scatterplot()
, using xlab=
and
ylab=
as before).
We'll give you some starting code chunks, and the output from them. You can then use the help in RStudio to try to customise the plots according to the suggestions below each plot!
Histograms are an essential staple of statistical plots, because we always need to know something about the distribution of our variable(s). Histograms are a good visual way to assess the shape of a variable's distribution, whether it's symmetrical and bell-shaped (normal), left- or right-skewed, bimodal (two 'peaks') or even multimodal. As with any check of a distribution, the 'shape' will be clearer if we have more observations.
ggplot
ggplot(data = hubbard, aes(x=MOISTURE.pct)) +
geom_histogram(fill="lightgray", color="black") +
theme_bw()
For histograms, try the following:
Box plots also give us some information about a variable's distribution, but one of their great strengths is in comparing values of a variable between different groups in our data.
Tukey box plots implemented by R have 5 key values; from least to greatest these are:
Any points less than the lower whisker, or greater than the upper whisker, are plotted separately and represent potential outliers.
aIQR is the interquartile range between the upper and lower quartiles (25th and 75th percentiles)
For box plots, try the following:
Strip charts, or one-dimensional scatter plots, can be a useful companion (or alternative) to box plots, especially when we don't have many observations of a variable.
ggplot
ggplot(data = hubbard, aes(x=MOISTURE.pct, y="")) +
geom_jitter(position=position_jitter(0.1)) +
theme_bw()
For stripcharts, try the following:
"With data collection, 'the sooner the better' is always the best answer."
There are a few ways to make useful tables in R to summarise your data. Here are a couple of examples.
tapply()
function in base RWe can use the tapply()
function to make very simple
tables:
# use the cat() [conCATenate] function to make a Table heading
# (\n is a line break)
cat("One-way table of means\n")
with(hubbard, tapply(X = EXCH_Ni, INDEX=Transect,
FUN=mean, na.rm=TRUE))
cat("\nTwo-way table of means\n")
with(hubbard, tapply(X = EXCH_Ni,
INDEX=list(Transect,Rel.To.Brook),
FUN=mean, na.rm=TRUE))
## One-way table of means
## E276000 E277000 E278000 E279000 E280000 E281000 E282000 E283000
## 0.002588815 0.002136418 0.002827809 0.002813563 0.002528296 0.002262386 0.002221885 0.002957922
##
## Two-way table of means
## North South
## E276000 0.002652212 0.002559759
## E277000 0.002154745 0.002117219
## E278000 0.003133112 0.002360874
## E279000 0.002733461 0.002927993
## E280000 0.002176569 0.002861511
## E281000 0.002537732 0.001962009
## E282000 0.002219457 0.002224313
## E283000 0.003542597 0.002039146
For tapply()
tables, try the following:
mean
function (FUN=mean
)
– try another function to get minima, maxima, standard deviations,
etc.numSummary()
function in the 'RcmdrMisc' R
packagerequire(RcmdrMisc)
# use the cat() [conCATenate] function to make a Table heading
# (\n is a line break)
cat("Summary statistics for EXCH_Ni\n")
numSummary(hubbard$EXCH_Ni)
cat("\nSummary statistics for EXCH_Ni grouped by Rel.To.Brook\n")
numSummary(hubbard$EXCH_Ni, groups=hubbard$Rel.To.Brook)
## Summary statistics for EXCH_Ni
## mean sd IQR 0% 25% 50% 75% 100% n NA
## 0.002536502 0.001382344 0.001126141 0.00070457 0.001829135 0.002246775 0.002955276 0.01784169 257 3
##
## Summary statistics for EXCH_Ni grouped by Rel.To.Brook
## mean sd IQR 0% 25% 50% 75% 100% data:n
## North 0.002635924 0.001605190 0.001114533 0.001120626 0.001933342 0.002277016 0.003047875 0.017841689 132
## South 0.002431513 0.001096042 0.001170948 0.000704570 0.001719421 0.002174767 0.002890369 0.008483806 125
## data:NA
## North 3
## South 0
For numSummary()
tables, try the following:
print()
on a data frameData frames are themselves tables, and if they already contain the
type of summary we need, we can just use the print()
function to get output. Let's do something [slightly] fancy (see if you
can figure out what is going on here¹):
output <-
numSummary(hubbard[,c("PH","MOISTURE.pct","OM.pct","Al.pct","Ca.pct","Fe.pct")],
statistics = c("mean","sd","quantiles"), quantiles=c(0,0.5,1))
mytable <- t(cbind(output$table,output$NAs))
row.names(mytable) <- c("Mean","Std.dev.","Min.","Median","Max.","Missing")
# here's where we get the output
print(mytable, digits=3)
write.table(mytable,"clipboard",sep="\t")
cat("\nThe table has now been copied to the clipboard, so you can paste it into Excel!\n")
## PH MOISTURE.pct OM.pct Al.pct Ca.pct Fe.pct
## Mean 4.322 5.37 12.01 1.641 0.208 3.163
## Std.dev. 0.293 3.83 5.17 1.478 0.214 1.112
## Min. 2.510 0.51 2.24 0.155 0.002 0.497
## Median 4.350 4.13 11.16 0.970 0.139 3.168
## Max. 4.870 26.26 50.18 7.132 1.107 7.709
## Missing 0.000 0.00 0.00 60.000 62.000 60.000
##
## The table has now been copied to the clipboard, so you can paste it into Excel!
¹Hints: we have made two data frame objects, one from the output of
numSummary()
; t()
is the transpose function;
there are also some other useful functions which might be new to you
like: row.names()
, cbind()
,
print()
, write.table()
. . .
If you want to take this further, we can start making really nice
Tables for reports with various R packages. I use the
flextable
package (Gohel & Skintzos, 2022) and
sometimes the kable()
function from the knitr
package. Here's an example (Table 1) using flextable
and
the table object mytable
made above:
library(flextable)
flextable(data.frame(Statistic=row.names(mytable),signif(mytable,3))) |>
bold(bold=TRUE, part="header") |>
set_caption(caption="Table 1: A table created by the `flextable` R package. Many more table formatting, text formatting, and number formatting options are available in this package.")
Statistic | PH | MOISTURE.pct | OM.pct | Al.pct | Ca.pct | Fe.pct |
---|---|---|---|---|---|---|
Mean | 4.320 | 5.37 | 12.00 | 1.640 | 0.208 | 3.160 |
Std.dev. | 0.293 | 3.83 | 5.17 | 1.480 | 0.214 | 1.110 |
Min. | 2.510 | 0.51 | 2.24 | 0.155 | 0.002 | 0.497 |
Median | 4.350 | 4.13 | 11.20 | 0.970 | 0.139 | 3.170 |
Max. | 4.870 | 26.30 | 50.20 | 7.130 | 1.110 | 7.710 |
Missing | 0.000 | 0.00 | 0.00 | 60.000 | 62.000 | 60.000 |
The following two excellent websites can extend your basic knowledge of using R and RStudio:
A great free resource for R beginners is An Introduction to R by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.
Getting used to R, RStudio, and R Markdown is an awesome (and free) eBook by Chester Ismay which is super-helpful if you want to start using R Markdown for reproducible coding and reporting.
Fox J (2022). RcmdrMisc: R Commander Miscellaneous Functions. R package version 2.7-2, https://CRAN.R-project.org/package=RcmdrMisc.
Fox J, Weisberg S (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/ (car package).
Gohel D, Skintzos P (2022). flextable: Functions for Tabular Reporting. R package version 0.8.1, https://CRAN.R-project.org/package=flextable.
Posit Software (2022) RStudio 2022.12.0+353 "Elsbeth Geranium" Release. https://posit.co/products/open-source/rstudio/.
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham, H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. https://ggplot2.tidyverse.org.
CC-BY-SA • All content by Ratey-AtUWA. My employer does not necessarily know about or endorse the content of this website.
Created with rmarkdown in RStudio. Currently using the free yeti theme from Bootswatch.