vignettes/subset_data.Rmd
subset_data.Rmd
The purpose of this vignette is to highlight the
subset_data()
function, an extremely useful utility
function in respR
.
When working with a large respirometry dataset, sometimes you don’t want to use all of it within another function. There can be a several reasons for this. Maybe it is an intermittent-flow dataset and you want to extract a single replicate, maybe you want to remove extraneous data at the start or end of a dataset, or maybe you just want to plot or inspect a small portion of a dataset for a quick look.
subset_data
can let you pass subsets of data to other
functions by ranges of time, row, or oxygen. This works not only on data
frames, but also respR inspect()
and
inspect.ft()
objects. Originally if you wanted to subset
these you would have to subset the original data then
inspect
it again, potentially creating a new object. Using
subset_data
removes these steps.
It is especially useful when used with piping, either the new native
|>
pipes introduced in R
v4.1 or %>%
dplyr
pipes, for
passing portions of a dataset without having to create a new object.
R makes it fairly easy to subset data objects, but this function has several advantages, in that it lets you subset the data easily using the units you are already thinking about, reduces the proliferation of extra objects in the local environment that you need to keep track of, and can be used in loops to subset regions of longer datasets in a consistent way.
The squid.rd
dataset is extremely long.
inspect(squid.rd)
#> inspect: Applying column default of 'time = 1'
#> inspect: Applying column default of 'oxygen = 2'
#> inspect: No issues detected while inspecting data frame.
Let’s say we are only interested in the initial stages.
x <- subset_data(squid.rd, from = 1, to = 10000, by = "time")
inspect(x)
Now we can pass this object to further functions.
We can do the same thing using rows, which in this dataset which is per-second happen to be the same values.
x <- subset_data(squid.rd, from = 1, to = 10000, by = "row")
inspect(x)
Let’s say we are only interested in routine metabolic rates in normal conditions. In this long experiment oxygen declines to very low levels, so let’s only select the higher values.
x <- subset_data(squid.rd, from = 8, to = 6.5, by = "oxygen")
inspect(x)
print(x)
#> Time Oxygen
#> <int> <num>
#> 1: 0 7.7264
#> 2: 1 7.7264
#> 3: 2 7.7264
#> 4: 3 7.7264
#> 5: 4 7.7264
#> ---
#> 3941: 3940 6.5035
#> 3942: 3941 6.4954
#> 3943: 3942 6.4954
#> 3944: 3943 6.5035
#> 3945: 3944 6.5035
Note how even though we use a value well above what is in the dataset (8.0) the function simply subsets from the start of the data, Similarly, there is no exact match to the value of 6.5, so the function finds the closest value, in this case 6.5035.
This is perhaps where subset_data
comes into its own. We
want to subset a single replicate from an intermittent flow experiment
to calculate a rate.
x <- inspect(intermittent.rd)
We have already inspected the dataset. Now we just subset out one
replicate, pipe the data to calc_rate
and other functions
and output our final rate, all without having to create a new
object.
x |>
subset_data(from = 2100, to = 3500, by = "time") |>
auto_rate() |>
adjust_rate(by = -0.00071) |>
convert_rate(oxy.unit = "mg/L",
time.unit = "s",
output.unit = "mg/h/g",
volume = 2.379,
mass = 0.006955) |>
summary(pos = 1:3)
#>
#> # summary.convert_rate # ----------------
#> Summary of converted rates from entered 'pos' rank(s):
#>
#> rep rank intercept_b0 slope_b1 rsq density row endrow time endtime oxy endoxy rate adjustment rate.adjusted rate.input oxy.unit time.unit volume mass area S t P rate.abs rate.m.spec rate.a.spec output.unit rate.output
#> 1: NA 1 8.52 -0.000608 0.993 8645 162 1262 2261 3361 7.14 6.52 -0.000608 -0.00071 0.0001022 0.0001022 mg/L sec 2.38 0.00696 NA NA NA NA 0.876 0.126 NA mgO2/hr/g 0.126
#> 2: NA 2 8.47 -0.000586 0.972 8411 72 670 2171 2769 7.19 6.85 -0.000586 -0.00071 0.0001235 0.0001235 mg/L sec 2.38 0.00696 NA NA NA NA 1.058 0.152 NA mgO2/hr/g 0.152
#> 3: NA 3 8.55 -0.000619 0.989 7819 162 1033 2261 3132 7.14 6.63 -0.000619 -0.00071 0.0000912 0.0000912 mg/L sec 2.38 0.00696 NA NA NA NA 0.781 0.112 NA mgO2/hr/g 0.112
#> -----------------------------------------
We can also use subset_data
and pipes just to have a
quick look at portions of big datasets without having to save
anything.
zeb_intermittent.rd |>
subset_data(from = 10000, to = 13000, by = "time") |>
inspect()
subset_data
will also work on multi-column data frames
(in the case of by = "oxygen"
it bases the subsetting on
the first column of oxygen only, but all other columns are subset to the
same rows).
urchins.rd |>
subset_data(from = 10, to = 20, by = "time") |>
print(nrows = 10)
#> time.min a b c d e f g h i j k l m n o p b1 b2
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 10.0 7.58 7.45 7.51 7.36 7.65 7.43 7.51 7.46 7.80 7.46 7.50 7.54 7.61 7.43 6.79 6.89 7.90 7.67
#> 2: 10.2 7.54 7.42 7.48 7.36 7.66 7.47 7.50 7.45 7.80 7.49 7.48 7.59 7.61 7.40 6.81 6.89 7.88 7.65
#> 3: 10.3 7.52 7.46 7.51 7.34 7.65 7.46 7.52 7.41 7.79 7.49 7.46 7.54 7.60 7.36 6.80 6.89 7.88 7.66
#> 4: 10.5 7.53 7.43 7.50 7.38 7.61 7.45 7.47 7.42 7.75 7.48 7.44 7.54 7.61 7.40 6.81 6.85 7.87 7.66
#> 5: 10.7 7.52 7.43 7.50 7.34 7.61 7.42 7.46 7.44 7.74 7.46 7.46 7.49 7.58 7.39 6.78 6.88 7.87 7.67
#> ---
#> 57: 19.3 7.25 7.24 7.40 7.12 7.48 7.29 7.32 7.32 7.60 7.30 7.32 7.37 7.43 7.22 6.70 6.78 7.92 7.64
#> 58: 19.5 7.27 7.23 7.42 7.14 7.45 7.30 7.31 7.33 7.60 7.26 7.30 7.37 7.40 7.20 6.73 6.76 7.90 7.64
#> 59: 19.7 7.27 7.24 7.41 7.13 7.45 7.30 7.28 7.29 7.61 7.24 7.30 7.37 7.38 7.20 6.71 6.76 7.89 7.65
#> 60: 19.8 7.27 7.23 7.40 7.16 7.43 7.27 7.36 7.32 7.63 7.25 7.31 7.37 7.36 7.18 6.69 6.76 7.89 7.66
#> 61: 20.0 7.31 7.20 7.41 7.11 7.45 7.24 7.33 7.37 7.63 7.26 7.30 7.36 7.41 7.19 6.70 6.75 7.89 7.65