Workflow
There are different approaches of working with R Markdown. Before you start writing your thesis, it is important that you decide on how much code you want to include in the .Rmd file.
You can code everything in the R Markdown file, but I found that this may not be the best idea. As much as I like to keep things in one place, I would recommend you to do the ‘hard’ part of coding in a normal R file, and just include light code in the R Markdown file. The reason is twofold. First, there is not much to be gained to do your data preparation in R Markdown, if you are not explicitly using it in your text. It will just extend the rendering time in the end. Secondly, I think the file gets more confusing, the bigger it gets.
I worked with three files in total:
The first file (1) was a normal R file for preparing the data set, where I checked the structure of the data and changed the format of variables. All my screens that I had to take in order of the random forest model to work took place here. In the end I saved this final data frame as a feather file, which is described below.
The second file (2), also a R file, was used for the model estimation and validation. I imported the feather data frame from (1) and ran the models. Every model and its validation were saved as a RDS file, also described below.
The third and final file was the R Markdown .Rmd file (3). Here I wrote all my text and imported the files which were saved in (1) and (2). The first chapter of my thesis was an overview over the data set, therefore I first imported the prepared file from (1) and did some summary statistics with kable described in Chapter Tables. Furthermore I did some boxplots of important variables described in Chapter Figures. For my results I imported the files from (2) and then plotted the residuals and other results directly in code chunks.
Figure 1: Flowchart for the three different files
Feather
Installing the feather package can save you loads of time when dealing with big data sets. Instead of always reading in your data with read.csv or read.xls you just use read feather, which is so much faster. After the first time your data is in the global environment you should use:
library(feather)
write_feather(your_data, "your_data.feather")
Now your data is saved as a feather file and can be quickly read in your next R session or the .Rmd file with:
your_data <- read_feather("your_data.feather")
Feather is really great for dealing with big data frames, but cannot save R objects.
RDS
RDS is a way of caching R objects or data frames. Although you should use feather when dealing with data frames, because it is quicker, you need RDS for saving R objects. I ran pretty big random forest models for my thesis and instead of doing that every time, I saved the output:
saveRDS(big_object, "big_object.rds")
As with feather you can just read the output again with:
big_object <- readRDS("big_object.rds")
Because I did the heavy coding in different files, I only read in the output with either feather or RDS in my R Markdown file.
Note that this only works if all your data is in the same working directory as your ‘heavy’-lifting files. You should just point the working directory of all files into one folder.
Experiments
As your R Markdown file gets bigger, the rendering time also extends. For your last format changes, it is probably helpful if you generate a different .Rmd file with which you can experiment. Especially for things like tables, your cover or the ‘Einverständniserklärung’ you do not want to render the whole document. For all my \(\LaTeX\) stuff and tables I experimented in a different file.