Using `.parquet` files in Shiny

Requirements

The current version of our Shiny application performs additional data processing to generate part summaries that are utilized by reactive data frames. The custom function is called gen_part_metaset() which is located in the R/fct_data_processing.R script. For the purposes of this exercise, we are not going to try and optimize this specific function (certainly you are welcome to try after the workshop), instead we are going to see if we can more efficiently utilize the results of the function inside our Shiny application.

Plan

Upon closer inspection, we see that the calls to gen_part_metaset() do not take any dynamic parameters when used in the application. In addition, the function is called multiple times inside a set of reactive expressions. A first attempt at removing the bottleneck would be to move this function call to the beginning of the app_server.R logic and feeding the resulting object directly into the reactives that consume it.

Knowing that the processing function is not leveraging any dynamic parameters, we can do even better. In our mission to ensure the Shiny application performs only the data processing that it absolutely needs to do, we can instead run this function outside of the application, and save the result of the processing as a .parquet file inside the inst/extdata directory using the {arrow} package.

library(dplyr)
library(arrow)
part_meta_df <- gen_part_metaset(min_set_parts = 1)

write_parquet(
  part_meta_df,
  "inst/extdata/part_meta_df.parquet")
)

With the processed data set available in the app infrastructure, we can utilize it inside the application with the following:

part_meta_df <- arrow::read_parquet(app_sys("extdata", "part_meta_df.parquet"), as_data_frame = FALSE)

Why do set the parameter as_data_frame to FALSE in the call above? This ensures the contents of the data file are not read into R’s memory right away, and we can perform data processing on this file in a tidyverse-like pipeline and collect() the results at the end of the pipeline to minimize overhead.

We could add this call to the top of our app_server.R logic, which would already lead to decreased processing time. For an application that is being used very infrequently, that might be good enough. But if we have an application that is going to be used concurrently by multiple users, we may be able to increase performance by ensuring this data file is read in R once for each process launched that servers the application, instead of once for each R session corresponding to different user’s Shiny sessions. More to come later in the workshop on how we can accomplish this with {golem}!