library(dplyr)
library(arrow)
<- gen_part_metaset(min_set_parts = 1)
part_meta_df
write_parquet(
part_meta_df,"inst/extdata/part_meta_df.parquet")
)
Using .parquet
files in Shiny
Requirements
The current version of our Shiny application performs additional data processing to generate part summaries that are utilized by reactive data frames. The custom function is called gen_part_metaset()
which is located in the R/fct_data_processing.R
script. For the purposes of this exercise, we are not going to try and optimize this specific function (certainly you are welcome to try after the workshop), instead we are going to see if we can more efficiently utilize the results of the function inside our Shiny application.
Plan
Upon closer inspection, we see that the calls to gen_part_metaset()
do not take any dynamic parameters when used in the application. In addition, the function is called multiple times inside a set of reactive
expressions. A first attempt at removing the bottleneck would be to move this function call to the beginning of the app_server.R
logic and feeding the resulting object directly into the reactives that consume it.
Knowing that the processing function is not leveraging any dynamic parameters, we can do even better. In our mission to ensure the Shiny application performs only the data processing that it absolutely needs to do, we can instead run this function outside of the application, and save the result of the processing as a .parquet
file inside the inst/extdata
directory using the {arrow}
package.
With the processed data set available in the app infrastructure, we can utilize it inside the application with the following:
<- arrow::read_parquet(app_sys("extdata", "part_meta_df.parquet"), as_data_frame = FALSE) part_meta_df
Why do set the parameter as_data_frame
to FALSE
in the call above? This ensures the contents of the data file are not read into R’s memory right away, and we can perform data processing on this file in a tidyverse-like pipeline and collect()
the results at the end of the pipeline to minimize overhead.
We could add this call to the top of our app_server.R
logic, which would already lead to decreased processing time. For an application that is being used very infrequently, that might be good enough. But if we have an application that is going to be used concurrently by multiple users, we may be able to increase performance by ensuring this data file is read in R once for each process launched that servers the application, instead of once for each R session corresponding to different user’s Shiny sessions. More to come later in the workshop on how we can accomplish this with {golem}
!