Big Data in R with Arrow

Poll: Arrow

Have you used or experimented with Arrow before today?

Vote using emojis at on the discord channel!

1️⃣ Not yet

2️⃣ Not yet, but I have read about it!

3️⃣ A little

4️⃣ A lot

Hello Arrow
Demo

Some “Big” Data

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

NYC Taxi Data

big NYC Taxi data set (~40GBs on disk)

open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
  filter(year %in% 2012:2021) |>
  write_dataset(here::here("data/nyc-taxi"), partitioning = c("year", "month"))

tiny NYC Taxi data set (<1GB on disk)

download.file(url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1/nyc-taxi-tiny.zip",
              destfile = here::here("data/nyc-taxi-tiny.zip"))

unzip(
  zipfile = here::here("data/nyc-taxi-tiny.zip"),
  exdir = here::here("data/")
)

Posit Cloud ☁️

Join the cloud workspace via URL in the workshop Discord channel
You will need to create a (free) Posit Cloud account

Posit Cloud ☁️

Once you have joined you can come and go

Larger-Than-Memory Data

arrow::open_dataset()

NYC Taxi Dataset

library(arrow)

nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))

NYC Taxi Dataset

nyc_taxi |> 
  nrow()

[1] 1150352666

1.15 billion rows 🤯

NYC Taxi Dataset: A question

What percentage of taxi rides each year had more than 1 passenger?

NYC Taxi Dataset: A dplyr pipeline

library(dplyr)

nyc_taxi |>
  group_by(year) |>
  summarise(
    all_trips = n(),
    shared_trips = sum(passenger_count > 1, na.rm = TRUE)
  ) |>
  mutate(pct_shared = shared_trips / all_trips * 100) |>
  collect()

# A tibble: 10 × 4
    year all_trips shared_trips pct_shared
   <int>     <int>        <int>      <dbl>
 1  2012 178544324     53313752       29.9
 2  2013 173179759     51215013       29.6
 3  2014 165114361     48816505       29.6
 4  2015 146112989     43081091       29.5
 5  2016 131165043     38163870       29.1
 6  2017 113495512     32296166       28.5
 7  2018 102797401     28796633       28.0
 8  2020  24647055      5837960       23.7
 9  2021  30902618      7221844       23.4
10  2019  84393604     23515989       27.9

NYC Taxi Dataset: A dplyr pipeline

library(tictoc)

tic()
nyc_taxi |>
  group_by(year) |>
  summarise(
    all_trips = n(),
    shared_trips = sum(passenger_count > 1, na.rm = TRUE)
  ) |>
  mutate(pct_shared = shared_trips / all_trips * 100) |>
  collect()
toc()

6.077 sec elapsed

Your Turn

Calculate the longest trip distance for every month in 2019
How long did this query take to run?

➡️ Hello Arrow Exercises Page

What is Apache Arrow?

A multi-language toolbox for accelerated data interchange and in-memory processing

Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another

https://arrow.apache.org/overview/

Apache Arrow Specification

In-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.

A Multi-Language Toolbox

Accelerated Data Interchange

Accelerated In-Memory Processing

Arrow’s Columnar Format is Fast

arrow 📦

Today

Module 1: Larger-than-memory data manipulation with Arrow—Part I
Module 2: Data engineering with Arrow
Module 3: Larger-than-memory data manipulation with Arrow—Part II
Module 4: In-memory workflows in R with Arrow

Hello Arrow

Poll: Arrow

Hello ArrowDemo

Some “Big” Data

NYC Taxi Data

Posit Cloud ☁️

Posit Cloud ☁️

Larger-Than-Memory Data

NYC Taxi Dataset

NYC Taxi Dataset

NYC Taxi Dataset: A question

NYC Taxi Dataset: A dplyr pipeline

NYC Taxi Dataset: A dplyr pipeline

Your Turn

What is Apache Arrow?

Apache Arrow Specification

A Multi-Language Toolbox

Accelerated Data Interchange

Accelerated In-Memory Processing

arrow 📦

arrow 📦

Today

Hello Arrow
Demo