Hello Arrow

Poll: Arrow

Have you used or experimented with Arrow before today?

Vote using emojis at on the discord channel!

1️⃣ Not yet

2️⃣ Not yet, but I have read about it!

3️⃣ A little

4️⃣ A lot

Some “Big” Data


NYC Taxi Data

  • big NYC Taxi data set (~40GBs on disk)
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
  filter(year %in% 2012:2021) |>
  write_dataset(here::here("data/nyc-taxi"), partitioning = c("year", "month"))
  • tiny NYC Taxi data set (<1GB on disk)
download.file(url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1/nyc-taxi-tiny.zip",
              destfile = here::here("data/nyc-taxi-tiny.zip"))

  zipfile = here::here("data/nyc-taxi-tiny.zip"),
  exdir = here::here("data/")

Posit Cloud ☁️

  • Join the cloud workspace via URL in the workshop Discord channel
  • You will need to create a (free) Posit Cloud account

  • Once you have joined you can come and go

Larger-Than-Memory Data


NYC Taxi Dataset


nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))

nyc_taxi |> 
[1] 1150352666

1.15 billion rows 🤯

NYC Taxi Dataset: A question

What percentage of taxi rides each year had more than 1 passenger?

NYC Taxi Dataset: A dplyr pipeline


nyc_taxi |>
  group_by(year) |>
    all_trips = n(),
    shared_trips = sum(passenger_count > 1, na.rm = TRUE)
  ) |>
  mutate(pct_shared = shared_trips / all_trips * 100) |>
# A tibble: 10 × 4
    year all_trips shared_trips pct_shared
   <int>     <int>        <int>      <dbl>
 1  2012 178544324     53313752       29.9
 2  2013 173179759     51215013       29.6
 3  2014 165114361     48816505       29.6
 4  2015 146112989     43081091       29.5
 5  2016 131165043     38163870       29.1
 6  2017 113495512     32296166       28.5
 7  2018 102797401     28796633       28.0
 8  2020  24647055      5837960       23.7
 9  2021  30902618      7221844       23.4
10  2019  84393604     23515989       27.9

nyc_taxi |>
  group_by(year) |>
    all_trips = n(),
    shared_trips = sum(passenger_count > 1, na.rm = TRUE)
  ) |>
  mutate(pct_shared = shared_trips / all_trips * 100) |>

6.077 sec elapsed

Your Turn

  1. Calculate the longest trip distance for every month in 2019

  2. How long did this query take to run?

➡️ Hello Arrow Exercises Page

What is Apache Arrow?

A multi-language toolbox for accelerated data interchange and in-memory processing

Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another


Apache Arrow Specification

In-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.

A Multi-Language Toolbox

Accelerated Data Interchange

Accelerated In-Memory Processing

Arrow’s Columnar Format is Fast

arrow 📦

arrow 📦


  • Module 1: Larger-than-memory data manipulation with Arrow—Part I
  • Module 2: Data engineering with Arrow
  • Module 3: Larger-than-memory data manipulation with Arrow—Part II
  • Module 4: In-memory workflows in R with Arrow