Data Manipulation Part 1 - Exercises

library(arrow)
library(dplyr)
library(stringr)
nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))
Using compute() and collect()

Use the function nrow() to work out the answers to these questions:

  1. How many taxi fares in the dataset had a total amount greater than $100?

  2. How many distinct pickup locations (distinct combinations of the pickup_latitude and pickup_longitude columns) are in the dataset since 2016?

nyc_taxi |>
  filter(total_amount > 100) |>
  nrow()
[1] 1518869
nyc_taxi |>
  filter(year >= 2016) |>
  distinct(pickup_longitude, pickup_latitude) |>
  compute() |>
  nrow()
[1] 29105801
Using the dplyr API in arrow
  1. Use the dplyr::filter() and stringr::str_ends() functions to return a subset of the data which is a) from September 2020, and b) the value in vendor_name ends with the letter “S”.

  2. Try to use the stringr function str_replace_na() to replace any NA values in the vendor_name column with the string “No vendor” instead. What happens, and why?

  3. Bonus question: see if you can find a different way of completing the task in question 2.

nyc_taxi |>
  filter(str_ends(vendor_name, "S"), year == 2020,  month == 9) |>
  collect()
nyc_taxi |>
  mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor")) |>
  head() |>
  collect()

This won’t work as stringr::str_replace_na() hasn’t been implemented in Arrow. You could try using mutate() and ifelse() here instead.

nyc_taxi |>
  mutate(vendor_name = ifelse(is.na(vendor_name), "No vendor", vendor_name)) |>
  head() |>
  collect()

Or, if you only needed a subset of the data, you could apply the function after collecting it into R memory.

nyc_taxi |>
  filter(year == 2019, month == 10) |> # smaller subset of the data
  collect() |>
  mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor"))