library(arrow)
library(dplyr)
library(stringr)Data Manipulation Part 1 - Exercises
nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))compute() and collect()
Use the function nrow() to work out the answers to these questions:
How many taxi fares in the dataset had a total amount greater than $100?
How many distinct pickup locations (distinct combinations of the
pickup_latitudeandpickup_longitudecolumns) are in the dataset since 2016?
nyc_taxi |>
filter(total_amount > 100) |>
nrow()[1] 1518869
nyc_taxi |>
filter(year >= 2016) |>
distinct(pickup_longitude, pickup_latitude) |>
compute() |>
nrow()[1] 29105801
Use the
dplyr::filter()andstringr::str_ends()functions to return a subset of the data which is a) from September 2020, and b) the value invendor_nameends with the letter “S”.Try to use the
stringrfunctionstr_replace_na()to replace anyNAvalues in thevendor_namecolumn with the string “No vendor” instead. What happens, and why?Bonus question: see if you can find a different way of completing the task in question 2.
nyc_taxi |>
filter(str_ends(vendor_name, "S"), year == 2020, month == 9) |>
collect()nyc_taxi |>
mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor")) |>
head() |>
collect()This won’t work as stringr::str_replace_na() hasn’t been implemented in Arrow. You could try using mutate() and ifelse() here instead.
nyc_taxi |>
mutate(vendor_name = ifelse(is.na(vendor_name), "No vendor", vendor_name)) |>
head() |>
collect()Or, if you only needed a subset of the data, you could apply the function after collecting it into R memory.
nyc_taxi |>
filter(year == 2019, month == 10) |> # smaller subset of the data
collect() |>
mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor"))