library(arrow)
library(dplyr)
library(stringr)
Data Manipulation Part 1 - Exercises
<- open_dataset(here::here("data/nyc-taxi")) nyc_taxi
compute()
and collect()
Use the function nrow()
to work out the answers to these questions:
How many taxi fares in the dataset had a total amount greater than $100?
How many distinct pickup locations (distinct combinations of the
pickup_latitude
andpickup_longitude
columns) are in the dataset since 2016?
|>
nyc_taxi filter(total_amount > 100) |>
nrow()
[1] 1518869
|>
nyc_taxi filter(year >= 2016) |>
distinct(pickup_longitude, pickup_latitude) |>
compute() |>
nrow()
[1] 29105801
Use the
dplyr::filter()
andstringr::str_ends()
functions to return a subset of the data which is a) from September 2020, and b) the value invendor_name
ends with the letter “S”.Try to use the
stringr
functionstr_replace_na()
to replace anyNA
values in thevendor_name
column with the string “No vendor” instead. What happens, and why?Bonus question: see if you can find a different way of completing the task in question 2.
|>
nyc_taxi filter(str_ends(vendor_name, "S"), year == 2020, month == 9) |>
collect()
|>
nyc_taxi mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor")) |>
head() |>
collect()
This won’t work as stringr::str_replace_na()
hasn’t been implemented in Arrow. You could try using mutate()
and ifelse()
here instead.
|>
nyc_taxi mutate(vendor_name = ifelse(is.na(vendor_name), "No vendor", vendor_name)) |>
head() |>
collect()
Or, if you only needed a subset of the data, you could apply the function after collecting it into R memory.
|>
nyc_taxi filter(year == 2019, month == 10) |> # smaller subset of the data
collect() |>
mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor"))