library(arrow)
library(dplyr)
library(stringr)Data Manipulation Part 1 - Exercises
nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))compute() and collect()
Use the function nrow() to work out the answers to these questions:
- How many taxi fares in the dataset had a total amount greater than $100? 
- How many distinct pickup locations (distinct combinations of the - pickup_latitudeand- pickup_longitudecolumns) are in the dataset since 2016?
nyc_taxi |>
  filter(total_amount > 100) |>
  nrow()[1] 1518869nyc_taxi |>
  filter(year >= 2016) |>
  distinct(pickup_longitude, pickup_latitude) |>
  compute() |>
  nrow()[1] 29105801- Use the - dplyr::filter()and- stringr::str_ends()functions to return a subset of the data which is a) from September 2020, and b) the value in- vendor_nameends with the letter “S”.
- Try to use the - stringrfunction- str_replace_na()to replace any- NAvalues in the- vendor_namecolumn with the string “No vendor” instead. What happens, and why?
- Bonus question: see if you can find a different way of completing the task in question 2. 
nyc_taxi |>
  filter(str_ends(vendor_name, "S"), year == 2020,  month == 9) |>
  collect()nyc_taxi |>
  mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor")) |>
  head() |>
  collect()This won’t work as stringr::str_replace_na() hasn’t been implemented in Arrow. You could try using mutate() and ifelse() here instead.
nyc_taxi |>
  mutate(vendor_name = ifelse(is.na(vendor_name), "No vendor", vendor_name)) |>
  head() |>
  collect()Or, if you only needed a subset of the data, you could apply the function after collecting it into R memory.
nyc_taxi |>
  filter(year == 2019, month == 10) |> # smaller subset of the data
  collect() |>
  mutate(vendor_name = stringr::str_replace_na(vendor_name, "No vendor"))