Devops for Data Scientists

Part 1: Introductions, setup, & workshop overview

Meet your instructors

Rika
Andrie
David
Gagan
Michael

Solutions Engineer at Posit
Former Data Scientist and Data Engineer

Director of Posit Strategy, former Solutions Engineer

David Aja is a Solutions Engineer at Posit. Before joining Posit, he worked as a data scientist in the public sector.

Gagandeep Singh is a former software engineer and data scientist who has worked in a variety of cross-technology teams before joining Posit as a Solutions Engineer.

Scientist by training turned IT / HPC / Scientific Computing

Solutions Engineering at Posit

Posit’s Solutions Engineering team aims to shrink the distance between the needs of Posit’s customers and our Pro and Open Source offerings, leading with curiosity and technical excellence.
Our customer-facing work helps our customers deploy, install, configure, and use our Pro products.

Special Thanks to Alex Gold!

Author of DevOps for Data Science

Posit Solutions Engineering Team Director

Introduce yourself to your neighbor

Take 5 minutes and introduce yourself to your neighbors.

Word Cloud

💻 If your WIFI isn’t working let us know as soon as possible!

Name: Posit Conf 2023

Password: conf2023

Pre-workshop Install

Required:
https://registry.hub.docker.com/signup
https://github.com/signup

Optional: (for local development)
https://docs.docker.com/desktop/
https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
https://quarto.org/docs/get-started/
https://posit.co/downloads/

Workshop Install

Software	Link	Credentials
Posit Workbench & Connect	http://pos.it/class	Classroom ID: `devops_workshop`
Docker Classroom	https://labs.play-with-docker.com/	Docker Hub credentials

Documentation & Communication

All documents including slides are available on our website: https://posit-conf-2023.github.io/DevOps/

Discord channel - devops-for-data-scientists

During labs and exercises,please place a post-it note on the back of your laptop for:

🟥 - I need help

🟩 - I’m done

Daily Schedule

📅 September 17 and 18, 2023

⏰ 09:00am - 5:00pm

Time	Topic
10:30 - 11:00am	Break
12:30 - 1:30pm	Lunch
3:00 - 3:30pm	Break

Code of Conduct

Everyone who comes to learn and enjoy the experience should feel welcome at posit::conf. Posit is committed to providing a professional, friendly and safe environment for all participants at its events, regardless of gender, sexual orientation, disability, race, ethnicity, religion, national origin or other protected class.
This code of conduct outlines the expectations for all participants, including attendees, sponsors, speakers, vendors, media, exhibitors, and volunteers. Posit will actively enforce this code of conduct throughout posit::conf.

https://posit.co/code-of-conduct/

Workshop Goals

To understand how DevOps can help you in your work as data scientists
To understand the main principles and tools of DevOps
To get hands-on experience putting code into production using common DevOps workflows
To leave the workshop with some “assets” and skills you can use in your work

Agenda & Lab Overview

Section/Time	Topics	Labs
Part 1	Workshop overview Logistics & setup Introductions	Infrastructure & wifi setup
Part 2: DevOps Principles & Tools	Introduction to DevOps Version control & github CI/CD Reproducing workflows and environments	Lab 1: Deploy your own Quarto website on Github Pages & Posit Connect using GitHub Actions \|
Part 3: Docker for Data Scientists	How and why data scientists use docker in production Docker: overview and architecture Building docker images and containers Ports & networking	Lab #2: Write your own Dockerfile to deploy Open Source Shiny Server on Docker playground and host an app on the server
Part 4: Data Science in Production	Choosing your deployment method APIs and when to use them Just enough auth Logging & metrics & testing	Lab #3: Host and secure an API on Posit Connect
Part 5: Discussion	Course feedback Questions for the team	Workshop Survey

Most important Linux commands

Command	Description	Examples
cd	change directory	https://linuxize.com/post/linux-cd-command/
ls	list all files in current working directory; add -lha flag for hidden files	https://linuxize.com/post/how-to-list-files-in-linux-using-the-ls-command/
pwd	print working directory	https://linuxize.com/post/current-working-directory/
touch	create a file	https://linuxize.com/post/linux-touch-command/
mkdir	create a directory	https://linuxize.com/post/how-to-create-directories-in-linux-with-the-mkdir-command/
vim	open a file in Vim text editor	https://linuxize.com/post/how-to-save-file-in-vim-quit-editor/
curl	command-line utility for transferring data from or to a server. Uses one of the supported protocols including HTTP, HTTPS, SCP , SFTP , and FTP	https://linuxize.com/post/curl-command-examples/
echo	print argument to standard output	https://linuxize.com/post/echo-command-in-linux-with-examples/
$PATH	environmental variable that includes colon-delimited list of directories where the shell searches for executable files	https://linuxize.com/post/how-to-add-directory-to-path-in-linux/

What we won’t cover

How to become a DevOps engineer
Python-based workflows
In-depth security and auth practices

Part 2: DevOps Principles & Tools

Section Goals

To understand the main principles of DevOps and how they can improve data science workflows.
To get familiar with the DevOps toolkit.
To get familiar with how the renv package helps create reproducible environments for your R projects.
To get comfortable using the terminal for interacting with git and github.
To understand how to authenticate to github using SSH or HTTPS.
To practice creating a CI/CD workflow using yaml and Github Actions.

“DevOps is a set of cultural norms, practices, and supporting tooling to help make the process of developing and deploying software smoother and lower risk.”

But first…a very brief history lesson

So you want to create an app?

Let’s talk about what it looks like to create an application. On one side you had developers creating some sort of application - let’s say a stock trading application, so your end user are stock traders

Devs would work on creating the code for this app, get requirements, test it and then package up all the source code in some way so that its executable and that it could be deployed somehow to those end users you’d configure the server, install any tools and frameworks needed to run the code, and then launch.

Now your stock traders are using their application - maybe they’re giving feedback somehow or errors are being logged. Another team is monitoring the environment and seeing if servers are able to handle the load of all the users.

But eventually you’d need to fix bugs or release new features or maybe change things so that the app could handle more end users -

your dev team make updates, version those changes, test them, re-launch the app and you do this over and over again in some sort or release process – so you have this continuous cycle of releasing your app, finding something to fix or add, testing that change, integrating that code, and then releasing the app..

Making this continuous process as fast, efficient, accurate, and low risk is the goal of devops.

sounds pretty efficient right -

Problems DevOps tries to solve

So this is the idealized example - but what was it like in practice 20 years ago. You want to make the release process as fast as possible but you also want it to be tested and free of bugs. And this introduces a natural conflict between speed and stability

You have the same goals of speed and accuracy - but this process of continuous integration, of testing, of automating processes so that your users can get their hands on the app - didnt exist.

One problem is that the entire process is siloed between the creation of the code and the deployment of your app. So your developers finish their code and then throw it over the fence to ops. Maybe they then throw it to security or to QA. But there was no formal alliance in place or processes on how the teams work together. Because each team is seemingly working on one part - the code vs. the deploy - there’s also technical knowledge siloes - each team only knows their bit but not anything else.

So imagine this stock trading app - your developers are creating cool new features so that you can easily buy and sell stock, maybe even see data on whats happening in the market. But they’re not necessarily thinking about how the app is secured or if its compliant with federal regulations or if user data is protected. Obvi this is a worst case scenario - but you get the picture.

So as the app is built it keeps getting thrown back and forth between the two teams and leads to one of the major problems that devops is trying to fix - really slow and manual bureaucratic release process.

On the technical side, a lot of the actual work is done manually. So you can imagine one team running tests manually in one environment - maybe the environment itself is created manually, or someone manually sizing up or fixing a server. This takes time, you need to get approval from people, and its not easily reproduced or really documented anywhere. Also, there is a lot of room for error and if things break its not easy to roll back bugs.

Problem becomes how do you automate this release process, make it streamlined, less error prone, improve how all these teams integrate and operate, and at the end of the day make the process of getting the app into the hands of your users much faster.

Common DevOps Principles

🤲 Collaboration

🚚 Continuous Integration & Delivery

🤖 Automation

📔 Reproducibility

👨‍🎤 Culture change

As the field of devops developed, there are different implementations/diff technologies of what devops means - these happen differently at different companies But there have been some common philosophies and best practices that have come out that I’ve seen consistently across the entire ecosystem.

First - everyone that’s involved in the creation of this application should be collaborating and working together. This is my wink wink moment for you all of why devops is so important for data scientists

Then we have this concept of Continuous Integration & Delivery - we’ll take about this in greater length in a few minutes - but this is the process of 1. the coding of the app 2. the versioning of that code is some kind of repository 3. testing those changes 4. integrating them into the app if those tests are passed 5. And then that continuous process of releasing those changes into production

So for the CI/CD process to work and specifically to work quickly - we will need some level of automation as we move through that cyclical process. You don’t want to have to pause things to get approval or manually change things when tests pass - you want it to get to the user as quickly as possible after everything has been tested.

The next piece is reproducibility. You have different teams working on different pieces of the puzzle, but at the end of the day the end result is the same. Everyone is working on that same app - in this case our stock platform. So as jobs go from one team to another team and as your app goes from development to testing to finally your users, you want to make sure that there’s a common ecoystem that everyone is working in. When you’re testing your app you want to make sure that your environment is as similar to what is going to be in real-life or what is called “production”. You also want to make sure that your code itself is reproducible and that it works for users regardless of their location or their computer or their browser maybe.

The last piece that I find really interesting is the concept of culture change. So what needs to change in your organization so that you’re able to function while still holding this tension or conflict between speed and stability. How do you make sure that teams that have different incentives - for example security team vs. a data team - are able to work together.

So let me pause - that’s a lot - everyone got a bit of a history lesson but as far as i’m aware no one here is a software developer or an IT engineer. So that brings me to this very important question: But

Why should we as data scientists care about this?

Has this ever happened to you?

You come back to code from a year ago and now it doesnt run!

Reproducibility

You need to hand off your model to the Engineering team but they only code in Java

Continuous Integration & Delivery & Collaboration

Your boss asks you to share that Shiny app with a client but the ops team is too busy working on their roadmap to help you deploy it somewhere.

Culture Change

We should care because …

Data scientists are developers!
Data science careers have moved from academic sphere to tech and software but education hasnt always followed
Automation, collaboration, testing can dramatically improve data work and improve reproducibility
The many hats of a data scientist
Improve collaboration & communication with other teams

Responsibility of the analyst

I mentioned before that a lot of DS comes from academia and research. So the concept of reproducing your results - especially if those results play a role in how medicines or medical devices are developed - is incredibly important.

2005 essay by John Ioannidis in scientific journal on what is known as the replicability crisis in scientific publishing, arguing that majority medical research studies cannot be replicated.

So as data scientists and we are curators of the data and are responsible for its legitimacy in many ways - obv cant always control it of course.

This second paper started looking a method of reproducing results - namely Jupyter notebooks - and found a lot of ways that reproducibility was improved but also a lot of places where things could have been done better and some best practices for how to do that - in what we call literate programming.

27,271 Notebooks:

11,454 could not declare dependencies
5,429 could not successfully install declared dependencies
9,185 returned an error when ran
324 returned a different result than originally reported

Let’s get back to our principles:

The Devops Process Flow

Proliferation of tools

An opinionated take on tools

CI/CD
Environment Management
Version Control & Git workflows
Package Management
Automation & Build tools
Continuous Monitoring and Logging

The CI/CD Pipeline

As we discussed before CI/CD is an iterative cycle of small steps to quickly build, test, and deploy your code - this is a critical component of devops.
So CI/CD is often said in the same breath - as it makes up a continuous pipeline - but its actually pretty discrete parts so I want to make sure we understand how they differ.

-Continuous integration (CI) - this is where you or anyone who writes code, automatically builds, tests, and commits code changes into a shared repository; This is usually triggered by an automatic process where the code is built, tested and then either passes or fails. This step is focused on improving the actual build or application as quickly as possible.

Different types of tests - from unit tests to integration tests to regression tests.

-Continuous delivery (CD) and deployment are less focused on the build but on the actual installation and distribution of your build. This includes an automated process to deploy across different environments - from development to testing and finally to production. Delivery is the process for the final release - that “push of a button step” to get to prod.

we’ll talk about some examples of popular CI/CD tools in a bit - Github Actions, Jenkins, CircleCI, Gitlab CI

Environment Management

Development	Testing	Production
Exploratory Often local machine Access to R/Python Packages	as similar to prod as possible unit & integration testing data validation “sandbox” with data that’s as close to real as possible	automatic CD isolated from dev & test created with code

Version Control & Workflows

Lab Activity - developing locally using Quarto

🟥 - I need help

🟩 - I’m done

Login to pos.it/class with code devops_workshop

Lab 1 Part 1

Is version control secure?

A short auth teaser

We can use a variety of data sharing “transfer protocols”
Protocols specify what kind of traffic is moving between 2 machines
Use different security mechanisms
Ports on the host and destination specifies where to direct the traffic

HTTPS

Git protocol options

http - 80

https - 443

ssh - 22

we’ll be using https - because … from jenny bryans happy git with R

I find HTTPS easier to get working quickly and strongly recommend it when you first start working with Git/GitHub.

HTTPS is what GitHub recommends, presumably for exactly the same reasons.

The “ease of use” argument in favor of HTTPS is especially true for Windows users.

Another advantage of HTTPS is that the PAT we’ll set up for that can also be used with GitHub’s REST API.

hypertext transfer protocol When a web user wants to load or interact with a web page, their web browser sends an HTTP request to the origin server that hosts the website’s files. These requests are essentially lines of text that are sent via the internet.

https : http encrypted with SSL/TLS - digital certificates that establish an encrypted connected

SSH: secure shell, public-key cryptography to authenticate the client, used for remote logins, command line execution

HTTPS is simpler. For most services besides Github, you just have to enter in your username and password, and you’ll be able to push and pull code.

You don’t have to juggle multiple SSH keys around to use multiple devices.

Port 443, which HTTPS uses, is open in basically any firewall that can access the internet. That isn’t always the case for SSH.

The primary downside for most people is that you must enter your Git password/token every time you push. While it gets added to a cache, it’s not configured to cache permanently (though this can be changed). With SSH keys, it just uses the key file on disk every time.

Where SSH takes the lead is with the authentication factor—the key. The length of it alone makes it harder to accidentally leak, and due to it being unwieldy and unique, it’s generally more secure.

Reproducing your environment

What are the layers that need to be reproduced across your dev, test, and prod environments?

What’s your most difficult reproducibility challenge?

Layers of reproducibility

Packages vs. Libraries vs. Repositories

Package - contains code, functions, data, and documentation.

Library - is a directory where packages are installed.

Repository - a collection of packages. CRAN is a public external repository that is a network of servers that distribute R along with R packages.

`renv` workflow

Example - renv

# install.packages("renv")
renv::init()
renv::snapshot()
lapply(.libPaths(), list.files)

Lab Activity - `renv` workflow & github authentication

🟥 - I need help

🟩 - I’m done

Lab 1: Part 2 & 3

Build tools

Virtualization/Containerization
Infrastructure as code
Automation with Github Actions/Jenkins/CircleCI
Cloud devops tools (Azure Devops, Google Cloud Build)

automate common build and deployment tasks
for example, generating artifacts, deploying to diff environs, provisioning or deprovisioning environments -
some overlap here between these tools and CI/CD

Virtual/Container -prod-identical - help you build isolated environments and tested and debug code; docker, kubernetes, other container services

IaC - quickly re-provision your entire environment with code (chef, puppet, dockr, terraform, ansible) instead of fixing it, just tear it down and restart it - configuration management

tools that automate the process of building and deploying your code once it goes from your dev env to test and prod

cloud devops - run your entire devops stack in a cloud managed cloud managed platform

You also have support tools - that are used across a lot of these tools - I want to discuss one of them here - YAML

Power of YAML

YAML Ain’t Markup Language
communication of data between people and computers
human friendly
configures files across many execution environments

YAML

EmpRecord:  
  emp01:
    name: Michael
    job: Manager
    skills: 
      - Improv
      - Public speaking
      - People management
  emp02:
    name: Dwight
    job: Assistant to the Manager
    skills: 
      - Martial Arts
      - Beets
      - Sales

whitespace indentation denotes structure & hierarchy
Colons separate keys and their values
Dashes are used to denote a list

Example - YAML in action

Docker compose yaml file to spin up all 3 Posit Pro products

Exercise - Inspect your YAML

🟥 - I need help

🟩 - I’m done

Inspect your _quarto.yml file and identify what each part of it does using the quarto site.

Continuous Integration Example

Other GHA Categories

Deployment
Security
Continuous Integration
Automation
Pages
Make your own!

Examples

Quarto Publish Flows

Manually

Continuous Integration

GHA Structure

Open source ecosystem of available actions

Actions Syntax

Code

Lab Activity - CICD with Github Actions

🟥 - I need help

🟩 - I’m done

Lab 1: Part 4

Logging

app.py

import logging

# Configure the log object
logging.basicConfig(
    format='%(asctime)s - %(message)s',
    level=logging.INFO
)

# Log app start
logging.info("App Started")

app.R

# Configure the log object
log <- log4r::logger()

# Log app start
log4r::info(log, "App Started")

Monitoring

Prometheus

OS monitoring toolkit
official Prometheus client in Python
{openmetrics} package to register metrics from a Plumber API or Shiny app.

Grafana

GUI for Prometheus
Get Started with Grafana and Prometheus

Part 3: Docker for Data Scientists

Section Goals

to understand different docker workflows
to learn common docker syntax for building and running containers
to get hands on experience building and running containers
to explore the linux file system

Why use a container at all?

allows you to package up everything you need to reproduce an environment/application
lightweight system without much overhead
share containers with colleagues without requiring them to have to set up their own local machines
quick testing and debugging
allows you to easily version snapshots of your work
scaling up

Our Docker Environment

Workflow

Architecture

On the left we have the client - think of this as your computer or perhaps a computer somewhere in the cloud

on the right you have the server - this the docker host which runs something called the Docker daemon - you’ll be communicating with the Docker daemon via a command line interface.

The host can be remote or it can be on your computer with the client - so for example, you can download something called Docker Engine. In most cases the host is running on a Linux OS. (in different laptop OS, Docker Desktop uses a lightweight VM under the hood)

Docker engine is a container runtime that runs on different OS’s. Sets up the isolated environment for your container.

what the heck is the Docker daemon - its a service that does all the heavy lifting of building and running your containers

registry - upload and download images

Docker Desktop includes the Docker daemon (dockerd), the Docker client (docker), Docker Compose - another client, Docker Content Trust, Kubernetes, and Credential Helper.

Images

Dockerhub Registry

https://hub.docker.com/search?q=

Containers

How can data scientists use docker?

Example - testing & versioning

docker pull postgres:12
docker pull postgres:latest
docker image ls -a
docker run -d -e POSTGRES_PASSWORD=mysecretpassword --name postgres_early imageID
docker run -d -e POSTGRES_PASSWORD=mysecretpassword --name postgres_new imageID
docker container ls -a
docker stop
docker restart

Example - reproduce environments

https://hub.docker.com/u/rocker

# pull the image
docker pull rocker/r-base

# run container
docker run --rm -it rocker/r-base

Example - isolate applications

https://hub.docker.com/u/rstudio

DO NOT RUN!

docker run -it --privileged \
    -p 3939:3939 \
    -e RSC_LICENSE=$RSC_LICENSE \
    rstudio/rstudio-connect:ubuntu2204

Example - Docker as deployment strategy

https://github.com/mickwar/ml-deploy

Example - Docker in CI pipeline

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      -
        name: Checkout
        uses: actions/checkout@v3
      -
        name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      -
        name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: .
          file: ./Dockerfile
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/latest

Modes for running containers

Mode	Run command	Use case
Detached	`docker run -d`	This runs the container in the background so the container keeps running until the application process exits, or you stop the container. Detached mode is often used for production purposes.
Interactive + terminal	`docker run -it`	This runs the container in the foreground so you are unable to access the command prompt. Interactive mode is often used for development and testing.
Remove everything once the container is done with its task	`docker run --rm`	This mode is used on foreground containers that perform short-term tasks such as tests or database backups. Once it is removed anything you may have downloaded or created in the container is also destroyed.

Run Command

Lab Activity - Running containers

🟥 - I need help

🟩 - I’m done

Lab 2: Part 1

Running containers

docker pull ubuntu
docker image ls -a

docker run -it ubuntu bash
ls
whoami
hostname
# exit the container with Ctrl+D or exit

docker run -d ubuntu
docker container ls -a

docker run -d -P --name nginx1 nginx:alpine

docker container stop nginx1

docker run --rm debian echo "hello world"

Container Debugging

docker run -it -d ubuntu
docker container ls -a 
docker exec -it CONTAINER_ID bash

docker container run -d --name mydb \
 --name mydb \
 -e MYSQL_ROOT_PASSWORD=my-secret-pw \ 
 mysql
 
 docker container logs mydb

Lab Activity - Debugging Containers

🟥 - I need help

🟩 - I’m done

Lab 2: Part 2

Port Mapping with `docker run -p host:container`

docker pull httpd:alpine
docker pull httpd:latest

docker inspect --format='{{.Config.ExposedPorts}}' httpd:latest
docker inspect --format='{{.Config.ExposedPorts}}' httpd:alpine

docker run -p DockerHostPort:ApplicationPort

docker run -d -p 80:80 --name httpd-latest httpd:latest
curl http://localhost:81

docker run -d -p 6574:80 --name httpd-alpine httpd:alpine
curl http://localhost:80

Lab Activity - Mapping ports

🟥 - I need help

🟩 - I’m done

Lab 2: Part 3

Persisting data with Docker

State is persistent information that is recorded and recalled later
Containers are designed to be ephemeral e.g. stateless by default
Data applications are usually stateful and more complex to deploy

Options

Storage Type	Description	Use case
Volume mount	create new volume managed by Docker CLI dedicated filesystem on the host (/var/lib/ docker/volumes)	Need to transfer, share, data between multiple containers
Bind mount	use any available host filesystem not managed by Docker CLI	Access hidden files
tmpfs	uses host memory not permanent	sensitive data
External Database	container remains stateless; container connects to external database	Production workflow
External Shared File System	shared network file system int the cloud (NFS)	Managed or multi-cluster deployments Don’t want to limit storage

Lab Activity - Persisting Data

🟥 - I need help

🟩 - I’m done

Lab 2: Part 4

Building Docker Images

Images are build using a Dockerfile or interactively “on-the-fly” for testing

Build Command

Lab Activity - Commit and push container to DockerHub

🟥 - I need help

🟩 - I’m done

Lab 2: Part 5

A few Dockerfile Build Commands

Command	Description
ARG	Define variables passed at build time
FROM	Base image
ENV	Define variable
COPY	Add local file or directory
RUN	Execute commands during build process
CMD	Execute command when you run container; once per Dockerfile
ENTRYPOINT	Execute command to change default entrypoint at runtime
USER	Set username or ID
VOLUME	Mount host machine to container
EXPOSE	Specify port pn which container listens at runtime

Dockerfile Example

Walk through Posit Connect Dockerfile

Activity - Putting it all together in Dockerfile

🟥 - I need help

🟨 - I’m still working

🟩 - I’m done

Complete Lab 2: Part 5

Part 4: Data Science in Production

Data Science in Production

Production “State”

What are some questions that we should ask once our code is ready for production?

Where is it deployed?
Is it secure and accessible?
Does it scale?
Is it tested?

Choosing the right presentation layer

Alex’s flow chart

Choosing a REST API

Lab Activity: Host API on Posit Connect

🟥 - I need help

🟩 - I’m done

Lab 3: Part 1

How is it secured?

Lab Activity: Explore API structure

🟥 - I need help

🟩 - I’m done

Lab 3: Part 2

Lab Activity: Explore Plumber

🟥 - I need help

🟩 - I’m done

Lab 3: Part 3

Where is it deployed?

Lab Activity: Push button deployment

🟥 - I need help

🟩 - I’m done

Lab 3: Part 4

Lab Activity: Access API data

🟥 - I need help

🟩 - I’m done

Lab 3: Part 5

Part 1: Introductions, setup, & workshop overview

Meet your instructors

Solutions Engineering at Posit

Special Thanks to Alex Gold!

Introduce yourself to your neighbor

Word Cloud

💻 If your WIFI isn’t working let us know as soon as possible!

Pre-workshop Install

Workshop Install

Documentation & Communication

Daily Schedule

Code of Conduct

Workshop Goals

Agenda & Lab Overview

Most important Linux commands

What we won’t cover

Part 2: DevOps Principles & Tools

Section Goals

But first…a very brief history lesson

So you want to create an app?

Problems DevOps tries to solve

Common DevOps Principles

Why should we as data scientists care about this?

Has this ever happened to you?

We should care because …

Responsibility of the analyst

The Devops Process Flow

Proliferation of tools

An opinionated take on tools

The CI/CD Pipeline

Environment Management

Version Control & Workflows

Lab Activity - developing locally using Quarto

Is version control secure?

A short auth teaser

Git protocol options

Reproducing your environment

Layers of reproducibility

Packages vs. Libraries vs. Repositories

renv workflow

Example - renv

Lab Activity - renv workflow & github authentication

Build tools

Power of YAML

YAML

Example - YAML in action

Exercise - Inspect your YAML

Continuous Integration Example

Other GHA Categories

Quarto Publish Flows

Manually

Continuous Integration

GHA Structure

Actions Syntax

Lab Activity - CICD with Github Actions

Logging

Monitoring

Part 3: Docker for Data Scientists

Section Goals

Why use a container at all?

Our Docker Environment

Workflow

Architecture

Images

Dockerhub Registry

Containers

How can data scientists use docker?

Example - testing & versioning

Example - reproduce environments

Example - isolate applications

Example - Docker as deployment strategy

Example - Docker in CI pipeline

Modes for running containers

Run Command

Lab Activity - Running containers

Running containers

Container Debugging

Lab Activity - Debugging Containers

Port Mapping with docker run -p host:container

`renv` workflow

Lab Activity - `renv` workflow & github authentication

Port Mapping with `docker run -p host:container`