Spark, Hadoop, and Snowflake for Data Engineering

Spark, Hadoop, and Snowflake for Data Engineering

This course is part of Applied Python Data Engineering Specialization

Taught in English

Instructors: Noah Gift

5,041 already enrolled

Included with Coursera Plus

Learn more

Course

Gain insight into a topic and learn the fundamentals

3.7

(22 reviews)

Advanced level

Recommended experience

29 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

What you'll learn

Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
Optimize data engineering with clustering and scaling to boost performance and resource use.
Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

21 quizzes

Course

Gain insight into a topic and learn the fundamentals

3.7

(22 reviews)

Advanced level

Recommended experience

29 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

Build your subject-matter expertise

This course is part of the Applied Python Data Engineering Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 4 modules in this course

e.g. This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programmingGain the skills for building efficient and scalable data pipelines. Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

This week, you will learn how to work with different data engineering platforms, such as Hadoop and Spark, and apply their concepts to real-world scenarios. First, you will explore the fundamentals of Hadoop to store and process big data. Next, you will delve into Spark concepts, distributed computing, deferred execution, and Spark SQL. By the end of the week, you will gain hands-on experience with PySpark DataFrames, DataFrame methods, and deferred execution strategies.

What's included

10 videos9 readings7 quizzes2 discussion prompts2 ungraded labs

10 videosTotal 25 minutes

Meet your Co-Instructor: Kennedy Behrman0 minutesPreview module
Meet your Co-Instructor: Noah Gift1 minute
Overview of Big Data Platforms1 minute
Getting Started with Hadoop1 minute
Getting Started with Spark1 minute
Introduction to Resilient Distributed Datasets (RDD)2 minutes
Resilient Distributed Datasets (RDD) Demo4 minutes
Introduction to Spark SQL1 minute
PySpark Dataframe Demo: Part 13 minutes
PySpark Dataframe Demo: Part 27 minutes

9 readingsTotal 90 minutes

Welcome to Data Engineering Platforms with Python!10 minutes
What is Apache Hadoop?10 minutes
What is Apache Spark?10 minutes
Use Apache Spark in Azure Databricks (optional)10 minutes
Choosing between Hadoop and Spark10 minutes
What are RDDs?10 minutes
Getting Started: Creating RDD's with PySpark10 minutes
Spark SQL, Dataframes and Datasets10 minutes
PySpark and Spark SQL10 minutes

7 quizzesTotal 210 minutes

Big Data Platforms30 minutes
Apache Hadoop Concepts30 minutes
Apache Spark Concepts30 minutes
RDD Concepts30 minutes
Spark SQL Concepts30 minutes
PySpark Dataframe Concepts30 minutes
PySpark30 minutes

2 discussion promptsTotal 20 minutes

Meet and Greet (optional)10 minutes
Let Us Know if Something's Not Working10 minutes

2 ungraded labsTotal 120 minutes

Practice: Creating RDD's with PySpark60 minutes
Practice: Reading Data into Dataframes60 minutes

This week, you will explore the Snowflake platform, gaining insights into its architecture and key concepts. Through hands-on practice in the Snowflake Web UI, you'll learn to create tables, manage warehouses, and use the Snowflake Python Connector to interact with tables. By the end of this week, you'll solidify your understanding of Snowflake's architecture and practical applications, emerging with the ability to effectively navigate and leverage the platform for data management and analysis.

What's included

8 videos5 readings6 quizzes

8 videosTotal 27 minutes

What is Snowflake?2 minutesPreview module
Snowflake Layers2 minutes
Snowflake Web UI3 minutes
Navigating Snowflake3 minutes
Creating a Table in Snowflake5 minutes
Snowflake Warehouses3 minutes
Writing to Snowflake3 minutes
Reading from Snowflake2 minutes

5 readingsTotal 50 minutes

Accessing Snowflake10 minutes
Detailed View Inside Snowflake10 minutes
Snowsight: The Snowflake Web Interface10 minutes
Working with Warehouses10 minutes
Python Connector Documentation10 minutes

6 quizzesTotal 180 minutes

Snowflake Architecture30 minutes
Snowflake Layers30 minutes
Navigating Snowflake30 minutes
Creating a Table30 minutes
Writing to Snowflake30 minutes
Snowflake30 minutes

This week, you will practice the essential skills for seamlessly managing machine learning workflows using Databricks and MLFlow. First, you will create a Databricks workspace and configure a cluster, setting the stage for efficient data analysis. Next, you will load a sample dataset into the Databricks workspace using the power of PySpark, enabling data manipulation and exploration. Finally, you will install MLFlow either locally or within the Databricks environment, gaining the ability to orchestrate the entire machine learning lifecycle. By the end of this week, you will be able to craft, track, and manage machine learning experiments within Databricks, ensuring precision, reproducibility, and optimal decision-making throughout your data-driven journey.

What's included

16 videos7 readings4 quizzes1 ungraded lab

16 videosTotal 71 minutes

Accessing Databricks0 minutesPreview module
Spark Notebooks with Databricks4 minutes
Using Data with Databricks4 minutes
Working with Workspaces in Databricks3 minutes
Advanced Capabilities of Databricks1 minute
PySpark Introduction on Databricks7 minutes
Exploring Databricks Azure Features3 minutes
Using the DBFS to AutoML Workflow4 minutes
Load, Register and Deploy ML Models2 minutes
Databricks Model Registry2 minutes
Model Serving on Databricks2 minutes
What is MLOps?12 minutes
Exploring Open-Source MLFlow Frameworks5 minutes
Running MLFlow with Databricks6 minutes
End to End Databricks MLFlow4 minutes
Databricks Autologging with MLFlow4 minutes

7 readingsTotal 70 minutes

What is Azure Databricks?10 minutes
Introduction to Databricks Machine Learning10 minutes
What is the Databricks File System (DBFS)?10 minutes
Serverless Compute with Databricks10 minutes
MLOps Workflow on Azure Databricks10 minutes
Run MLFlow Projects on Azure Databricks10 minutes
Databricks Autologging10 minutes

4 quizzesTotal 120 minutes

PySpark SQL30 minutes
PySpark DataFrames30 minutes
MLFlow with Databricks30 minutes
DataBricks30 minutes

1 ungraded labTotal 60 minutes

ETL-Part-1: Keyword Extractor Tool to HashTag Tool 60 minutes

This week, you will explore the concepts of Kaizen, DevOps, and DataOps and how these methodologies synergistically contribute to efficient and seamless data engineering workflows. Through practical examples, you will learn how Kaizen's continuous improvement philosophy, DevOps' collaborative practices, and DataOps' focus on data quality and integration converge to enhance the development, deployment, and management of data engineering platforms. By the end of this week, you will have the knowledge and perspective needed to optimize data engineering processes and deliver scalable, reliable, and high-quality solutions.

What's included

21 videos6 readings4 quizzes1 ungraded lab

21 videosTotal 502 minutes

Kaizen Methodology for Data4 minutesPreview module
Introducing GitHub CodeSpaces9 minutes
Compiling Python in GitHub Codespaces18 minutes
Walking through Sagemaker Studio Lab28 minutes
Pytest Master Class (Optional)166 minutes
What is DevOps?2 minutes
DevOps Key Concepts35 minutes
Continuous Integration Overview32 minutes
Build an NLP in Cloud9 with Python43 minutes
Build a Continuously Deployed Containerized FastAPI Microservice43 minutes
Hugo Continuous Deploy on AWS18 minutes
Container Based Continuous Delivery8 minutes
What is DataOps?1 minute
DataOps and MLOps with Snowflake61 minutes
Building Cloud Pipelines with Step Functions and Lambda16 minutes
What is a Data Lake?2 minutes
Data Warehouse vs. Feature Store2 minutes
Big Data Challenges1 minute
Types of Big Data Processing1 minute
Real-World Data Engineering Pipeline2 minutes
Data Feedback Loop0 minutes

6 readingsTotal 60 minutes

GitHub Codespaces Overview10 minutes
Getting Started with Amazon SageMaker Studio Lab10 minutes
Teaching MLOps at Scale with GitHub (Optional)10 minutes
Getting Started with DevOps and Cloud Computing10 minutes
Benefits of Serverless ETL Technologies10 minutes
Next Steps10 minutes

4 quizzesTotal 120 minutes

Kaizen Methodology30 minutes
DevOps30 minutes
DataOps30 minutes
DataOps and Operations Methodologies30 minutes

1 ungraded labTotal 60 minutes

ETL-Part2: SQLite ETL Destination60 minutes

Instructors

Instructor ratings

3.7 (6 ratings)

Noah Gift

Duke University

40 Courses93,423 learners

Offered by

Duke University

Recommended if you're interested in Machine Learning

Duke University
Applied Python Data Engineering
Specialization
Duke University
Virtualization, Docker, and Kubernetes for Data Engineering
Course
Google Cloud
Creating a Real-time Data Pipeline using Eventarc and MongoDB Atlas
Project
Coursera Project Network
Crea formularios con React Hooks y MUI
Guided Project

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

New to Machine Learning? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy.

Spark, Hadoop, and Snowflake for Data Engineering

Course

What you'll learn

Skills you'll gain

Details to know

Course

See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise

Earn a career certificate

There are 4 modules in this course

Overview and Introduction to PySpark

What's included

Snowflake

What's included

Azure Databricks and MLFLow

What's included

DataOps and Operations Methodologies

What's included

Instructors

Offered by

Recommended if you're interested in Machine Learning

Applied Python Data Engineering

Virtualization, Docker, and Kubernetes for Data Engineering

Creating a Real-time Data Pipeline using Eventarc and MongoDB Atlas

Crea formularios con React Hooks y MUI

Why people choose Coursera for their career

New to Machine Learning? Start here.

Open new doors with Coursera Plus

Advance your career with an online degree

Join over 3,400 global companies that choose Coursera for Business

Frequently asked questions

When will I have access to the lectures and assignments?

What will I get if I subscribe to this Specialization?

What is the refund policy?

More questions