Introduction to Spark SQL and Data Frames
**2 sections of the same training running at the same time**


* Apache Spark Basics & Architecture

* Spark SQL

* DataFrames

* Brief Overview of Databricks Certified Developer for Apache Spark


This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and expose them how to analyze big data with Spark SQL and DataFrames.


In this partly instructor-led and self-paced labs, we will cover Apache Spark concepts and you’ll do labs for Spark SQL and DataFrames in Databricks Community Edition.


Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.



1) Basic or intermediate knowledge of SQL language

2) Basic or intermediate knowledge of programming languages such as Python or Scala.

3) Labs are in SQL and Python or Scala



1) Please bring a laptop with at least 8GB

2) Ensure you have Chrome or Firefox browser.

3) Please bring a set of headphones. Databricks notebooks, containing short embedded instructional clips, would be used in the workshop. This would be a watch-and-try workshop.

4) Follow these instructions to create Databricks Community Edition and

Import labs

Training #2
Ricardo Portilla
12:00 PM - 2:00 PM
100 A
Training #1T
Silvio Fiorito
12:00 PM - 2:00 PM
370 A/B
Training #3: What the Hail!? - A crash course in Spark-enabled genomic analytics
Mike Trepanier & Nathan Salmon
12:00 PM - 2:00 PM  
370 C

This training workshop will explore the use of Apache Spark and how it is being leveraged in the growing field of computational genomics. We will leverage the open-source genomic analysis tool Hail to walk through some of the more common practices in this field, such as filtering or querying genomic data and running a genome-wide association study. As well, we will explore migrating this data to a Spark DataFrame and walking through some more familiar ML pipelines in Spark’s native environment.

Nathan Salmon.jpg
Michael Trepanier.png
Patrick Hall.jpeg
Training #4: Practical techniques for interpreting machine learning models
                               Patrick Hall
                               12:00 PM - 2:00 PM 
                               370 D/E

Transparency, auditability, and stability of predictive models and results are typically key differentiators in effective machine learning applications. This tutorial shares approaches learned through implementing interpretable machine learning solutions in industries like financial services, telecom, and health insurance. Using a set of publicly available and highly annotated examples, it will teach several holistic approaches to interpretable machine learning. The examples use the well-known University of California Irvine (UCI) credit card dataset and popular open source Python packages to train constrained, interpretable machine learning models and visualize, explain, and test more complex machine learning models in the context of an example credit-risk application. Instructors will draw on their applied experience to highlight crucial success factors and common pitfalls not typically discussed in blog posts and open source software documentation, such as the importance of both local and global explanations and the approximate nature of nearly all machine learning explanation techniques. The tutorial materials are available online: and the tutorial will be taught using a Qwiklabs environment. Audience members need to bring only their laptop to follow along. For audience members who would enjoy a more technical discussion of the presented techniques, consider attending the 40 min. session: A Discussion of Model Explanation Tools with Practical Recommendations.