KEYNOTE: Spark and AI at Capital One
Eric Martin - Chief Technical Officer, US Card & Engineering Excellence
Yuri Bogdanov - Director of Software Engineering
9:10 AM - 9:40 AM
Capital One innovating to disrupt a banking industry and revolutionizing financial services. We went from typical bank challenges such as mainframes, slow batch data processes, complex and redundant systems to efficient marketing strategies, underwriting techniques and connected digital experiences to name just a few. Learn how we did it by applying unifying data and AI through Apache Spark techniques, our way to test and prototype, make decisions and apply product lens - all as part of our journey.
KEYNOTE: Healthcare - The Final Frontier
Nathan Salmon - Chief Architect for Ember
9:40 AM - 10:10 AM
Nearly 50% of cancer patients are diagnosed too late
Medical errors are the third leading cause of death in the U.S. after heart disease and cancer
It takes an average of 7.6 years in the United States to uncover a rare disease diagnosis
In the Data Age, Digital Transformation has permeated nearly every industry, from retail and finance to cyber security. This, along with modern advances in machine learning and artificial intelligence, has led to the realization of astounding efficiencies, insight, and automation that were thought to be impossible even 5 to 10 years ago. However, despite being arguably the most globally impactful and altruistic cause, for one reason or another, healthcare lags the pack.
In this talk we’ll survey the challenges facing the healthcare industry’s adoption of a modern data-driven culture, and how other similarly-regulated industries are already solving these problems today. From precision and genomic medicine to population health, we’ll discuss the great opportunities and novel use cases that can literally change lives, and what we can do practically with current technology in healthcare to boldly go where most industries have gone before.
Modern Data Architecture with Cloud Native Technologies
Ugur Tigli - Chief Technical Officer
10:20 AM - 11:00 AM
Modern analytics workloads need to run at the speed of business. As applications generate more data at rates faster than before, storage infrastructure needs to catch up with the analytics platforms. Only then businesses can analyse the data and respond quickly to changing market needs.
Private AWS S3 compatible cloud object storage solutions, such as Minio, allow organizations to leverage object storage as the standard storage platform, while also offering the necessary speed and agility to directly use data from object storage and feed it to Spark like platforms in a cloud native framework.
This cuts out extra data analysis stages like Extract, Transform and Load (ETL), and simplifies over all data pipeline. Streamlined data flow makes maintenance easier and cuts infrastructure and manpower costs.
This session will focus on benefits of deploying Minio Object Storage services in a private cloud to support Spark and AI use cases. We will walk through how private cloud object storage could help in storing massive amounts of data using de facto interfaces and help optimize data analysis pipeline using Spark.
A Discussion of Model Explanation Tools with Practical Recommendations
Patrick Hall - Senior Director of Product
10:20 AM - 11:00 AM
This presentation is a technical discussion of several explanatory methods that go beyond the error measurements and plots traditionally used to assess machine learning models. The approaches, decision tree surrogate models, individual conditional expectation (ICE) plots, local interpretable model-agnostic explanations (LIME), partial dependence plots, and Shapley explanations, vary in terms of scope, fidelity, and suitable application domain. Along with descriptions of these methods, a few practical recommendations are also presented. Materials for this presentation are available online: https://github.com/jphall663/jsm_2018_slides. For audience members who would prefer a more hands-on experience, consider attending the tutorial session: Practical Techniques for Interpretable Machine Learning.
Data Modeling Considerations for State Management of Spark Structure Streaming
Ted Malaska - Director of Enterprise Architecture
11:05 AM - 11:45 AM
As we move from a world of Batch to Streaming, the concept of streams as tables and tables as stream is emerging. In this session we will do a deep dive into the phase. First, going into the different statement management strategies available in Spark Structured Streaming.
This will include hot topics like:
Triggering & Evicting
Checkpointing and addressing the idea of micro batching
Each example will be accompanied by real workable examples that will be posted on GitHub so you can try them at home.
Second, the session will close with a vision for a data ecosystem that expectidates the progression from research batch to NRT actionable value.
AI assisted Merchandising: Driven by Personalized Recommendations, Powered by Spark
Charmee Patel - Data & Analytics Architect
Vijay Chakilam Data & Analytics Lead
11:05 AM - 11:45 AM
In the age of information overload and ample choices, many companies are keenly aware of the competitive advantage of better recommendations and are investing to go beyond the most commonly used recommendation algorithms like collaborative and content filtering and their various hybrids. The next frontier in recommender systems is real-time contextual and personalized recommendations where the current user context, as defined by their online and offline experiences, is considered in making personalized recommendations. In this talk, we discuss how we have built a modern recommender systems by: 1) Considering clickstream data; 2) Leveraging advanced techniques to create latent features to reduce dimensionality; and 3) Using Apache Spark to support scalable batch and streaming operations.
Azure Databricks and Microsoft Machine Learning for Spark
Chris Hamson - Data Solution Architect
1:15 PM - 2 PM
Azure Databricks is a first party service within Azure. All the goodness of Spark and all the goodness of Databricks for Data Scientists and all the ease of management that comes with being a first party service within Azure.
Azure Machine Learning is an SDK that simplifies the process of designing, training, deploying and monitoring machine learning models. This SDK comprehensively covers the entire lifecycle of AI models and has a strong model management component. Azure Machine learning integrates coding and management tools, deep learning frameworks, AI services, and the breadth of hardware options, from cloud to the edge.
KEYNOTE: Introducing MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Elena Boiarskaia - Senior Technical Solutions Engineer
2:1o PM - 2:40 PM
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
With a short demo you see a complete ML model life-cycle example, you will walk away with:
MLflow concepts and abstractions for models, experiments, and projects
How to get started with MLFlow
Using tracking APIs during model training
Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
Building ML and AI Pipelines with Spark and TensorFlow
Chris Fregly - Founder, CEO
2:55 PM - 3:35 PM
In this talk, I demonstrate how to extend your existing Spark-based Data Pipelines to include TensorFlow Model Training and Deploying. I present TensorFlow’s TFRecord format - including libraries for converting to/from other popular file format’s such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. All demos are 100% open source and downloadable as Docker images from http://pipeline.ai.
Containerization of Spark Workloads – a new paradigm for building simple and efficient polyglot systems In Kubernetes
Hari Rajaram - Director Of Engineering
2:55 PM - 3:35 PM
Merlin International provides a single point of security administration, governance and visibility into all endpoints critical to the operations of large Healthcare and Financial Organizations. Merlin has built a cybersecurity solution leveraging Spark on Kubernetes for historical analysis and real time streaming, integrating it with ML and AI pipelines. In this session we will delve into the implementation details of stream processing and the benefits of using containers for Spark workloads. We will briefly talk about the Kubernetes resource management for Spark workloads along with the ease in which both the big data components including ML and AI pipelines and services can be deployed and managed on Kubernetes.
Challenge of real time B2C recommendations: Software and data science views
Dr. Alain Briançon - Vice President of Data Science
Victor Potapov - Senior Director of Software
3:40 PM - 4:10 PM
We will present our journey evolving Cerebri AI’s machine-learning model for customer experience. Cerebri Values system, is our enterprise AI software that enables companies to predict customer behavior at scale. We will focus on key data engineering design patterns and challenges. We will also demonstrate how we're deploying our Customer’s Commitment to Brand model and its industrial applications.
Frontiers at the interface of deep learning and high performance computing
Dr. Eliu Huerta Escudero - Founder and Lead of the Gravity Group at the National Center for Supercomputing Applications
3:40 PM - 4:10 PM
Machine and deep learning have revolutionized industry and technology. This paradigm shift in data science has slowly percolated in scientific research leading to remarkable breakthroughs. A new wave of innovation that fuses deep learning with high performance computing presents new challenges and opportunities to address grand computational challenges in industry and science that cannot be tackled by deep learning or large scale computing alone. In this talk I will present recent applications of deep learning at scale, spearheaded by my research team at the National Center for Supercomputing Applications, and will discuss a roadmap of action to create synergies between industry and academia to drive innovation in data science.