Agenda
Registration & Breakfast
Opening session
Keynote – Building ML-Driven Products for Business Users
The Gong product is used by business users (customer-facing professionals) on a daily basis. In this talk, Eilon will review some of the learnings around building products that are based on data and machine learning: what data-driven functionality is accepted well, and how to internally define and build such functionality. Eilon will provide examples from Gong’s lifecycle on areas that worked well and areas that didn’t.
#DataEngineering #DataScience #MLOps #Product
Keynote – The Data Manageability Revolution – How Data Trust Is Becoming the New North Star
Working with Data is hard. Intrinsic difficulties that we used to cope with by manual workarounds become massive manageability problems when data is big, diverse, and many of us are working it in parallel. In this talk we will review the evolution of questions such as: What data do I have and where is it? How do I ensure high quality data? How do I cope with my data being transient? and see how the answers to those questions evolved into new categories of data tools that become a standard in every data architecture.
#DataEngineering
Coffee Break
Everything You Always Wanted to Know About Data Mesh* (*But Were Afraid to Ask)
Data Mesh is the new buzzword in the data management world. It describes a distributed approach to managing your central data lake and promises to become “the microservices of data lakes.”
In this talk, Erez will explain why you should care about this new buzz, walk you through its core concepts, share his personal experience from implementing it, and tell you what to watch out for (Not necessarily in this order).
#DataEngineering
Reaching the Top – Can We Train ML Models Which Are Both Accurate and Fair?
We usually optimize ML models for metrics like accuracy or precision. But can we really tell if this optimization leads us to the best solution for the problem ? how can we really define what is the best model for a certain problem?
In this talk, we will see how ML models that may seem optimal, can create discrimination.
We will review common notions of fairness and show why it’s hard to even agree on what is objectively fair.
We will propose a new notion of fairness, named ‘consistency score’, which is subjective to the problem at hand and will show how to select the top model, which is optimized both for accuracy as well as consistency.
We will also share a python package – the bias detector – that can help any data scientist detect bias in the ML models they develop.
#DataScience #Product
The Evolution of Meta’s Batch Pipelines Framework
In this talk we’ll see how Meta’s internal Batch Pipelines framework is developed according to DEs needs and feedback, resulting in a fully automated and privacy-aware data pipelines
Meta’s large scale enables the company to develop internal frameworks for DEs, focusing on problems unique to the company’s world of content.
In this joint DE-DI talk, we’ll show how the company’s unique structure enabled us to improve our Batch Pipelines Framework.
We’ll first give a birds eye overview of Meta’s Batch Pipelines tooling. We’ll then present the problems DEs face on a daily basis by using a fictional use case, highlighting the problems in the simple manual solutions.
We’ll iterate over the problems, improving the framework based on users feedback, up to a level that enables us to tackle complex issues such as managed schemas, automatic privacy management, rich types and unavoidable human errors.
#DataEngineering #BI
For My Next Trick: A Complex AI From Nothing!
If Data Is a Story, Then How Should We Analyze It?
Let’s Make Your CFO Happy; A Practical Guide for Cost Reduction
Take a look at your AWS bill, and you will probably find Hadoop, Spark, and Kafka at the top.
According to Gartner Forecasts, the worldwide end-user spending on public cloud services is forecast to grow by 23% in 2021, to a total of $332B. As organizations evolve and grow, data rates grow too, as do consequent cloud costs.
In this talk, we are going to address exactly this problem. We will understand what we are paying for, how to develop an economic mindset, where we can cut costs, and what we can proactively do to reduce our data infrastructure cost.
#DataEngineering
Data Bias by Perception
The goal of this presentation is making you question the way you trust yourself around data and numbers. If following this talk you will pause before automatically jump to conclusions and take a deeper look I will consider it as a success.
I will take you through some fascinating examples including near sighted children, AB tests and Nicholas cage to demonstrate how even our basic perception of data and numbers might be misleading and can make us make the wrong decision.
#Analytics
Lightning Talks
Getting Work Done Using Task Forces: Examples and Practical Tips, by Moran Brody
Pushing new initiatives forward, beating the backlog and increasing collaboration are challenges every manager faces. If you tried tackling those while the day-to-day work and failed it is time to consider some out of the box workflows.
In this talk, I’ll talk about designated task forces and how you can use them to do things differently. I will share task force examples and practical tips from my previous experience as a team lead at Riskified. I hope that by showing you the challenges we were able to tackle you will be convinced to add task forces to your managerial toolkit.
3 Hiring Mistakes in Your Way to a Data Driven Culture, By Gil Adirim
Most modern organizations have immense amounts of data, but very few are actually data driven. Becoming data-driven is not impossible, and I can show you how to get there! In this lightning talk I’ll review the 3 most common hiring pitfalls that affect your data culture, and give you practical tools to start transforming your organization tomorrow!
History Always Repeats Itself – And So Do Histograms! By Gilit Saporta
In this short talk, I’d like to breeze through the most common types of histogram that any data researcher should apply when looking for anomalies.
Anomaly detection are the bread and butter of fraud prevention, so with just 3 examples (hourly/daily traffic breakdown, connection type breakdown, RSME visualization for device/OS), we can demonstrate the power of the histogram for every day analysis.
#DataEngineering #DataScience #BI #Analytics #Product
Multi-Class Mathew’s Correlation Coefficient
The multi-class prediction had gained popularity over the recent years.
Thus measuring fit goodness becomes a cardinal question which researchers often has to deal with. There are several metrics that are commonly used for this task. However, when one has to decide about the right measurement, he must consider that different use-cases impose different constrains that govern this decision.
We suggest generalizing Mathew’s correlation coefficient into multi-dimensions. This generalization is based on geometrical interpretation of the generalized confusion matrix.
#DataScience
Lunch
The Data Practitioners Guide to Metadata
Ever wonder about the secret behind the legendary data-driven cultures of companies like LinkedIn, Airbnb, and others? The answer is metadata!
In this session, Maggie Hays will walk you through emerging best practices for managing metadata across vast, disparate systems. You’ll hear about the common pitfalls that arise within rapidly evolving, fragmented data stacks and why it’s critical to prioritize metadata management early to get ahead of them. Maggie will share top lessons learned from the 3k strong DataHub Community, equipping you with practical next steps so you can begin wrangling your organization’s metadata.
#DataEngineering #DataScience #BI #Analytics #Product
Correlating at Scale – Building Time-Series Clustering and Correlation Service for Big Data
Real-time similarity measurements can be challenging at a large scale in real-time. Usually, this problem is solved using approximation models calculated in advance (LSH-based) for finding suitable candidates during the serving phase.
We will present how Anodot uses LSH similarity approximation for large-scale time-series clustering and correlation, how Spark is used in our data pipelines, and explain the technical challenges of migrating from Hive to Spark.
Our initial time series clustering solution used AWS EMR service and Hive scripts to aggregate the data, extract feature vectors for each time series and calculate the LSH model signatures.
Later we discovered that Spark could significantly improve data processing performance. Moreover, this discovery enabled us to reduce the model calculation time and Data Lake size by supporting efficient compression methods. It gave our system the flexibility of using the same code base for SaaS and on-prem solutions.
#DataScience #Analytics
Cardinality Control – From Batch to Stream
At AppsFlyer we deal with large volumes of data where some dimensions have very high cardinality — meaning many distinct values. We aggregate or data in order to provide interactive dashboards but for this aggregation to be effective we must carefully limit the cardinality of the input data.
I will show you how our approach to limiting cardinality has evolved from batch to a new streaming process that leverages mergeable probabilistic data structures.
#DataEngineering
Computer Vision for the Poor: How to Easily Reduce Deep Computer Vision to Shallow NLP
Are There Any Benefits for Before and After Tests Over A/B Tests?
Coffee Break
Vespa or ClickHouse: What to Do After Elasticsearch
Elasticsearch is not dead. Yet.
Give it 5 more years, and it will no longer be the wide-spread technology that it is today.
I have been working with Elasticsearch since it’s dawn, started over 10 years ago with version 0.12 or something like that, and saw Elasticsearch becoming the de-facto standard technology for search, log analytics and real-time BI.
Today new technologies emerge and for some use-cases they might replace Elasticsearch completely. This session is about two of those technologies I consider the most prominent – ClickHouse and Vespa. Come to this talk to learn more 🙂
#DataEngineering
Solving MLOps From First Principles
Selecting which tools to use in your workflow is one of the hardest challenges data teams face. Buyer’s remorse is real, and you continuously hear of new buzzwords you “just have to have in your stack”. In this talk, I’ll present a mental framework for thinking about MLOps challenges, and how to select the best tools for a task.
#DataEngineering #DataScience #MLOps
Self Service: Getting Developers Out of Your Way
Do you feel you are the bottleneck of the development process? Drowning in maintaining fragile data pipelines? Wasting time on explaining data concepts to frontend engineers?
In this talk, we’ll review a few examples of how investing in making DataOps self-serviceable can help you get rid of the mundane work and focus on what really matters. As a long-time developer first-time data engineer at Yotpo, I’ll share with you how little customization can carry you a long way.
#DataEngineering #BI
Deploying Models in a Highly Regulated Industry
As Machine learning is growing in dominance in high stake situations (fintech, healthcare, autonomous driving) Model Risk Management will play an increasingly important role in years to come.
The financial industry has been using models to make high stake decisions for decades and have developed best practices for Model Risk Management.
The framework governs classical risks of poor performance, population drift and implementation mistakes, and also specific requirements for reproducibility, explainability and no discrimination toward protected classes.
How to responsibly manage hundreds of ML models, Some interlinked in various ways, in production?
How can you provide your users, decision makers and regulators with human understandable explanation of how a complex ML system makes decisions?
How can you make sure your system is not negatively impacting individuals based on factors like gender, ethnicity, and age?
#DataScience #MLOps #Product
Keynote – Huge Language Models and Neuro-Symbolic AI
The term “neuro-symbolic AI” evokes heated debates, with neutal-net-diehards on one extreme, neuro-skeptics on the other, and the rest trying to have a rational conversation. We’ll have a rational conversation, in the context of natural language.
#DataScience