2025 Agenda
Opening remarks
Stream All the Things — Patterns of Effective Data Stream Processing
Data streaming is a really difficult problem. Despite 10+ years of attempting to simplify it, teams building real-time data pipelines can spend up to 80% of their time optimizing it or fixing downstream output by handling bad data at the lake. All we want is a service that will be reliable, handle all kinds of data, connect with all kinds of systems, be easy to manage, and scale up and down as our systems change.
Oh, it should also have super low latency and result in good data. Is it too much to ask?
In this presentation, you’ll learn the basics of data streaming and architecture patterns such as DLQ, used to tackle these challenges. We will then explore how to implement these patterns using Apache Flink and discuss the challenges that real-time AI applications bring to our infra. Difficult problems are difficult, and we offer no silver bullets. Still, we will share pragmatic solutions that have helped many organizations build fast, scalable, and manageable data streaming pipelines.
No Data Left Behind: Handling Late Events & Reference Data in Flink
Apache Flink is a powerful stream processing framework that enables complex real-time data processing. One of the most common use cases in streaming ETL is enriching events with reference data, such as Slowly Changing Dimensions (SCDs). However, real-world streaming systems are anything but perfect: events arrives late, reference data updates unpredictably, and standard join patterns can fall short.
In this session, we go beyond the basics to explore advanced enrichment techniques for handling late-arriving events and evolving reference data. Attendees will learn how to ensure consistency and meet latency requirements, even when dealing with unreliable data sources. We’ll also dive into lesser-known but powerful features of Flink’s API that can help designing resilient, high-performance real-time data pipelines.
Building Fast, Scaling Faster - How We’ve Reinvented Our Startup for LinkedIn
Imagine your startup is being acquired by a big tech company. The technology you designed and built now needs to scale to meet the demands of that company. That’s what happened to us 3 years ago, when LinkedIn acquired Oribi. We had to rebuild our system to match LinkedIn’s scale: over 3 million requests per second and more than 45 trillion Kafka messages per day! More than that – we had to make sure we supported a whole new level of compliance and legal requirements, and fast.
In this talk, I’ll share how we redesigned our system from scratch to meet LinkedIn’s requirements & scale, what guided us, the changes we made, and how we successfully completed the project ahead of a very tight timeline.
Whether you’re working at a small startup or a large company, scale can take many forms. Don’t let it catch you by surprise – develop the right mindset and methodologies to prepare in advance!
From Prompts to Pipelines: Using MCP to Automate Stuff (and Impress Your PM)
Model Context Protocol (MCP) is emerging as a lightweight standard for connecting large language models to actual systems—not just chat interfaces. In this fast-paced session, you’ll learn what MCP is, why it matters, and how it can help automate the repetitive glue work that clogs modern data workflows. We’ll walk through a real-world example for automating daily tasks using MCP to make life easier.
Fast Writes, Furious Reads - Making Near Real-Time Ingestion Work for Data Analysis
At Coralogix, we process massive volumes of observability data (logs, metrics, traces) with a hard requirement: the data must be available for query within five minutes of arrival. To achieve this, we built a near real-time ingestion pipeline into object storage — but one that doesn’t stop at speed.
Our system makes data queryable immediately upon arrival using a custom file format. Meanwhile, background compaction processes kick in — restructuring that data into optimized files for efficient querying.
This talk explores how we make data instantly queryable while still optimizing it later for maximum throughput — all while (trying to) meet both latency and performance goals under strict SLA pressure.
Artificial General Memory Is the New Frontier of the Data World
AI memory is the collection of interactions of an AI agent with humans and other AI agents. As the world of AI rapidly develops, making LLMs smarter and more capable, the idea of unified memory for AI is still undefined. In this talk Roy will try to convince the audience that 1) Artificial General Memory is “a thing”; 2) There is likely a missing ‘data structure’ in our world; and 3) traditional systems like Snowflake and Databricks should be concerned; and 4) There’s opportunity for new players.
This is going to be a fast paced, 15 minutes talk that covers the problem, the solution that exists today and the opportunity that lies ahead.
Less Is More: The Counterintuitive Secret for Impactful Insights Sharing
This talk will explore the challenges of sharing insights in a way that resonates with stakeholders, and reveal a surprising solution to this common problem. By attending this session, you’ll learn how to overcome the pitfalls of traditional analysis and discover a proven approach to driving impact and engagement through concise and effective communication. Join us to find out what this game-changing solution is and how it can transform your approach to insights sharing.
Pause Can Lead To Innovation: Reimagining DWH Architecture with ClickHouse and S3
What if your DWH could deliver petabyte-scale capabilities directly to applications in real-time, without the cost, complexity, or limits of conventional solutions? We’ve reimagined the DWH by adopting a fully stateless architecture that leverages Kubernetes, Spot Instances, S3, ClickHouse, and Parquet.
Building a stateless DWH is more than just a technical challenge- it’s a strategic shift in our mindset. It requires stepping away from familiar methods, critically evaluating the long-term impact of architectural decisions, and embracing experimentation over assumptions. To innovate, you must stop and think deeply about your infrastructure.
In this talk, we’ll share how deliberate decision-making and experimentation enabled us to build a scalable, cloud-native DWH. By the end, you’ll gain insights on how to move beyond conventional solutions, focus on outcomes, embrace a state of mind of bottleneck-free architecture, and what to watch out for when introducing a novel data architecture.
What Does the Fox Say? Actionable Data From Animal Distress
What do you do when your most valuable dataset wasn’t designed, labeled, or even intended to exist?
In this talk, we’ll share how The Haibulans, a volunteer wildlife rescue network, unintentionally built a high-impact dataset from WhatsApp messages, spreadsheets, and field notes. Originally used for coordination, this informal data revealed spatial and temporal patterns in injuries, urban hazards, and human-wildlife interactions.
We’ll show how manual tagging and analysis led to changes in dispatch strategy, public outreach, and collaboration with city agencies — and explore the ethical challenges of working with emotionally charged, unstructured data.
Gain practical insights into extracting value from messy, real-world data, and learn how meaning can emerge even without a formal model. This talk is especially relevant for anyone navigating the space between fieldwork and data work — where meaning emerges before models.
From Handoff to Hand-in-Hand: Building AI Products Together
Building AI products often involves a handoff: data scientists develop models, and then developers figure out how to productionize them. This division can lead to friction, delays, and suboptimal results, especially when working with complex LLMs. As AI becomes more integrated into production environments, the gap between data science and software development must be addressed.
In this talk, I’ll share how to move beyond the traditional handoff model by creating shared ownership across roles. We’ll discuss practical strategies for building heterogeneous teams, the challenges of merging distinct mindsets, and the tangible benefits of this integration.
This isn’t just about team structure—it’s about how people work together, test together, and ship together.
Drawing from real examples, I’ll cover principles, tools, and pitfalls that helped us build LLM-powered products more effectively—without the handoff headaches.
Human Judgment, Machine Edition: LLMs in Data Labeling
LLMs can do a lot, but can they label and evaluate data like a human? Sometimes. This talk shares hard-earned lessons from the front lines of data operations (those who manage human judgment on data operations and ensure its quality), where one practitioner set out to recreate textual human-labeled data with the generous help of GenAI, and discovered what works, what doesn’t, and why it’s not as simple as it seems.
Airflow Unleashed: Scaling for the Enterprise
Apache Airflow is a powerful data pipeline orchestrator—but what if it could do more? What if a single Airflow instance on Kubernetes could serve an entire org, empowering teams beyond data engineering?
In this talk, I’ll share how we evolved Airflow into a self-service, org-wide workflow orchestration platform. By scaling its distribution, we replaced two legacy systems and enabled 20+ R&D teams to own their workflows. We also built an internal Airflow community that fostered collaboration and achieved milestones, some shared with the broader Airflow community. A key enabler was “Wrappers”—an abstraction layer bridging user code with Airflow’s core. It let us scale while enforcing best practices for 100+ users.
You’ll learn how to run multi-tenant Airflow, integrate Okta, ensure ownership, and streamline local development with lightweight testing. If you’re ready to push Airflow beyond its boundaries, this talk is for you.
The Inner Life of an Export Engine
We’ve all used data export mechanisms in one way or another – whether it’s by clicking an “Export to CSV” button, calling a designated API, or using another common solution to get the data we need. But what if we had to build such a mechanism ourselves?
While integrating cost data reports from popular major systems into our platform, we realized the strategy to approach this differs widely: AWS, for example, relies on snapshots stored in Parquet files, Oracle Cloud relies on incremental CSV updates, and others like GCP and Azure implement their own unique approach.
In this talk, we’ll uncover the hidden complexities of building efficient data export tools. We’ll explore trade-offs between snapshot-based and incremental exports, key choices for file formats, and scalability. With real-world lessons and practical approaches, you’ll learn how smart design simplifies workflows, enhances user experience, and future-proofs systems against growing data demands.
Building a Self-Serve Iceberg Lakehouse
See how Unity iAds transformed its data lake into a self-serve Iceberg powerhouse: any engineer now spins up an autoscaling EMR-on-EKS streaming pipeline in minutes, hammering through millions of events per second, with Iceberg maintenance on autopilot and table-level FinOps insight baked in.
In 30 minutes we’ll unpack the key design moves, killer tools, and cultural shifts that made this dream into reality.
How We Let Our COO Query PBs of Data Without Knowing SQL
Our data is spread across many data sources (Iceberg, Vertica, BigQuery, MySql, Druid just to name a few) across thousands of tables and countless columns, even if someone knows where the data sits, they don’t necessarily know how to query it, let alone validate their result.
At this point you’re probably thinking, just GPT and RAG it right? Well, it just doesn’t work.
In this talk we’ll describe the different angles we tried and how all “best practices”, and we took it to the next level to not only answer questions, but completely own deep data investigations (why was there a drop in revenue yesterday? In what user segment variant B is losing?).
Modeling Magic: How AI Built Our DWH Tables from a Slide Deck
At Wix, we’ve adopted a 10-step methodology for building high-quality DWH – starting from gathering business questions, through designing logical models, and ending with ensuring real adoption across teams with managed certified tables.
But building a great DWH is still a time-consuming process – especially when translating product and business intent into scalable, governed, and tested tables.
In this talk, I’ll share how we used gen AI to support and enhance this journey. Starting with just a product manager’s presentation, our AI-powered modeling approach aims to extract meaningful metadata, identifies core business entities and questions, builds ERDs and logical models, writes quality tests, and even helps with visualized documentation and onboarding.
We’ll explore what worked well, what needed refinement, and which parts of the process still rely heavily on human expertise.
Demystifying LLM Development: Apply ML Principles, Not Magic
Breaking the Content-Collaborative Barrier: LLMs as the Bridge for Recommender Systems
For decades, recommender systems faced a fundamental divide: DNN-based collaborative filters excel at user-item interactions but struggle with textual semantics, while content-based approaches miss collaborative patterns.
Enter LLMs: the breakthrough technology redefining the paradigm!
In this deep-tech talk we’ll explore the following core insights:
1. How LLMs transform your architecture by serving as powerful feature encoders for user/item representations
2. Practical implementations of “personalized prompts” that unify multiple recommendation tasks
3.The trade-offs between parameter-efficient fine-tuning vs. full-model approaches for production systems
I’ll share code patterns demonstrating both ID-based and textual side information-enhanced representation learning techniques that are transforming systems at Amazon, Netflix etc.
This talk is perfect for data practitioners interested in the convergence of NLP and personalization.
Come join this exciting paradigm shift!
Code Review 2.0—Teaming Up with AI Agents
AI is no longer just writing code; it’s transforming how we review it. This session dives into effective strategies for evaluating both AI-generated and human-written code, leveraging AI tools to assist. Discover practical Python techniques to enhance quality, speed, and insight in your evolving code review process.
Key Takeaways (or “”Attendees will learn””):
– Navigate the evolving landscape of AI in code review.
– Master strategies for critiquing AI-generated code.
– Leverage AI agents for deeper, context-aware analysis of human code.
– Integrate Python and AI tools for smarter, automated review workflows.
– Boost code quality, review efficiency, and team insight in the AI era.
Hard-Earned Engineering Lessons from Building with LLM-Powered Systems
In the last two years, I’ve designed, implemented, advised and debug lots of LLM powered applications as adjunct engineer to our Advanced Analytics and Applied AI team. Those are stories from the trenches, what I learned when it crashed in my face. I’ll share engineering lessons and patterns that hopefully will help you suffer less with your next project and what Data science practitioners need to consider before handing on their “working” notebooks.
Rethinking Fine-Tuning: Building Adaptive Pipelines on Top of Pretrained Models
What if your model isn’t the end of the pipeline—but the beginning of a smarter one? In this talk, we’ll explore building a lightweight, modular adjustment layer on top of pre-trained models, designed to inject strategic signals into your system: business objectives, domain expertise, blind spot corrections, or patterns the model wasn’t even trained on. The result is a flexible, ensemble-style architecture that adapts without retraining the core model.
This approach gives you an easy way to fine-tune model behavior in production to reflect contexts, constraints, or objectives that weren’t explicitly modeled during training.
More with Less: A Personalized Model to the Rescue for Ultra-Small Data
Predictive models typically rely on large datasets, but what if we could achieve real-time insights with minimal data?
This talk introduces a modeling approach that thrives with ultra-small data (starting with just five points!), adapts to new cases without prior history, and enables real-time predictions without full database access.
We address real-world challenges of data scarcity due to privacy, costs, or operational constraints, where traditional models fail to provide accurate predictions.
In a healthcare use case, we built fetal growth curves from 5-8 ultrasound measurements, identifying deviations that signaled neonatal pathologies. Real-time detection could have enabled earlier intervention.
This “more with less” mindset can be extended to other domains, such as fraud detection and recommendation systems. Attendees will learn to build personalized models for low-data environments and cost-effective real-time analytics, challenging the “bigger is better” mindset.
AI-Driven Autonomous Rule Tuning with Synthetic Test Data
What if you could generate realistic test values on demand, without using real data, and have your detection rules fine-tuned automatically?
Manual rule-tweaking is time-consuming, as teams craft spreadsheets of “typical” samples like emails, IDs or log entries, only to see new formats slip through. Public test sets are often too simplistic or unavailable, leading to blind spots or forcing the reuse of scrubbed production data.
In this talk, I’ll share how we built a zero-touch pipeline using OpenAI’s APIs to generate diverse examples, filter poor-quality cases with a two-step validation and embed results in real-world formats. I’ll show how we score regex and keyword rules, use LLMs for suggestions, & automatically update to improve detection, while no custom models or real data are needed.
Whether you focus on DLP, log parsing, ETL validation, or rule-based detection, this talk will show how to move from reactive pattern babysitting to a scalable, self-evolving, and secure workflow.
Ummm, Actually
From inner and left joins to a model classifying wrong, From ETL to DWH hell, data proffesionals are excited about a lot of things, but there’s one things that stands above all else – and that’s correcting people – join our game show that’ll prove that you ARE the true lord of rows, and can classify correctly between the true elements and the false elements of a statment – We will have more description as we progress.
Unlocking Feature Engineering with Embedding Models in the Era of LLMs
Feature engineering is traditionally labor-intensive, requiring domain expertise and iteration. With large language models (LLMs), embedding models offer a transformative way to streamline this process.
This talk explores using pre-trained embeddings for classification tasks in fintech, focusing on predicting financial risk from credit history. By capturing the context and semantics of credit reports as high-dimensional vectors, embedding models have achieved competitive results against production models.
We’ll share our journey, from challenges and solutions to performance gains through fine-tuning. Additionally, we’ll discuss new opportunities enabled by embeddings, such as few-shot prompting, clustering, and similar tools.
Optimizing the Deployment of LLMs for Cost Efficiency
In this talk, we delve into the vital considerations surrounding cost management when deploying LLMs in real-world applications. We explore the nuances of token usage, infrastructure costs, human resources, and ancillary expenses. Furthermore, we explore various optimization methods, including model architecture optimization, fine-tuning, quantization, caching, prefetching, and parallelization, alongside distributed computing. Additionally, we address practical techniques for estimating costs through methodologies such as cost modeling, cost monitoring, and management, as well as budgeting and planning. These insights aim to empower organizations in effectively navigating the financial landscape associated with LLM deployment, ensuring optimized resource allocation and sustainable operations.
How I Failed to Build My WhatsApp Agent - But Learned to Love the Challenge
I was frustrated. All I wanted was to find my friends’ trusted recommendation for where to travel with the kids next weekend – buried in months of casual chatter in our local WhatsApp group. Google didn’t help, ChatGPT didn’t know, and re-asking felt silly. I needed something smarter – an agent that could surface what my people had already shared, no matter when or how casually they’d said it.
That simple desire turned into a late-night obsession—a personal project that combined my data science strengths with the messy, unfamiliar world I was eager to explore: backend logic, user interfaces, system design, and bending tools until they (mostly) did what I needed. Because let’s face it—it’s never just about embeddings and semantic search, right?
In this talk, I’ll share how I tried to build the perfect WhatsApp agent, what broke, what it taught me, and why sometimes failure is the best teacher. You’ll leave with tools, insights, and motivation to build your own.
LLM Classification Chaos: How Embracing Complexity Improved Accuracy
What if the best way to solve a problem is to actually increase it?
NAICS is an American business classification system with a long-tail challenge: 1,000 codes with fuzzy boundaries. Combining it with web-mined inputs creates a classification nightmare. Traditional ML approaches treat it as a 1,000‑way classifier, often leveling off near 60% accuracy with inconsistent results. So how did we handle such a complicated problem? By embracing complexity and going granular.
In this session, we’ll share how we leveraged ~20,000 detailed descriptors from the NAICS index, embedded them in a vector store and used a retrieval augmented LLM classification. We’ll show how we synthesized rich business profiles from web data, retrieved the closest descriptors, and increased accuracy while halving errors — all for just 1¢ per case.
Join us to explore how increasing the label space can simplify AI decision-making and learn how to build an LLM classifier pipeline for complex classification challenges.
Surfacing Insights: From Big Data to Action
Discover how LSports built a robust, scalable data infrastructure to transform billions of sports data records into actionable insights and alerts for business users. This session walks participants through real-world examples, success stories, key architectural decisions, and the tools that made it all possible. Learn how a strong data culture, smart processes, and the right technology stack drive high-performance dashboards that matter.
Beyond the Chart: Best Practices and the Strategic Role of the Data Visualization Engineer
Organizations are flooded with data but starved for insight. Users face countless dashboards, yet few drive real decisions. This session addresses how to maximize insight while minimizing user effort, focusing on the Data Visualization Engineer’s role in transforming complex data into clear, actionable stories.
We’ll define the role of the Data Visualization Engineer within the modern data stack and demonstrate how this role works hand-in-hand with data engineers to ensure that data is not only accurate, but also optimized for visualization—modeled, cleaned, and structured with the end user in mind.
Through real-world examples—such as enterprise dashboard redesigns, centralized metric stores, and cross-domain reporting frameworks—you’ll learn how Data Visualization Engineers apply best practices to bring order and impact to BI environments.
Small Effort, Big Impact: Two Questions to Ask in Every Weekly with Your PM
Building a strong partnership with your Product Manager doesn’t require grand gestures, just the right questions. In this session, I’ll share how consistently asking two simple questions in weekly meetings – “What are you working on right now?” and “How can I help you?”, can transform your collaboration, uncover hidden needs, and drive real impact. Through a real-world story, I’ll show how this approach led to the rapid creation of a dashboard that saved hours of manual work and surfaced critical insights. Walk away with actionable tips to strengthen your PM relationship, deliver value fast, and become a better analyst by truly understanding the business.
Product Mindset Makes Good Analysts
When someone comes to you with a data request, a good analyst doesn’t just pull the numbers; they ask “why?”
What decision are you trying to make? Is this a one-time question, or something that will come up again? What follow-up questions might come next?
In this talk, we’ll explore the surprisingly overlapping mindset of great analysts and product managers. We’ll see how analysts who adopt a product mindset can drive much greater impact within their organizations.
We’ll look at practical tools, real-world examples, and common pitfalls.
This is an invitation to rethink the analyst role – not as a service provider, but as a strategic partner.
From Panic to Purpose: Turning dbt Alerts into Actual Actions
Managing large-scale dbt projects often turns into a whack-a-mole of broken models, ambiguous alerts, and frantic Slack pings. At HoneyBook, we hit that wall — with over 1,000 dbt models, 30+ data sources, and a mixed team of data engineers and analysts, we were drowning in alert noise with little direction or ownership.
This talk shares how we transformed that chaos into clarity. We’ll walk through our alerting redesign: how we defined “critical” using both domain tagging and graph-based centrality, how we enriched alert context with Git and query metadata, and how we routed incidents to the right Slack channels and humans. You’ll learn why off-the-shelf tools failed us, and how a lightweight, metadata-driven approach helped us make dbt alerts actually useful — and even empowering — for data teams.
Expect practical insights, battle-tested lessons, and a look at how small changes in metadata and ownership mapping can dramatically improve trust and speed in data operations.
Data Meets Narrative: Unlocking the Power of A/B Testing
Great results, perfect execution… and still, your test analysis gets brushed aside or leads to the wrong call. Sound familiar? In this talk, I’ll share key lessons from running A/B tests as a Product Analyst, where I learned that the real power of experimentation lies in telling the story behind the numbers.
I’ll walk you through the principles that helped me turn cold data into clear decisions—by zooming out to the full user journey, collaborating closely with PMs, and crafting a narrative that resonates with stakeholders. We’ll cover practical tips, cautionary tales, and how to make sure your results don’t just live in a report—but actually shape product strategy.
This session is for anyone who’s ever felt like their “significant result” didn’t get the attention it deserved—or worse, was misinterpreted. you’ll walk away with practical, repeatable tools for turning any analysis—not just A/B tests—into a compelling, decision-shaping narrative and becoming a real partner.
From Zero to Insight: Building a Data Platform from the Ground Up
Imagine joining a company with a single database and a burning need for data-driven insights. Imagine a world where you can’t measure KPIs or answer burning questions from management.
In this session we will explore the creation of a mature company’s first data platform, from a Data Product Manager perspective. We will talk about pushing this initiative forward in an incremental way, while providing high-value deliverables.
Learn firsthand how we navigated the challenges of starting from scratch, understanding initial data flows to defining key KPIs, and ultimately enabling a company-wide shift towards a data-informed business model.
This session will provide practical takeaways for anyone facing the daunting task of building a data platform, emphasizing the importance of a product-centric approach, early wins, and the power of a clear vision.
Reimagining Anomaly Detection Through GenAI
Gen AI is a game changer applied to many industries and businesses. Can the same rules be applie to Data Engineering?
It is time to see how Gen AI is affecting the world of Anomaly Detection. How can we leverage Generative AI for anomaly detection? And what benefit can we extract by using LLMs in our setup.
In this session I will address all of these issues and show you how we can take our anomaly detection ball game to a whole new level.
How LLMs Transformed Our Analytics
This session explores how we transformed our approach to analytics using the power of Large Language Models. Discover how we reduced deep-dive analytics work from a full week to just a few hours. We unlocked faster, smarter decision-making across the organization. We will break down the key building blocks that enabled this shift, highlight common pitfalls to avoid, and share practical insights on how to align your team for success with LLM-driven analytics.
From Idea to MVP in 20 Minutes: Rapid Dashboard Prototyping with MCPs
In this talk, we’ll show how MCP (Model Context Protocol) can transform messy, unstructured Kafka data into a live dashboard-in just 20 minutes. By leveraging MCP servers, we eliminate the need for manual data engineering, enabling teams to explore real-time data visually with minimal setup.
This approach modernizes legacy data flows and drastically reduces time to insight.
Ideal for teams looking to accelerate prototyping and decision-making without getting stuck in infrastructure.
Conversational BI - The death of tradition BI
Traditional BI tools are becoming outdated—they’re slow, hard to use, and often lead to dashboards no one uses. Conversational BI is the future: instead of clicking through reports, people can simply ask questions and get answers from their data using AI. But for this to work well, there needs to be a strong semantic layer underneath—a smart system that understands business terms and connects them to the right data. It keeps answers consistent, accurate, and trusted, making AI-powered BI possible and useful at scale.
Stop Babysitting Your AI: From Junior Assistant to Senior Developer
What if your AI assistant could grow beyond its perpetual junior status? At Lightricks, we confronted the curious “junior paradox” of development tools like Cursor – technically powerful, yet forever stuck repeating the same mistakes. Why can’t AI tools learn team knowledge the way humans do?
Our solution – A registry-based framework with meta-rules and MCPs that orchestrates context-aware guidance. But how exactly did we transform our AI from a talented-yet-needy junior into a self-sufficient senior developer? And what happened when we deployed this system across our bi engineering team?
Join us to uncover how we broke the AI knowledge barrier – and discover how you might free yourself from repeatedly teaching the same standards to your AI assistants. Could your AI finally graduate from junior to senior developer?