Rajiv Shah - rajistics blog

Running Code and Failing Models

Fri, 26 Dec 2025 06:00:00 GMT

img

Machine learning is a glass cannon. When used correctly, it can be a truly transformative technology, but just a small oversight can cause it to become misleading and even actively harmful. Even if all the code runs and the model seems to be spitting out reasonable answers, it’s possible for a model to encode fundamental data science mistakes that invalidate its results. These errors might seem small, but the effects can be disastrous when the model is used to make decisions in the real world.

The promise and power of AI lead many researchers to gloss over the ways in which things can go wrong when building and operationalizing machine learning models. As a data scientist, one of my passions is to reproduce research papers as a learning exercise. Along the way, I have uncovered cases where the research was published with faulty methodologies. My hope is that this analysis can increase awareness about data science mistakes and raise the standards for machine learning in research. For example, last year I shared an analysis of a project by Harvard and Google researchers that contained fundamental errors. The researchers refused to fix their mistake even when confronted with it directly.

Over the holidays, I used DataRobot to reproduce a few machine learning benchmarks. I found many examples of machine learning code that ran without errors but that were built using flawed data science practices. The examples I share in this post come from the world’s best data scientists and affect hundreds of peer-reviewed research publications. As these examples show, errors in machine learning can be subtle. The key to finding these errors is to work with a tool that offers guardrails and insights along the way.

Target Leakage in a fast.ai Example

Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD by Jeremy Howard and Sylvain Gugger is a hands-on guide that helps people with little math background understand and use deep learning quickly. In the section about tabular datasets, the authors use the Blue Book for Bulldozers problem, the goal of which is to predict the sale price for heavy equipment at auction. I tried to replicate their machine learning model and wasn’t able to beat their model’s predictive performance, which piqued my interest.

After carefully inspecting their code, I found a mistake in their validation dataset. Their code attempted to create a validation test set based on a prediction point of November 1, 2011. The goal was to split the data at this point so that you could train on the data known at prediction time. The performance of the model is then analyzed on a test set, which is located after the prediction point. Unfortunately, the code was not written correctly; there was contamination from the future in the training data.

Leakage.png

The code below might at first look like it separates data before and after November 1, 2011, but there’s a subtle mistake that includes future dates. The use of information in the model training process that would not be expected at prediction time is known as target leakage, and it led to an over-optimistic accuracy. Because I used DataRobot, which requires and validates a date when creating a validation dataset based on time, I was able to find the mistake in the fast.ai book.

After the target leakage was fixed, the fast.ai scores dropped, and I was able to reproduce the results outside of fast.ai. This simple coding mistake led to a notebook and model that appeared valid. If this model were put into production, the results would have been much worse on new data. After I identified this issue, Jeremy Howard agreed to add a note in the course materials.

fastai2.png

SARCOS Dataset Failure

The SARCOS dataset is a widely used benchmark dataset in machine learning. Based on predicting the movement of a robotic arm, SARCOS appears in more than one hundred academic papers. I tested this dataset because it appears in various benchmarks by Google and fast.ai.

The SARCOS dataset is broken into two parts: a training dataset (sarcos_inv) and a test dataset (sarcos_inv_test). Following common data science practices, DataRobot broke the SARCOS training set into a training partition and a validation partition. I treated the SARCOS test set (sarcos_inv_test) as a holdout. When I looked at the results, I immediately noticed something suspicious. Do you see it?

sarcos3.png

The large drop between the validation score and the holdout score indicates that something is very different between the validation and holdout datasets. When I examined the holdout dataset (the SARCOS test set), I found that every row in the test set was in the training data too. After some investigation, I discovered that the holdout dataset was built out of the training dataset. Of the 4,449 examples in the test set, 4,445 examples are present in the training set, too. The target leakage here is significant. By overfitting or memorizing the training dataset, it’s possible to get perfect results on the test set. Overfitting, a well-known issue in machine learning, is illustrated in the following figure. The test dataset should have used out-of-sample testing to prevent overfitting.

overfit4.png

Target leakage helped to explain the very low scores of the deep learning models. For comparison, a random forest model achieves 2.38 mean squared error (MSE), while a deep learning model overfits and produces 0.038 MSE. Judging from the suspiciously large difference between the models, it appears that the deep learning model just memorized the training data, which is why it had such low error.

The consequences of this target leakage are far-reaching. More than one hundred journal articles relied on this dataset. Thousands of data scientists have used it to benchmark their machine learning code. Researcher Kai Arulkumaran has already acknowledged this issue and now the research community is dealing with the ramifications of the target leakage.

Why wasn’t this error discovered earlier? When I reproduced the SARCOS benchmarks, I used a tool that includes technical safeguards for proper validation splits and provides transparency in the display of the results of each split. DataRobot’s AutoML was designed by data scientists to prevent these sorts of issues. In contrast, working within code, it was quite easy to overlook this fundamental issue. After all, thousands of data scientists have rerun their code and published their results without a second thought.

Poker Hand Dataset

The Poker Hand dataset is another widely used benchmark dataset in machine learning. It’s used to predict poker hands (for example, a full house from five cards). The fast.ai and Google benchmarks for this model use the accuracy metric. Accuracy is a measurement for assessing the predictive performance of a model (basically, the percentage of predictions that are correct). Although it’s easy to get running code with the accuracy metric, it’s not good data science practice for this problem.

When DataRobot builds a model with the Poker Hand dataset, by default, it uses log loss as an optimization metric. Log loss is a measure of error for a model. At DataRobot, we believe that it isn’t good practice to use accuracy as your metric on a classification project with imbalanced classes. With imbalanced data, you can easily build a highly accurate model that’s useless.

To understand why accuracy isn’t the best metric when classifying unbalanced data, consider the following figure. Minesweeper is a popular game where the goal is to identify a few mines that are scattered across a board. Because there are a lot of squares with no mines, you could generate a very accurate model just by predicting that every square is safe. Although a 99% accurate model for Minesweeper sounds impressive, it’s not very useful.

minesweeper5.png

Automated feature selection in DataRobot provides a more parsimonious featurelist. In the Poker Hand dataset, DataRobot created a DR Reduced Features list with only six features. The starting feature list for this dataset, Cat+Cont, contained 15 features. The leaderboard below shows that the simpler DR Reduced Features list performs better than the full Cat+Cont feature list. The model below was optimized on log loss, but I am viewing the accuracy metrics for comparison to the existing benchmarks.

DRreduce6.png

Conclusion

I have shared simple examples of how data scientists can have running code, but failed models. After spending a week going through a half dozen datasets, I am even more convinced that automation with technical safeguards is a required part of building trusted AI. The mistakes I’ve shared here are not isolated incidents.

The issues go beyond the reproducibility crisis for machine learning research. It’s a great first step for researchers to publish their code and make the data available, but as these examples show, sharing code isn’t enough to validate models. So, what should you do about this?

In regulated industries, there are processes in place to validate running code (for example, building a challenger model using a different technical framework). For its safeguards and transparency, many organizations use DataRobot to validate models. Just rereading or rerunning a project isn’t enough to identify errors.

A Practical Guide to Evaluating Generative AI Applications

Sat, 01 Nov 2025 05:00:00 GMT

Video

Watch the full video

Annotated Presentation

Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.

Here is the annotated presentation for Rajiv Shah’s workshop on “Hill Climbing: Best Practices for Evaluating LLMs.”

1. Title Slide

Slide 1

(Timestamp: 00:00)

This slide introduces the workshop titled “Hill Climbing: Best Practices for Evaluating LLMs,” presented by Rajiv Shah, PhD, at the Open Data Science Conference (ODSC). The presentation focuses on the technical nuances of Generative AI and how to build effective evaluation workflows.

Rajiv sets the stage by outlining his three main goals for the session: understanding the technical differences in GenAI evaluation, learning a basic introductory workflow for building evaluation datasets, and inspiring practitioners to start “learning by doing” rather than just reading papers.

The concept of “Hill Climbing” refers to the iterative process of improving LLM applications—starting with a baseline and continuously optimizing performance through rigorous testing and error analysis.

2. Evaluating for Gen AI Resources

Slide 2

(Timestamp: 00:06)

This slide provides a QR code and a GitHub URL, directing the audience to the code and resources associated with the talk. It emphasizes that the workshop is practical, with code examples available for attendees to replicate the evaluation techniques discussed.

Rajiv encourages the audience to access these resources to follow along with the technical implementations of the concepts, such as building LLM judges and creating unit tests, which will be covered later in the presentation.

3. Customer Support Use Case

Slide 3

(Timestamp: 00:48)

To motivate the need for evaluation, the presentation introduces a common real-world use case: Customer Support. Generative AI is frequently deployed to help agents compose emails or chat responses based on user inquiries.

This scenario serves as the baseline example throughout the talk. It represents a high-volume task where automation is desirable, but accuracy and tone are critical for maintaining customer satisfaction and brand reputation.

4. Vibe Coding

Slide 4

(Timestamp: 00:59)

This slide introduces the concept of “Vibe Coding”—the initial phase where developers grab a simple prompt, feed it to a model, and get a result that feels right. It highlights the misconception that GenAI is easy because it works “out of the box” for simple demos.

Rajiv notes that while “vibe coding” might work for a quick demo app, it is insufficient for production systems. Relying on a “vibe” that the model is working prevents teams from catching subtle failures that occur at scale.

5. Good Response: Delayed Order

Slide 5

(Timestamp: 01:10)

Here, we see a successful output generated by the LLM. The customer inquired about a delayed order, and the AI generated a polite, relevant response acknowledging the delay and apologizing.

This example reinforces the “Vibe Coding” trap: because the model often produces high-quality, human-sounding text like this, developers can be lulled into a false sense of security regarding the system’s reliability.

6. Good Response: Damaged Product

Slide 6

(Timestamp: 01:12)

This slide provides another example of a “good” response. The AI correctly identifies that the customer received a damaged product and initiates a replacement protocol.

These positive examples establish a baseline of expected behavior. The challenge in evaluation is not just confirming that the model can work, but ensuring it works consistently across all edge cases.

7. Bad Response: Irrelevance

Slide 7

(Timestamp: 01:26)

The presentation shifts to failure modes. In this example, the user asks about an “Order Delay,” but the AI responds with information about a “New Product Launch.”

This illustrates a complete context mismatch. The model failed to attend to the user’s intent, generating a coherent but completely irrelevant response. This type of failure frustrates users and degrades trust in the automated system.

8. Bad Response: Hallucination

Slide 8

(Timestamp: 01:36)

This slide shows a more dangerous failure: Hallucination. The AI apologizes for a defective “espresso machine,” but as the speaker notes, “We don’t actually sell espresso machines.”

This highlights the risk of the model fabricating facts to be helpful. Such errors can lead to logistical nightmares, such as customers expecting replacements for products that do not exist or that the company never sold.

9. Risks of LLM Mistakes

Slide 9

(Timestamp: 01:51)

Rajiv categorizes the risks associated with LLM failures into three buckets: Reputational, Legal, and Financial. He cites the example of Cursor, an IDE company, where a support bot hallucinated a policy restricting users to one device, causing customers to cancel subscriptions.

The slide emphasizes that courts may view AI agents as employees; if a bot makes a promise (like a refund or policy change), the company might be legally bound to honor it. This escalates evaluation from a technical nice-to-have to a business necessity.

10. The Despair of Gen AI

Slide 10

(Timestamp: 02:38)

This visual represents the frustration developers feel when moving from a successful demo to a failing production system. The “despair” comes from the realization that the stochastic nature of LLMs makes them difficult to control.

It serves as an emotional anchor for the audience, acknowledging that while GenAI is exciting, the unpredictability of its failures causes significant stress for engineering teams responsible for deployment.

11. High Failure Rates

Slide 11

(Timestamp: 02:48)

The slide cites an MIT report stating that “95% of GenAI pilots are failing.” While Rajiv notes this number might be overstated, it reflects a trend where executives are demanding ROI and seeing lackluster results.

This shift in 2025 means that evaluation is no longer just for debugging; it is required to prove business value and justify the high costs of running Generative AI infrastructure.

12. Evaluation Improves Applications

Slide 12

(Timestamp: 03:14)

This slide asserts the core thesis: Evaluation helps you build better GenAI applications. It references a previous viral video by the speaker on the same topic, positioning this talk as an updated, condensed version with fresh content.

Rajiv explains that you cannot improve what you cannot measure. Without a robust evaluation framework, developers are essentially guessing whether changes to prompts or models are actually improving performance.

13. Why Evaluation is Necessary

Slide 13

(Timestamp: 03:40)

This concentric diagram illustrates the stakeholders involved in evaluation. It starts with “Things Go Wrong” (technical reality), moves to “Buy-in” (convincing managers/teams), and ends with “Regulators” (external compliance).

Evaluation serves multiple audiences: it helps the developer debug, it provides the metrics needed to convince management that the app is production-ready, and it creates the audit trails required by third-party auditors or regulators.

14. Evaluation Dimensions

Slide 14

(Timestamp: 04:18)

Evaluation must cover three dimensions: Technical (F1 scores, accuracy), Business (ROI, value generated), and Operational (Total Cost of Ownership, latency).

Rajiv highlights that data scientists often focus solely on the technical, but ignoring operational costs (like the expense of hosting GPUs vs. using APIs) can kill a project. A comprehensive evaluation strategy considers the cost-to-quality ratio.

15. Public Benchmarks

Slide 15

(Timestamp: 05:06)

The slide discusses Public Benchmarks (like MMLU, GSM8K). While useful for a general idea of a model’s capabilities (e.g., “Is Llama 3 better than Llama 2?”), they are insufficient for specific applications.

Rajiv warns against using these benchmarks to determine if a model fits your specific use case. Companies promote these numbers for marketing, but they rarely reflect performance on proprietary business data.

16. Custom Benchmarks

Slide 16

(Timestamp: 05:22)

The solution to the limitations of public benchmarks is Custom Benchmarks. This slide defines a benchmark as a combination of a Task, a Dataset, and an Evaluation Metric.

This is a critical definition for the workshop. To “tame” GenAI, you must build a dataset that reflects your specific customer queries and define success metrics that matter to your business logic, rather than relying on generic academic tests.

17. Taming Gen AI

Slide 17

(Timestamp: 05:28)

This title slide signals a transition into the technical “how-to” section of the talk. “Taming” implies that the default state of GenAI is wild and unpredictable.

The goal of the following sections is to bring structure and control to this chaos through rigorous engineering practices and evaluation workflows.

18. Workshop Roadmap

Slide 18

(Timestamp: 05:31)

The roadmap outlines the four main sections of the talk: 1. Basics of Gen AI: Understanding variability and technical nuances. 2. Evaluation Workflow: Building the dataset and running the first tests. 3. More Complexity: Adding unit tests and conducting error analysis. 4. Agents: Evaluating complex, multi-step workflows.

19. Variability in Responses

Slide 19

(Timestamp: 06:00)

This slide visually demonstrates the Non-Determinism of LLMs. It shows two responses to the same prompt generated just minutes apart. While substantively similar, the wording and structure differ slightly.

This variability makes exact string matching (a common software testing technique) impossible for LLMs. It necessitates semantic evaluation techniques, which complicates the testing pipeline.

20. Input-Model-Output Diagram

Slide 20

(Timestamp: 06:24)

A simple diagram illustrates the flow: Prompt -> Model -> Output. Rajiv uses this to structure the analysis of where variability comes from.

He explains that “chaos” can enter the system at any of these three stages: the input (prompt sensitivity), the model (inference non-determinism), or the output (formatting and evaluation).

21. Inconsistent Benchmark Scores

Slide 21

(Timestamp: 06:44)

The slide presents a discrepancy between benchmark scores tweeted by Hugging Face and those in the official Llama paper. Both used the same dataset (MMLU), but reported different accuracy numbers.

This introduces the problem of Evaluation Harness Sensitivity. Even with standard benchmarks, how you ask the model to take the test changes the score, proving that evaluation is fragile and implementation-dependent.

22. MMLU Overview

Slide 22

(Timestamp: 07:25)

MMLU (Massive Multitask Language Understanding) is explained here. It is a multiple-choice test covering 57 tasks across STEM, the humanities, and more.

It is currently the standard for measuring general “intelligence” in models. However, because it is a multiple-choice format, it is susceptible to prompt formatting nuances, as the next slides demonstrate.

23. Prompt Sensitivity

Slide 23

(Timestamp: 07:44)

This slide reveals why the scores in Slide 21 differed. The three evaluation harnesses used slightly different prompt structures (e.g., using the word “Question” vs. just listing the text).

These minor changes resulted in significant accuracy shifts. This proves that LLMs are highly sensitive to syntax, meaning a “better” model might just be one that was prompted more effectively for the test, not one that is actually smarter.

24. Formatting Changes

Slide 24

(Timestamp: 08:22)

Expanding on sensitivity, this slide references Anthropic’s research showing that changing answer choices from (A) to [A] or (1) affects the output.

This level of fragility is a key takeaway: seemingly cosmetic changes in how inputs are formatted can alter the model’s reasoning capabilities or its ability to output the correct token.

25. GPT-4o Performance Drop

Slide 25

(Timestamp: 08:38)

A bar chart demonstrates that this issue persists even in state-of-the-art models like GPT-4o. Subtle changes in wording can lead to a 5-10% drop in performance.

This counters the assumption that newer, larger models have “solved” prompt sensitivity. It remains a persistent variable that evaluators must control for.

26. Tone Sensitivity

Slide 26

(Timestamp: 08:46)

This slide shows that the tone of a prompt (e.g., being polite vs. direct) affects accuracy. Rajiv jokes, “I guess this is why mom always said to be polite.”

The graph indicates that prompt engineering strategies, like adding emotional weight or politeness, can statistically alter model performance, adding another layer of complexity to evaluation.

27. Persistent Sensitivity

Slide 27

(Timestamp: 09:00)

The slide reiterates that despite years of progress, models are still sensitive to specific phrases. It shows a “Prompt Engineering” guide suggesting specific words to use.

The takeaway is that developers cannot treat the prompt as a static instruction; it is a hyperparameter that requires optimization and constant testing.

28. Falcon LLM Bias

Slide 28

(Timestamp: 09:18)

This slide introduces a case study with the Falcon LLM. A user tweet shows the model recommending Abu Dhabi as a technological city with glowing sentiment, which raised suspicions about bias given the model’s origin in the Middle East.

This serves as a detective story: users wondered if the model weights were altered or if specific training data was injected to force this positive association.

29. Potential Cover-up?

Slide 29

(Timestamp: 09:50)

Another tweet speculates if the model is “covering up human rights abuses” because it provides different answers for Abu Dhabi compared to other cities.

This highlights how model behavior can be misinterpreted as malicious bias or censorship, when the root cause might be something much simpler in the input stack.

30. Inspecting the System Prompt

Slide 30

(Timestamp: 10:00)

The reveal: The bias wasn’t in the weights, but in the System Prompt. The slide suggests looking at the hidden instructions given to the model.

In Falcon’s case, the system prompt explicitly told the model, “You are a model built in Abu Dhabi.” This context influenced its generation probabilities, causing it to favor Abu Dhabi in its responses.

31. Claude System Prompt

Slide 31

(Timestamp: 10:33)

Rajiv points out that most developers never read the system prompts of the models they use. He highlights the Claude System Prompt, which is 1700 words long and takes nearly 10 minutes to read.

These extensive instructions define the model’s personality and safety guardrails. Ignoring them means you don’t fully understand the inputs driving your application’s behavior.

32. Complexity of a Single Response

Slide 32

(Timestamp: 11:00)

The diagram is updated to show that a “single response” is actually the result of complex interactions: Tokenization -> Prompt Styles -> Prompt Engineering -> System Prompt.

This visual summarizes the “Input” section of the talk, reinforcing that before the model even processes data, multiple layers of text transformation occur that can alter the result.

33. Inter-text Similarity

Slide 33

(Timestamp: 11:15)

This heatmap compares Inter-text similarity between models. It highlights Llama 70B and Llama 8B. Even though they are from the same family and likely trained on similar data, they are not identical.

This means you cannot swap a smaller model for a larger one (or vice versa) and expect the exact same behavior. Any model change requires a full re-evaluation.

34. Sycophantic Models

Slide 34

(Timestamp: 12:16)

The slide discusses Sycophancy—the tendency of models to agree with the user even when the user is wrong. It mentions how early versions of GPT-4 were sometimes “overly nice.”

This behavior is a specific type of model bias that evaluators must watch for. If a user asks a leading question containing false premises, a sycophantic model might validate the falsehood rather than correct it.

35. Model Drift

Slide 35

(Timestamp: 12:37)

“Model Drift” refers to the phenomenon where commercial APIs (like OpenAI or Anthropic) change their model behavior over time without warning.

Because developers do not control the weights of API-based models, the “ground underneath them” can shift. A prompt that worked yesterday might fail today because the provider updated the backend or the inference infrastructure.

36. Degraded Responses Timeline

Slide 36

(Timestamp: 12:55)

This slide shows a timeline of Degraded Responses from an Anthropic incident. Technical issues like context window routing errors led to corrupted outputs for a period of days.

This illustrates that drift isn’t always about model updates; it can be infrastructure failures. Continuous monitoring is required to detect when an external dependency degrades your application’s performance.

37. Hyperparameters

Slide 37

(Timestamp: 13:33)

The slide lists Hyperparameters like Temperature, Top-P, and Max Length. Rajiv explains that users can control these “knobs” to influence creativity versus determinism.

Setting temperature to 0 makes the model less random, but as the next slides show, it does not guarantee perfect determinism due to hardware nuances.

38. Non-Deterministic Inference

Slide 38

(Timestamp: 14:03)

This slide tackles Non-Deterministic Inference. Unlike traditional ML models (e.g., XGBoost) where a fixed seed guarantees identical output, LLMs on GPUs often produce different results for identical inputs.

Causes include floating-point accumulation errors and the behavior of Mixture of Experts (MoE) models where different batches might activate different experts.

39. Addressing Non-Determinism

Slide 39

(Timestamp: 15:11)

Rajiv references recent work by Thinking Machines and updates to vLLM that attempt to solve the non-determinism problem through correct batching.

While solutions are emerging, the takeaway is that most current setups are non-deterministic by default. Evaluators must design their tests to tolerate this variance rather than expecting bit-wise reproducibility.

40. Updated Model Diagram

Slide 40

(Timestamp: 15:43)

The diagram expands again. The “Model” box now includes Model Selection, Hyperparameters, Non-deterministic Inference, and Forced Updates.

This visual summarizes the “Model” section, showing that the “black box” is actually a dynamic system with internal variables (weights/architecture) and external variables (infrastructure/updates) that all add noise to the output.

41. Output Format Issues

Slide 41

(Timestamp: 16:01)

Moving to the “Output” stage, this slide uses MMLU again to show how Output Formatting affects evaluation. How do you ask the model to answer a multiple-choice question?

Do you ask it to output just the letter “A”? Or the full text? Or the probability of the token “A”? Different evaluation harnesses use different methods, leading to the score discrepancies seen earlier.

42. Evaluation Harness Variations

Slide 42

(Timestamp: 16:35)

This table details the specific differences in implementation between harnesses (e.g., original MMLU vs. HELM vs. EleutherAI).

It reinforces that there is no standard “ruler” for measuring LLMs. The tool you use to measure the model introduces its own bias and variance into the final score.

43. Score Comparison Table

Slide 43

(Timestamp: 16:56)

A spreadsheet shows the same models scoring differently across different evaluation implementations. The variance is not trivial; it can be large enough to change the ranking of which model is “best.”

This data drives home the point: You must control your own evaluation pipeline. Relying on reported numbers is risky because you don’t know the implementation details behind them.

44. Sentiment Analysis Variance

Slide 44

(Timestamp: 17:09)

This slide shows varying Sentiment Analysis outputs. Different models (or the same model with different prompts) might classify a review as “Positive” while another says “Neutral.”

This introduces the concept that even “simple” classification tasks in GenAI are subject to interpretation and variance, unlike traditional classifiers that have a fixed decision boundary.

45. Tool Use Variance

Slide 45

(Timestamp: 17:23)

Radar charts illustrate variance in Tool Use. Models might be good at using an “Email” tool but fail at “Calendar” or “Terminal” tools.

Furthermore, models exhibit non-determinism in decision making—sometimes they choose to use a tool, and sometimes they try to answer from memory. This adds a layer of logic errors on top of text generation errors.

46. Summary: Why Responses Differ

Slide 46

(Timestamp: 17:49)

This comprehensive slide aggregates all the factors discussed: Inputs (prompts, system prompts), Model (drift, hyperparams), Outputs (formatting), and Infrastructure.

It serves as a checklist for the audience. If your application is behaving inconsistently, investigate these specific layers of the stack to find the source of the noise.

47. Chaos is Okay

Slide 47

(Timestamp: 18:17)

Rajiv reassures the audience that “Chaos is Okay.” The slide presents a chart of evaluation methods ranging from flexible/expensive (human eval) to rigid/cheap (code assertions).

The message is that while the technology is chaotic, there is a spectrum of tools available to manage it. We don’t need to solve every source of variance; we just need a robust process to measure it.

48. From Chaos to Control

Slide 48

(Timestamp: 18:27)

This transition slide marks the beginning of the Evaluation Workflow section. The presentation shifts from describing the problem to prescribing the solution.

The goal here is to move from “Vibe Coding” to a structured engineering discipline where changes are measured against a stable baseline.

49. Build the Evaluation Dataset

Slide 49

(Timestamp: 18:37)

The first step in the workflow is to Build the Evaluation Dataset. The slide lists examples of prompts for tasks like summarization, extraction, and translation.

Rajiv emphasizes that this dataset should reflect your actual use case. It is the foundation of the “Custom Benchmark” concept introduced earlier.

50. Get Labeled Outputs (Gold)

Slide 50

(Timestamp: 18:46)

Step two is to get Labeled Outputs, also known as Gold Outputs, Reference, or Ground Truth. The slide adds a column showing the ideal answer for each prompt.

This is the standard against which the model will be judged. While obtaining these labels can be expensive (requiring human effort), they are essential for calculating accuracy.

51. Compare to Model Output

Slide 51

(Timestamp: 19:00)

Step three is to generate responses from your system and place them alongside the Gold Outputs. The slide adds a “Model Output” column.

This visual comparison allows developers (and automated judges) to see the delta between what was expected and what was produced.

52. Measure Equivalence

Slide 52

(Timestamp: 19:10)

Step four is to Measure Equivalence. Since LLMs rarely produce exact string matches, we use an LLM Judge (another model) to determine if the Model Output means the same thing as the Gold Output.

The slide shows a prompt for the judge: “Are these two responses semantically equivalent?” This converts a fuzzy text comparison problem into a binary (Pass/Fail) metric.

53. Optimize Using Equivalence

Slide 53

(Timestamp: 19:57)

Once you have an equivalence metric, you can Optimize. The slide shows Config A vs. Config B. By changing prompts or models, you can track if your “Equivalence Score” goes up or down.

This treats GenAI engineering like traditional hyperparameter tuning. The goal is to maximize the equivalence score on your custom dataset.

54. Why Global Metrics Aren’t Enough

Slide 54

(Timestamp: 20:28)

The slide discusses the limitations of the “Equivalence” approach. While good for a general sense of quality, Global Metrics miss nuances.

Sometimes it’s hard to get a Gold Answer for open-ended creative tasks. Furthermore, a simple “Pass/Fail” doesn’t tell you why the model failed (e.g., was it tone, length, or factuality?).

55. From Global to Targeted Evaluation

Slide 55

(Timestamp: 20:55)

This slide argues for Targeted Evaluation. To maximize performance, you need to dig deeper into the data and identify specific error modes.

This transitions the talk from “Basic Workflow” to “Advanced Testing,” where we break down “Quality” into specific, testable components like tone, length, and safety.

56. Building Tests

Slide 56

(Timestamp: 21:14)

The section title “Building Tests” appears. This is where the presentation moves into the “Unit Testing” philosophy for GenAI.

Just as software engineering relies on unit tests to verify specific functions, GenAI engineering should use targeted tests to verify specific attributes of the generated text.

57. Good vs. Bad Examples

Slide 57

(Timestamp: 21:20)

The slide displays a Good Example and a Bad Example of a response. The bad example is visibly shorter and less polite.

Rajiv asks the audience to identify why it is bad. This exercise is crucial: you cannot build a test until you can articulate exactly what makes a response a failure.

58. Develop an Evaluation Mindset

Slide 58

(Timestamp: 21:46)

To define “Bad,” developers need an Evaluation Mindset. This involves observing real-world user interactions and problems.

Data scientists often want to stay in their “chair” and optimize algorithms, but Rajiv argues that effective evaluation requires understanding the user’s pain points.

59. Collaborate with Experts

Slide 59

(Timestamp: 21:58)

The slide stresses Collaboration. You must talk to domain experts (e.g., the customer support team) to define what a “good” answer looks like.

Naive bootstrapping—pretending to be a user—is a good start, but long-term success requires input from the people who actually know the business domain.

60. Identify and Categorize Failures

Slide 60

(Timestamp: 22:52)

Once you understand the domain, you can Categorize Failure Types. The slide shows a chart grouping errors into categories like “Harmful Content,” “Bias,” or “Incorrect Info.”

This clustering allows you to see patterns. Instead of just knowing “the model failed 20% of the time,” you know “the model has a specific problem with tone.”

61. Define What Good Looks Like

Slide 61

(Timestamp: 23:11)

Using the categorization, you can explicitly Define What Good Looks Like. The slide contrasts the good/bad examples again, but now with labels: “Too short,” “Lacks professional tone.”

This transforms a subjective feeling (“this response sucks”) into objective criteria (“response must be >50 words and use polite honorifics”).

62. Document Every Issue

Slide 62

(Timestamp: 23:32)

The slide shows a spreadsheet where humans evaluate responses and Document Every Issue. Columns track specific attributes like “Is it helpful?” or “Is the tone right?”

This manual annotation is the training data for your automated tests. You need humans to establish the ground truth before you can automate the checking.

63. Evaluation Tooling

Slide 63

(Timestamp: 23:53)

Rajiv mentions that Tooling Can Help. The slide shows a custom chat viewer designed to make human review easier.

However, he warns against getting sidetracked by building fancy tools. Simple spreadsheets often suffice for the early stages. The goal is the data, not the interface.

64. Test 1: Length Check

Slide 64

(Timestamp: 24:05)

Now we build the automated tests. Test 1 is a Length Check. The slide shows Python code asserting that the word count is between 8 and 200.

This is a deterministic test. You don’t need an LLM to count words. Rajiv encourages using simple Python assertions wherever possible because they are fast, cheap, and reliable.

65. Test 2: Tone and Style

Slide 65

(Timestamp: 24:22)

Test 2 checks Tone and Style. Since “tone” is subjective, we use an LLM Judge (OpenAI model) to classify the response.

The prompt asks the judge to identify the style. This allows us to automate the “vibe check” that humans were previously doing manually.

66. Adding Metrics to Documentation

Slide 66

(Timestamp: 24:41)

The spreadsheet is updated with new columns: Length_OK and Tone_OK. These are the results of the automated tests.

Now, for every row in the dataset, we have granular pass/fail metrics. This helps pinpoint exactly why a specific response failed, rather than just a generic failure.

67. Check Judges Against Humans

Slide 67

(Timestamp: 25:12)

A critical step: Check LLM Judges Against Humans. You must verify that your automated “Tone Judge” agrees with your human experts.

If the human says the tone is rude, but the LLM Judge says it’s polite, your metric is useless. You must iterate on the judge’s prompt until alignment is high.

68. Self-Evaluation Bias

Slide 68

(Timestamp: 26:06)

The slide illustrates Self-Evaluation Bias. LLMs tend to rate their own outputs higher than outputs from other models. GPT-4 prefers GPT-4 text.

To mitigate this, Rajiv suggests mixing models—use Claude to judge GPT-4, or Gemini to judge Claude. This helps ensure a more neutral evaluation.

69. Alignment Checks

Slide 69

(Timestamp: 26:46)

This slide reinforces the need for Continuous Alignment. Just because your judge aligned with humans last month doesn’t mean it still does (due to model drift).

Human spot-checks should be a permanent part of the pipeline to ensure the automated judges haven’t drifted.

70. Biases in LLM Judges

Slide 70

(Timestamp: 27:02)

The slide lists known Biases in LLM Judges, such as Position Bias (favoring the first answer presented) or Verbosity Bias (favoring longer answers).

Evaluators must be aware of these. For example, you should shuffle the order of answers when asking a judge to compare two options to cancel out position bias.

71. Best Practices for LLM Judges

Slide 71

(Timestamp: 27:11)

A summary of Best Practices: Calibrate with human data, use ensembles (multiple judges), avoid asking for “relevance” (too vague), and use discrete rating scales (1-5) rather than continuous numbers.

These tips help stabilize the inherently noisy process of using AI to evaluate AI.

72. Error Analysis Chart

Slide 72

(Timestamp: 27:46)

With tests in place, we move to Error Analysis. The bar chart shows the number of failed cases categorized by error type (Length, Tone, Professional, Context).

This visualization tells you where to focus your efforts. If “Tone” is the biggest bar, you work on the system prompt’s tone instructions. If “Context” is the issue, you might need better Retrieval Augmented Generation (RAG).

73. Comparing Prompts

Slide 73

(Timestamp: 27:58)

The chart can compare Prompt A vs. Prompt B. This allows for A/B testing of prompt engineering strategies.

You can see if a new prompt improves “Tone” but accidentally degrades “Context.” This tradeoff analysis is impossible with a single global score.

74. Explanations Guide Improvement

Slide 74

(Timestamp: 28:14)

Rajiv suggests asking the LLM Judge for Explanations. Don’t just ask for a score; ask for “one sentence explaining why.”

These explanations act as metadata that helps developers understand the judge’s reasoning, making it easier to debug discrepancies between human and AI judgments.

75. Limits to Explanations

Slide 75

(Timestamp: 28:35)

A warning: Explanations are not causal. When an LLM explains why it did something, it is generating a plausible justification, not a trace of its actual neural activations.

Treat explanations as a heuristic or a helpful hint, not as absolute truth about the model’s internal state.

76. The Evaluation Flywheel

Slide 76

(Timestamp: 28:46)

The Evaluation Flywheel describes the iterative cycle: Build Eval -> Analyze -> Improve -> Repeat.

This concept, credited to Hamill, emphasizes that evaluation is not a one-time event but a continuous loop that spins faster as you gather more data and build better tests.

77. Financial Analyst Agent Example

Slide 77

(Timestamp: 29:20)

To demonstrate advanced unit testing, Rajiv introduces a Financial Analyst Agent. The goal is to assess the specific “style” of a financial report.

This is a complex domain where “good” is highly specific (regulated, precise, risk-aware), making it a perfect candidate for granular unit tests.

78. Use a Global Test?

Slide 78

(Timestamp: 29:43)

You could use a Global Test: “Was this explained as a financial analyst would?”

While simple, this test is opaque. If it fails, you don’t know if it was because of compliance issues, lack of clarity, or poor formatting.

79. Global vs. Unit Tests

Slide 79

(Timestamp: 29:54)

The slide contrasts the Global approach with Unit Tests. Instead of one question, we ask six: Context, Clarity, Precision, Compliance, Actionability, and Risks.

This breakdown allows for targeted debugging. You might find the model is great at “Clarity” but terrible at “Compliance.”

80. Scoring Radar Chart

Slide 80

(Timestamp: 30:16)

A Radar Chart visualizes the unit test scores. This allows for a quick visual assessment of the model’s profile.

It facilitates comparison: you can overlay the profiles of two different models to see which one has the better balance of attributes for your specific needs.

81. Analyzing Failures with Clusters

Slide 81

(Timestamp: 30:37)

With enough unit test data, you can use Clustering (e.g., K-Means) to group failures. The slide shows clusters like “Synthesis,” “Context,” and “Hallucination.”

This moves error analysis from reading individual logs to analyzing aggregate trends, helping you prioritize which class of errors to fix first.

82. Designing Good Unit Tests

Slide 82

(Timestamp: 30:52)

Advice on Designing Unit Tests: Keep them focused (one concept per test), use unambiguous language, and use small rating ranges.

Good unit tests are the building blocks of a reliable evaluation pipeline. If the tests themselves are noisy or vague, the entire system collapses.

83. Examples of Unit Tests

Slide 83

(Timestamp: 30:55)

The slide lists specific examples of tests for Legal (Compliance, Terminology), Retrieval (Relevance, Completeness), and Bias/Fairness.

This serves as a menu of options for the audience, showing that unit tests can cover almost any dimension of quality required by the business.

84. Evaluating New Prompts

Slide 84

(Timestamp: 30:58)

A bar chart shows how unit tests are used to Evaluate New Prompts. By running the full suite of unit tests on a new prompt, you get a “scorecard” of its performance.

This data-driven approach removes the guesswork from prompt engineering.

85. Tools - No Silver Bullet

Slide 85

(Timestamp: 31:02)

Rajiv reminds the audience that Tools are No Silver Bullet. You must master the basics (datasets, metrics) first.

He advises logging traces and experiments and practicing Dataset Versioning. Tools facilitate these practices, but they cannot replace the fundamental engineering discipline.

86. Forest and Trees

Slide 86

(Timestamp: 31:04)

An analogy helps structure the analysis: Forest (Global/Integration) vs. Trees (Test Case/Unit Tests).

You need to look at both. The forest tells you the overall health of the app, while the trees tell you specifically what needs pruning or fixing.

87. Change One Thing at a Time

Slide 87

(Timestamp: 31:17)

A crucial scientific principle: Change One Thing at a Time. With so many knobs (prompt, temp, model, RAG settings), changing multiple variables simultaneously makes it impossible to know what caused the improvement (or regression).

Isolate your variables to conduct valid experiments.

88. Error Analysis Tips

Slide 88

(Timestamp: 31:32)

A summary of Error Analysis Tips: Use ablation studies (removing parts to see impact), categorize failures, save interesting examples, and leverage logs/traces.

These are the daily habits of successful GenAI engineers.

89. The Evaluation Story

Slide 89

(Timestamp: 32:08)

The slide shows the “Story We Tell”—a linear graph of improvement over time. This is the idealized version of progress often presented in case studies.

It suggests a smooth journey from “Out of the box” to “Specialized” to “User Feedback.”

90. The Reality of Progress

Slide 90

(Timestamp: 32:24)

The Reality is a messy, non-linear graph. You take two steps forward, one step back. Sometimes an “improvement” breaks the model.

Rajiv encourages resilience. Experienced practitioners know that this messy graph is normal and that sticking to the process eventually yields results.

91. Continual Process

Slide 91

(Timestamp: 33:01)

Evaluation is a Continual Process. It involves Problem ID, Data Collection, Optimization, User Acceptance Testing (UAT), and Updates.

Crucially, UAT is your holdout set. Since you don’t have a traditional test set in GenAI, your real users act as the final validation layer.

92. Eating the Elephant

Slide 92

(Timestamp: 34:03)

The metaphor “How do you eat an elephant?” addresses the overwhelming nature of building a comprehensive evaluation suite.

The answer, of course, is “one bite at a time.” You don’t need 100 tests on day one.

93. Adding Tests Over Time

Slide 93

(Timestamp: 34:10)

The slide visualizes the “elephant” being broken down into bites. You start with a few critical tests. As the app matures and you discover new failure modes, you add more tests.

Six months in, you might have 100 tests, but you built them incrementally. This makes the task manageable.

94. Doing Evaluation the Right Way

Slide 94

(Timestamp: 34:39)

A summary slide listing best practices: Annotated Examples, Systematic Documentation, Continuous Error Analysis, Collaboration, and awareness of Generalization.

This concludes the core methodology section of the talk.

95. Agentic Use Cases

Slide 95

(Timestamp: 34:50)

The final section covers Agentic Use Cases, symbolized by a dragon. Agents add a layer of complexity because the model is now making decisions (routing, tool use) rather than just generating text.

This “agency” makes the system harder to track and evaluate.

96. Crossing the River

Slide 96

(Timestamp: 35:06)

A conceptual slide asking, “How should it cross the river?” (Fly, Swim, Bridge?). This represents the decision-making step in an agent.

Evaluating an agent requires evaluating how it made the decision (the router) separately from how well it executed the action.

97. Chat-to-Purchase Router

Slide 97

(Timestamp: 35:22)

A complex flowchart shows a Chat-to-Purchase Router. The agent must decide if the user wants to search for a product, get support, or track a package.

Rajiv suggests breaking this down: evaluate the Router component first (did it pick the right path?), then evaluate the specific workflow (did it track the package correctly?).

98. Text to SQL Agent

Slide 98

(Timestamp: 36:17)

Another example: Text to SQL Agent. This workflow involves classification, feature extraction, and SQL generation.

You can isolate the “Classification” step (is this a valid SQL question?) and build a test just for that, before testing the actual SQL generation.

99. Evaluating Office-Style Agents

Slide 99

(Timestamp: 36:46)

The slide discusses OdysseyBench, a benchmark for office tasks. It highlights failure modes like “Failed to create folder” or “Failed to use tool.”

Evaluating agents involves checking if they successfully manipulated the environment (files, APIs), which is a functional test rather than a text similarity test.

100. Error Analysis for Agents

Slide 100

(Timestamp: 37:00)

Error Analysis for Agentic Workflows requires assessing the overall performance, the routing decisions, and the individual steps.

It is the same “action error analysis” process but applied recursively to every node in the agent’s decision tree.

101. Evaluating Workflow vs. Response

Slide 101

(Timestamp: 37:19)

This slide distinguishes between evaluating a Response (text) and a Workflow (process). The flowchart shows a conversational flow.

Evaluating a workflow might mean checking if the agent successfully moved the user from “Greeting” to “Resolution,” regardless of the exact words used.

102. Agentic Frameworks

Slide 102

(Timestamp: 37:48)

Rajiv warns that “Agentic Frameworks Help – Until They Don’t.” Frameworks (like LangChain or AutoGen) are great for demos because they abstract complexity.

However, in production, these abstractions can break or become outdated. He often recommends using straight Python for production agents to maintain control and reliability.

103. Abstraction for Workflows

Slide 103

(Timestamp: 38:32)

The slide illustrates the trade-off in Abstraction. You can build rigid workflows (orchestration) where you control every step, or use general agents where the LLM decides.

Orchestration is more reliable but rigid. General agents are flexible but prone to non-deterministic errors.

104. When Abstractions Break

Slide 104

(Timestamp: 38:53)

Model providers are training models to handle workflows internally (removing the need for external orchestration).

However, until models are perfect, developers often need to break tasks down into specific pieces to ensure reliability. The choice between “letting the model do it” and “scripting the flow” depends on the application’s risk tolerance.

105. Lessons from Agent Benchmarks

Slide 105

(Timestamp: 39:15)

The slide lists Lessons from Reproducing Agent Benchmarks: Standardize evaluation, measure efficiency, detect shortcuts, and log real behavior.

These are advanced tips for those pushing the boundaries of what agents can do.

106. Conclusion

Slide 106

(Timestamp: 39:27)

The final slide, “We did it!”, concludes the presentation. Rajiv thanks the audience and provides the QR code again.

His final message is one of empowerment: he hopes the audience now has the confidence to go out, build their own evaluation datasets, and start “hill climbing” their own applications.

This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.

From Vectors to Agents: Managing RAG in an Agentic World

Mon, 27 Oct 2025 05:00:00 GMT

Video

Watch the full video

Annotated Presentation

Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.

Here is the slide-by-slide annotated presentation based on the video “From Vectors to Agents: Managing RAG in an Agentic World” by Rajiv Shah.

1. Title Slide

Slide 1

(Timestamp: 00:00)

The presentation begins with the title slide, introducing the core theme: “From Vectors to Agents: Managing RAG in an Agentic World.” The speaker, Rajiv Shah from Contextual, sets the stage for a technical deep dive into Retrieval-Augmented Generation (RAG).

He outlines the agenda, promising to move beyond basic RAG concepts to focus specifically on retrieval approaches. The talk is designed to cover the spectrum from traditional methods like BM25 and Language Models to the emerging field of Agentic Search.

2. ACME GPT

Slide 2

(Timestamp: 00:40)

This slide displays a stylized logo for “ACME GPT,” representing the typical enterprise aspiration. Companies see tools like ChatGPT and immediately want to apply that capability to their internal data, asking questions like, “Can I get the list of board of directors?”

However, the speaker notes a common hurdle: generic models don’t know enterprise-specific knowledge. This sets up the necessity for RAG—injecting private data into the model—rather than relying solely on the model’s pre-trained knowledge.

3. Building RAG is Easy

Slide 3

(Timestamp: 01:10)

The speaker illustrates the deceptively simple workflow of a basic RAG demo. The diagram shows the standard path: a user query is converted to vectors, matched against a database, and sent to an LLM.

Shah acknowledges that building a “hello world” version of this is trivial. He notes, “You can build a very easy RAG demo out of the box by just grabbing some data, using an embedding model, creating vectors, doing the similarity.”

4. Building RAG is Easy (Code Example)

Slide 4

(Timestamp: 01:22)

A Python code snippet using LangChain is displayed to reinforce how accessible basic RAG has become. The code demonstrates loading a document, chunking it, and setting up a retrieval chain in just a few lines.

This slide serves as a foil for the upcoming reality check. While the code works for a demo, it hides the immense complexity required to make such a system robust, accurate, and scalable in a real-world production environment.

5. RAG Reality Check

Slide 5

(Timestamp: 01:35)

The tone shifts to the challenges of production. The slide highlights a sobering statistic: 95% of Gen AI projects fail to reach production. The speaker details the specific reasons why demos fail when scaled: poor accuracy, unbearable latency, scaling issues with millions of documents, and ballooning costs.

He emphasizes a critical, often overlooked factor: Compliance. “Inside an enterprise, not everybody gets to read every document.” A demo ignores entitlements, but a production system cannot.

6. Maybe try a different RAG?

Slide 6

(Timestamp: 03:00)

This slide lists a dizzying array of RAG variants (GraphRAG, RAPTOR, CRAG, etc.) and retrieval techniques. It represents the “analysis paralysis” developers face when scouring arXiv papers for a solution to their accuracy problems.

Shah warns against blindly chasing the latest academic paper to fix fundamental system issues. “The answer is not in here of pulling together like a bunch of archive papers.” Instead, he advocates for a structured framework to make decisions.

7. Ultimate RAG Solution

Slide 7

(Timestamp: 03:30)

A humorous cartoon depicts a “Rube Goldberg” machine, representing the “Ultimate RAG Solution.” It mocks the tendency to over-engineer systems with too many interconnected, fragile components in the pursuit of performance.

The speaker uses this visual to argue for simplicity and deliberate design. The goal is to avoid building a monstrosity that is impossible to maintain, urging the audience to think about trade-offs before complexity.

8. RAG as a system

Slide 8

(Timestamp: 03:35)

The speaker introduces a clean system architecture for RAG, broken into four distinct stages: Parsing, Querying, Retrieving, and Generation. This framework serves as the mental map for the rest of the presentation.

He highlights that “Parsing” is vastly overlooked—getting information out of complex documents cleanly is a prerequisite for success. Today’s talk, however, will zoom in specifically on the Retrieving and Querying components.

9. Designing a RAG Solution

Slide 9

(Timestamp: 04:10)

This slide presents a “Tradeoff Triangle” for RAG, balancing Problem Complexity, Latency, and Cost. The speaker advises having a serious conversation with stakeholders about these constraints before writing code.

A key concept introduced here is the “Cost of a mistake.” In coding assistants, a mistake is low-cost (the developer fixes it). In medical RAG systems, the cost of a mistake is high (life or death), which dictates a completely different architectural approach.

10. RAG Considerations

Slide 10

(Timestamp: 05:30)

A detailed table breaks down specific considerations that influence RAG design, such as domain difficulty, multilingual requirements, and data quality. This slide was originally created for sales teams to help scope customer problems.

Shah emphasizes that understanding the nuances of the use case upfront saves heartache later. For instance, knowing if users will ask simple questions or require complex reasoning changes the retrieval strategy entirely.

11. Consider Query Complexity

Slide 11

(Timestamp: 06:15)

The speaker categorizes queries by complexity, ranging from simple Keywords (“Total Revenue”) to Semantic variations (“How much bank?”), to Multi-hop reasoning, and finally Agentic scenarios.

He points out a common failure mode: “The answers aren’t in the documents… all of a sudden they’re asking for knowledge that’s outside.” Recognizing the query complexity determines whether you need a simple search engine or a complex agentic workflow.

12. Retrieval (Highlighted)

Slide 12

(Timestamp: 07:32)

The presentation zooms back into the system diagram, highlighting the “Retrieving” box. This signals the start of the deep technical dive into retrieval algorithms.

Shah notes that this area causes the most confusion due to the sheer number of model choices and architectures available. He aims to provide a practical guide to selecting the right retrieval tool.

13. Retrieval Approaches

Slide 13

(Timestamp: 08:16)

Three primary retrieval pillars are introduced: 1. BM25: The lexical, keyword-based standard. 2. Language Models: Semantic embeddings and vector search. 3. Agentic Search: The new frontier of iterative reasoning.

The speaker emphasizes that documents must be broken into pieces (chunking) because no single model context window is efficient enough to hold all enterprise data for every query.

14. Building RAG is Easy (Code Highlight)

Slide 14

(Timestamp: 08:50)

Returning to the initial code snippet, the speaker highlights the vectorstore and retriever initialization lines. This pinpoints exactly where the upcoming concepts fit into the implementation.

This visual anchor helps developers map the theoretical concepts of BM25 and Embeddings back to the actual lines of code they write in libraries like LangChain or LlamaIndex.

15. BM25

Slide 15

(Timestamp: 09:18)

BM25 (Best Match 25) is explained as a probabilistic lexical ranking function. The slide visualizes an inverted index, mapping words (like “butterfly”) to the specific documents containing them.

Shah explains that this is the 25th iteration of the formula, designed to score documents based on word frequency and saturation. It remains a powerful, fast baseline for retrieval.

16. BM25 Performance

Slide 16

(Timestamp: 09:55)

A table compares the speed of a Linear Scan (Ctrl+F style) versus an Inverted Index (BM25) as the document count grows from 1,000 to 9,000.

The data shows that linear search becomes exponentially slower (taking 3,000 seconds for 1k documents in this synthetic test), while BM25 remains orders of magnitude faster. This efficiency is why lexical search is still widely used in production.

17. BM25 Failure Cases

Slide 17

(Timestamp: 11:08)

The limitations of BM25 are exposed. Because it relies on exact word matches, it fails when users use synonyms. If a user searches for “Physician” but the documents only contain “Doctor,” BM25 will return zero results.

Similarly, it struggles with acronyms like “IBM” vs “International Business Machines.” Despite this, Shah argues BM25 is a “very strong baseline” that often beats complex neural models on specific keyword-heavy datasets.

18. Hands on: BM25s

Slide 18

(Timestamp: 12:14)

For developers wanting to implement this, the slide points to a library called bm25s, a high-performance Python implementation available on Hugging Face.

This reinforces the practical nature of the talk—BM25 isn’t just a legacy concept; it is an active, installable tool that developers should consider using alongside vector search.

19. Enter Language Models

Slide 19

(Timestamp: 12:24)

The talk transitions to Language Models (Embeddings). The slide explains how an encoder model turns text into a dense vector (a list of numbers) that captures semantic meaning.

Because these models are trained on vast amounts of data, they “have an idea of these similar concepts.” This solves the synonym problem that plagues BM25.

20. Embeddings Visualized

Slide 20

(Timestamp: 12:50)

A 2D visualization demonstrates how embeddings group related concepts in latent space. The word “Doctor” and “Physician” would be located very close to each other mathematically.

This spatial proximity allows for Semantic Search: finding documents that mean the same thing as the query, even if they don’t share a single word.

21. Semantic search is widely used

Slide 21

(Timestamp: 13:15)

The speaker validates the importance of semantic search by showing a tweet from Google’s SearchLiaison regarding BERT, and a screenshot of Hugging Face’s model repository.

This confirms that semantic search is the industry standard for modern information retrieval, having been deployed at massive scale by tech giants to improve result relevance.

22. Which language model?

Slide 22

(Timestamp: 13:30)

A scatter plot compares various models based on Inference Speed (X-axis) and NDCG@10 (Y-axis, a measure of retrieval quality).

Shah places BM25 on the right (fast but lower accuracy) to orient the audience. He points out that there is a massive variety of models with different trade-offs between compute cost and retrieval quality.

23. Static Embeddings

Slide 23

(Timestamp: 14:43)

The speaker introduces Static Embeddings (like Word2Vec or GloVe) which are located on the far right of the previous scatter plot—extremely fast, even on CPUs.

These models assign a fixed vector to every word. While efficient, they lack context. The word “bank” has the same vector whether referring to a river bank or a financial bank, which limits their accuracy.

24. Why Context Matters

Slide 24

(Timestamp: 15:16)

A cartoon illustrates the difference between Static Embeddings and Transformers. The Transformer can distinguish between “Model” in a data science context versus “Model” in a fashion context.

This contextual awareness is why modern Transformer-based embeddings (like BERT) generally outperform static embeddings and BM25 in complex retrieval tasks, despite being slower.

25. Many more models!

Slide 25

(Timestamp: 15:55)

Returning to the scatter plot, a red arrow points toward the top-left quadrant—models that are slower but achieve higher accuracy.

The speaker notes that the field is constantly evolving, with “newer generations of models” pushing the boundary of what is possible in terms of retrieval quality.

26. MTEB/RTEB

Slide 26

(Timestamp: 16:35)

To help developers choose, Shah introduces the MTEB (Massive Text Embedding Benchmark) and RTEB (Retrieval Text Embedding Benchmark). These are leaderboards hosted on Hugging Face.

He highlights a key distinction: MTEB uses public datasets, while RTEB uses private, held-out datasets. This is crucial for avoiding “data contamination,” where models perform well simply because they were trained on the test data.

27. Selecting an embedding model

Slide 27

(Timestamp: 16:48)

The speaker switches to a live browser view (captured in the slide) of the leaderboard. He discusses the bubble chart visualization where size often correlates with parameter count.

He points out an interesting trend: “You’ll see that there’s a bunch of models here that are all the same size… but the performance differs.” This indicates improvements in training strategies and architecture rather than just throwing more compute at the problem.

28. Selecting an embedding model (Other Considerations)

Slide 28

(Timestamp: 19:07)

Beyond the leaderboard score, Shah lists practical selection criteria: Model Size (can it fit in memory?), Architecture (CPU vs GPU), Embedding Dimension (storage costs), and Training Data (multilingual support).

He advises checking if a model is open source and quantizable, as this can significantly reduce latency without a major hit to accuracy.

29. Matryoshka Embedding Models

Slide 29

(Timestamp: 20:53)

A specific innovation is highlighted: Matryoshka Embeddings. These models allow developers to truncate vectors (e.g., from 768 dimensions down to 64) while retaining most of the performance.

This is a “neat kind of innovation” for optimizing storage and search speed. OpenAI’s newer models also support this feature, offering flexibility between cost and accuracy.

30. Sentence Transformer

Slide 30

(Timestamp: 21:42)

The Sentence Transformer architecture is described as the dominant approach for RAG. Unlike standard BERT which works on tokens, these are fine-tuned to understand full sentences and paragraphs.

This architecture uses Siamese networks to ensure that semantically similar sentences are close in vector space, making them ideal for the “chunk-level” retrieval required in RAG.

31. Cross Encoder / Reranker

Slide 31

(Timestamp: 22:16)

The concept of a Cross Encoder (or Reranker) is introduced. Unlike the bi-encoder (retriever) which processes query and document separately, the cross-encoder processes them together.

This allows for a much deeper calculation of relevance. It is typically used as a second stage: retrieve 50 documents quickly with vectors, then use the slow but accurate Cross Encoder to rank the top 5.

32. Cross Encoder / Reranker (Duplicate)

Slide 32

(Timestamp: 22:16)

(This slide reinforces the previous diagram, emphasizing the “crossing” of the query and document in the model architecture.)

33. Cross Encoder / Reranker (Accuracy Boost)

Slide 33

(Timestamp: 23:07)

A bar chart quantifies the value of reranking. It shows a significant boost in NDCG (accuracy) when a reranker is added to the pipeline.

The speaker notes that while you get a “bump” in quality, it “doesn’t come for free.” The trade-off is increased latency, as the cross-encoder is computationally expensive.

34. Cross Encoder / Reranker (Execution Flow)

Slide 34

(Timestamp: 23:15)

The execution flow diagram highlights the reranker’s position in the pipeline. It sits between the Vector Store retrieval and the LLM generation.

This visual reinforces the latency implication: the user has to wait for both the initial search and the reranking pass before the LLM even starts generating an answer.

35. Hands On: Retriever & Reranker

Slide 35

(Timestamp: 23:30)

A screenshot of a Google Colab notebook is shown, demonstrating a practical implementation of the Retrieve and Re-rank strategy using the SentenceTransformer and CrossEncoder libraries.

This provides a concrete resource for the audience to test the accuracy vs. speed trade-offs themselves on simple datasets like Wikipedia.

36. Instruction Following Reranker

Slide 36

(Timestamp: 23:48)

Shah mentions a specific advancement: Instruction Following Rerankers (developed by his company, Contextual). These allow developers to pass a prompt to the reranker, such as “Prioritize safety notices.”

This adds a “knob” for developers to tune retrieval based on business logic without retraining the model.

37. Combine Multiple Retrievers

Slide 37

(Timestamp: 24:19)

The presentation suggests that you don’t have to pick just one method. You can combine BM25, various embedding models (E5, BGE), and rerankers.

While combining them (Ensemble Retrieval) often yields better recall, Shah warns that “you got to engineer this.” Managing multiple indexes and fusion logic increases operational complexity and compute costs.

38. Cascading Rerankers in Kaggle

Slide 38

(Timestamp: 24:56)

A complex diagram from a Kaggle competition winner illustrates a Cascade Strategy. The solution used three different rerankers, filtering from 64 documents down to 8, and then to 5.

This shows the extreme end of retrieval engineering, where multiple models are chained to squeeze out every percentage point of accuracy.

39. Best practices

Slide 39

(Timestamp: 25:16)

Shah distills the complexity into a recommended Best Practice: 1. Hybrid Search: Combine Semantic Search (Vectors) and Lexical Search (BM25). 2. Reciprocal Rank Fusion: Merge the results. 3. Reranker: Pass the top results through a cross-encoder.

This setup provides a “pretty good standard performance out of the box” and should be the default baseline before trying exotic methods.

40. Families of Embedding Models

Slide 40

(Timestamp: 25:42)

A taxonomy slide categorizes the models discussed: Static (Fastest/Low Accuracy), Bi-Encoders (Fast/Good Accuracy), and Cross-Encoders (Slow/Best Accuracy).

This summary helps the audience mentally organize the tools available in their toolbox.

41. Lots of New Models

Slide 41

(Timestamp: 25:50)

Logos for IBM Granite, Google EmbeddingGemma, and others appear. The speaker notes that while new models from major players appear weekly, the improvements are often “incremental.”

He advises against “ripping up” a working system just to switch to a model that is 1% better on a leaderboard.

42. Other retrieval methods

Slide 42

(Timestamp: 26:18)

Alternative methods are briefly listed: SPLADE (Sparse retrieval), ColBERT (Late interaction), and GraphRAG.

Shah acknowledges these exist and may fit specific niches, but warns against chasing the “flavor of the week” before establishing a solid baseline with hybrid search.

43. Operational Concerns

Slide 43

(Timestamp: 27:30)

The talk shifts to operations. Libraries like FAISS are mentioned for efficient vector similarity search.

A key point is that for many use cases, you can simply store embeddings in memory. You don’t always need a complex vector database if your dataset fits in RAM.

44. Vector Database Options

Slide 44

(Timestamp: 27:55)

A diagram categorizes storage into Hot (In-Memory), Warm (SSD/Disk), and Cold tiers.

Shah notes there are “tons of vector database options” (Snowflake, Pinecone, etc.). The choice should be governed by latency requirements. If you need sub-millisecond retrieval, you need in-memory storage.

45. Operational Concerns (Datastore Size)

Slide 45

(Timestamp: 28:40)

A graph shows that as Datastore Size increases (X-axis), retrieval performance naturally degrades (Y-axis).

To combat this, the speaker strongly recommends using Metadata Filtering. “If you’re not using something like metadata… it’s going to be very tough.” Narrowing the search scope is essential for scaling to millions of documents.

46. Search Strategy Comparison

Slide 46

(Timestamp: 29:22)

The presentation pivots to the “exciting part”: Agentic RAG. A visual compares “Traditional RAG” (a linear path) with “Agentic RAG” (a winding, exploratory path).

This represents the shift from a “one-shot” retrieval attempt to an iterative system that can explore, backtrack, and reason.

47. Tools use / Reasoning

Slide 47

(Timestamp: 29:40)

Reasoning models (like o1 or DeepSeek R1) enable LLMs to use tools effectively. A code snippet shows an agent loop: query -> generate -> “Did it answer the question?”

If the answer is no, the model can “rewrite the query… try to find that missing information, feed that back into the loop.” This self-correction is the core of Agentic RAG.

48. Agentic RAG (Workflow)

Slide 48

(Timestamp: 30:32)

A flowchart details the Agentic RAG lifecycle. The model thinks through steps: “Oh, this is the query I need to make… based on those results… maybe we should do it a different way.”

This workflow allows the system to synthesize answers from multiple sources or clarify ambiguous queries automatically.

49. Tools use / Reasoning (Detailed Example)

Slide 49

(Timestamp: 30:35)

A specific example of a complex query is shown. The agent breaks the problem down, calls tools, and iterates.

This demonstrates that the “Thinking” time is where the value is generated, allowing for a depth of research that a single retrieval pass cannot match.

50. Open Deep Research

Slide 50

(Timestamp: 31:02)

Shah references “Open Deep Research” by LangChain, an open-source framework where sub-agents go out, perform research, and report back.

This is a specific category of Agentic RAG focused on generating comprehensive reports rather than quick answers.

51. DeepResearch Bench

Slide 51

(Timestamp: 31:30)

A leaderboard for DeepResearch Bench is shown, testing models on “100 PhD level research tasks.”

The speaker warns that this approach “can get very expensive.” Solving a single complex query might cost significant money due to the number of tokens and iterative steps required.

52. Westlaw AI Deep Research

Slide 52

(Timestamp: 31:55)

A real-world application is highlighted: Westlaw AI. In the legal field, thoroughness is worth the latency and cost.

This proves that Agentic RAG isn’t just a toy; it is being commercialized in high-value verticals where accuracy is paramount.

53. Agentic RAG (Self-RAG)

Slide 53

(Timestamp: 32:11)

The concept of Self-RAG is introduced, emphasizing the “Reflection” step. The model critiques its own retrieved documents and generation quality.

Shah notes that this isn’t brand new, but has become practical due to better reasoning models.

54. Agentic RAG (LangChain Reddit)

Slide 54

(Timestamp: 34:04)

A Reddit post is shown where a developer discusses building a self-reflection RAG system. This highlights the community’s active experimentation with these loops.

55. Agentic RAG (Efficiency Concerns)

Slide 55

(Timestamp: 34:15)

The discussion turns to the “Rub”: Inefficiency. Agentic loops can be slow and wasteful, re-retrieving data unnecessarily.

This sets up the trade-off conversation again: Is the extra time and compute worth the accuracy gain?

56. Research: BRIGHT

Slide 56

(Timestamp: 32:11)

Note: The speaker introduces the BRIGHT benchmark around 32:11, slightly out of slide order in the transcript flow, but connects it here.

BRIGHT is a benchmark specifically designed for Retrieval Reasoning. Unlike standard benchmarks that test keyword matching, BRIGHT tests questions that require thinking, logic, and multi-step deduction to find the correct document.

57. BRIGHT #1: DIVER

Slide 57

(Timestamp: 32:48)

The top-performing system on BRIGHT is DIVER. The diagram shows it uses the exact components discussed earlier: Chunking, Retrieving, and Reranking, but wrapped in an iterative loop.

Shah points out, “It probably doesn’t look that crazy to you if you’re used to RAG.” The innovation is in the process, not necessarily a magical new model architecture.

58. BRIGHT #1: DIVER (LLM Instructions)

Slide 58

(Timestamp: 33:31)

The specific prompts used in DIVER are shown. The system asks the LLM: “Given a query… what do you think would be possibly helpful to do?”

This Query Expansion allows the system to generate new search terms that the user didn’t think of, bridging the semantic gap through reasoning.

59. Agentic RAG on WixQA

Slide 59

(Timestamp: 34:36)

Shah shares his own experiment results on the WixQA dataset (technical support). * One Shot RAG: 5 seconds latency, 76% Factuality. * Agentic RAG: Slower latency, 93% Factuality.

This massive jump in accuracy (0.76 to 0.93) is the key takeaway. “That has a ton of implications.” It suggests that the limitation of RAG often isn’t the data, but the lack of reasoning applied to the retrieval process.

60. Rethink your Assumptions

Slide 60

(Timestamp: 37:10)

This is the climax of the technical argument. A graph from the BRIGHT paper shows that BM25 (lexical search) combined with an Agentic loop (GPT-4) outperforms advanced embedding models (Qwen).

“This is crazy,” Shah exclaims. Because the LLM can rewrite queries into many variations, it mitigates BM25’s weakness (synonyms). This implies you might not need complex vector databases if you have a smart agent.

61. Agentic RAG with BM25

Slide 61

(Timestamp: 38:20)

Shah validates the paper’s finding with his own internal data (Financial 10Ks). Agentic RAG with BM25 performed nearly as well as Agentic RAG with Embeddings.

He suggests a radical possibility: “I could throw all that away [vector DBs]… just stick this in a text-only database and use BM25.”

62. Agentic RAG for Code Search

Slide 62

(Timestamp: 39:46)

He connects this finding to Claude Code, which uses a lexical approach (like grep) rather than vectors for code search.

Since code doesn’t have the same semantic ambiguity as natural language, and agents can iterate rapidly, lexical search is proving to be superior for coding assistants.

63. Combine Retrieval Approaches

Slide 63

(Timestamp: 40:15)

A DoorDash case study illustrates a two-tier guardrail system. They use simple text similarity first (fast/cheap). If that fails or is uncertain, they kick it to an LLM (slow/expensive).

This “Tiered” approach optimizes the trade-off between cost and accuracy in production.

64. Hands on: Agentic RAG (Smolagents)

Slide 64

(Timestamp: 41:07)

The speaker points to Smolagents, a Hugging Face library, as a way to get hands-on with these concepts. A Colab notebook is provided for the audience to build their own agentic retrieval loops.

65. Solutions for a RAG Solution

Slide 65

(Timestamp: 41:18)

Shah updates the “Problem Complexity” framework from the beginning of the talk with specific recommendations: * Low Latency (<5s): Use BM25 or Static Embeddings. * High Cost of Mistake: Add a Reranker. * Complex Multi-hop: Use Agentic RAG.

66. Retriever Checklist

Slide 66

(Timestamp: 41:52)

A final checklist summarizes the retrieval hierarchy: 1. Keyword/BM25 (The baseline). 2. Semantic Search (The standard). 3. Agentic/Reasoning (The problem solver).

This provides the audience with a mental menu to choose from based on their specific constraints.

67. RAG as a system (Retrieval with Instruction Following Reranker)

Slide 67

(Timestamp: 42:00)

The system diagram is shown one last time, updated to include the Instruction Following Reranker in the retrieval box, solidifying the modern RAG architecture.

68. RAG - Generation

Slide 68

(Timestamp: 42:10)

Note: The speaker concludes the talk at 42:10, stating “I’m going to end it here.” Slides 68-70 regarding the Generation stage were included in the deck but skipped in the video recording due to time constraints.

This slide would have covered the final stage of RAG: generating the answer. The focus here is typically on reducing hallucinations and ensuring the tone matches the user’s needs.

69. RAG - Generation (Model Selection)

Slide 69

(Timestamp: 42:10)

Skipped in video. This slide illustrates the choice of LLM for generation (e.g., GPT-4 vs Llama 3 vs Claude). The choice depends on the “Cost/Latency budget” and specific domain requirements.

70. Chunking approaches

Slide 70

(Timestamp: 42:10)

Skipped in video. This slide compares Original Chunking (cutting text at fixed intervals) with Contextual Chunking (adding a summary prefix to every chunk). Contextual chunking significantly improves retrieval because every chunk carries the context of the parent document.

71. Title Slide (Duplicate)

Slide 71

(Timestamp: 42:10)

The presentation concludes with the title slide. Rajiv Shah thanks the audience, encouraging them to think about trade-offs rather than just chasing the latest models. “Hopefully I’ve given you a sense of thinking about these trade-offs… thank you all.”

This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.

Understanding Sparse Matrices through Interactive Visualizations

Fri, 07 Mar 2025 06:00:00 GMT

When working with machine learning models, preparing data properly is essential. One common preprocessing technique is one-hot encoding, which transforms categorical data into a format algorithms can understand. However, this transformation often creates sparse matrices - dataframes where most values are zero.

Basic One-Hot Encoding

The first animation illustrates the fundamental concept of one-hot encoding. This transformation converts a single categorical column (like “city”) into multiple binary columns, where each column represents one possible category value.

View the basic one-hot encoding animation

This visualization walks through the transformation step-by-step:

Starting with the original dataset containing categorical values
Adding binary indicator columns for each category
Showing how the dataset becomes wider but sparse (mostly filled with zeros)
Demonstrating how the original categorical column becomes redundant

In traditional tabular data processing, we often don’t see this sparsity visually. The animation makes it clear how one-hot encoding dramatically changes the structure of our data.

The Curse of Dimensionality

The second animation takes the concept further by demonstrating what happens with high-cardinality categorical features - those with many possible values.

View the curse of dimensionality animation

This more advanced visualization shows how one-hot encoding can lead to the “curse of dimensionality”:

Starting with a modest 4-column dataset
Expanding to over 150 columns when encoding a categorical feature with many values
Creating an extremely sparse matrix where 99% of values are zeros
Illustrating the practical challenges this presents for machine learning

Why It Matters

Understanding the sparsity that results from one-hot encoding is crucial for several reasons:

Memory usage: Sparse matrices can consume excessive memory if not properly handled
Computational efficiency: Processing mostly-zero matrices is inefficient
Model performance: Many algorithms struggle with extremely sparse data
Feature selection: With hundreds of binary columns, feature selection becomes critical

For high-cardinality features, consider alternatives like feature hashing, target encoding, or embeddings to avoid the dimensionality explosion shown in the second animation.

These visualizations help build intuition about what’s happening “under the hood” when we preprocess data - something that’s often hidden when we use high-level libraries that handle these transformations automatically.

Related videos: Sparsity in AI or Curse of Dimensionality or Reality of Models

Feature Selection Methods and Feature Selection Curves

Tue, 08 Oct 2024 05:00:00 GMT

How to Select the Best Features for Machine Learning!

Let’s deep dive into several feature selection techniques and help you figure out when to use each one. The notebook includes two data sources: the MNIST dataset and the Madelon dataset. The MNIST dataset is a collection of 28x28 pixel images of handwritten digits. The Madelon dataset is a synthetic dataset that you can control.

The notebook uses the following feature selection techniques:

F-statistic
Mutual Information
Logistic Regression
Logistics Regression with Lasso (L1) Regularization
Feature Importance
Boruta
MRMR (Minimum Redundancy Maximum Relevance)
Recursive Feature Elimination
Feature importance rank ensembling (FIRE)

To help visualize the feature selection process, the notebook includes a feature selection curve. The feature selection curve plots the number of features against the accuracy of the model. This helps you understand how many features you need to achieve a certain level of accuracy.

This notebook is based on the following articles:
Feature Selection: How to Throw Away 95% of Your Features and Get 95% Accuracy and the associated notebook.

A companion video to this can be found on my youtube site, @rajistics: Feature Selection Methods and Feature Selection Curves, it’s about 15 minutes and gives more context to the notebook.

This blog post can be found at http://bit.ly/raj_fs or https://projects.rajivshah.com/blog/Feature_Selection.html

import warnings; warnings.filterwarnings("ignore")

import os
os.environ["KERAS_BACKEND"] = "torch"
import keras
import pandas as pd
import matplotlib.pyplot as plt

Import data

A. MNIST (Very visual dataset)

from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000, 28 * 28)
X_test = X_test.reshape(10000, 28 * 28)

print(list(X_train[7, :]))

plt.imshow(X_train[7, :].reshape(28, 28), cmap = 'binary', vmin = 0, vmax = 255)
plt.xticks([])
plt.yticks([])
plt.savefig('sample_image.png')

X_train = pd.DataFrame(X_train)  # Assuming X_mnist is the MNIST feature data
y_train = pd.Series(y_train)   
X_test = pd.DataFrame(X_test)  # Assuming X_mnist is the MNIST feature data
y_test = pd.Series(y_test)

B. Madelon (Very high-dimensional dataset that you control)

If you run the following cells, Madelon will be the dataset you use. If you want to use MNIST, you should skip the following cells.

Madelon is a favorite of mine because you know which features are carrying the signal and which ones are noise. In this case, the first 5 features will be informative. I often modify Madelon to include other types of noisy features, interactions, correlations, and then use this dataset to test various machine learning techniques. Since I know what the true signal is, this is very effective at helping me guage the effectiveness of these methods.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000,
                           n_features=40,
                           n_informative=5,
                           n_classes=3,
                           n_redundant = 3,
                           random_state=0,
                           flip_y =0.05,
                           class_sep = 0.5,
                           n_clusters_per_class=3,
                           shuffle=False)

X_df = pd.DataFrame(X, columns=[f'{i}' for i in range(X.shape[1])])
y_df = pd.Series(y, name='target')

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

X_train.columns = X_train.columns.astype(int)
X_test.columns = X_test.columns.astype(int)

X_train

	0	1	2	3	4	5	6	7	8	9	...	30	31	32	33	34	35	36	37	38	39
9254	0.926694	-1.773357	0.172527	0.217298	-1.733944	0.319415	0.185012	1.907097	-0.649596	0.121499	...	0.706101	-0.880984	-0.980460	-1.040219	-1.495820	2.793184	0.206932	-0.357897	-1.633463	-0.358298
1561	1.692096	-1.412235	1.294343	-0.672776	-0.576808	1.088448	0.446408	1.081032	-0.355654	-0.940438	...	-2.334601	-0.446046	-0.577543	-0.692218	-0.311946	0.329447	-1.312834	0.339797	-0.291047	0.931088
1670	-0.721183	-1.430124	0.776395	0.226875	-1.209252	-0.458278	-1.011414	1.682210	-1.048116	-1.783993	...	1.440083	-0.666334	-0.909174	0.377606	1.303421	-0.655019	0.003210	-0.802838	-1.305648	-0.170390
6087	1.429094	1.539467	0.230706	0.256132	-0.478975	-1.493286	1.738055	0.888900	0.164039	-2.488486	...	1.454662	0.493267	0.079875	-1.390000	1.330840	0.212113	1.955695	-0.567808	-0.883676	-0.472567
6669	0.207305	0.600810	0.477484	-0.784978	-0.651178	-0.362503	1.032674	0.369245	-0.659173	-1.210180	...	-3.207426	0.423698	1.538654	-0.856037	0.343482	-0.119711	-0.355270	0.724913	1.702261	-1.597048
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5734	0.484935	0.695846	1.481478	0.223780	-1.012330	-2.116814	0.613437	2.080326	-0.730603	1.408916	...	0.328236	0.413631	0.337577	-0.747556	0.020008	-0.202360	1.484470	-0.465176	1.391591	0.294199
5191	-0.715630	-1.895183	-1.091845	-0.579646	-0.474871	2.217163	-0.666726	-0.763180	0.261672	1.570425	...	0.280017	0.836381	0.115396	-0.044588	0.516398	-0.630678	0.755802	0.016894	0.183862	-0.401010
5390	1.091513	-1.606975	-1.678945	-0.706068	-0.547585	2.905629	0.827132	-0.997257	-0.983815	-0.609981	...	0.001994	0.411601	-0.809740	-0.163079	0.020689	-0.731637	-0.154384	0.599125	1.094542	-1.020837
860	-1.696127	1.277728	0.043566	0.659020	0.537680	-1.793380	-0.878325	-0.168647	-0.712758	2.285642	...	0.233483	0.551602	0.139338	0.805881	-0.628342	0.532257	-0.107130	1.449110	-0.499819	-0.826810
7270	2.113646	-2.644226	-0.097455	-0.645787	-1.748579	2.288712	0.920350	1.266990	0.545140	-0.652207	...	-0.048207	1.331188	-0.683416	-0.014705	0.185842	1.100054	-0.244144	-1.529671	-1.914142	0.072284

8000 rows × 40 columns

Feature selection

Let’s go through a couple of different methods for feature selection

Feature Selection Methods Comparison

Method	Pros	Cons	Best Used When	Computational Complexity
F-statistic	- Fast and simple - Works well for linear relationships - Easy to interpret	- Assumes linear relationship - Considers features independently - May miss interaction effects	- Initial screening - Linear problems - Need interpretable results	O(n)
Mutual Information	- Captures non-linear relationships - No assumptions about distribution	- Can be computationally intensive - May overfit with small samples	- Non-linear relationships - Complex interactions	O(n log n)
Logistic Regression	- Fast for high-dimensional data - Provides feature coefficients	- Assumes linear decision boundary - Sensitive to correlated features	- Binary classification - Need interpretable coefficients	O(n^2)
Lasso (L1)	- Fast for high-dimensional data - Automatically does feature selection	- May struggle with correlated features - Can be sensitive to outliers	- High-dimensional data - Need sparse solutions	O(n^2)
LightGBM	- Handles non-linear relationships - Considers feature interactions	- Can be computationally intensive - May overfit with small samples	- Complex relationships - Large datasets	O(n log n)
MRMR	- Considers feature redundancy - Good for correlated features	- Can be computationally intensive - May struggle with non-linear relationships	- Datasets with correlated features - Need diverse feature set	O(n^2)
RFE	- Considers feature interactions - Can capture complex relationships	- Computationally intensive - Can be unstable with small changes in data	- When computational cost isn’t an issue - Need very precise feature selection	O(n^2 log n)

1. F-statistic

f_classif relies on the Analysis of Variance (ANOVA) F-statistic to evaluate the relationship between each feature and the target variable. It tests whether the mean values of the target variable differ significantly across the groups defined by each feature. The higher the F-value, the more likely it is that the feature discriminates between different classes. Assumes a linear relationship between the features and the target, and that the target is categorical.

See: https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

from sklearn.feature_selection import f_classif

f = f_classif(X_train, y_train)[0]
f

array([2.40810399e+02, 6.64390517e+01, 5.42587843e+01, 1.71099993e-01,
       4.27226542e+01, 9.64735668e+01, 4.72136845e+01, 5.71846457e+01,
       1.27834979e+00, 1.42650284e+00, 3.40020422e-01, 4.08232233e-01,
       1.30120819e-01, 3.36734714e+00, 4.55665866e-01, 3.62300510e-01,
       1.57485437e-01, 6.93572687e-02, 1.64816305e+00, 2.91782944e+00,
       1.24026934e+00, 9.32533895e-01, 7.07099908e-01, 1.87544216e+00,
       1.10130690e+00, 3.54044700e-01, 1.15417945e+00, 2.59156089e-01,
       7.45820681e-01, 7.75403854e-01, 1.35835715e-01, 3.34985292e+00,
       8.36576456e-02, 5.15026453e-02, 4.33788709e-01, 3.12140721e-01,
       3.55118575e+00, 7.37076241e+00, 1.17274619e+00, 4.36532461e+00])

2. Mutual information

mutual_info_classif uses the concept of mutual information, which measures the dependency between each feature and the target variable. Mutual information quantifies the amount of information gained about the target by knowing the value of the feature. It captures both linear and non-linear dependencies.

See: https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

from sklearn.feature_selection import mutual_info_classif

mi = mutual_info_classif(X_train, y_train)
mi

array([0.05548077, 0.01340894, 0.04092611, 0.00099355, 0.00516085,
       0.06952759, 0.02752171, 0.04043945, 0.00083431, 0.        ,
       0.        , 0.        , 0.        , 0.01146672, 0.        ,
       0.00298292, 0.        , 0.        , 0.0068234 , 0.00961735,
       0.00935105, 0.00586449, 0.00561433, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00876386, 0.00049355,
       0.        , 0.00478042, 0.00487523, 0.00268551, 0.00118896,
       0.        , 0.00583264, 0.        , 0.        , 0.        ])

3. Logistic regression

Logistic regression is a linear model for classification rather than regression. It is used to estimate the probability that an instance belongs to a particular class. The coefficients of the model can be used to determine feature importance.

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression().fit(X_train, y_train)
logreg.coef_

array([[-0.08669758,  0.06064912, -0.04063592,  0.00704782,  0.05079063,
        -0.02366404, -0.03246196, -0.07613473, -0.02216845,  0.00405511,
         0.00890925, -0.01289459, -0.00446435,  0.00330386,  0.01287983,
        -0.00599418,  0.00494212, -0.00385749,  0.03721175, -0.03849129,
        -0.0032749 , -0.01534965, -0.00908255,  0.02016669, -0.00175419,
         0.00918138,  0.01908963,  0.01357562, -0.01804835,  0.00266229,
         0.00180036, -0.00624841, -0.00351875,  0.00131487, -0.01573702,
        -0.00485053, -0.03744854,  0.05047984,  0.0174477 , -0.00658735],
       [ 0.22615114, -0.09497312, -0.04870109,  0.08462927, -0.00434942,
         0.0845374 ,  0.05186658,  0.05312074,  0.01121456,  0.0271935 ,
        -0.00027397,  0.01737956,  0.0080553 , -0.04770429, -0.00379082,
        -0.00963934,  0.00274429,  0.00688572, -0.03159925,  0.02737958,
        -0.0220797 , -0.00789503, -0.01134082, -0.02724488, -0.02432216,
         0.00543336, -0.02498671, -0.00919296,  0.01043275,  0.01323123,
        -0.00059363, -0.02632891, -0.00159478, -0.00753653,  0.01697607,
         0.01247471,  0.04553467, -0.05091659, -0.02028138,  0.03615948],
       [-0.13945356,  0.03432401,  0.089337  , -0.09167709, -0.04644121,
        -0.06087336, -0.01940462,  0.02301399,  0.0109539 , -0.03124862,
        -0.00863527, -0.00448497, -0.00359095,  0.04440043, -0.00908901,
         0.01563352, -0.00768641, -0.00302823, -0.0056125 ,  0.01111171,
         0.02535461,  0.02324468,  0.02042338,  0.0070782 ,  0.02607635,
        -0.01461474,  0.00589708, -0.00438266,  0.0076156 , -0.01589352,
        -0.00120673,  0.03257732,  0.00511353,  0.00622166, -0.00123904,
        -0.00762418, -0.00808614,  0.00043674,  0.00283369, -0.02957212]])

3.5 Feature Selection with L1 (Lasso) Regularization

Lasso is a great feature selection technique. It’s fast, easy to use, and works well with high-dimensional data. I have often used it when very wide data, greater than 100 features (or even >10k features) to help parse down the number of features. It uses L1 regularization to penalize the absolute size of the coefficients. This leads to sparse solutions, where many of the coefficients are zero. The features with non-zero coefficients are selected. Lasso can be used for feature selection by setting the regularization parameter to a value that results in a sparse solution. The regularization parameter can be tuned using cross-validation.

Try modifying the regularization parameter to see how it affects the number of features selected.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Step 4: Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Apply Logistic Regression with L1 regularization for feature selection
logregL1 = LogisticRegression(penalty='l1', solver='saga', multi_class='multinomial', C=0.01)  # C is inverse of regularization strength
logregL1.fit(X_train_scaled, y_train)

# Step 6: Get the selected features using the original DataFrame 'X'
selected_features = X_train.columns[(logregL1.coef_ != 0).any(axis=0)]
print("Selected features: ", selected_features)

# Optional: Check the coefficients
#print("Logistic Regression coefficients: ", logreg.coef_)

Selected features:  Index([0, 1, 2, 4, 5, 13, 19, 36, 37], dtype='int64')

4. LightGBM

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed for efficiency and can handle large datasets. It can be used to determine feature importance.

from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(
    objective = 'multiclass',
    metric = 'multi_logloss',
    importance_type = 'gain'
).fit(X_train, y_train)

lgbm.feature_importances_

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001096 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 10200
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 40
[LightGBM] [Info] Start training from score -1.108284
[LightGBM] [Info] Start training from score -1.094371
[LightGBM] [Info] Start training from score -1.093252

array([7030.24782425, 4036.90086633, 7197.39050466, 5339.74033117,
       2017.793881  , 8113.67321557, 3905.26838762, 4383.72206521,
        550.03625131,  531.02187729,  472.12624365,  678.28547454,
        526.57803982,  586.75292325,  552.92263156,  433.08122051,
        552.18078488,  534.15573859,  566.58704376,  630.01932001,
        635.43262064,  636.71719581,  560.95981157,  586.52648336,
        553.7755444 ,  563.13766581,  547.99060541,  523.01072556,
        676.76891661,  616.94216621,  634.27822083,  489.91742009,
        680.71264285,  620.95509708,  618.59545827,  418.22946733,
        568.21738124,  592.29172051,  553.43465978,  655.03435677])

5. Boruta

Boruta is an all-relevant feature selection method. It is an extension of the Random Forest algorithm. It selects all features that are relevant to the target variable, rather than just the most important features.

### long training time > 1 hour
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

#boruta = BorutaPy(
#    estimator = RandomForestClassifier(max_depth = 5), 
#    n_estimators = 'auto', 
#    max_iter = 100
#).fit(X_train, y_train)

6. MRMR

MRMR (Minimum Redundancy Maximum Relevance) is a feature selection method that selects features based on their relevance to the target variable and their redundancy with other features. It aims to select features that are highly correlated with the target variable but uncorrelated with each other.

There are several implementations of MRMR available in Python: https://github.com/smazzanti/mrmr https://koaning.github.io/scikit-lego/api/feature-selection/ https://github.com/AutoViML/featurewiz?tab=readme-ov-file

import pandas as pd
from mrmr import mrmr_classif

#mrmr = mrmr_classif(pd.DataFrame(X_train), pd.Series(y_train), K = 784)
mrmr = mrmr_classif(pd.DataFrame(X_train), pd.Series(y_train), K = X_train.shape[1])

100%|██████████| 40/40 [00:03<00:00, 11.45it/s]

Store results

import numpy as np

ranking = pd.DataFrame(index = range(X_train.shape[1]))

ranking['f'] = pd.Series(f, index = ranking.index).fillna(0).rank(ascending = False)
ranking['mi'] = pd.Series(mi, index = ranking.index).fillna(0).rank(ascending = False)
ranking['logreg'] = pd.Series(np.abs(logreg.coef_).mean(axis = 0), index = ranking.index).rank(ascending = False)
ranking['lasso']= pd.Series(np.abs(logregL1.coef_).mean(axis = 0), index = ranking.index).rank(ascending = False)
ranking['lightgbm'] = pd.Series(lgbm.feature_importances_, index = ranking.index).rank(ascending = False)
#ranking['boruta'] = boruta.support_* 1 + boruta.support_weak_ * 2 + (1 - boruta.support_ - boruta.support_weak_) * X_train.shape[1]
ranking['mrmr'] = pd.Series(
    list(range(1, len(mrmr) + 1)) + [len(mrmr) + 1] * (X_train.shape[1] - len(mrmr)),
    index = mrmr + list(set(ranking.index) - set(mrmr))
).sort_index()
ranking['lasso']= pd.Series(np.abs(logregL1.coef_).mean(axis = 0), index = ranking.index).rank(ascending = False)


ranking = ranking.replace(to_replace = ranking.max(), value = X_train.shape[1])
#ranking.to_csv('ranking.csv', index = False)

Evaluate Feature Selection Methods

Let’s see how the predictive performance of the model changes as we add more features. We will use the top features selected by each method to train a model and evaluate its performance.

from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
## 22 minutes for mnist

algos = ['f', 'mi', 'logreg', 'lasso', 'lightgbm', 'mrmr'] ##Feel free to change this
ks = [1, 2, 5, 10, 15, 20, 30, 40] 
ks = [1, 2, 3, 5, 10, 20, 30, 40] ##Feel free to change this

accuracy = pd.DataFrame(index = ks, columns = algos)
roc = pd.DataFrame(index = ks, columns = algos)

for algo in algos:
    print (algo)
    for k in ks:
    
        cols = ranking[algo].sort_values().head(k).index.to_list()
                
        clf = CatBoostClassifier().fit(
            X_train[cols], y_train,
            eval_set=(X_test[cols], y_test),
            early_stopping_rounds = 20,
            verbose = False
        )
                
        # Store accuracy
        accuracy.loc[k, algo] = accuracy_score(
            y_true=y_test, y_pred=clf.predict(X_test[cols])
        )
        
accuracy.to_csv('accuracyMC.csv', index = True)
roc.to_csv('rocMC.csv', index = True)

f
mi
logreg
lasso
lightgbm
mrmr

Feature Selection Curves

Let’s visualize how the model’s accuracy changes as a function of feature selection.
Notice how for Madelon, there is an optimal number of features. Too many features that are noise end up reducing the performance of the model

for algo, label, color in zip(
    ['mrmr', 'f', 'mi', 'lightgbm', 'logreg',"lasso"],
    ['MRMR', 'F-statistic', 'Mutual Info', 'LightGBM', 'Log Reg','Log Reg (L1/Lasso)'],
    ['orangered', 'blue', 'yellow', 'lime', 'black', 'pink']):
        plt.plot(accuracy.index, accuracy[algo], label = label, color = color, lw = 3)

plt.plot(
    [1, 40], [pd.Series(y_test).value_counts(normalize = True).iloc[0]] * 2, 
    label = '[Random]', color = 'grey', ls = '--', lw = 3
)

plt.legend(fontsize = 13, loc = 'center left', bbox_to_anchor = (1, 0.5))
plt.grid()
plt.xlabel('Number of features', fontsize = 13)
plt.ylabel('Accuracy', fontsize = 13)
plt.savefig('accuracy.png', dpi = 300, bbox_inches = 'tight')

Feature Selection combined with Feature Elimination Techniques

Recursive Feature Elimination

One of the best methods for feature selection consistently is feature importance with LightGBM. We can refine and improve this in several ways: Recursive Feature Elimination uses the same feature importance method, but then iteratively removes the least important features. This iterative process requires training a model several times, but can provide an improvement in feature selection. This method is a version of Recursive Feature Elimination that is widely accepted as a best practice for feature selection.

from sklearn.feature_selection import RFE
from xgboost import XGBClassifier
from sklearn.svm import SVR
model = XGBClassifier(random_state=42)
#model = SVR(kernel="linear")  #took 3 minutes, ok results but not as good as XGB on Madelon
rfe = RFE(model, n_features_to_select=7, step=1)
#rfe = RFE(model, n_features_to_select=50, step=200,verbose=2) #for MNIST
rfe.fit(X_train, y_train)
rfe.support_

array([ True,  True,  True,  True, False,  True,  True,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

# Train an XGBoost model with the selected features from RFE
model_selected = XGBClassifier(random_state=42)
X_selected = X_train.loc[:, rfe.support_]
model_selected.fit(X_selected, y_train)

# Make predictions on the test set with both models
y_pred_selected = model_selected.predict(X_test.loc[:, rfe.support_])
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print(f"Accuracy with selected features: {accuracy_selected:.4f}")

Accuracy with selected features: 0.7135

Compare with perfect on Madelon

perfect = [ True,  True,  True,  True, True,  True,  True,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False]

# Train an XGBoost model with the selected features from RFE
model_selected = XGBClassifier(random_state=42)
X_selected = X_train.loc[:, perfect]
model_selected.fit(X_selected, y_train)

# Make predictions on the test set with both models
y_pred_selected = model_selected.predict(X_test.loc[:, perfect])
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print(f"Accuracy with selected features: {accuracy_selected:.4f}")

Accuracy with selected features: 0.7140

Feature Elimination with FIRE

At DataRobot, we had a mighty AutoML engine that showed you how feature importance aggregated across different models (this is feature importance from four diverse models).

You can use this variance as part of feature selection. It takes a lot more compute, but in our experiments, can perform even better feature selection. Read more about feature importance rank ensembling (FIRE) here - https://docs.datarobot.com/en/docs/api/accelerators/adv-approaches/fire.html and a code snippet is here - https://github.com/datarobot-community/examples-for-data-scientists/blob/master/Feature%20Lists%20Manipulation/Python/Advanced%20Feature%20Selection.ipynb

FeatureViz

Featureviz looks like a cool feature selection package, but I wasn’t able to get it to work. It’s worth checking out. add links

Other great feature selection resources:

A classic dataset where many feature selection techniques have been applied is the Kaggle Santader Customer Satisfaction competition.

Feature Selection with Feature Engine

Advance Feature Selection Tutorial

Interpretable Machine Learning Models Simply Explained

Wed, 25 Sep 2024 05:00:00 GMT

Video

Watch the full video

Annotated Presentation

Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.

Here is the annotated presentation for “Rules: A Simple & Effective Machine Learning Approach” by Rajiv Shah.

1. Title Slide

Slide 1

(Timestamp: 00:00:00)

The presentation begins by introducing the core topic: Interpretable Models and the use of rules in machine learning. Rajiv Shah sets the stage by contrasting this talk with previous discussions on explainability (using tools to explain complex models). Instead, this session focuses on choosing models that are inherently easy to understand.

Shah expresses his interest in how machine learning helps us understand the world. He notes that while tools like SHAP or LIME help unpack complex models, there is immense value in approaching the problem differently: by selecting model architectures that are transparent by design.

The speaker invites the audience to view this not just as a technical lecture but as a discussion on the trade-offs between model complexity and interpretability, setting a collaborative tone for the presentation.

2. Table of Contents

Slide 2

(Timestamp: 00:02:30)

This slide outlines the roadmap for the presentation. Shah explains that he will begin with the “Big Picture” concepts—specifically the “Why?” and the “Baseline”—before diving into four specific technical approaches to rule-based modeling.

The four specific methods to be covered are Rulefit, GA2M (Generalized Additive Models with interactions), Rule Lists, and Scorecards. This structure moves from theoretical justification to practical application, comparing different algorithms that prioritize transparency.

Shah also mentions that a GitHub repository is available with code examples for everything shown, allowing the audience to reproduce the results for the tabular datasets discussed.

3. Section 1: Why?

Slide 3

(Timestamp: 00:03:09)

This section header introduces the fundamental question: Why do we want rules? The speaker moves past the obvious statement that “AI is important” to investigate the influences that drive data scientists toward complex, opaque models.

Shah prepares to discuss the cultural and competitive pressures in data science that prioritize raw accuracy over usability. This section serves as a critique of the “accuracy at all costs” mindset often found in the industry.

4. Mark Cuban Quote

Slide 4

(Timestamp: 00:03:17)

The slide features a quote from Mark Cuban: “Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years.”

Shah briefly references this as the “obligatory” acknowledgment of AI’s massive importance in the current landscape. It reinforces that while the field is moving fast, the understanding of these systems is paramount, which ties into the presentation’s focus on interpretability.

5. Influences: Kaggle & Academia

Slide 5

(Timestamp: 00:03:40)

Shah identifies Kaggle competitions and academic research as two primary influences on data scientists. He notes that these platforms heavily incentivize accuracy above all else. For example, in the Zillow Prize, the difference between the top scores is minuscule, yet teams fight for that fraction of a percentage.

He argues that this environment trains data scientists to focus solely on improving metrics (like RMSE or AUC), often ignoring other critical trade-offs like model complexity, deployment difficulty, or explainability.

As he states, “One of the byproducts of Kaggle is a very heavy focus on making sure you improve your models around accuracy… and that’s how you can get a conference paper.” This sets up the problem of complexity creep.

6. The Netflix Prize Winners

Slide 6

(Timestamp: 00:05:39)

This slide shows the winners of the famous Netflix Prize, a competition held about 15 years ago where a team won $1 million for improving Netflix’s recommendation algorithm by 10%.

Shah uses this story to illustrate the peak of the “accuracy” mindset. The competition drew massive interest and drove innovation, but it also encouraged teams to prioritize the leaderboard score over the practicality of the solution.

7. Netflix Prize Progress Graph

Slide 7

(Timestamp: 00:06:14)

The graph displays the progress of teams over time during the Netflix competition. Shah points out that after an initial period of rapid improvement using standard algorithms, progress plateaued.

To break through these plateaus, teams began using Ensembling—combining multiple models together. The winning solution was an ensemble of 107 different models. Shah emphasizes that while this strategy is powerful for eking out the last bit of performance, it creates immense complexity.

8. The Engineering Cost of Complexity

Slide 8

(Timestamp: 00:07:39)

This slide reveals the ironic conclusion of the Netflix Prize: the winning model was never implemented. The engineering costs to deploy an ensemble of 107 models were simply too high compared to the marginal gain in accuracy.

Shah uses this as a cautionary tale: “If your focus is on accuracy… it drives you down towards this complexity… but often you end up with these complex models [that] are often very difficult to implement.” This highlights the disconnect between competitive data science and enterprise reality.

9. Understandable White Box Model (CLEAR-2)

Slide 9

(Timestamp: 00:08:04)

Shah transitions to the alternative: Interpretable Models. This slide shows a simple linear model (CLEAR-2) with only two features. This is a classic “White Box” model where the relationship between inputs and outputs is transparent.

The speaker contrasts this with the “Black Box” nature of complex ensembles. He argues that if you cannot understand what is going on inside a model, you cannot effectively debug it, nor can you easily convince stakeholders to trust it.

10. Complex White Box Model (CLEAR-8)

Slide 10

(Timestamp: 00:11:51)

This slide presents a linear model with eight features (CLEAR-8). While technically still a “White Box” model, Shah implies that as feature counts grow, true understandability diminishes.

He touches on this concept later in the “Caveats” section, noting that even linear models can become confusing if there is multicollinearity (features moving in the same direction). Just because we can see the coefficients doesn’t mean the model is intuitively “explainable” to a human if the variables interact in complex, non-obvious ways.

11. Easy to Understand Decision Tree

Slide 11

(Timestamp: 00:19:15)

Here, a simple Decision Tree is presented. Shah connects this to the history of rule-based learning, noting that early research found that keeping decision trees “short and stumpy” made them very easy for humans to explain.

This visual represents the ideal of interpretability: a clear path of logic (e.g., “If X is less than 3, go left”) that leads to a prediction. This is the foundation for the Rulefit method discussed later.

12. Too Much to Comprehend

Slide 12

(Timestamp: 00:07:46)

Contrasting the previous slide, this image shows a chaotic forest of decision trees. This represents modern ensemble methods like Random Forests or Gradient Boosted Machines.

Shah uses this visual to reinforce the point that while ensembles offer “Better Performance,” the sheer number of decision paths makes them “too much to Comprehend.” You lose the ability to trace the “why” behind a specific prediction, turning the system into a Black Box.

13. Pedro Domingos Tweet

Slide 13

(Timestamp: 00:08:22)

Shah acknowledges the counter-argument by showing a tweet from Pedro Domingos, a prominent machine learning researcher, who suggests that demanding explainability limits the potential of AI.

Shah respectfully disagrees with this stance in the context of enterprise data science. He argues that in the real world, “If you don’t understand what’s going on in your model, it’s hard for you to debug it, it’s hard to convince somebody else to adopt your model.” Practicality and trust often outweigh raw theoretical power.

14. Benefits of Interpretable Models

Slide 14

(Timestamp: 00:09:16)

This slide summarizes the key benefits of using interpretable models, referencing the work of Cynthia Rudin. The main advantages are: 1. Debugging: It is easier to spot weird behaviors. 2. Trust: Stakeholders and legal/risk teams are more likely to approve the model. 3. Deployment: These models can often be deployed as simple SQL queries or basic code, avoiding the need for heavy GPU infrastructure.

Shah emphasizes the deployment aspect: “You don’t have to go out and get a GPU… you can actually deploy directly within a database.”

15. Caveats of Interpretable Models

Slide 15

(Timestamp: 00:11:00)

Shah provides a necessary reality check. He clarifies that selecting an interpretable algorithm is only one part of the process. True interpretability depends on the entire data pipeline.

Issues like data labeling, feature engineering, and multicollinearity can render even a simple model confusing. For example, if two correlated features have opposite coefficients in a linear model, it becomes very difficult to explain the logic to a business user, even if the math is simple.

16. Section 2: Baseline

Slide 16

(Timestamp: 00:12:15)

This slide introduces the Baseline section. Shah advocates for always starting a project with a simple baseline model to establish a performance benchmark.

He shares an anecdote about people spending a year on a project only to be nearly matched by a simple model built in two hours. Establishing a baseline helps determine how much effort should be spent chasing incremental accuracy improvements.

17. The Problem: UCI Adult Dataset

Slide 17

(Timestamp: 00:12:54)

Shah introduces the dataset he will use for all examples in the talk: the UCI Adult Dataset (Census Income). The goal is a binary classification problem: predicting whether someone has a high or low income based on demographics.

He chooses this dataset because it represents typical enterprise tabular data: it has 30,000 rows, a mix of numerical and categorical features, and contains collinearity and interaction effects. This makes it a realistic test bed for the models he will demonstrate.

18. Baseline Models

Slide 18

(Timestamp: 00:13:53)

The speaker outlines the three baseline models he built to bracket the performance possibilities: 1. Logistic Regression: The standard statistical approach. 2. AutoML (H2O): A stacked ensemble of many models (Neural Networks, GBMs, etc.) representing the “maximum” possible performance. 3. OneR: A very simple rule-based algorithm.

These baselines provide the context for evaluating the interpretable models later.

19. Baseline Models Plot

Slide 19

(Timestamp: 00:14:12)

This plot visualizes Complexity vs. AUC (Area Under the Curve). * OneR is at the bottom (AUC ~0.60) with very low complexity. * Logistic Regression is in the middle (AUC ~0.91). * Stacked Ensemble is at the top (AUC ~0.93) but with massive complexity.

Shah notes that while the Stacked Ensemble wins on accuracy, the Logistic Regression is surprisingly close, highlighting that simpler models can often be “good enough.”

20. OneR Example

Slide 20

(Timestamp: 00:15:17)

Shah explains the OneR (One Rule) algorithm. This method finds the single feature in the dataset that best predicts the target. In the example shown (Iris dataset), utilizing just “Petal Width” classifies 96% of instances correctly.

He suggests OneR is a great way to detect Target Leakage—if one feature predicts the target perfectly, it might be “cheating.” It also sets the floor for performance; if a complex model can’t beat OneR, something is wrong.

21. Baseline Models Plot (Recap)

Slide 21

(Timestamp: 00:16:36)

Returning to the complexity plot, Shah reiterates the performance gap. The AutoML model sets the “ceiling” at 0.93 AUC.

The goal for the rest of the presentation is to see where the interpretable models (Rulefit, GA2M, etc.) fall on this graph. Can they approach the 0.93 AUC of the ensemble without incurring the massive complexity penalty?

22. Section 3: Rulefit

Slide 22

(Timestamp: 00:18:18)

This slide introduces the first major interpretable technique: Rulefit. Shah mentions familiarity with this from his time at Data Robot and notes that it is a powerful way to combine the benefits of trees and linear models.

23. What is Rulefit?

Slide 23

(Timestamp: 00:18:30)

Rulefit is an algorithm developed by Friedman and Popescu (2008). It works by: 1. Building a random forest of short, “stumpy” decision trees. 2. Extracting each path through the trees as a “Rule.” 3. Using these rules as binary features in a sparse linear model (Lasso).

This approach allows the model to capture interactions (via the trees) while maintaining the interpretability of a linear equation.

24. H2O Rulefit Output

Slide 24

(Timestamp: 00:22:19)

Shah displays the output from the H2O Rulefit implementation. The model generates human-readable rules, such as: “If Education < 12 AND Capital Gain < $7000, THEN Coefficient is negative.”

He notes that while the rules are readable, the raw output can look like “computer-ese.” However, it allows a data scientist to identify specific segments of the population (e.g., low education, low capital gain) that strongly drive the prediction.

25. Overlapping Rules

Slide 25

(Timestamp: 00:24:30)

A key characteristic of Rulefit is that the rules overlap. A single data point might satisfy multiple rules simultaneously.

Shah points out that this adds a layer of complexity to interpretability. To understand a prediction, you have to sum up the coefficients of all the rules that apply to that person. This is different from a decision tree where you fall into exactly one leaf node.

26. H2O Rulefit with Linear Terms

Slide 26

(Timestamp: 00:25:55)

One limitation of pure rules is handling continuous variables (like age or miles driven). Rules have to “bin” these variables (e.g., Age < 30, Age 30-40).

Shah explains that H2O Rulefit solves this by including Linear Terms. The model can use rules for non-linear interactions and standard linear coefficients for continuous trends. This hybrid approach boosts the AUC significantly (up to 0.88 in this example) by capturing linear relationships more naturally.

27. Rulefit Results

Slide 27

(Timestamp: 00:27:04)

This slide plots the performance of Rulefit models with varying numbers of rules. Shah demonstrates that by increasing the number of rules (complexity), the AUC climbs closer to the Stacked Ensemble.

He concludes that Rulefit is a versatile tool. You can tune the “dial” of complexity: fewer rules for more interpretability, or more rules for higher accuracy, often getting very competitive performance.

28. Section 4: GA2M

Slide 28

(Timestamp: 00:31:35)

The presentation moves to the second technique: GA2M (Generalized Additive Models with pairwise interactions). Shah notes that while GAMs have existed for a while, modern implementations like Microsoft’s Explainable Boosting Machines (EBM) have made them much more accessible and powerful.

29. What is GA2M?

Slide 29

(Timestamp: 00:32:02)

GA2M is essentially a linear model where features are binned, and pairwise interactions are automatically detected. Shah highlights InterpretML, an open-source library from Microsoft that implements this via EBMs.

The model structure is additive: . This means the final score is just the sum of individual feature scores and interaction scores, making it very transparent.

30. GA2M Binning

Slide 30

(Timestamp: 00:32:42)

Shah explains how GA2M handles numerical data. Instead of a single slope coefficient (like in logistic regression), the model bins the continuous feature (e.g., dividing “criminal history” into ranges).

Each bin gets its own coefficient. This allows the model to learn non-linear patterns (e.g., risk might go up, then down, then up again as a variable increases) while remaining easy to inspect.

31. Interactions in GA2M

Slide 31

(Timestamp: 00:33:08)

The “2” in GA2M stands for pairwise interactions. Shah emphasizes that this is the model’s superpower. While standard linear models struggle with interactions (e.g., the combined effect of age and education), GA2M has an efficient algorithm to automatically find the most important pairs.

This allows the model to achieve accuracy levels comparable to complex ensembles (AUC 0.93) because it captures the interaction signal that simple linear models miss.

32. GA2M Visualization

Slide 32

(Timestamp: 00:35:14)

Shah showcases the InterpretML dashboard. It provides clear visualizations of how each feature contributes to the prediction.

In the example, we see the coefficients for different marital statuses. This acts like a “lookup table” for risk. Shah argues that this is very “model risk management friendly” because stakeholders can validate every single coefficient and interaction term to ensure they make business sense.

33. Section 5: Rule Lists

Slide 33

(Timestamp: 00:40:28)

The third approach is Rule Lists. Shah introduces this as a method to solve the “overlapping rules” problem found in Rulefit.

34. What are Rule Lists?

Slide 34

(Timestamp: 00:40:48)

Rule Lists are ordered sets of IF-THEN-ELSE statements. Unlike Rulefit, where you sum up multiple rules, here an observation triggers only the first rule it matches.

Shah mentions implementations like CORELS and SBRL (Scalable Bayesian Rule Lists). The goal is to produce a concise list that a human can read from top to bottom to make a decision.

35. SBRL Process

Slide 35

(Timestamp: 00:41:09)

Creating an optimal rule list is computationally expensive because the algorithm must search through many permutations to find the best order.

Shah explains the logic: The algorithm finds a rule that covers a subset of data, removes those instances, and then finds the next rule for the remaining data. This sequential “peeling off” of data creates the IF-ELSE structure.

36. SBRL Output Example

Slide 36

(Timestamp: 00:41:45)

The output of an SBRL model is shown. It reads like a checklist: 1. IF Capital Gain > $7500 -> High Income (99% prob) 2. ELSE IF Education < 4 -> Low Income (90% prob) 3. ELSE…

Shah highlights the simplicity: “You just go down the list until you find the rule… much easier to explain to those marketing people.” The trade-off is a drop in accuracy (AUC 0.86) compared to GA2M or Rulefit.

37. Section 6: Scorecard

Slide 37

(Timestamp: 00:44:52)

The final approach is the Scorecard. Shah introduces this as perhaps the simplest and most widely recognized format for decision-making in industries like credit and criminal justice.

38. What are Scorecards?

Slide 38

(Timestamp: 00:45:04)

Scorecards are simple additive models where features are assigned integer “points.” To get a prediction, you simply add up the points.

Shah mentions tools like Optbinning and SLIM (Sparse Linear Integer Models). This format is beloved in operations because it can be printed on a physical card or implemented in a basic spreadsheet.

39. Scorecard Example

Slide 39

(Timestamp: 00:46:08)

This slide shows a scorecard built for the Adult dataset. * Capital Gain > 7000? +29 points. * Age < 25? -5 points.

Shah expresses a personal preference for this over raw coefficients: “I actually like this better… I think it’s a little easier to understand which features are most important.” The integer points make the “weight” of each factor immediately obvious to a layperson.

40. Summary

Slide 40

(Timestamp: 00:51:11)

Shah begins to wrap up the presentation, preparing to consolidate the four methods (Rulefit, GA2M, Rule Lists, Scorecards) into a final comparison.

41. Complexity vs AUC Summary Plot

Slide 41

(Timestamp: 00:51:13)

This is the definitive comparison graph of the talk. It places all discussed models on the Complexity vs. AUC plane. * GA2M (EBM) and Rulefit sit high up, offering near-SOTA accuracy with moderate interpretability. * Scorecards and Rule Lists sit lower on accuracy but offer maximum simplicity.

Shah summarizes the trade-off: “The Rule Lists and Scorecard… you lose a little bit [of accuracy]… but we talked about the trade-offs of being able to easily understand.”

42. Take Away

Slide 42

(Timestamp: 00:52:08)

The final message is a call to action: Try these approaches.

Shah encourages data scientists to add these tools to their toolkit. He asks them to consider the specific needs of their problem: Is it about transparency in calculation (Scorecard)? Or understanding factors (GA2M)? Often, a simple model that gets deployed is far better than a complex model that gets stuck in review.

43. Conclusion

Slide 43

(Timestamp: 00:52:52)

The presentation concludes with Rajiv Shah’s contact information. He mentions an upcoming blog post that will synthesize these topics and invites the audience to reach out with questions or feedback.

He reiterates that these interpretable models are often easier to get “buy-in” for, making them a pragmatic choice for real-world data science success.

This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.

Spark of AI: How Transfer Learning Unlocked AI’s Potential

Fri, 20 Sep 2024 05:00:00 GMT

Video

Watch the full video

Annotated Presentation

Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.

Here is the annotated presentation based on the provided video transcript and slide summaries.

1. The Spark of the AI Revolution

Slide 1

(Timestamp: 00:00)

The presentation begins with the title slide, “The Spark of the AI Revolution: Transfer Learning,” presented by Rajiv Shah from Snowflake. This talk was originally given at the University of Cincinnati and recorded later to share the insights with a broader audience.

Rajiv sets the stage by explaining that this is not a deep technical dive into code, but rather a descriptive history and analysis of the drivers behind the current AI boom. The goal is to explain how AI learns and how individuals can start to interrogate and understand these technologies in their own lives.

The core premise is that Transfer Learning is the catalyst that shifted AI from academic curiosity to a revolutionary force. The talk aims to bridge the gap for those unfamiliar with the underlying mechanics of how models like ChatGPT came to be.

2. Sparks of AGI: Early Experiments

Slide 2

(Timestamp: 01:00)

This slide illustrates an early experiment conducted by researchers investigating GPT-4. To understand how the model was learning, they gave it a concept and asked it to draw it using code (SVG). The slide displays a progression of abstract animal figures, showing how the model’s ability to represent concepts improved over time during training.

This references the paper “Sparks of Artificial General Intelligence,” which caused significant waves in the tech community. It suggests that these models were beginning to show signs of Artificial General Intelligence (AGI)—reasoning capabilities that extend beyond narrow tasks.

The visual progression from crude shapes to recognizable forms serves as a metaphor for the rapid evolution of these models. It highlights the mystery and potential power hidden within the training process of Large Language Models (LLMs).

3. Extinction Level Threat?

Slide 3

(Timestamp: 01:36)

The presentation addresses the extreme concerns surrounding the rapid scaling of AI technologies. The slide features a dramatic image reminiscent of the Terminator, referencing fears that unchecked AI development could pose an “extinction-level” threat to humanity.

Rajiv notes that as these technologies scale, there is a segment of the research and safety community worried about catastrophic outcomes. This sets up a contrast between the theoretical existential risks and the practical, everyday reality of how AI is currently being used.

This slide acknowledges the “hype and fear” cycle that dominates the media narrative, validating the audience’s anxiety before pivoting to a more grounded explanation of how the technology actually works.

4. The New AI Overlords

Slide 4

(Timestamp: 01:42)

Shifting to a lighter tone, this slide highlights the widespread adoption of AI by the younger generation. It cites a statistic that 89% of students have used ChatGPT for homework, humorously suggesting that children have already “accepted our new AI overlords.”

The slide points out a discrepancy in honesty, noting that while 89% use it, a significant portion (implied by the “11% are lying” joke) might not admit it. This reflects a fundamental shift in education and information retrieval that has already taken place.

This context emphasizes that the AI revolution is not just a future possibility but a current reality affecting how the next generation learns and works. It underscores the urgency of understanding these tools.

5. Fundamental Questions

Slide 5

(Timestamp: 01:53)

This slide poses the central questions that the presentation will answer: “What is AI doing?” and “How should you think about AI?” It serves as an agenda setting for the technical explanation that follows.

Rajiv transitions here from the societal impact of AI to the mechanics of machine learning. He prepares the audience to look “under the hood” to demystify the “magic” of tools like ChatGPT.

The goal is to move the audience from passive consumers of AI hype to critical thinkers who understand the limitations and capabilities of the technology based on how it is built.

6. How We Teach Computers

Slide 6

(Timestamp: 02:02)

The presentation begins its technical explanation with a fundamental question: “How do we teach computers?” The slide uses imagery of blueprints and tools, likening the traditional process of building AI models to craftsmanship.

This introduces the concept of Supervised Learning in a relatable way. Before discussing neural networks, Rajiv grounds the audience in traditional analytics, where humans explicitly guide the machine on what to look for.

The focus here is on the human element in traditional machine learning—the “artisan” who must carefully select inputs to get a desired output.

7. Identifying Features

Slide 7

(Timestamp: 02:11)

Using a real estate example, this slide explains the concept of Features (or variables). To teach a computer to value a house, one must identify specific characteristics like square footage, number of bedrooms, or closet space.

Rajiv explains that we capture these characteristics and organize them into a tabular format. This process is known as Feature Engineering, where the data scientist decides which attributes are relevant for the problem at hand.

This is the bedrock of traditional enterprise AI: converting real-world objects into structured data points that a machine can process mathematically.

8. Historical Data Patterns

Slide 8

(Timestamp: 02:50)

This slide displays a scatter plot correlating “Sales Price” with “Square Feet.” It illustrates how enterprises gather historical data to look for patterns and relationships backwards in time.

Rajiv notes that much of traditional analytics is simply looking at this historical data to understand what happened. However, the power of AI lies in using this data for forward-looking purposes.

The visual clearly shows a trend: as square footage increases, the price generally increases. This linear relationship is what the machine needs to “learn.”

9. Learning the Model

Slide 9

(Timestamp: 03:03)

Here, a line is drawn through the data points on the scatter plot. This line represents the Model. Learning, in this context, is simply the mathematical process of fitting this line to the historical data to minimize error.

Rajiv explains that the model “understands the relationships” defined by the data. Instead of a human manually writing rules, the algorithm finds the best-fit trend based on the input features.

This simplifies the concept of training a model down to its essence: finding a mathematical representation of a trend within a dataset.

10. Making Predictions

Slide 10

(Timestamp: 03:16)

This slide demonstrates the utility of the trained model. When a “New House” comes onto the market, the model uses the learned line to predict its value based on its square footage.

This defines the Inference stage of machine learning. The model is no longer learning; it is applying its “knowledge” (the line) to unseen data to generate a prediction.

It highlights the portability of a model—once trained, it can be used to make rapid assessments of new data points without human intervention.

11. The Domain Limitation

Slide 11

(Timestamp: 03:30)

The presentation introduces a critical limitation of traditional models. The slide shows the model trained on San Francisco data being applied to houses in South Carolina. The result is labeled “Poor Model.”

Rajiv explains that while you can technically take the model with you, it will fail because the underlying relationships between features (size) and targets (price) are different in different domains (geographies).

This illustrates the concept of Domain Shift or lack of generalization. A model is only as good as the data it was trained on, and it assumes the future (or new location) looks exactly like the past.

12. The Thinking Emoji

Slide 12

(Timestamp: 03:43)

This slide reinforces the previous point with a thinking emoji, emphasizing the realization that the existing model is inadequate. The “San Francisco Model” does not fit the “South Carolina Data.”

It serves as a visual pause to let the problem sink in: traditional machine learning is brittle. It requires the data distribution to remain constant.

Rajiv uses this to set up the labor-intensive nature of traditional analytics, where models cannot simply be “transferred” across different contexts.

13. Train New Model

Slide 13

(Timestamp: 03:55)

The solution in the traditional paradigm is presented here: “Train New Model.” To get accurate predictions for South Carolina, one must collect local data and repeat the entire training process from scratch.

This highlights the “Never-Ending Battle” of enterprise analytics. Data scientists are constantly retraining models for every specific region, product line, or use case.

This sets the baseline for why Transfer Learning (introduced later) is such a revolution. In the old way, knowledge was not portable; every problem required a bespoke solution.

14. Artisan AI

Slide 14

(Timestamp: 04:15)

Rajiv coins the term “Artisan AI” to describe this traditional approach. The slide features an image of a craftsman, symbolizing that these models are hand-built and rely heavily on human-crafted features.

This approach is slow and difficult to scale. Just as an artisan can only produce a limited number of goods, a data science team using these methods can only maintain a limited number of models.

It emphasizes that the intelligence in these systems comes largely from the human who engineered the features, not the machine itself.

15. Enterprise AI Use Cases

Slide 15

(Timestamp: 04:23)

This slide lists common Enterprise AI applications: Forecasting, Pricing, Customer Churn, and Fraud. It notes that 80% of production models currently fall into this category.

Rajiv grounds the talk in the reality of today’s business world. Despite the hype around Generative AI, most companies are still running on these “Artisan” structured data models.

This distinction is crucial for understanding the market. There is “Old AI” (highly effective, structured, labor-intensive) and “New AI” (generative, unstructured, scalable), and they solve different problems.

16. The Computer Science Perspective

Slide 16

(Timestamp: 04:41)

The presentation shifts from the enterprise view to the academic Computer Science view. The slide asks, “How should we teach computers?” signaling a move toward more advanced methodologies.

Rajiv indicates that computer scientists were trying to find ways to move beyond the limitations of manual feature engineering. They wanted machines to learn the features themselves.

This transition introduces the concept of Deep Learning and the move toward processing unstructured data like audio, images, and text.

17. Frederick Jelinek’s Insight

Slide 17

(Timestamp: 04:49)

This slide introduces a quote from Frederick Jelinek, a pioneer in speech recognition: “Every time I fire a linguist, the performance of the speech recognizer goes up.”

This provocative quote encapsulates a major shift in AI philosophy. It suggests that human expertise (linguistics) often gets in the way of raw data processing. Instead of hard-coding grammar rules, it is better to let the model learn patterns directly from the data.

Rajiv asks the audience to “chew on that,” as it foreshadows the “Bitter Lesson” of AI: massive compute and data often outperform human domain expertise.

18. Computer Vision in 2010

Slide 18

(Timestamp: 05:47)

The slide depicts the state of Computer Vision around 2010. It shows a process of manual feature extraction (like HOG - Histogram of Oriented Gradients) used to identify shapes and edges.

Rajiv explains that even in vision, researchers were essentially doing “Artisan AI.” They sat around thinking about how to mathematically describe the shape of a car or a truck to a computer.

This illustrates that before the deep learning boom, computer vision was stuck in the same “feature engineering” trap as tabular analytics.

19. SVM Classification

Slide 19

(Timestamp: 06:05)

Following feature extraction, this slide shows a Support Vector Machine (SVM) classifier separating data points (cars vs. trucks). This was the standard approach: extract features manually, then use a simple algorithm to classify them.

This reinforces the previous point about the limitations of the time. The intelligence was in the manual extraction, not the classification model.

Rajiv mentions his own work at Caterpillar, noting that this was exactly how they tried to separate images of machinery—a tedious and specific process.

20. Fei-Fei Li and Big Data

Slide 20

(Timestamp: 06:15)

The slide introduces Professor Fei-Fei Li, a visionary in computer vision. It features a collage of images, hinting at the need for scale.

Rajiv explains that Fei-Fei Li recognized that for computer vision to advance, it needed to move away from tiny datasets (100-200 images) and toward massive scale. She understood that deep learning required vast amounts of data to generalize.

This marks the beginning of the “Big Data” era in AI, where the focus shifted from better algorithms to better and larger datasets.

21. ImageNet

Slide 21

(Timestamp: 06:41)

This slide details ImageNet, the dataset Fei-Fei Li helped create. It contains 14 million images across 1000 classes.

Rajiv highlights the sheer effort involved, noting the use of Mechanical Turk to crowdsource the labeling of these images. He calls this the “dirty secret” of AI—that it is powered by low-wage human labor labeling data.

ImageNet became the benchmark that drove the AI revolution. It provided the “fuel” necessary for neural networks to finally work.

22. AlexNet and GPUs

Slide 22

(Timestamp: 07:44)

The presentation introduces Alex Krizhevsky, a graduate student under Geoffrey Hinton. The slide mentions “AlexNet” and the use of GPUs (Graphics Processing Units).

Rajiv tells the story of how Alex decided to use NVIDIA gaming cards to train neural networks. Traditional CPUs were too slow for the math required by deep learning.

This moment—combining the massive ImageNet dataset with the parallel processing power of GPUs—was the “big bang” of modern AI.

23. AlexNet Training Details

Slide 23

(Timestamp: 08:05)

This slide provides the technical specs of AlexNet: trained on 1.2 million images, using 2 GPUs, taking roughly 6 days, with 60 million parameters.

Rajiv emphasizes that while 6 days seems long, the result was a model vastly superior to anything else. It proved that neural networks, which had been theoretical for decades, were now practical.

The “60 million parameters” figure is a precursor to the “billions” and “trillions” we see today, marking the start of the parameter scaling race.

24. Crushing the Competition

Slide 24

(Timestamp: 08:27)

A chart displays the results of the ImageNet Large Scale Visual Recognition Challenge. It shows AlexNet achieving a significantly lower error rate than the competitors.

Rajiv notes that the performance jump was so dramatic that by the following year, every competitor had switched to using the AlexNet architecture.

This visualizes the paradigm shift. The “Artisan” methods were instantly obsolete, replaced by Deep Learning.

25. Feature Engineering vs. Deep Learning

Slide 25

(Timestamp: 08:37)

Using a humorous meme format, this slide compares the “Old Way” (Feature Engineering + SVM) with the “New Way” (AlexNet). The AlexNet side is depicted as a powerful, overwhelming force.

This solidifies the takeaway: Deep Learning didn’t just improve upon the old methods; it completely replaced them for unstructured data tasks like vision.

It emphasizes that the model learned the features itself (edges, textures, shapes) rather than having humans manually code them.

26. The 1000 Classes

Slide 26

(Timestamp: 08:51)

This slide shows examples of the 1000 classes in ImageNet, ranging from specific dog breeds to everyday objects.

Rajiv explains that this model learned to identify a vast array of things from the raw pixels. It went from raw vision to understanding textures, shapes, and objects.

However, he sets up the next problem: What if you want to identify something not in those 1000 classes?

27. The Hot Dog Problem

Slide 27

(Timestamp: 09:01)

referencing a famous scene from the show Silicon Valley, this slide presents the specific challenge of classifying “Hot Dogs.”

Rajiv uses this to ask: How do you help a buddy with a startup who needs to find hot dogs if “hot dog” isn’t one of the primary categories, or if they need a specific type of hot dog? Do you have to start from scratch?

This sets the stage for Transfer Learning—the solution to avoiding the need for 14 million images every time you have a new problem.

28. Pre-Trained Models

Slide 28

(Timestamp: 09:12)

The slide introduces the concept of a Pre-trained Model. This is the model that has already learned the 1000 classes from ImageNet.

Rajiv explains that this model already “knows” how to see. It understands edges, curves, and textures. This knowledge is contained in the “weights” of the neural network.

The key idea is that we don’t need to relearn how to “see” every time we want to identify a new object.

29. Transfer Learning Mechanics

Slide 29

(Timestamp: 09:30)

This technical slide illustrates how Transfer Learning works. It shows the layers of a neural network. We keep the early layers (which know shapes and textures) and only retrain the final layers for the new task (e.g., identifying boats).

Rajiv explains that we can transfer “most of that knowledge” and only change a small amount of parameters (less than 10%).

This is the revolution: You can build a world-class model with a small amount of data by standing on the shoulders of the giant ImageNet model.

30. The Revolution

Slide 30

(Timestamp: 09:54)

A graph titled “Transfer Learning Revolution” shows the dramatic improvement in accuracy when using transfer learning versus training from scratch. It includes a quote from Andrew Ng stating that transfer learning will be the next driver of commercial success.

Rajiv emphasizes that this capability allowed startups and companies to build powerful AI without needing Google-sized datasets. It democratized access to high-performance computer vision.

This wraps up the vision section of the talk, establishing Transfer Learning as the “Spark.”

31. The Implications

Slide 31

(Timestamp: 10:11)

The slide shows a YouTube video thumbnail from 2016 featuring Geoffrey Hinton. This transitions the talk to the societal and professional implications of this technology.

Rajiv prepares to share a famous prediction by Hinton regarding the medical field, specifically radiology. It signals a shift from “how it works” to “what it does to jobs.”

32. The Coyote Moment

Slide 32

(Timestamp: 10:19)

The slide displays a webpage for the University of Cincinnati Radiology Fellows. Rajiv quotes Hinton: “Radiologists are like the coyote that’s already over the edge of the cliff but hasn’t yet looked down.”

Hinton suggested people should stop training radiologists because AI interprets images better. Rajiv humorously notes that since he was speaking at U of C, he had to show the “coyotes” in the audience.

This highlights the tension between AI capabilities and human expertise, a recurring theme in the presentation.

33. NLP: The Academic View

Slide 33

(Timestamp: 10:42)

The presentation switches domains from Computer Vision to Natural Language Processing (NLP). The slide depicts a traditional academic setting, representing the text researchers.

Rajiv explains that while Computer Vision was having its revolution with AlexNet, the text folks were still doing things the “Old Way”—crafting features and rules for language.

They saw the success in vision and wondered how to replicate it for text, but language proved more difficult to model than images initially.

34. Traditional NLP Tasks

Slide 34

(Timestamp: 11:04)

This slide lists various NLP tasks: Classification, Information Extraction, and Sentiment Analysis.

Rajiv notes that traditionally, each of these was a separate discipline. You built a specific model for sentiment, a different one for translation, and another for summarization. There was no “one model to rule them all.”

This fragmentation made NLP difficult and resource-intensive, as knowledge didn’t transfer between tasks.

35. The GLUE Benchmark

Slide 35

(Timestamp: 11:25)

The slide introduces the GLUE Benchmark (General Language Understanding Evaluation). This was a collection of different text tasks put together to measure general language ability.

Rajiv explains this was an attempt to push the field toward general-purpose models. Researchers wanted a single metric to see if a model could understand language broadly, not just solve one specific trick.

36. The Transformer Architecture

Slide 36

(Timestamp: 11:37)

This slide marks the turning point for text: the introduction of the Transformer architecture by Google researchers in 2017 (the “Attention Is All You Need” paper).

Rajiv highlights that this architecture was not only more accurate (higher BLEU scores) but, crucially, more efficient.

The Transformer allowed for parallel processing of text, unlike previous sequential models (RNNs/LSTMs), unlocking the ability to train on massive datasets.

37. Lower Training Costs

Slide 37

(Timestamp: 11:46)

The slide emphasizes the Training Cost reduction associated with Transformers.

Rajiv points out that because the architecture used less processing power per unit of data, researchers immediately asked: “What happens if we give it more processing?”

This efficiency paradox—making something cheaper allows you to do vastly more of it—sparked the scaling era of LLMs.

38. Exponential Growth

Slide 38

(Timestamp: 11:56)

A graph demonstrates the exponential growth in the size of Transformer models (measured in parameters) over just a few years. The curve shoots upward vertically.

Rajiv explains that this scaling—simply making the models bigger and feeding them more data—led to the performance of GPT-4.

This visualizes the “Scale” aspect of modern AI. We haven’t necessarily changed the architecture since 2017; we’ve just made it significantly larger.

39. GPT-4 and Images

Slide 39

(Timestamp: 12:11)

The presentation circles back to the GPT-4 generated images from Slide 2.

Rajiv connects the Transformer architecture and scaling directly to these “Sparks of AGI.” The ability to reason and draw emerged from simply predicting the next word at a massive scale.

40. The Era of ChatGPT

Slide 40

(Timestamp: 12:16)

The slide displays the ChatGPT logo, symbolizing the current era where these technical advancements reached the public consciousness.

Rajiv sets up the next section of the talk: explaining exactly how a model like ChatGPT is trained. He moves from history to the “Recipe.”

41. The Learning Process

Slide 41

(Timestamp: 12:20)

A visual diagram outlines the evolutionary stages of ChatGPT. It previews the three steps Rajiv will cover: Pre-training, Fine-tuning, and Alignment.

This roadmap helps the audience understand that ChatGPT isn’t just one static thing; it’s the result of a multi-stage pipeline involving different types of learning.

42. Recipe Step 1: Foundation Model

Slide 42

(Timestamp: 12:27)

The first step identified is the “Foundation Model” (or Base Model).

Rajiv explains that the core capability of these models is Next Word Prediction. Before it can answer questions or be helpful, it must simply learn the statistical structure of language.

43. Predictive Keyboards

Slide 43

(Timestamp: 12:31)

To make the concept relatable, the slide compares LLMs to the predictive text feature on a smartphone keyboard.

Rajiv notes that while the game on your phone is simple, scaling that concept up to the entire internet makes it incredibly powerful. It grounds the “magic” of AI in a familiar user experience.

44. Next Token Prediction

Slide 44

(Timestamp: 12:40)

This technical slide defines “Next Token Prediction.” It explains that the model looks at a sequence of text and calculates the probability of what comes next.

Rajiv emphasizes that this is a hard statistical problem. There are many possibilities for the next word, and the model must learn to weigh them based on context.

45. The Homer Simpson Challenge

Slide 45

(Timestamp: 13:05)

Rajiv introduces a specific experiment: Training a Transformer to speak like Homer Simpson. He mentions using 7MB of Simpsons scripts (~7 million tokens).

This serves as a concrete example to show how training data size affects model performance.

46. 4 Million Tokens

Slide 46

(Timestamp: 13:32)

The slide shows the output of the model when trained on only 4 Million tokens. The text is “nonsensical and random.”

Rajiv demonstrates that with insufficient data, the model hasn’t learned grammar or structure yet. It’s just outputting characters.

47. 16 Million Tokens

Slide 47

(Timestamp: 13:37)

At 16 Million tokens, the output improves slightly. It contains random words and incorrect grammar, but it’s recognizable as language.

This illustrates the “grokking” phase where the model starts to pick up on basic syntax but lacks semantic meaning.

48. 64 Million Tokens

Slide 48

(Timestamp: 13:39)

With 64 Million tokens, the model generates text that is “close to a proper sentence” and sounds vaguely like Homer Simpson.

Rajiv uses this progression to prove that these models are statistical engines. With enough data, they mimic the patterns of the training set effectively.

49. GPT-2 Specifications

Slide 49

(Timestamp: 13:54)

The slide details GPT-2 (released in 2019), which had 1.5 Billion parameters.

Rajiv recalls that when GPT-2 came out, he wasn’t excited because it was just a “creative storytelling model.” It wasn’t factually accurate. He wants the audience to remember that at their core, these models are just predicting the next word, not checking facts.

50. Llama 3.1 and Scale

Slide 50

(Timestamp: 14:27)

Updating the timeline, this slide shows Llama 3.1. It highlights the training data: 15 Trillion Tokens and the compute: 40 Million GPU Hours.

Rajiv emphasizes that 15 trillion tokens is an “unfathomable amount of information.” The scale has increased 10,000x since GPT-2.

This underscores the energy and compute intensity of modern AI—it requires massive infrastructure.

51. Hallucinations

Slide 51

(Timestamp: 15:15)

This slide addresses Hallucinations. It uses an example of asking for the “Capital of Mars.” The model will confidently invent an answer.

Rajiv argues that “hallucination” isn’t the right metaphor because the model isn’t malfunctioning. It is doing exactly what it was designed to do: predict the most likely next word. It has no concept of “truth,” only statistical likelihood.

52. GPT-2 Failure on Sentiment

Slide 52

(Timestamp: 16:17)

Rajiv shows an example of trying to use the base GPT-2 model for a specific task: Customer Sentiment. When prompted, the model just continues the story instead of classifying the sentiment.

This illustrates that Base Models are creative but not useful for following instructions. They don’t know they are supposed to solve a problem; they just want to write text.

53. Recipe Step 2: Instruction Fine-Tuned

Slide 53

(Timestamp: 16:38)

This introduces the second step in the ChatGPT recipe: “Instruction Fine-Tuned Model.”

Rajiv explains that to make the model useful, we must teach it to follow orders. This is done via Transfer Learning—taking the base model and training it further on examples of instructions and answers.

54. Fine-Tuning for Sentiment

Slide 54

(Timestamp: 16:46)

The slide shows the process of fine-tuning the language model specifically for Sentiment Analysis.

By showing the model examples of “Sentence -> Sentiment,” we can tweak the parameters so it learns to perform classification rather than just storytelling.

55. Multi-Task Fine-Tuning

Slide 55

(Timestamp: 17:22)

Rajiv expands the concept. We don’t just fine-tune for one task; we fine-tune for Topic Classification as well.

The key insight is that one model can now solve multiple problems. Unlike the “Old NLP” where you needed separate models, the LLM can swap between tasks based on the instruction.

56. Translation Task

Slide 56

(Timestamp: 17:24)

The slide adds Translation to the mix, using about 10,000 examples.

This reinforces the “General Purpose” nature of LLMs. They are Swiss Army knives for text.

57. Generalization to New Tasks

Slide 57

(Timestamp: 17:30)

Rajiv poses a challenge: What happens if you give the model a task it hasn’t seen before?

The slide indicates the model will try to solve it. This is the breakthrough of Generalization. Because it understands language so well, it can interpolate and attempt tasks it wasn’t explicitly trained on.

58. Practical Applications

Slide 58

(Timestamp: 17:55)

This slide showcases the wide array of use cases: Code explanation, Creative writing, Information extraction, etc.

Rajiv explains that these capabilities exist because we have “trained these models to follow instructions.” This is why we can talk to them via Prompts.

59. Zero Shot Learning

Slide 59

(Timestamp: 18:19)

The slide introduces “Zero shot learning” and “Prompting.”

This is the ability to get a result without showing the model any examples (zero shots). Rajiv notes that there is a “whole language” around prompting, but fundamentally, it’s just giving the model the instruction we trained it to expect.

60. Weeks vs. Days

Slide 60

(Timestamp: 18:41)

A comparison slide contrasts “Training a ML Model (weeks)” with “Prompting a LLM (days).”

Rajiv highlights the efficiency shift. In the old days, solving a sentiment problem meant weeks of data collection and training. Now, it takes minutes to write a prompt. This is a massive productivity booster for NLP tasks.

61. Reasoning and Planning

Slide 61

(Timestamp: 19:25)

The presentation pivots to the limitations of LLMs, specifically regarding Reasoning and Planning. The slide shows a “Block Stacking” puzzle.

Rajiv explains that stacking blocks requires planning several steps ahead. It is not a one-step prediction problem; it requires maintaining a state of the world in memory.

62. Mystery World Failure

Slide 62

(Timestamp: 20:40)

The slide introduces “Mystery World,” a variation of the block problem where the names of the blocks are changed to random words.

While a human (or a 4-year-old) understands that changing the name doesn’t change the physics of stacking, GPT-4 fails (3% accuracy). Rajiv explains that the model gets distracted by the creative aspect of the words and loses the logical thread. It shows these models struggle with abstract reasoning.

63. Recipe Step 3: Aligned Model

Slide 63

(Timestamp: 21:56)

The final step in the recipe is the “Aligned Model.”

Rajiv introduces the need for safety and helpfulness. A model that follows instructions perfectly might follow bad instructions. We need to align it with human values.

64. Galactica: Science LLM

Slide 64

(Timestamp: 22:00)

The slide presents Galactica, a model released by Meta focused on science.

Rajiv describes the intent: a helpful assistant for researchers to write code, summarize papers, and generate scientific content. It was meant to be a specialized tool.

65. Galactica Output

Slide 65

(Timestamp: 22:20)

An example of Galactica’s output shows it generating technical content.

Rajiv highlights the potential utility. It looked like a powerful tool for accelerating scientific discovery.

66. Galactica Pulled

Slide 66

(Timestamp: 22:47)

The slide reveals that Meta pulled the model shortly after release.

Rajiv explains why: users found they could ask it for the “benefits of eating crushed glass” or “benefits of suicide,” and the model would happily generate a scientific-sounding justification. It lacked a safety layer. This incident underscored the necessity of Red Teaming and alignment before release.

67. Learning What is Helpful

Slide 67

(Timestamp: 23:59)

To explain how we define “helpful,” Rajiv shows a Stack Overflow question.

He notes that defining “helpful” mathematically is difficult. Unlike “square footage,” helpfulness is subjective and nuanced.

68. Technical Answer

Slide 68

(Timestamp: 24:05)

The slide shows a detailed technical answer.

Rajiv points out that trying to create a “feature list” for what makes this answer helpful is nearly impossible. We can’t write a rule-based program to detect helpfulness.

69. The Dating App Analogy

Slide 69

(Timestamp: 24:26)

Rajiv uses a humorous Dating App analogy. He compares the “Old Way” (filling out long compatibility forms/features) with the “New Way” (Swiping).

He explains that Swiping is a way of capturing human preferences without asking the user to explicitly define them. This is how we teach AI what is helpful.

70. Collect Human Feedback

Slide 70

(Timestamp: 24:55)

The slide details the process: “Collect Human Feedback.”

We present the model with two options and ask a human, “Which is better?” By collecting thousands of these “swipes,” we build a dataset of human preference.

71. RLHF (Reinforcement Learning from Human Feedback)

Slide 71

(Timestamp: 25:05)

This slide introduces the technical term: RLHF.

Rajiv explains this is the layer that turns a raw instruction-following model into a safe, helpful product like ChatGPT. It is an active curation process, similar to curating an Instagram feed.

72. The Makeover Example

Slide 72

(Timestamp: 25:27)

A “Before and After” makeover image illustrates the effect of RLHF.

The “Before” is the raw model (messy, potentially harmful). The “After” is the aligned model (polished, safe, presentable).

73. Tuning Responses

Slide 73

(Timestamp: 25:44)

The slide shows different ways an AI can answer a question: Sycophantic (sucking up to the user), Baseline Truthful (blunt), or Helpful Truthful.

Rajiv notes we can train models to have specific personalities. We can make them polite, or we can make them “kiss your butt” if the user wants validation.

74. AI Conversations

Slide 74

(Timestamp: 26:17)

This slide references Character.ai and the trend of people spending hours talking to AI personas.

Rajiv mentions research showing people sometimes prefer AI doctors over human ones because the AI is patient, listens, and is polite (due to alignment). This suggests a future where AI handles high-touch conversational roles.

75. The Full Recipe

Slide 75

(Timestamp: 27:19)

The presentation summarizes the full pipeline: Foundation Model -> Instruction Fine-Tuned -> Aligned Model.

This visual recap cements the three-stage process in the audience’s mind.

76. Learning Mechanisms Recap

Slide 76

(Timestamp: 27:23)

Rajiv maps the learning mechanisms to the stages: 1. Next Word Prediction (Foundation) 2. Multi-task Training (Instruction) 3. Human Preferences (Alignment)

He reiterates that understanding these three mechanics helps explain why the models behave the way they do (hallucinations, ability to code, politeness).

77. Key Takeaways

Slide 77

(Timestamp: 27:43)

The presentation transitions to the conclusion with three main takeaways: 1. Measure Twice 2. Respect Scale 3. Critical Thinking

Rajiv notes in the video that he skimmed these in the original talk, but the slides provide the detail for how to work effectively with AI.