<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Rajiv Shah - rajistics blog</title>
<link>https://rajivshah.com/blog/</link>
<atom:link href="https://rajivshah.com/blog/index.xml" rel="self" type="application/rss+xml"/>
<description>Rajistics blog</description>
<generator>quarto-1.8.27</generator>
<lastBuildDate>Fri, 26 Dec 2025 06:00:00 GMT</lastBuildDate>
<item>
  <title>Running Code and Failing Models</title>
  <link>https://rajivshah.com/blog/running-code-failing-models.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/target_leakage.jpg" class="img-fluid figure-img"></p>
<figcaption>target_leakage.jpg</figcaption>
</figure>
</div>
<p>Machine learning is a glass cannon. When used correctly, it can be a truly transformative technology, but just a small oversight can cause it to become misleading and even actively harmful. Even if all the code runs and the model seems to be spitting out reasonable answers, it’s possible for a model to encode fundamental data science mistakes that invalidate its results. These errors might seem small, but the effects can be disastrous when the model is used to make decisions in the real world.</p>
<p>The promise and power of AI lead many researchers to gloss over the ways in which things can go wrong when building and operationalizing machine learning models. As a data scientist, one of my passions is to reproduce research papers as a learning exercise. Along the way, I have uncovered cases where the research was published with faulty methodologies. My hope is that this analysis can increase awareness about data science mistakes and raise the standards for machine learning in research. For example, last year I shared an analysis of a project by Harvard and Google researchers that contained fundamental errors. The researchers refused to fix their mistake even when confronted with it directly.</p>
<p>Over the holidays, I used DataRobot to reproduce a few machine learning benchmarks. I found many examples of machine learning code that ran without errors but that were built using flawed data science practices. The examples I share in this post come from the world’s best data scientists and affect hundreds of peer-reviewed research publications. As these examples show, errors in machine learning can be subtle. The key to finding these errors is to work with a tool that offers guardrails and insights along the way.</p>
<section id="target-leakage-in-a-fast.ai-example" class="level2">
<h2 class="anchored" data-anchor-id="target-leakage-in-a-fast.ai-example">Target Leakage in a fast.ai Example</h2>
<p><em>Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD</em> by Jeremy Howard and Sylvain Gugger is a hands-on guide that helps people with little math background understand and use deep learning quickly. In the section about tabular datasets, the authors use the Blue Book for Bulldozers problem, the goal of which is to predict the sale price for heavy equipment at auction. I tried to replicate their machine learning model and wasn’t able to beat their model’s predictive performance, which piqued my interest.</p>
<p>After carefully inspecting their code, I found a mistake in their validation dataset. Their code attempted to create a validation test set based on a prediction point of November 1, 2011. The goal was to split the data at this point so that you could train on the data known at prediction time. The performance of the model is then analyzed on a test set, which is located after the prediction point. Unfortunately, the code was not written correctly; there was contamination from the future in the training data.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/leakage1.png" class="img-fluid figure-img"></p>
<figcaption>Leakage.png</figcaption>
</figure>
</div>
<p>The code below might at first look like it separates data before and after November 1, 2011, but there’s a subtle mistake that includes future dates. The use of information in the model training process that would not be expected at prediction time is known as <strong>target leakage</strong>, and it led to an over-optimistic accuracy. Because I used DataRobot, which requires and validates a date when creating a validation dataset based on time, I was able to find the mistake in the fast.ai book.</p>
<p>After the target leakage was fixed, the fast.ai scores dropped, and I was able to reproduce the results outside of fast.ai. This simple coding mistake led to a notebook and model that appeared valid. If this model were put into production, the results would have been much worse on new data. After I identified this issue, Jeremy Howard agreed to add a note in the course materials.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/fastai2.png" class="img-fluid figure-img"></p>
<figcaption>fastai2.png</figcaption>
</figure>
</div>
</section>
<section id="sarcos-dataset-failure" class="level2">
<h2 class="anchored" data-anchor-id="sarcos-dataset-failure">SARCOS Dataset Failure</h2>
<p>The SARCOS dataset is a widely used benchmark dataset in machine learning. Based on predicting the movement of a robotic arm, SARCOS appears in more than one hundred academic papers. I tested this dataset because it appears in various benchmarks by Google and fast.ai.</p>
<p>The SARCOS dataset is broken into two parts: a training dataset (sarcos_inv) and a test dataset (sarcos_inv_test). Following common data science practices, DataRobot broke the SARCOS training set into a training partition and a validation partition. I treated the SARCOS test set (sarcos_inv_test) as a holdout. When I looked at the results, I immediately noticed something suspicious. Do you see it?</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/sarcos3.png" class="img-fluid figure-img"></p>
<figcaption>sarcos3.png</figcaption>
</figure>
</div>
<p>The large drop between the validation score and the holdout score indicates that something is very different between the validation and holdout datasets. When I examined the holdout dataset (the SARCOS test set), I found that every row in the test set was in the training data too. After some investigation, I discovered that the holdout dataset was built out of the training dataset. Of the 4,449 examples in the test set, 4,445 examples are present in the training set, too. The target leakage here is significant. By overfitting or memorizing the training dataset, it’s possible to get perfect results on the test set. Overfitting, a well-known issue in machine learning, is illustrated in the following figure. The test dataset should have used out-of-sample testing to prevent overfitting.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/overfit4.png" class="img-fluid figure-img"></p>
<figcaption>overfit4.png</figcaption>
</figure>
</div>
<p>Target leakage helped to explain the very low scores of the deep learning models. For comparison, a random forest model achieves 2.38 mean squared error (MSE), while a deep learning model overfits and produces 0.038 MSE. Judging from the suspiciously large difference between the models, it appears that the deep learning model just memorized the training data, which is why it had such low error.</p>
<p>The consequences of this target leakage are far-reaching. More than one hundred journal articles relied on this dataset. Thousands of data scientists have used it to benchmark their machine learning code. Researcher Kai Arulkumaran has already acknowledged this issue and now the research community is dealing with the ramifications of the target leakage.</p>
<p>Why wasn’t this error discovered earlier? When I reproduced the SARCOS benchmarks, I used a tool that includes technical safeguards for proper validation splits and provides transparency in the display of the results of each split. DataRobot’s AutoML was designed by data scientists to prevent these sorts of issues. In contrast, working within code, it was quite easy to overlook this fundamental issue. After all, thousands of data scientists have rerun their code and published their results without a second thought.</p>
</section>
<section id="poker-hand-dataset" class="level2">
<h2 class="anchored" data-anchor-id="poker-hand-dataset">Poker Hand Dataset</h2>
<p>The Poker Hand dataset is another widely used benchmark dataset in machine learning. It’s used to predict poker hands (for example, a full house from five cards). The fast.ai and Google benchmarks for this model use the accuracy metric. Accuracy is a measurement for assessing the predictive performance of a model (basically, the percentage of predictions that are correct). Although it’s easy to get running code with the accuracy metric, it’s not good data science practice for this problem.</p>
<p>When DataRobot builds a model with the Poker Hand dataset, by default, it uses log loss as an optimization metric. Log loss is a measure of error for a model. At DataRobot, we believe that it isn’t good practice to use accuracy as your metric on a classification project with imbalanced classes. With imbalanced data, you can easily build a highly accurate model that’s useless.</p>
<p>To understand why accuracy isn’t the best metric when classifying unbalanced data, consider the following figure. Minesweeper is a popular game where the goal is to identify a few mines that are scattered across a board. Because there are a lot of squares with no mines, you could generate a very accurate model just by predicting that every square is safe. Although a 99% accurate model for Minesweeper sounds impressive, it’s not very useful.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/minesweeper5.png" class="img-fluid figure-img"></p>
<figcaption>minesweeper5.png</figcaption>
</figure>
</div>
<p>Automated feature selection in DataRobot provides a more parsimonious featurelist. In the Poker Hand dataset, DataRobot created a DR Reduced Features list with only six features. The starting feature list for this dataset, Cat+Cont, contained 15 features. The leaderboard below shows that the simpler DR Reduced Features list performs better than the full Cat+Cont feature list. The model below was optimized on log loss, but I am viewing the accuracy metrics for comparison to the existing benchmarks.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/DRreduce6.png" class="img-fluid figure-img"></p>
<figcaption>DRreduce6.png</figcaption>
</figure>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>I have shared simple examples of how data scientists can have running code, but failed models. After spending a week going through a half dozen datasets, I am even more convinced that automation with technical safeguards is a required part of building trusted AI. The mistakes I’ve shared here are not isolated incidents.</p>
<p>The issues go beyond the reproducibility crisis for machine learning research. It’s a great first step for researchers to publish their code and make the data available, but as these examples show, sharing code isn’t enough to validate models. So, what should you do about this?</p>
<p>In regulated industries, there are processes in place to validate running code (for example, building a challenger model using a different technical framework). For its safeguards and transparency, many organizations use DataRobot to validate models. Just rereading or rerunning a project isn’t enough to identify errors.</p>
</section>
<section id="links" class="level2">
<h2 class="anchored" data-anchor-id="links">Links</h2>
<ul>
<li><a href="https://medium.com/data-science/stand-up-for-best-practices-8a8433d3e0e8">Stand Up for Best Practices (Harvard Leakage)</a></li>
<li><a href="https://github.com/fastai/fastbook/issues/325">Fast.AI Issue</a></li>
<li><a href="https://github.com/Kaixhin/SARCOS">SARCOS</a></li>
</ul>


</section>

 ]]></description>
  <category>Leakage</category>
  <category>Earthquake</category>
  <category>SARCOS</category>
  <guid>https://rajivshah.com/blog/running-code-failing-models.html</guid>
  <pubDate>Fri, 26 Dec 2025 06:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/target_leakage.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>A Practical Guide to Evaluating Generative AI Applications</title>
  <link>https://rajivshah.com/blog/genai-evaluation-guide.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/qPHsWTZP58U" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/qPHsWTZP58U">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the annotated presentation for Rajiv Shah’s workshop on “Hill Climbing: Best Practices for Evaluating LLMs.”</p>
<section id="title-slide" class="level3">
<h3 class="anchored" data-anchor-id="title-slide">1. Title Slide</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=0s">Timestamp: 00:00</a>)</p>
<p>This slide introduces the workshop titled <strong>“Hill Climbing: Best Practices for Evaluating LLMs,”</strong> presented by Rajiv Shah, PhD, at the Open Data Science Conference (ODSC). The presentation focuses on the technical nuances of Generative AI and how to build effective evaluation workflows.</p>
<p>Rajiv sets the stage by outlining his three main goals for the session: understanding the technical differences in GenAI evaluation, learning a basic introductory workflow for building evaluation datasets, and inspiring practitioners to start “learning by doing” rather than just reading papers.</p>
<p>The concept of “Hill Climbing” refers to the iterative process of improving LLM applications—starting with a baseline and continuously optimizing performance through rigorous testing and error analysis.</p>
</section>
<section id="evaluating-for-gen-ai-resources" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-for-gen-ai-resources">2. Evaluating for Gen AI Resources</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=6s">Timestamp: 00:06</a>)</p>
<p>This slide provides a QR code and a GitHub URL, directing the audience to the code and resources associated with the talk. It emphasizes that the workshop is practical, with code examples available for attendees to replicate the evaluation techniques discussed.</p>
<p>Rajiv encourages the audience to access these resources to follow along with the technical implementations of the concepts, such as building LLM judges and creating unit tests, which will be covered later in the presentation.</p>
</section>
<section id="customer-support-use-case" class="level3">
<h3 class="anchored" data-anchor-id="customer-support-use-case">3. Customer Support Use Case</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=48s">Timestamp: 00:48</a>)</p>
<p>To motivate the need for evaluation, the presentation introduces a common real-world use case: <strong>Customer Support</strong>. Generative AI is frequently deployed to help agents compose emails or chat responses based on user inquiries.</p>
<p>This scenario serves as the baseline example throughout the talk. It represents a high-volume task where automation is desirable, but accuracy and tone are critical for maintaining customer satisfaction and brand reputation.</p>
</section>
<section id="vibe-coding" class="level3">
<h3 class="anchored" data-anchor-id="vibe-coding">4. Vibe Coding</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=59s">Timestamp: 00:59</a>)</p>
<p>This slide introduces the concept of <strong>“Vibe Coding”</strong>—the initial phase where developers grab a simple prompt, feed it to a model, and get a result that feels right. It highlights the misconception that GenAI is easy because it works “out of the box” for simple demos.</p>
<p>Rajiv notes that while “vibe coding” might work for a quick demo app, it is insufficient for production systems. Relying on a “vibe” that the model is working prevents teams from catching subtle failures that occur at scale.</p>
</section>
<section id="good-response-delayed-order" class="level3">
<h3 class="anchored" data-anchor-id="good-response-delayed-order">5. Good Response: Delayed Order</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=70s">Timestamp: 01:10</a>)</p>
<p>Here, we see a successful output generated by the LLM. The customer inquired about a delayed order, and the AI generated a polite, relevant response acknowledging the delay and apologizing.</p>
<p>This example reinforces the “Vibe Coding” trap: because the model often produces high-quality, human-sounding text like this, developers can be lulled into a false sense of security regarding the system’s reliability.</p>
</section>
<section id="good-response-damaged-product" class="level3">
<h3 class="anchored" data-anchor-id="good-response-damaged-product">6. Good Response: Damaged Product</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=72s">Timestamp: 01:12</a>)</p>
<p>This slide provides another example of a “good” response. The AI correctly identifies that the customer received a damaged product and initiates a replacement protocol.</p>
<p>These positive examples establish a baseline of expected behavior. The challenge in evaluation is not just confirming that the model <em>can</em> work, but ensuring it works consistently across all edge cases.</p>
</section>
<section id="bad-response-irrelevance" class="level3">
<h3 class="anchored" data-anchor-id="bad-response-irrelevance">7. Bad Response: Irrelevance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=86s">Timestamp: 01:26</a>)</p>
<p>The presentation shifts to failure modes. In this example, the user asks about an <strong>“Order Delay,”</strong> but the AI responds with information about a <strong>“New Product Launch.”</strong></p>
<p>This illustrates a complete context mismatch. The model failed to attend to the user’s intent, generating a coherent but completely irrelevant response. This type of failure frustrates users and degrades trust in the automated system.</p>
</section>
<section id="bad-response-hallucination" class="level3">
<h3 class="anchored" data-anchor-id="bad-response-hallucination">8. Bad Response: Hallucination</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=96s">Timestamp: 01:36</a>)</p>
<p>This slide shows a more dangerous failure: <strong>Hallucination</strong>. The AI apologizes for a defective “espresso machine,” but as the speaker notes, “We don’t actually sell espresso machines.”</p>
<p>This highlights the risk of the model fabricating facts to be helpful. Such errors can lead to logistical nightmares, such as customers expecting replacements for products that do not exist or that the company never sold.</p>
</section>
<section id="risks-of-llm-mistakes" class="level3">
<h3 class="anchored" data-anchor-id="risks-of-llm-mistakes">9. Risks of LLM Mistakes</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=111s">Timestamp: 01:51</a>)</p>
<p>Rajiv categorizes the risks associated with LLM failures into three buckets: <strong>Reputational, Legal, and Financial</strong>. He cites the example of <strong>Cursor</strong>, an IDE company, where a support bot hallucinated a policy restricting users to one device, causing customers to cancel subscriptions.</p>
<p>The slide emphasizes that courts may view AI agents as employees; if a bot makes a promise (like a refund or policy change), the company might be legally bound to honor it. This escalates evaluation from a technical nice-to-have to a business necessity.</p>
</section>
<section id="the-despair-of-gen-ai" class="level3">
<h3 class="anchored" data-anchor-id="the-despair-of-gen-ai">10. The Despair of Gen AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=158s">Timestamp: 02:38</a>)</p>
<p>This visual represents the frustration developers feel when moving from a successful demo to a failing production system. The “despair” comes from the realization that the stochastic nature of LLMs makes them difficult to control.</p>
<p>It serves as an emotional anchor for the audience, acknowledging that while GenAI is exciting, the unpredictability of its failures causes significant stress for engineering teams responsible for deployment.</p>
</section>
<section id="high-failure-rates" class="level3">
<h3 class="anchored" data-anchor-id="high-failure-rates">11. High Failure Rates</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=168s">Timestamp: 02:48</a>)</p>
<p>The slide cites an MIT report stating that <strong>“95% of GenAI pilots are failing.”</strong> While Rajiv notes this number might be overstated, it reflects a trend where executives are demanding ROI and seeing lackluster results.</p>
<p>This shift in 2025 means that evaluation is no longer just for debugging; it is required to prove business value and justify the high costs of running Generative AI infrastructure.</p>
</section>
<section id="evaluation-improves-applications" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-improves-applications">12. Evaluation Improves Applications</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=194s">Timestamp: 03:14</a>)</p>
<p>This slide asserts the core thesis: <strong>Evaluation helps you build better GenAI applications.</strong> It references a previous viral video by the speaker on the same topic, positioning this talk as an updated, condensed version with fresh content.</p>
<p>Rajiv explains that you cannot improve what you cannot measure. Without a robust evaluation framework, developers are essentially guessing whether changes to prompts or models are actually improving performance.</p>
</section>
<section id="why-evaluation-is-necessary" class="level3">
<h3 class="anchored" data-anchor-id="why-evaluation-is-necessary">13. Why Evaluation is Necessary</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=220s">Timestamp: 03:40</a>)</p>
<p>This concentric diagram illustrates the stakeholders involved in evaluation. It starts with <strong>“Things Go Wrong”</strong> (technical reality), moves to <strong>“Buy-in”</strong> (convincing managers/teams), and ends with <strong>“Regulators”</strong> (external compliance).</p>
<p>Evaluation serves multiple audiences: it helps the developer debug, it provides the metrics needed to convince management that the app is production-ready, and it creates the audit trails required by third-party auditors or regulators.</p>
</section>
<section id="evaluation-dimensions" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-dimensions">14. Evaluation Dimensions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=258s">Timestamp: 04:18</a>)</p>
<p>Evaluation must cover three dimensions: <strong>Technical</strong> (F1 scores, accuracy), <strong>Business</strong> (ROI, value generated), and <strong>Operational</strong> (Total Cost of Ownership, latency).</p>
<p>Rajiv highlights that data scientists often focus solely on the technical, but ignoring operational costs (like the expense of hosting GPUs vs.&nbsp;using APIs) can kill a project. A comprehensive evaluation strategy considers the cost-to-quality ratio.</p>
</section>
<section id="public-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="public-benchmarks">15. Public Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=306s">Timestamp: 05:06</a>)</p>
<p>The slide discusses <strong>Public Benchmarks</strong> (like MMLU, GSM8K). While useful for a general idea of a model’s capabilities (e.g., “Is Llama 3 better than Llama 2?”), they are insufficient for specific applications.</p>
<p>Rajiv warns against using these benchmarks to determine if a model fits <em>your</em> specific use case. Companies promote these numbers for marketing, but they rarely reflect performance on proprietary business data.</p>
</section>
<section id="custom-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="custom-benchmarks">16. Custom Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=322s">Timestamp: 05:22</a>)</p>
<p>The solution to the limitations of public benchmarks is <strong>Custom Benchmarks</strong>. This slide defines a benchmark as a combination of a <strong>Task</strong>, a <strong>Dataset</strong>, and an <strong>Evaluation Metric</strong>.</p>
<p>This is a critical definition for the workshop. To “tame” GenAI, you must build a dataset that reflects your specific customer queries and define success metrics that matter to your business logic, rather than relying on generic academic tests.</p>
</section>
<section id="taming-gen-ai" class="level3">
<h3 class="anchored" data-anchor-id="taming-gen-ai">17. Taming Gen AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=328s">Timestamp: 05:28</a>)</p>
<p>This title slide signals a transition into the technical “how-to” section of the talk. “Taming” implies that the default state of GenAI is wild and unpredictable.</p>
<p>The goal of the following sections is to bring structure and control to this chaos through rigorous engineering practices and evaluation workflows.</p>
</section>
<section id="workshop-roadmap" class="level3">
<h3 class="anchored" data-anchor-id="workshop-roadmap">18. Workshop Roadmap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=331s">Timestamp: 05:31</a>)</p>
<p>The roadmap outlines the four main sections of the talk: 1. <strong>Basics of Gen AI:</strong> Understanding variability and technical nuances. 2. <strong>Evaluation Workflow:</strong> Building the dataset and running the first tests. 3. <strong>More Complexity:</strong> Adding unit tests and conducting error analysis. 4. <strong>Agents:</strong> Evaluating complex, multi-step workflows.</p>
</section>
<section id="variability-in-responses" class="level3">
<h3 class="anchored" data-anchor-id="variability-in-responses">19. Variability in Responses</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=360s">Timestamp: 06:00</a>)</p>
<p>This slide visually demonstrates the <strong>Non-Determinism</strong> of LLMs. It shows two responses to the same prompt generated just minutes apart. While substantively similar, the wording and structure differ slightly.</p>
<p>This variability makes exact string matching (a common software testing technique) impossible for LLMs. It necessitates semantic evaluation techniques, which complicates the testing pipeline.</p>
</section>
<section id="input-model-output-diagram" class="level3">
<h3 class="anchored" data-anchor-id="input-model-output-diagram">20. Input-Model-Output Diagram</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=384s">Timestamp: 06:24</a>)</p>
<p>A simple diagram illustrates the flow: <strong>Prompt -&gt; Model -&gt; Output</strong>. Rajiv uses this to structure the analysis of where variability comes from.</p>
<p>He explains that “chaos” can enter the system at any of these three stages: the input (prompt sensitivity), the model (inference non-determinism), or the output (formatting and evaluation).</p>
</section>
<section id="inconsistent-benchmark-scores" class="level3">
<h3 class="anchored" data-anchor-id="inconsistent-benchmark-scores">21. Inconsistent Benchmark Scores</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=404s">Timestamp: 06:44</a>)</p>
<p>The slide presents a discrepancy between benchmark scores tweeted by Hugging Face and those in the official Llama paper. Both used the same dataset (MMLU), but reported different accuracy numbers.</p>
<p>This introduces the problem of <strong>Evaluation Harness Sensitivity</strong>. Even with standard benchmarks, <em>how</em> you ask the model to take the test changes the score, proving that evaluation is fragile and implementation-dependent.</p>
</section>
<section id="mmlu-overview" class="level3">
<h3 class="anchored" data-anchor-id="mmlu-overview">22. MMLU Overview</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=445s">Timestamp: 07:25</a>)</p>
<p><strong>MMLU (Massive Multitask Language Understanding)</strong> is explained here. It is a multiple-choice test covering 57 tasks across STEM, the humanities, and more.</p>
<p>It is currently the standard for measuring general “intelligence” in models. However, because it is a multiple-choice format, it is susceptible to prompt formatting nuances, as the next slides demonstrate.</p>
</section>
<section id="prompt-sensitivity" class="level3">
<h3 class="anchored" data-anchor-id="prompt-sensitivity">23. Prompt Sensitivity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=464s">Timestamp: 07:44</a>)</p>
<p>This slide reveals <em>why</em> the scores in Slide 21 differed. The three evaluation harnesses used slightly different prompt structures (e.g., using the word “Question” vs.&nbsp;just listing the text).</p>
<p>These minor changes resulted in significant accuracy shifts. This proves that LLMs are highly sensitive to syntax, meaning a “better” model might just be one that was prompted more effectively for the test, not one that is actually smarter.</p>
</section>
<section id="formatting-changes" class="level3">
<h3 class="anchored" data-anchor-id="formatting-changes">24. Formatting Changes</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=502s">Timestamp: 08:22</a>)</p>
<p>Expanding on sensitivity, this slide references Anthropic’s research showing that changing answer choices from <code>(A)</code> to <code>[A]</code> or <code>(1)</code> affects the output.</p>
<p>This level of fragility is a key takeaway: seemingly cosmetic changes in how inputs are formatted can alter the model’s reasoning capabilities or its ability to output the correct token.</p>
</section>
<section id="gpt-4o-performance-drop" class="level3">
<h3 class="anchored" data-anchor-id="gpt-4o-performance-drop">25. GPT-4o Performance Drop</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=518s">Timestamp: 08:38</a>)</p>
<p>A bar chart demonstrates that this issue persists even in state-of-the-art models like <strong>GPT-4o</strong>. Subtle changes in wording can lead to a 5-10% drop in performance.</p>
<p>This counters the assumption that newer, larger models have “solved” prompt sensitivity. It remains a persistent variable that evaluators must control for.</p>
</section>
<section id="tone-sensitivity" class="level3">
<h3 class="anchored" data-anchor-id="tone-sensitivity">26. Tone Sensitivity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=526s">Timestamp: 08:46</a>)</p>
<p>This slide shows that the <strong>tone</strong> of a prompt (e.g., being polite vs.&nbsp;direct) affects accuracy. Rajiv jokes, “I guess this is why mom always said to be polite.”</p>
<p>The graph indicates that prompt engineering strategies, like adding emotional weight or politeness, can statistically alter model performance, adding another layer of complexity to evaluation.</p>
</section>
<section id="persistent-sensitivity" class="level3">
<h3 class="anchored" data-anchor-id="persistent-sensitivity">27. Persistent Sensitivity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=540s">Timestamp: 09:00</a>)</p>
<p>The slide reiterates that despite years of progress, models are still sensitive to specific phrases. It shows a “Prompt Engineering” guide suggesting specific words to use.</p>
<p>The takeaway is that developers cannot treat the prompt as a static instruction; it is a hyperparameter that requires optimization and constant testing.</p>
</section>
<section id="falcon-llm-bias" class="level3">
<h3 class="anchored" data-anchor-id="falcon-llm-bias">28. Falcon LLM Bias</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=558s">Timestamp: 09:18</a>)</p>
<p>This slide introduces a case study with the <strong>Falcon LLM</strong>. A user tweet shows the model recommending <strong>Abu Dhabi</strong> as a technological city with glowing sentiment, which raised suspicions about bias given the model’s origin in the Middle East.</p>
<p>This serves as a detective story: users wondered if the model weights were altered or if specific training data was injected to force this positive association.</p>
</section>
<section id="potential-cover-up" class="level3">
<h3 class="anchored" data-anchor-id="potential-cover-up">29. Potential Cover-up?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=590s">Timestamp: 09:50</a>)</p>
<p>Another tweet speculates if the model is “covering up human rights abuses” because it provides different answers for Abu Dhabi compared to other cities.</p>
<p>This highlights how model behavior can be misinterpreted as malicious bias or censorship, when the root cause might be something much simpler in the input stack.</p>
</section>
<section id="inspecting-the-system-prompt" class="level3">
<h3 class="anchored" data-anchor-id="inspecting-the-system-prompt">30. Inspecting the System Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=600s">Timestamp: 10:00</a>)</p>
<p>The reveal: The bias wasn’t in the weights, but in the <strong>System Prompt</strong>. The slide suggests looking at the hidden instructions given to the model.</p>
<p>In Falcon’s case, the system prompt explicitly told the model, “You are a model built in Abu Dhabi.” This context influenced its generation probabilities, causing it to favor Abu Dhabi in its responses.</p>
</section>
<section id="claude-system-prompt" class="level3">
<h3 class="anchored" data-anchor-id="claude-system-prompt">31. Claude System Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=633s">Timestamp: 10:33</a>)</p>
<p>Rajiv points out that most developers never read the system prompts of the models they use. He highlights the <strong>Claude System Prompt</strong>, which is 1700 words long and takes nearly 10 minutes to read.</p>
<p>These extensive instructions define the model’s personality and safety guardrails. Ignoring them means you don’t fully understand the inputs driving your application’s behavior.</p>
</section>
<section id="complexity-of-a-single-response" class="level3">
<h3 class="anchored" data-anchor-id="complexity-of-a-single-response">32. Complexity of a Single Response</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=660s">Timestamp: 11:00</a>)</p>
<p>The diagram is updated to show that a “single response” is actually the result of complex interactions: <strong>Tokenization -&gt; Prompt Styles -&gt; Prompt Engineering -&gt; System Prompt</strong>.</p>
<p>This visual summarizes the “Input” section of the talk, reinforcing that before the model even processes data, multiple layers of text transformation occur that can alter the result.</p>
</section>
<section id="inter-text-similarity" class="level3">
<h3 class="anchored" data-anchor-id="inter-text-similarity">33. Inter-text Similarity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=675s">Timestamp: 11:15</a>)</p>
<p>This heatmap compares <strong>Inter-text similarity</strong> between models. It highlights Llama 70B and Llama 8B. Even though they are from the same family and likely trained on similar data, they are not identical.</p>
<p>This means you cannot swap a smaller model for a larger one (or vice versa) and expect the exact same behavior. Any model change requires a full re-evaluation.</p>
</section>
<section id="sycophantic-models" class="level3">
<h3 class="anchored" data-anchor-id="sycophantic-models">34. Sycophantic Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=736s">Timestamp: 12:16</a>)</p>
<p>The slide discusses <strong>Sycophancy</strong>—the tendency of models to agree with the user even when the user is wrong. It mentions how early versions of GPT-4 were sometimes “overly nice.”</p>
<p>This behavior is a specific type of model bias that evaluators must watch for. If a user asks a leading question containing false premises, a sycophantic model might validate the falsehood rather than correct it.</p>
</section>
<section id="model-drift" class="level3">
<h3 class="anchored" data-anchor-id="model-drift">35. Model Drift</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=757s">Timestamp: 12:37</a>)</p>
<p><strong>“Model Drift”</strong> refers to the phenomenon where commercial APIs (like OpenAI or Anthropic) change their model behavior over time without warning.</p>
<p>Because developers do not control the weights of API-based models, the “ground underneath them” can shift. A prompt that worked yesterday might fail today because the provider updated the backend or the inference infrastructure.</p>
</section>
<section id="degraded-responses-timeline" class="level3">
<h3 class="anchored" data-anchor-id="degraded-responses-timeline">36. Degraded Responses Timeline</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=775s">Timestamp: 12:55</a>)</p>
<p>This slide shows a timeline of <strong>Degraded Responses</strong> from an Anthropic incident. Technical issues like context window routing errors led to corrupted outputs for a period of days.</p>
<p>This illustrates that drift isn’t always about model updates; it can be infrastructure failures. Continuous monitoring is required to detect when an external dependency degrades your application’s performance.</p>
</section>
<section id="hyperparameters" class="level3">
<h3 class="anchored" data-anchor-id="hyperparameters">37. Hyperparameters</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=813s">Timestamp: 13:33</a>)</p>
<p>The slide lists <strong>Hyperparameters</strong> like Temperature, Top-P, and Max Length. Rajiv explains that users can control these “knobs” to influence creativity versus determinism.</p>
<p>Setting temperature to 0 makes the model less random, but as the next slides show, it does not guarantee perfect determinism due to hardware nuances.</p>
</section>
<section id="non-deterministic-inference" class="level3">
<h3 class="anchored" data-anchor-id="non-deterministic-inference">38. Non-Deterministic Inference</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=843s">Timestamp: 14:03</a>)</p>
<p>This slide tackles <strong>Non-Deterministic Inference</strong>. Unlike traditional ML models (e.g., XGBoost) where a fixed seed guarantees identical output, LLMs on GPUs often produce different results for identical inputs.</p>
<p>Causes include floating-point accumulation errors and the behavior of Mixture of Experts (MoE) models where different batches might activate different experts.</p>
</section>
<section id="addressing-non-determinism" class="level3">
<h3 class="anchored" data-anchor-id="addressing-non-determinism">39. Addressing Non-Determinism</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=911s">Timestamp: 15:11</a>)</p>
<p>Rajiv references recent work by <strong>Thinking Machines</strong> and updates to <strong>vLLM</strong> that attempt to solve the non-determinism problem through correct batching.</p>
<p>While solutions are emerging, the takeaway is that most current setups are non-deterministic by default. Evaluators must design their tests to tolerate this variance rather than expecting bit-wise reproducibility.</p>
</section>
<section id="updated-model-diagram" class="level3">
<h3 class="anchored" data-anchor-id="updated-model-diagram">40. Updated Model Diagram</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=943s">Timestamp: 15:43</a>)</p>
<p>The diagram expands again. The “Model” box now includes <strong>Model Selection, Hyperparameters, Non-deterministic Inference, and Forced Updates</strong>.</p>
<p>This visual summarizes the “Model” section, showing that the “black box” is actually a dynamic system with internal variables (weights/architecture) and external variables (infrastructure/updates) that all add noise to the output.</p>
</section>
<section id="output-format-issues" class="level3">
<h3 class="anchored" data-anchor-id="output-format-issues">41. Output Format Issues</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=961s">Timestamp: 16:01</a>)</p>
<p>Moving to the “Output” stage, this slide uses MMLU again to show how <strong>Output Formatting</strong> affects evaluation. How do you ask the model to answer a multiple-choice question?</p>
<p>Do you ask it to output just the letter “A”? Or the full text? Or the probability of the token “A”? Different evaluation harnesses use different methods, leading to the score discrepancies seen earlier.</p>
</section>
<section id="evaluation-harness-variations" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-harness-variations">42. Evaluation Harness Variations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=995s">Timestamp: 16:35</a>)</p>
<p>This table details the specific differences in implementation between harnesses (e.g., original MMLU vs.&nbsp;HELM vs.&nbsp;EleutherAI).</p>
<p>It reinforces that there is no standard “ruler” for measuring LLMs. The tool you use to measure the model introduces its own bias and variance into the final score.</p>
</section>
<section id="score-comparison-table" class="level3">
<h3 class="anchored" data-anchor-id="score-comparison-table">43. Score Comparison Table</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1016s">Timestamp: 16:56</a>)</p>
<p>A spreadsheet shows the same models scoring differently across different evaluation implementations. The variance is not trivial; it can be large enough to change the ranking of which model is “best.”</p>
<p>This data drives home the point: You must control your own evaluation pipeline. Relying on reported numbers is risky because you don’t know the implementation details behind them.</p>
</section>
<section id="sentiment-analysis-variance" class="level3">
<h3 class="anchored" data-anchor-id="sentiment-analysis-variance">44. Sentiment Analysis Variance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_44.png" class="img-fluid figure-img"></p>
<figcaption>Slide 44</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1029s">Timestamp: 17:09</a>)</p>
<p>This slide shows varying <strong>Sentiment Analysis</strong> outputs. Different models (or the same model with different prompts) might classify a review as “Positive” while another says “Neutral.”</p>
<p>This introduces the concept that even “simple” classification tasks in GenAI are subject to interpretation and variance, unlike traditional classifiers that have a fixed decision boundary.</p>
</section>
<section id="tool-use-variance" class="level3">
<h3 class="anchored" data-anchor-id="tool-use-variance">45. Tool Use Variance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_45.png" class="img-fluid figure-img"></p>
<figcaption>Slide 45</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1043s">Timestamp: 17:23</a>)</p>
<p>Radar charts illustrate variance in <strong>Tool Use</strong>. Models might be good at using an “Email” tool but fail at “Calendar” or “Terminal” tools.</p>
<p>Furthermore, models exhibit non-determinism in <em>decision making</em>—sometimes they choose to use a tool, and sometimes they try to answer from memory. This adds a layer of logic errors on top of text generation errors.</p>
</section>
<section id="summary-why-responses-differ" class="level3">
<h3 class="anchored" data-anchor-id="summary-why-responses-differ">46. Summary: Why Responses Differ</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_46.png" class="img-fluid figure-img"></p>
<figcaption>Slide 46</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1069s">Timestamp: 17:49</a>)</p>
<p>This comprehensive slide aggregates all the factors discussed: <strong>Inputs</strong> (prompts, system prompts), <strong>Model</strong> (drift, hyperparams), <strong>Outputs</strong> (formatting), and <strong>Infrastructure</strong>.</p>
<p>It serves as a checklist for the audience. If your application is behaving inconsistently, investigate these specific layers of the stack to find the source of the noise.</p>
</section>
<section id="chaos-is-okay" class="level3">
<h3 class="anchored" data-anchor-id="chaos-is-okay">47. Chaos is Okay</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_47.png" class="img-fluid figure-img"></p>
<figcaption>Slide 47</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1097s">Timestamp: 18:17</a>)</p>
<p>Rajiv reassures the audience that <strong>“Chaos is Okay.”</strong> The slide presents a chart of evaluation methods ranging from flexible/expensive (human eval) to rigid/cheap (code assertions).</p>
<p>The message is that while the technology is chaotic, there is a spectrum of tools available to manage it. We don’t need to solve every source of variance; we just need a robust process to measure it.</p>
</section>
<section id="from-chaos-to-control" class="level3">
<h3 class="anchored" data-anchor-id="from-chaos-to-control">48. From Chaos to Control</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_48.png" class="img-fluid figure-img"></p>
<figcaption>Slide 48</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1107s">Timestamp: 18:27</a>)</p>
<p>This transition slide marks the beginning of the <strong>Evaluation Workflow</strong> section. The presentation shifts from describing the problem to prescribing the solution.</p>
<p>The goal here is to move from “Vibe Coding” to a structured engineering discipline where changes are measured against a stable baseline.</p>
</section>
<section id="build-the-evaluation-dataset" class="level3">
<h3 class="anchored" data-anchor-id="build-the-evaluation-dataset">49. Build the Evaluation Dataset</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_49.png" class="img-fluid figure-img"></p>
<figcaption>Slide 49</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1117s">Timestamp: 18:37</a>)</p>
<p>The first step in the workflow is to <strong>Build the Evaluation Dataset</strong>. The slide lists examples of prompts for tasks like summarization, extraction, and translation.</p>
<p>Rajiv emphasizes that this dataset should reflect <em>your</em> actual use case. It is the foundation of the “Custom Benchmark” concept introduced earlier.</p>
</section>
<section id="get-labeled-outputs-gold" class="level3">
<h3 class="anchored" data-anchor-id="get-labeled-outputs-gold">50. Get Labeled Outputs (Gold)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_50.png" class="img-fluid figure-img"></p>
<figcaption>Slide 50</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1126s">Timestamp: 18:46</a>)</p>
<p>Step two is to get <strong>Labeled Outputs</strong>, also known as <strong>Gold Outputs</strong>, Reference, or Ground Truth. The slide adds a column showing the ideal answer for each prompt.</p>
<p>This is the standard against which the model will be judged. While obtaining these labels can be expensive (requiring human effort), they are essential for calculating accuracy.</p>
</section>
<section id="compare-to-model-output" class="level3">
<h3 class="anchored" data-anchor-id="compare-to-model-output">51. Compare to Model Output</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_51.png" class="img-fluid figure-img"></p>
<figcaption>Slide 51</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1140s">Timestamp: 19:00</a>)</p>
<p>Step three is to generate responses from your system and place them alongside the Gold Outputs. The slide adds a <strong>“Model Output”</strong> column.</p>
<p>This visual comparison allows developers (and automated judges) to see the delta between what was expected and what was produced.</p>
</section>
<section id="measure-equivalence" class="level3">
<h3 class="anchored" data-anchor-id="measure-equivalence">52. Measure Equivalence</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_52.png" class="img-fluid figure-img"></p>
<figcaption>Slide 52</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1150s">Timestamp: 19:10</a>)</p>
<p>Step four is to <strong>Measure Equivalence</strong>. Since LLMs rarely produce exact string matches, we use an <strong>LLM Judge</strong> (another model) to determine if the Model Output means the same thing as the Gold Output.</p>
<p>The slide shows a prompt for the judge: “Are these two responses semantically equivalent?” This converts a fuzzy text comparison problem into a binary (Pass/Fail) metric.</p>
</section>
<section id="optimize-using-equivalence" class="level3">
<h3 class="anchored" data-anchor-id="optimize-using-equivalence">53. Optimize Using Equivalence</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_53.png" class="img-fluid figure-img"></p>
<figcaption>Slide 53</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1197s">Timestamp: 19:57</a>)</p>
<p>Once you have an equivalence metric, you can <strong>Optimize</strong>. The slide shows Config A vs.&nbsp;Config B. By changing prompts or models, you can track if your “Equivalence Score” goes up or down.</p>
<p>This treats GenAI engineering like traditional hyperparameter tuning. The goal is to maximize the equivalence score on your custom dataset.</p>
</section>
<section id="why-global-metrics-arent-enough" class="level3">
<h3 class="anchored" data-anchor-id="why-global-metrics-arent-enough">54. Why Global Metrics Aren’t Enough</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_54.png" class="img-fluid figure-img"></p>
<figcaption>Slide 54</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1228s">Timestamp: 20:28</a>)</p>
<p>The slide discusses the limitations of the “Equivalence” approach. While good for a general sense of quality, <strong>Global Metrics</strong> miss nuances.</p>
<p>Sometimes it’s hard to get a Gold Answer for open-ended creative tasks. Furthermore, a simple “Pass/Fail” doesn’t tell you <em>why</em> the model failed (e.g., was it tone, length, or factuality?).</p>
</section>
<section id="from-global-to-targeted-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="from-global-to-targeted-evaluation">55. From Global to Targeted Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_55.png" class="img-fluid figure-img"></p>
<figcaption>Slide 55</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1255s">Timestamp: 20:55</a>)</p>
<p>This slide argues for <strong>Targeted Evaluation</strong>. To maximize performance, you need to dig deeper into the data and identify specific error modes.</p>
<p>This transitions the talk from “Basic Workflow” to “Advanced Testing,” where we break down “Quality” into specific, testable components like tone, length, and safety.</p>
</section>
<section id="building-tests" class="level3">
<h3 class="anchored" data-anchor-id="building-tests">56. Building Tests</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_56.png" class="img-fluid figure-img"></p>
<figcaption>Slide 56</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1274s">Timestamp: 21:14</a>)</p>
<p>The section title <strong>“Building Tests”</strong> appears. This is where the presentation moves into the “Unit Testing” philosophy for GenAI.</p>
<p>Just as software engineering relies on unit tests to verify specific functions, GenAI engineering should use targeted tests to verify specific attributes of the generated text.</p>
</section>
<section id="good-vs.-bad-examples" class="level3">
<h3 class="anchored" data-anchor-id="good-vs.-bad-examples">57. Good vs.&nbsp;Bad Examples</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_57.png" class="img-fluid figure-img"></p>
<figcaption>Slide 57</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1280s">Timestamp: 21:20</a>)</p>
<p>The slide displays a <strong>Good Example</strong> and a <strong>Bad Example</strong> of a response. The bad example is visibly shorter and less polite.</p>
<p>Rajiv asks the audience to identify <em>why</em> it is bad. This exercise is crucial: you cannot build a test until you can articulate exactly what makes a response a failure.</p>
</section>
<section id="develop-an-evaluation-mindset" class="level3">
<h3 class="anchored" data-anchor-id="develop-an-evaluation-mindset">58. Develop an Evaluation Mindset</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_58.png" class="img-fluid figure-img"></p>
<figcaption>Slide 58</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1306s">Timestamp: 21:46</a>)</p>
<p>To define “Bad,” developers need an <strong>Evaluation Mindset</strong>. This involves observing real-world user interactions and problems.</p>
<p>Data scientists often want to stay in their “chair” and optimize algorithms, but Rajiv argues that effective evaluation requires understanding the user’s pain points.</p>
</section>
<section id="collaborate-with-experts" class="level3">
<h3 class="anchored" data-anchor-id="collaborate-with-experts">59. Collaborate with Experts</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_59.png" class="img-fluid figure-img"></p>
<figcaption>Slide 59</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1318s">Timestamp: 21:58</a>)</p>
<p>The slide stresses <strong>Collaboration</strong>. You must talk to domain experts (e.g., the customer support team) to define what a “good” answer looks like.</p>
<p>Naive bootstrapping—pretending to be a user—is a good start, but long-term success requires input from the people who actually know the business domain.</p>
</section>
<section id="identify-and-categorize-failures" class="level3">
<h3 class="anchored" data-anchor-id="identify-and-categorize-failures">60. Identify and Categorize Failures</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_60.png" class="img-fluid figure-img"></p>
<figcaption>Slide 60</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1372s">Timestamp: 22:52</a>)</p>
<p>Once you understand the domain, you can <strong>Categorize Failure Types</strong>. The slide shows a chart grouping errors into categories like “Harmful Content,” “Bias,” or “Incorrect Info.”</p>
<p>This clustering allows you to see patterns. Instead of just knowing “the model failed 20% of the time,” you know “the model has a specific problem with tone.”</p>
</section>
<section id="define-what-good-looks-like" class="level3">
<h3 class="anchored" data-anchor-id="define-what-good-looks-like">61. Define What Good Looks Like</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_61.png" class="img-fluid figure-img"></p>
<figcaption>Slide 61</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1391s">Timestamp: 23:11</a>)</p>
<p>Using the categorization, you can explicitly <strong>Define What Good Looks Like</strong>. The slide contrasts the good/bad examples again, but now with labels: “Too short,” “Lacks professional tone.”</p>
<p>This transforms a subjective feeling (“this response sucks”) into objective criteria (“response must be &gt;50 words and use polite honorifics”).</p>
</section>
<section id="document-every-issue" class="level3">
<h3 class="anchored" data-anchor-id="document-every-issue">62. Document Every Issue</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_62.png" class="img-fluid figure-img"></p>
<figcaption>Slide 62</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1412s">Timestamp: 23:32</a>)</p>
<p>The slide shows a spreadsheet where humans evaluate responses and <strong>Document Every Issue</strong>. Columns track specific attributes like “Is it helpful?” or “Is the tone right?”</p>
<p>This manual annotation is the training data for your automated tests. You need humans to establish the ground truth before you can automate the checking.</p>
</section>
<section id="evaluation-tooling" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-tooling">63. Evaluation Tooling</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_63.png" class="img-fluid figure-img"></p>
<figcaption>Slide 63</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1433s">Timestamp: 23:53</a>)</p>
<p>Rajiv mentions that <strong>Tooling Can Help</strong>. The slide shows a custom chat viewer designed to make human review easier.</p>
<p>However, he warns against getting sidetracked by building fancy tools. Simple spreadsheets often suffice for the early stages. The goal is the data, not the interface.</p>
</section>
<section id="test-1-length-check" class="level3">
<h3 class="anchored" data-anchor-id="test-1-length-check">64. Test 1: Length Check</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_64.png" class="img-fluid figure-img"></p>
<figcaption>Slide 64</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1445s">Timestamp: 24:05</a>)</p>
<p>Now we build the automated tests. <strong>Test 1 is a Length Check</strong>. The slide shows Python code asserting that the word count is between 8 and 200.</p>
<p>This is a <strong>deterministic test</strong>. You don’t need an LLM to count words. Rajiv encourages using simple Python assertions wherever possible because they are fast, cheap, and reliable.</p>
</section>
<section id="test-2-tone-and-style" class="level3">
<h3 class="anchored" data-anchor-id="test-2-tone-and-style">65. Test 2: Tone and Style</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_65.png" class="img-fluid figure-img"></p>
<figcaption>Slide 65</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1462s">Timestamp: 24:22</a>)</p>
<p><strong>Test 2 checks Tone and Style</strong>. Since “tone” is subjective, we use an <strong>LLM Judge</strong> (OpenAI model) to classify the response.</p>
<p>The prompt asks the judge to identify the style. This allows us to automate the “vibe check” that humans were previously doing manually.</p>
</section>
<section id="adding-metrics-to-documentation" class="level3">
<h3 class="anchored" data-anchor-id="adding-metrics-to-documentation">66. Adding Metrics to Documentation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_66.png" class="img-fluid figure-img"></p>
<figcaption>Slide 66</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1481s">Timestamp: 24:41</a>)</p>
<p>The spreadsheet is updated with new columns: <code>Length_OK</code> and <code>Tone_OK</code>. These are the results of the automated tests.</p>
<p>Now, for every row in the dataset, we have granular pass/fail metrics. This helps pinpoint exactly <em>why</em> a specific response failed, rather than just a generic failure.</p>
</section>
<section id="check-judges-against-humans" class="level3">
<h3 class="anchored" data-anchor-id="check-judges-against-humans">67. Check Judges Against Humans</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_67.png" class="img-fluid figure-img"></p>
<figcaption>Slide 67</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1512s">Timestamp: 25:12</a>)</p>
<p>A critical step: <strong>Check LLM Judges Against Humans</strong>. You must verify that your automated “Tone Judge” agrees with your human experts.</p>
<p>If the human says the tone is rude, but the LLM Judge says it’s polite, your metric is useless. You must iterate on the judge’s prompt until alignment is high.</p>
</section>
<section id="self-evaluation-bias" class="level3">
<h3 class="anchored" data-anchor-id="self-evaluation-bias">68. Self-Evaluation Bias</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_68.png" class="img-fluid figure-img"></p>
<figcaption>Slide 68</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1566s">Timestamp: 26:06</a>)</p>
<p>The slide illustrates <strong>Self-Evaluation Bias</strong>. LLMs tend to rate their own outputs higher than outputs from other models. GPT-4 prefers GPT-4 text.</p>
<p>To mitigate this, Rajiv suggests mixing models—use Claude to judge GPT-4, or Gemini to judge Claude. This helps ensure a more neutral evaluation.</p>
</section>
<section id="alignment-checks" class="level3">
<h3 class="anchored" data-anchor-id="alignment-checks">69. Alignment Checks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_69.png" class="img-fluid figure-img"></p>
<figcaption>Slide 69</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1606s">Timestamp: 26:46</a>)</p>
<p>This slide reinforces the need for <strong>Continuous Alignment</strong>. Just because your judge aligned with humans last month doesn’t mean it still does (due to model drift).</p>
<p>Human spot-checks should be a permanent part of the pipeline to ensure the automated judges haven’t drifted.</p>
</section>
<section id="biases-in-llm-judges" class="level3">
<h3 class="anchored" data-anchor-id="biases-in-llm-judges">70. Biases in LLM Judges</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_70.png" class="img-fluid figure-img"></p>
<figcaption>Slide 70</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1622s">Timestamp: 27:02</a>)</p>
<p>The slide lists known <strong>Biases in LLM Judges</strong>, such as <strong>Position Bias</strong> (favoring the first answer presented) or <strong>Verbosity Bias</strong> (favoring longer answers).</p>
<p>Evaluators must be aware of these. For example, you should shuffle the order of answers when asking a judge to compare two options to cancel out position bias.</p>
</section>
<section id="best-practices-for-llm-judges" class="level3">
<h3 class="anchored" data-anchor-id="best-practices-for-llm-judges">71. Best Practices for LLM Judges</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_71.png" class="img-fluid figure-img"></p>
<figcaption>Slide 71</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1631s">Timestamp: 27:11</a>)</p>
<p>A summary of <strong>Best Practices</strong>: Calibrate with human data, use ensembles (multiple judges), avoid asking for “relevance” (too vague), and use discrete rating scales (1-5) rather than continuous numbers.</p>
<p>These tips help stabilize the inherently noisy process of using AI to evaluate AI.</p>
</section>
<section id="error-analysis-chart" class="level3">
<h3 class="anchored" data-anchor-id="error-analysis-chart">72. Error Analysis Chart</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_72.png" class="img-fluid figure-img"></p>
<figcaption>Slide 72</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1666s">Timestamp: 27:46</a>)</p>
<p>With tests in place, we move to <strong>Error Analysis</strong>. The bar chart shows the number of failed cases categorized by error type (Length, Tone, Professional, Context).</p>
<p>This visualization tells you where to focus your efforts. If “Tone” is the biggest bar, you work on the system prompt’s tone instructions. If “Context” is the issue, you might need better Retrieval Augmented Generation (RAG).</p>
</section>
<section id="comparing-prompts" class="level3">
<h3 class="anchored" data-anchor-id="comparing-prompts">73. Comparing Prompts</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_73.png" class="img-fluid figure-img"></p>
<figcaption>Slide 73</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1678s">Timestamp: 27:58</a>)</p>
<p>The chart can compare <strong>Prompt A vs.&nbsp;Prompt B</strong>. This allows for A/B testing of prompt engineering strategies.</p>
<p>You can see if a new prompt improves “Tone” but accidentally degrades “Context.” This tradeoff analysis is impossible with a single global score.</p>
</section>
<section id="explanations-guide-improvement" class="level3">
<h3 class="anchored" data-anchor-id="explanations-guide-improvement">74. Explanations Guide Improvement</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_74.png" class="img-fluid figure-img"></p>
<figcaption>Slide 74</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1694s">Timestamp: 28:14</a>)</p>
<p>Rajiv suggests asking the LLM Judge for <strong>Explanations</strong>. Don’t just ask for a score; ask for “one sentence explaining why.”</p>
<p>These explanations act as metadata that helps developers understand the judge’s reasoning, making it easier to debug discrepancies between human and AI judgments.</p>
</section>
<section id="limits-to-explanations" class="level3">
<h3 class="anchored" data-anchor-id="limits-to-explanations">75. Limits to Explanations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_75.png" class="img-fluid figure-img"></p>
<figcaption>Slide 75</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1715s">Timestamp: 28:35</a>)</p>
<p>A warning: <strong>Explanations are not causal</strong>. When an LLM explains why it did something, it is generating a plausible justification, not a trace of its actual neural activations.</p>
<p>Treat explanations as a heuristic or a helpful hint, not as absolute truth about the model’s internal state.</p>
</section>
<section id="the-evaluation-flywheel" class="level3">
<h3 class="anchored" data-anchor-id="the-evaluation-flywheel">76. The Evaluation Flywheel</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_76.png" class="img-fluid figure-img"></p>
<figcaption>Slide 76</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1726s">Timestamp: 28:46</a>)</p>
<p>The <strong>Evaluation Flywheel</strong> describes the iterative cycle: Build Eval -&gt; Analyze -&gt; Improve -&gt; Repeat.</p>
<p>This concept, credited to Hamill, emphasizes that evaluation is not a one-time event but a continuous loop that spins faster as you gather more data and build better tests.</p>
</section>
<section id="financial-analyst-agent-example" class="level3">
<h3 class="anchored" data-anchor-id="financial-analyst-agent-example">77. Financial Analyst Agent Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_77.png" class="img-fluid figure-img"></p>
<figcaption>Slide 77</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1760s">Timestamp: 29:20</a>)</p>
<p>To demonstrate advanced unit testing, Rajiv introduces a <strong>Financial Analyst Agent</strong>. The goal is to assess the specific “style” of a financial report.</p>
<p>This is a complex domain where “good” is highly specific (regulated, precise, risk-aware), making it a perfect candidate for granular unit tests.</p>
</section>
<section id="use-a-global-test" class="level3">
<h3 class="anchored" data-anchor-id="use-a-global-test">78. Use a Global Test?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_78.png" class="img-fluid figure-img"></p>
<figcaption>Slide 78</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1783s">Timestamp: 29:43</a>)</p>
<p>You <em>could</em> use a <strong>Global Test</strong>: “Was this explained as a financial analyst would?”</p>
<p>While simple, this test is opaque. If it fails, you don’t know if it was because of compliance issues, lack of clarity, or poor formatting.</p>
</section>
<section id="global-vs.-unit-tests" class="level3">
<h3 class="anchored" data-anchor-id="global-vs.-unit-tests">79. Global vs.&nbsp;Unit Tests</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_79.png" class="img-fluid figure-img"></p>
<figcaption>Slide 79</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1794s">Timestamp: 29:54</a>)</p>
<p>The slide contrasts the Global approach with <strong>Unit Tests</strong>. Instead of one question, we ask six: Context, Clarity, Precision, Compliance, Actionability, and Risks.</p>
<p>This breakdown allows for targeted debugging. You might find the model is great at “Clarity” but terrible at “Compliance.”</p>
</section>
<section id="scoring-radar-chart" class="level3">
<h3 class="anchored" data-anchor-id="scoring-radar-chart">80. Scoring Radar Chart</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_80.png" class="img-fluid figure-img"></p>
<figcaption>Slide 80</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1816s">Timestamp: 30:16</a>)</p>
<p>A <strong>Radar Chart</strong> visualizes the unit test scores. This allows for a quick visual assessment of the model’s profile.</p>
<p>It facilitates comparison: you can overlay the profiles of two different models to see which one has the better balance of attributes for your specific needs.</p>
</section>
<section id="analyzing-failures-with-clusters" class="level3">
<h3 class="anchored" data-anchor-id="analyzing-failures-with-clusters">81. Analyzing Failures with Clusters</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_81.png" class="img-fluid figure-img"></p>
<figcaption>Slide 81</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1837s">Timestamp: 30:37</a>)</p>
<p>With enough unit test data, you can use <strong>Clustering (e.g., K-Means)</strong> to group failures. The slide shows clusters like “Synthesis,” “Context,” and “Hallucination.”</p>
<p>This moves error analysis from reading individual logs to analyzing aggregate trends, helping you prioritize which class of errors to fix first.</p>
</section>
<section id="designing-good-unit-tests" class="level3">
<h3 class="anchored" data-anchor-id="designing-good-unit-tests">82. Designing Good Unit Tests</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_82.png" class="img-fluid figure-img"></p>
<figcaption>Slide 82</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1852s">Timestamp: 30:52</a>)</p>
<p>Advice on <strong>Designing Unit Tests</strong>: Keep them focused (one concept per test), use unambiguous language, and use small rating ranges.</p>
<p>Good unit tests are the building blocks of a reliable evaluation pipeline. If the tests themselves are noisy or vague, the entire system collapses.</p>
</section>
<section id="examples-of-unit-tests" class="level3">
<h3 class="anchored" data-anchor-id="examples-of-unit-tests">83. Examples of Unit Tests</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_83.png" class="img-fluid figure-img"></p>
<figcaption>Slide 83</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1855s">Timestamp: 30:55</a>)</p>
<p>The slide lists specific examples of tests for <strong>Legal</strong> (Compliance, Terminology), <strong>Retrieval</strong> (Relevance, Completeness), and <strong>Bias/Fairness</strong>.</p>
<p>This serves as a menu of options for the audience, showing that unit tests can cover almost any dimension of quality required by the business.</p>
</section>
<section id="evaluating-new-prompts" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-new-prompts">84. Evaluating New Prompts</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_84.png" class="img-fluid figure-img"></p>
<figcaption>Slide 84</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1858s">Timestamp: 30:58</a>)</p>
<p>A bar chart shows how unit tests are used to <strong>Evaluate New Prompts</strong>. By running the full suite of unit tests on a new prompt, you get a “scorecard” of its performance.</p>
<p>This data-driven approach removes the guesswork from prompt engineering.</p>
</section>
<section id="tools---no-silver-bullet" class="level3">
<h3 class="anchored" data-anchor-id="tools---no-silver-bullet">85. Tools - No Silver Bullet</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_85.png" class="img-fluid figure-img"></p>
<figcaption>Slide 85</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1862s">Timestamp: 31:02</a>)</p>
<p>Rajiv reminds the audience that <strong>Tools are No Silver Bullet</strong>. You must master the basics (datasets, metrics) first.</p>
<p>He advises logging traces and experiments and practicing <strong>Dataset Versioning</strong>. Tools facilitate these practices, but they cannot replace the fundamental engineering discipline.</p>
</section>
<section id="forest-and-trees" class="level3">
<h3 class="anchored" data-anchor-id="forest-and-trees">86. Forest and Trees</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_86.png" class="img-fluid figure-img"></p>
<figcaption>Slide 86</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1864s">Timestamp: 31:04</a>)</p>
<p>An analogy helps structure the analysis: <strong>Forest (Global/Integration)</strong> vs.&nbsp;<strong>Trees (Test Case/Unit Tests)</strong>.</p>
<p>You need to look at both. The forest tells you the overall health of the app, while the trees tell you specifically what needs pruning or fixing.</p>
</section>
<section id="change-one-thing-at-a-time" class="level3">
<h3 class="anchored" data-anchor-id="change-one-thing-at-a-time">87. Change One Thing at a Time</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_87.png" class="img-fluid figure-img"></p>
<figcaption>Slide 87</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1877s">Timestamp: 31:17</a>)</p>
<p>A crucial scientific principle: <strong>Change One Thing at a Time</strong>. With so many knobs (prompt, temp, model, RAG settings), changing multiple variables simultaneously makes it impossible to know what caused the improvement (or regression).</p>
<p>Isolate your variables to conduct valid experiments.</p>
</section>
<section id="error-analysis-tips" class="level3">
<h3 class="anchored" data-anchor-id="error-analysis-tips">88. Error Analysis Tips</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_88.png" class="img-fluid figure-img"></p>
<figcaption>Slide 88</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1892s">Timestamp: 31:32</a>)</p>
<p>A summary of <strong>Error Analysis Tips</strong>: Use ablation studies (removing parts to see impact), categorize failures, save interesting examples, and leverage logs/traces.</p>
<p>These are the daily habits of successful GenAI engineers.</p>
</section>
<section id="the-evaluation-story" class="level3">
<h3 class="anchored" data-anchor-id="the-evaluation-story">89. The Evaluation Story</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_89.png" class="img-fluid figure-img"></p>
<figcaption>Slide 89</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1928s">Timestamp: 32:08</a>)</p>
<p>The slide shows the “Story We Tell”—a linear graph of improvement over time. This is the idealized version of progress often presented in case studies.</p>
<p>It suggests a smooth journey from “Out of the box” to “Specialized” to “User Feedback.”</p>
</section>
<section id="the-reality-of-progress" class="level3">
<h3 class="anchored" data-anchor-id="the-reality-of-progress">90. The Reality of Progress</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_90.png" class="img-fluid figure-img"></p>
<figcaption>Slide 90</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1944s">Timestamp: 32:24</a>)</p>
<p><strong>The Reality</strong> is a messy, non-linear graph. You take two steps forward, one step back. Sometimes an “improvement” breaks the model.</p>
<p>Rajiv encourages resilience. Experienced practitioners know that this messy graph is normal and that sticking to the process eventually yields results.</p>
</section>
<section id="continual-process" class="level3">
<h3 class="anchored" data-anchor-id="continual-process">91. Continual Process</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_91.png" class="img-fluid figure-img"></p>
<figcaption>Slide 91</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=1981s">Timestamp: 33:01</a>)</p>
<p><strong>Evaluation is a Continual Process</strong>. It involves Problem ID, Data Collection, Optimization, User Acceptance Testing (UAT), and Updates.</p>
<p>Crucially, <strong>UAT</strong> is your holdout set. Since you don’t have a traditional test set in GenAI, your real users act as the final validation layer.</p>
</section>
<section id="eating-the-elephant" class="level3">
<h3 class="anchored" data-anchor-id="eating-the-elephant">92. Eating the Elephant</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_92.png" class="img-fluid figure-img"></p>
<figcaption>Slide 92</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2043s">Timestamp: 34:03</a>)</p>
<p>The metaphor <strong>“How do you eat an elephant?”</strong> addresses the overwhelming nature of building a comprehensive evaluation suite.</p>
<p>The answer, of course, is “one bite at a time.” You don’t need 100 tests on day one.</p>
</section>
<section id="adding-tests-over-time" class="level3">
<h3 class="anchored" data-anchor-id="adding-tests-over-time">93. Adding Tests Over Time</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_93.png" class="img-fluid figure-img"></p>
<figcaption>Slide 93</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2050s">Timestamp: 34:10</a>)</p>
<p>The slide visualizes the “elephant” being broken down into bites. You start with a few critical tests. As the app matures and you discover new failure modes, you add more tests.</p>
<p>Six months in, you might have 100 tests, but you built them incrementally. This makes the task manageable.</p>
</section>
<section id="doing-evaluation-the-right-way" class="level3">
<h3 class="anchored" data-anchor-id="doing-evaluation-the-right-way">94. Doing Evaluation the Right Way</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_94.png" class="img-fluid figure-img"></p>
<figcaption>Slide 94</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2079s">Timestamp: 34:39</a>)</p>
<p>A summary slide listing best practices: <strong>Annotated Examples</strong>, <strong>Systematic Documentation</strong>, <strong>Continuous Error Analysis</strong>, <strong>Collaboration</strong>, and awareness of <strong>Generalization</strong>.</p>
<p>This concludes the core methodology section of the talk.</p>
</section>
<section id="agentic-use-cases" class="level3">
<h3 class="anchored" data-anchor-id="agentic-use-cases">95. Agentic Use Cases</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_95.png" class="img-fluid figure-img"></p>
<figcaption>Slide 95</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2090s">Timestamp: 34:50</a>)</p>
<p>The final section covers <strong>Agentic Use Cases</strong>, symbolized by a dragon. Agents add a layer of complexity because the model is now making decisions (routing, tool use) rather than just generating text.</p>
<p>This “agency” makes the system harder to track and evaluate.</p>
</section>
<section id="crossing-the-river" class="level3">
<h3 class="anchored" data-anchor-id="crossing-the-river">96. Crossing the River</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_96.png" class="img-fluid figure-img"></p>
<figcaption>Slide 96</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2106s">Timestamp: 35:06</a>)</p>
<p>A conceptual slide asking, <strong>“How should it cross the river?”</strong> (Fly, Swim, Bridge?). This represents the decision-making step in an agent.</p>
<p>Evaluating an agent requires evaluating <em>how</em> it made the decision (the router) separately from <em>how well</em> it executed the action.</p>
</section>
<section id="chat-to-purchase-router" class="level3">
<h3 class="anchored" data-anchor-id="chat-to-purchase-router">97. Chat-to-Purchase Router</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_97.png" class="img-fluid figure-img"></p>
<figcaption>Slide 97</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2122s">Timestamp: 35:22</a>)</p>
<p>A complex flowchart shows a <strong>Chat-to-Purchase Router</strong>. The agent must decide if the user wants to search for a product, get support, or track a package.</p>
<p>Rajiv suggests breaking this down: evaluate the <strong>Router</strong> component first (did it pick the right path?), then evaluate the specific workflow (did it track the package correctly?).</p>
</section>
<section id="text-to-sql-agent" class="level3">
<h3 class="anchored" data-anchor-id="text-to-sql-agent">98. Text to SQL Agent</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_98.png" class="img-fluid figure-img"></p>
<figcaption>Slide 98</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2177s">Timestamp: 36:17</a>)</p>
<p>Another example: <strong>Text to SQL Agent</strong>. This workflow involves classification, feature extraction, and SQL generation.</p>
<p>You can isolate the “Classification” step (is this a valid SQL question?) and build a test just for that, before testing the actual SQL generation.</p>
</section>
<section id="evaluating-office-style-agents" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-office-style-agents">99. Evaluating Office-Style Agents</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_99.png" class="img-fluid figure-img"></p>
<figcaption>Slide 99</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2206s">Timestamp: 36:46</a>)</p>
<p>The slide discusses <strong>OdysseyBench</strong>, a benchmark for office tasks. It highlights failure modes like “Failed to create folder” or “Failed to use tool.”</p>
<p>Evaluating agents involves checking if they successfully manipulated the environment (files, APIs), which is a functional test rather than a text similarity test.</p>
</section>
<section id="error-analysis-for-agents" class="level3">
<h3 class="anchored" data-anchor-id="error-analysis-for-agents">100. Error Analysis for Agents</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_100.png" class="img-fluid figure-img"></p>
<figcaption>Slide 100</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2220s">Timestamp: 37:00</a>)</p>
<p><strong>Error Analysis for Agentic Workflows</strong> requires assessing the overall performance, the routing decisions, and the individual steps.</p>
<p>It is the same “action error analysis” process but applied recursively to every node in the agent’s decision tree.</p>
</section>
<section id="evaluating-workflow-vs.-response" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-workflow-vs.-response">101. Evaluating Workflow vs.&nbsp;Response</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_101.png" class="img-fluid figure-img"></p>
<figcaption>Slide 101</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2239s">Timestamp: 37:19</a>)</p>
<p>This slide distinguishes between evaluating a <strong>Response</strong> (text) and a <strong>Workflow</strong> (process). The flowchart shows a conversational flow.</p>
<p>Evaluating a workflow might mean checking if the agent successfully moved the user from “Greeting” to “Resolution,” regardless of the exact words used.</p>
</section>
<section id="agentic-frameworks" class="level3">
<h3 class="anchored" data-anchor-id="agentic-frameworks">102. Agentic Frameworks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_102.png" class="img-fluid figure-img"></p>
<figcaption>Slide 102</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2268s">Timestamp: 37:48</a>)</p>
<p>Rajiv warns that <strong>“Agentic Frameworks Help – Until They Don’t.”</strong> Frameworks (like LangChain or AutoGen) are great for demos because they abstract complexity.</p>
<p>However, in production, these abstractions can break or become outdated. He often recommends using straight Python for production agents to maintain control and reliability.</p>
</section>
<section id="abstraction-for-workflows" class="level3">
<h3 class="anchored" data-anchor-id="abstraction-for-workflows">103. Abstraction for Workflows</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_103.png" class="img-fluid figure-img"></p>
<figcaption>Slide 103</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2312s">Timestamp: 38:32</a>)</p>
<p>The slide illustrates the trade-off in <strong>Abstraction</strong>. You can build rigid workflows (orchestration) where you control every step, or use general agents where the LLM decides.</p>
<p>Orchestration is more reliable but rigid. General agents are flexible but prone to non-deterministic errors.</p>
</section>
<section id="when-abstractions-break" class="level3">
<h3 class="anchored" data-anchor-id="when-abstractions-break">104. When Abstractions Break</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_104.png" class="img-fluid figure-img"></p>
<figcaption>Slide 104</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2333s">Timestamp: 38:53</a>)</p>
<p>Model providers are training models to handle workflows internally (removing the need for external orchestration).</p>
<p>However, until models are perfect, developers often need to break tasks down into specific pieces to ensure reliability. The choice between “letting the model do it” and “scripting the flow” depends on the application’s risk tolerance.</p>
</section>
<section id="lessons-from-agent-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="lessons-from-agent-benchmarks">105. Lessons from Agent Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_105.png" class="img-fluid figure-img"></p>
<figcaption>Slide 105</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2355s">Timestamp: 39:15</a>)</p>
<p>The slide lists <strong>Lessons from Reproducing Agent Benchmarks</strong>: Standardize evaluation, measure efficiency, detect shortcuts, and log real behavior.</p>
<p>These are advanced tips for those pushing the boundaries of what agents can do.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">106. Conclusion</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_106.png" class="img-fluid figure-img"></p>
<figcaption>Slide 106</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/qPHsWTZP58U&amp;t=2367s">Timestamp: 39:27</a>)</p>
<p>The final slide, <strong>“We did it!”</strong>, concludes the presentation. Rajiv thanks the audience and provides the QR code again.</p>
<p>His final message is one of empowerment: he hopes the audience now has the confidence to go out, build their own evaluation datasets, and start “hill climbing” their own applications.</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>GenAI</category>
  <category>Evaluation</category>
  <category>LLM</category>
  <category>Testing</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/genai-evaluation-guide.html</guid>
  <pubDate>Sat, 01 Nov 2025 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/genai-evaluation-guide/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>RAG Retrieval Deep Dive: BM25, Embeddings, and the Power of Agentic Search</title>
  <link>https://rajivshah.com/blog/rag-agentic-world.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/AS_HlJbJjH8" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/AS_HlJbJjH8">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the slide-by-slide annotated presentation based on the video “From Vectors to Agents: Managing RAG in an Agentic World” by Rajiv Shah. This was presented in different forms at several conferences in the Fall of 2025 including, <a href="https://www.youtube.com/watch?v=GZSj_IIK5yE">MLOps World | GenAI Summit 2025</a> and <a href="https://www.youtube.com/watch?v=JYZXsH1Xz0I">Fully Connected London from Weights and Biases</a>.</p>
<hr>
<section id="title-slide" class="level3">
<h3 class="anchored" data-anchor-id="title-slide">1. Title Slide</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=0s">Timestamp: 00:00</a>)</p>
<p>The presentation begins with the title slide, introducing the core theme: <strong>“From Vectors to Agents: Managing RAG in an Agentic World.”</strong> The speaker, Rajiv Shah from Contextual, sets the stage for a technical deep dive into Retrieval-Augmented Generation (RAG).</p>
<p>He outlines the agenda, promising to move beyond basic RAG concepts to focus specifically on <strong>retrieval approaches</strong>. The talk is designed to cover the spectrum from traditional methods like BM25 and Language Models to the emerging field of Agentic Search.</p>
</section>
<section id="acme-gpt" class="level3">
<h3 class="anchored" data-anchor-id="acme-gpt">2. ACME GPT</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=40s">Timestamp: 00:40</a>)</p>
<p>This slide displays a stylized logo for “ACME GPT,” representing the typical enterprise aspiration. Companies see tools like ChatGPT and immediately want to apply that capability to their internal data, asking questions like, “Can I get the list of board of directors?”</p>
<p>However, the speaker notes a common hurdle: generic models don’t know enterprise-specific knowledge. This sets up the necessity for RAG—injecting private data into the model—rather than relying solely on the model’s pre-trained knowledge.</p>
</section>
<section id="building-rag-is-easy" class="level3">
<h3 class="anchored" data-anchor-id="building-rag-is-easy">3. Building RAG is Easy</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=70s">Timestamp: 01:10</a>)</p>
<p>The speaker illustrates the deceptively simple workflow of a basic RAG demo. The diagram shows the standard path: a user query is converted to vectors, matched against a database, and sent to an LLM.</p>
<p>Shah acknowledges that building a “hello world” version of this is trivial. He notes, “You can build a very easy RAG demo out of the box by just grabbing some data, using an embedding model, creating vectors, doing the similarity.”</p>
</section>
<section id="building-rag-is-easy-code-example" class="level3">
<h3 class="anchored" data-anchor-id="building-rag-is-easy-code-example">4. Building RAG is Easy (Code Example)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=82s">Timestamp: 01:22</a>)</p>
<p>A Python code snippet using <strong>LangChain</strong> is displayed to reinforce how accessible basic RAG has become. The code demonstrates loading a document, chunking it, and setting up a retrieval chain in just a few lines.</p>
<p>This slide serves as a foil for the upcoming reality check. While the code works for a demo, it hides the immense complexity required to make such a system robust, accurate, and scalable in a real-world production environment.</p>
</section>
<section id="rag-reality-check" class="level3">
<h3 class="anchored" data-anchor-id="rag-reality-check">5. RAG Reality Check</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=95s">Timestamp: 01:35</a>)</p>
<p>The tone shifts to the challenges of production. The slide highlights a sobering statistic: <strong>95% of Gen AI projects fail to reach production</strong>. The speaker details the specific reasons why demos fail when scaled: poor accuracy, unbearable latency, scaling issues with millions of documents, and ballooning costs.</p>
<p>He emphasizes a critical, often overlooked factor: <strong>Compliance</strong>. “Inside an enterprise, not everybody gets to read every document.” A demo ignores entitlements, but a production system cannot.</p>
</section>
<section id="maybe-try-a-different-rag" class="level3">
<h3 class="anchored" data-anchor-id="maybe-try-a-different-rag">6. Maybe try a different RAG?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=180s">Timestamp: 03:00</a>)</p>
<p>This slide lists a dizzying array of RAG variants (GraphRAG, RAPTOR, CRAG, etc.) and retrieval techniques. It represents the “analysis paralysis” developers face when scouring arXiv papers for a solution to their accuracy problems.</p>
<p>Shah warns against blindly chasing the latest academic paper to fix fundamental system issues. “The answer is not in here of pulling together like a bunch of archive papers.” Instead, he advocates for a structured framework to make decisions.</p>
</section>
<section id="ultimate-rag-solution" class="level3">
<h3 class="anchored" data-anchor-id="ultimate-rag-solution">7. Ultimate RAG Solution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=210s">Timestamp: 03:30</a>)</p>
<p>A humorous cartoon depicts a “Rube Goldberg” machine, representing the <strong>“Ultimate RAG Solution.”</strong> It mocks the tendency to over-engineer systems with too many interconnected, fragile components in the pursuit of performance.</p>
<p>The speaker uses this visual to argue for simplicity and deliberate design. The goal is to avoid building a monstrosity that is impossible to maintain, urging the audience to think about trade-offs before complexity.</p>
</section>
<section id="rag-as-a-system" class="level3">
<h3 class="anchored" data-anchor-id="rag-as-a-system">8. RAG as a system</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=215s">Timestamp: 03:35</a>)</p>
<p>The speaker introduces a clean system architecture for RAG, broken into four distinct stages: <strong>Parsing, Querying, Retrieving, and Generation</strong>. This framework serves as the mental map for the rest of the presentation.</p>
<p>He highlights that “Parsing” is vastly overlooked—getting information out of complex documents cleanly is a prerequisite for success. Today’s talk, however, will zoom in specifically on the <strong>Retrieving</strong> and <strong>Querying</strong> components.</p>
</section>
<section id="designing-a-rag-solution" class="level3">
<h3 class="anchored" data-anchor-id="designing-a-rag-solution">9. Designing a RAG Solution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=250s">Timestamp: 04:10</a>)</p>
<p>This slide presents a “Tradeoff Triangle” for RAG, balancing <strong>Problem Complexity, Latency, and Cost</strong>. The speaker advises having a serious conversation with stakeholders about these constraints before writing code.</p>
<p>A key concept introduced here is the <strong>“Cost of a mistake.”</strong> In coding assistants, a mistake is low-cost (the developer fixes it). In medical RAG systems, the cost of a mistake is high (life or death), which dictates a completely different architectural approach.</p>
</section>
<section id="rag-considerations" class="level3">
<h3 class="anchored" data-anchor-id="rag-considerations">10. RAG Considerations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=330s">Timestamp: 05:30</a>)</p>
<p>A detailed table breaks down specific considerations that influence RAG design, such as domain difficulty, multilingual requirements, and data quality. This slide was originally created for sales teams to help scope customer problems.</p>
<p>Shah emphasizes that understanding the <strong>nuances</strong> of the use case upfront saves heartache later. For instance, knowing if users will ask simple questions or require complex reasoning changes the retrieval strategy entirely.</p>
</section>
<section id="consider-query-complexity" class="level3">
<h3 class="anchored" data-anchor-id="consider-query-complexity">11. Consider Query Complexity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=375s">Timestamp: 06:15</a>)</p>
<p>The speaker categorizes queries by complexity, ranging from simple <strong>Keywords</strong> (“Total Revenue”) to <strong>Semantic</strong> variations (“How much bank?”), to <strong>Multi-hop</strong> reasoning, and finally <strong>Agentic</strong> scenarios.</p>
<p>He points out a common failure mode: “The answers aren’t in the documents… all of a sudden they’re asking for knowledge that’s outside.” Recognizing the query complexity determines whether you need a simple search engine or a complex agentic workflow.</p>
</section>
<section id="retrieval-highlighted" class="level3">
<h3 class="anchored" data-anchor-id="retrieval-highlighted">12. Retrieval (Highlighted)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=452s">Timestamp: 07:32</a>)</p>
<p>The presentation zooms back into the system diagram, highlighting the <strong>“Retrieving”</strong> box. This signals the start of the deep technical dive into retrieval algorithms.</p>
<p>Shah notes that this area causes the most confusion due to the sheer number of model choices and architectures available. He aims to provide a practical guide to selecting the right retrieval tool.</p>
</section>
<section id="retrieval-approaches" class="level3">
<h3 class="anchored" data-anchor-id="retrieval-approaches">13. Retrieval Approaches</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=496s">Timestamp: 08:16</a>)</p>
<p>Three primary retrieval pillars are introduced: 1. <strong>BM25:</strong> The lexical, keyword-based standard. 2. <strong>Language Models:</strong> Semantic embeddings and vector search. 3. <strong>Agentic Search:</strong> The new frontier of iterative reasoning.</p>
<p>The speaker emphasizes that documents must be broken into pieces (<strong>chunking</strong>) because no single model context window is efficient enough to hold all enterprise data for every query.</p>
</section>
<section id="building-rag-is-easy-code-highlight" class="level3">
<h3 class="anchored" data-anchor-id="building-rag-is-easy-code-highlight">14. Building RAG is Easy (Code Highlight)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=530s">Timestamp: 08:50</a>)</p>
<p>Returning to the initial code snippet, the speaker highlights the <code>vectorstore</code> and <code>retriever</code> initialization lines. This pinpoints exactly where the upcoming concepts fit into the implementation.</p>
<p>This visual anchor helps developers map the theoretical concepts of BM25 and Embeddings back to the actual lines of code they write in libraries like LangChain or LlamaIndex.</p>
</section>
<section id="bm25" class="level3">
<h3 class="anchored" data-anchor-id="bm25">15. BM25</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=558s">Timestamp: 09:18</a>)</p>
<p><strong>BM25 (Best Match 25)</strong> is explained as a probabilistic lexical ranking function. The slide visualizes an <strong>inverted index</strong>, mapping words (like “butterfly”) to the specific documents containing them.</p>
<p>Shah explains that this is the 25th iteration of the formula, designed to score documents based on word frequency and saturation. It remains a powerful, fast baseline for retrieval.</p>
</section>
<section id="bm25-performance" class="level3">
<h3 class="anchored" data-anchor-id="bm25-performance">16. BM25 Performance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=595s">Timestamp: 09:55</a>)</p>
<p>A table compares the speed of a <strong>Linear Scan</strong> (Ctrl+F style) versus an <strong>Inverted Index (BM25)</strong> as the document count grows from 1,000 to 9,000.</p>
<p>The data shows that linear search becomes exponentially slower (taking 3,000 seconds for 1k documents in this synthetic test), while BM25 remains orders of magnitude faster. This efficiency is why lexical search is still widely used in production.</p>
</section>
<section id="bm25-failure-cases" class="level3">
<h3 class="anchored" data-anchor-id="bm25-failure-cases">17. BM25 Failure Cases</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=668s">Timestamp: 11:08</a>)</p>
<p>The limitations of BM25 are exposed. Because it relies on exact word matches, it fails when users use synonyms. If a user searches for <strong>“Physician”</strong> but the documents only contain <strong>“Doctor,”</strong> BM25 will return zero results.</p>
<p>Similarly, it struggles with acronyms like <strong>“IBM”</strong> vs <strong>“International Business Machines.”</strong> Despite this, Shah argues BM25 is a “very strong baseline” that often beats complex neural models on specific keyword-heavy datasets.</p>
</section>
<section id="hands-on-bm25s" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-bm25s">18. Hands on: BM25s</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=734s">Timestamp: 12:14</a>)</p>
<p>For developers wanting to implement this, the slide points to a library called <code>bm25s</code>, a high-performance Python implementation available on Hugging Face.</p>
<p>This reinforces the practical nature of the talk—BM25 isn’t just a legacy concept; it is an active, installable tool that developers should consider using alongside vector search.</p>
</section>
<section id="enter-language-models" class="level3">
<h3 class="anchored" data-anchor-id="enter-language-models">19. Enter Language Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=744s">Timestamp: 12:24</a>)</p>
<p>The talk transitions to <strong>Language Models (Embeddings)</strong>. The slide explains how an encoder model turns text into a dense vector (a list of numbers) that captures semantic meaning.</p>
<p>Because these models are trained on vast amounts of data, they “have an idea of these similar concepts.” This solves the synonym problem that plagues BM25.</p>
</section>
<section id="embeddings-visualized" class="level3">
<h3 class="anchored" data-anchor-id="embeddings-visualized">20. Embeddings Visualized</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=770s">Timestamp: 12:50</a>)</p>
<p>A 2D visualization demonstrates how embeddings group related concepts in <strong>latent space</strong>. The word “Doctor” and “Physician” would be located very close to each other mathematically.</p>
<p>This spatial proximity allows for <strong>Semantic Search</strong>: finding documents that mean the same thing as the query, even if they don’t share a single word.</p>
</section>
<section id="semantic-search-is-widely-used" class="level3">
<h3 class="anchored" data-anchor-id="semantic-search-is-widely-used">21. Semantic search is widely used</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=795s">Timestamp: 13:15</a>)</p>
<p>The speaker validates the importance of semantic search by showing a tweet from Google’s SearchLiaison regarding BERT, and a screenshot of Hugging Face’s model repository.</p>
<p>This confirms that semantic search is the industry standard for modern information retrieval, having been deployed at massive scale by tech giants to improve result relevance.</p>
</section>
<section id="which-language-model" class="level3">
<h3 class="anchored" data-anchor-id="which-language-model">22. Which language model?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=810s">Timestamp: 13:30</a>)</p>
<p>A scatter plot compares various models based on <strong>Inference Speed</strong> (X-axis) and <strong>NDCG@10</strong> (Y-axis, a measure of retrieval quality).</p>
<p>Shah places <strong>BM25</strong> on the right (fast but lower accuracy) to orient the audience. He points out that there is a massive variety of models with different trade-offs between compute cost and retrieval quality.</p>
</section>
<section id="static-embeddings" class="level3">
<h3 class="anchored" data-anchor-id="static-embeddings">23. Static Embeddings</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=883s">Timestamp: 14:43</a>)</p>
<p>The speaker introduces <strong>Static Embeddings</strong> (like Word2Vec or GloVe) which are located on the far right of the previous scatter plot—extremely fast, even on CPUs.</p>
<p>These models assign a fixed vector to every word. While efficient, they lack context. The word “bank” has the same vector whether referring to a river bank or a financial bank, which limits their accuracy.</p>
</section>
<section id="why-context-matters" class="level3">
<h3 class="anchored" data-anchor-id="why-context-matters">24. Why Context Matters</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=916s">Timestamp: 15:16</a>)</p>
<p>A cartoon illustrates the difference between Static Embeddings and Transformers. The Transformer can distinguish between “Model” in a data science context versus “Model” in a fashion context.</p>
<p>This contextual awareness is why modern Transformer-based embeddings (like BERT) generally outperform static embeddings and BM25 in complex retrieval tasks, despite being slower.</p>
</section>
<section id="many-more-models" class="level3">
<h3 class="anchored" data-anchor-id="many-more-models">25. Many more models!</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=955s">Timestamp: 15:55</a>)</p>
<p>Returning to the scatter plot, a red arrow points toward the top-left quadrant—models that are slower but achieve higher accuracy.</p>
<p>The speaker notes that the field is constantly evolving, with “newer generations of models” pushing the boundary of what is possible in terms of retrieval quality.</p>
</section>
<section id="mtebrteb" class="level3">
<h3 class="anchored" data-anchor-id="mtebrteb">26. MTEB/RTEB</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=995s">Timestamp: 16:35</a>)</p>
<p>To help developers choose, Shah introduces the <strong>MTEB (Massive Text Embedding Benchmark)</strong> and <strong>RTEB (Retrieval Text Embedding Benchmark)</strong>. These are leaderboards hosted on Hugging Face.</p>
<p>He highlights a key distinction: MTEB uses public datasets, while RTEB uses <strong>private, held-out datasets</strong>. This is crucial for avoiding “data contamination,” where models perform well simply because they were trained on the test data.</p>
</section>
<section id="selecting-an-embedding-model" class="level3">
<h3 class="anchored" data-anchor-id="selecting-an-embedding-model">27. Selecting an embedding model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1008s">Timestamp: 16:48</a>)</p>
<p>The speaker switches to a live browser view (captured in the slide) of the leaderboard. He discusses the bubble chart visualization where size often correlates with parameter count.</p>
<p>He points out an interesting trend: “You’ll see that there’s a bunch of models here that are all the same size… but the performance differs.” This indicates improvements in training strategies and architecture rather than just throwing more compute at the problem.</p>
</section>
<section id="selecting-an-embedding-model-other-considerations" class="level3">
<h3 class="anchored" data-anchor-id="selecting-an-embedding-model-other-considerations">28. Selecting an embedding model (Other Considerations)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1147s">Timestamp: 19:07</a>)</p>
<p>Beyond the leaderboard score, Shah lists practical selection criteria: <strong>Model Size</strong> (can it fit in memory?), <strong>Architecture</strong> (CPU vs GPU), <strong>Embedding Dimension</strong> (storage costs), and <strong>Training Data</strong> (multilingual support).</p>
<p>He advises checking if a model is open source and quantizable, as this can significantly reduce latency without a major hit to accuracy.</p>
</section>
<section id="matryoshka-embedding-models" class="level3">
<h3 class="anchored" data-anchor-id="matryoshka-embedding-models">29. Matryoshka Embedding Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1253s">Timestamp: 20:53</a>)</p>
<p>A specific innovation is highlighted: <strong>Matryoshka Embeddings</strong>. These models allow developers to truncate vectors (e.g., from 768 dimensions down to 64) while retaining most of the performance.</p>
<p>This is a “neat kind of innovation” for optimizing storage and search speed. OpenAI’s newer models also support this feature, offering flexibility between cost and accuracy.</p>
</section>
<section id="sentence-transformer" class="level3">
<h3 class="anchored" data-anchor-id="sentence-transformer">30. Sentence Transformer</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1302s">Timestamp: 21:42</a>)</p>
<p>The <strong>Sentence Transformer</strong> architecture is described as the dominant approach for RAG. Unlike standard BERT which works on tokens, these are fine-tuned to understand full sentences and paragraphs.</p>
<p>This architecture uses Siamese networks to ensure that semantically similar sentences are close in vector space, making them ideal for the “chunk-level” retrieval required in RAG.</p>
</section>
<section id="cross-encoder-reranker" class="level3">
<h3 class="anchored" data-anchor-id="cross-encoder-reranker">31. Cross Encoder / Reranker</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1336s">Timestamp: 22:16</a>)</p>
<p>The concept of a <strong>Cross Encoder (or Reranker)</strong> is introduced. Unlike the bi-encoder (retriever) which processes query and document separately, the cross-encoder processes them <em>together</em>.</p>
<p>This allows for a much deeper calculation of relevance. It is typically used as a second stage: retrieve 50 documents quickly with vectors, then use the slow but accurate Cross Encoder to rank the top 5.</p>
</section>
<section id="cross-encoder-reranker-duplicate" class="level3">
<h3 class="anchored" data-anchor-id="cross-encoder-reranker-duplicate">32. Cross Encoder / Reranker (Duplicate)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1336s">Timestamp: 22:16</a>)</p>
<p>(This slide reinforces the previous diagram, emphasizing the “crossing” of the query and document in the model architecture.)</p>
</section>
<section id="cross-encoder-reranker-accuracy-boost" class="level3">
<h3 class="anchored" data-anchor-id="cross-encoder-reranker-accuracy-boost">33. Cross Encoder / Reranker (Accuracy Boost)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1387s">Timestamp: 23:07</a>)</p>
<p>A bar chart quantifies the value of reranking. It shows a significant boost in <strong>NDCG (accuracy)</strong> when a reranker is added to the pipeline.</p>
<p>The speaker notes that while you get a “bump” in quality, it “doesn’t come for free.” The trade-off is increased latency, as the cross-encoder is computationally expensive.</p>
</section>
<section id="cross-encoder-reranker-execution-flow" class="level3">
<h3 class="anchored" data-anchor-id="cross-encoder-reranker-execution-flow">34. Cross Encoder / Reranker (Execution Flow)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1395s">Timestamp: 23:15</a>)</p>
<p>The execution flow diagram highlights the reranker’s position in the pipeline. It sits between the Vector Store retrieval and the LLM generation.</p>
<p>This visual reinforces the latency implication: the user has to wait for both the initial search <em>and</em> the reranking pass before the LLM even starts generating an answer.</p>
</section>
<section id="hands-on-retriever-reranker" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-retriever-reranker">35. Hands On: Retriever &amp; Reranker</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1410s">Timestamp: 23:30</a>)</p>
<p>A screenshot of a Google Colab notebook is shown, demonstrating a practical implementation of the Retrieve and Re-rank strategy using the <code>SentenceTransformer</code> and <code>CrossEncoder</code> libraries.</p>
<p>This provides a concrete resource for the audience to test the accuracy vs.&nbsp;speed trade-offs themselves on simple datasets like Wikipedia.</p>
</section>
<section id="instruction-following-reranker" class="level3">
<h3 class="anchored" data-anchor-id="instruction-following-reranker">36. Instruction Following Reranker</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1428s">Timestamp: 23:48</a>)</p>
<p>Shah mentions a specific advancement: <strong>Instruction Following Rerankers</strong> (developed by his company, Contextual). These allow developers to pass a prompt to the reranker, such as “Prioritize safety notices.”</p>
<p>This adds a “knob” for developers to tune retrieval based on business logic without retraining the model.</p>
</section>
<section id="combine-multiple-retrievers" class="level3">
<h3 class="anchored" data-anchor-id="combine-multiple-retrievers">37. Combine Multiple Retrievers</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1459s">Timestamp: 24:19</a>)</p>
<p>The presentation suggests that you don’t have to pick just one method. You can combine BM25, various embedding models (E5, BGE), and rerankers.</p>
<p>While combining them (Ensemble Retrieval) often yields better recall, Shah warns that “you got to engineer this.” Managing multiple indexes and fusion logic increases operational complexity and compute costs.</p>
</section>
<section id="cascading-rerankers-in-kaggle" class="level3">
<h3 class="anchored" data-anchor-id="cascading-rerankers-in-kaggle">38. Cascading Rerankers in Kaggle</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1496s">Timestamp: 24:56</a>)</p>
<p>A complex diagram from a Kaggle competition winner illustrates a <strong>Cascade Strategy</strong>. The solution used three different rerankers, filtering from 64 documents down to 8, and then to 5.</p>
<p>This shows the extreme end of retrieval engineering, where multiple models are chained to squeeze out every percentage point of accuracy.</p>
</section>
<section id="best-practices" class="level3">
<h3 class="anchored" data-anchor-id="best-practices">39. Best practices</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1516s">Timestamp: 25:16</a>)</p>
<p>Shah distills the complexity into a recommended <strong>Best Practice</strong>: 1. <strong>Hybrid Search:</strong> Combine Semantic Search (Vectors) and Lexical Search (BM25). 2. <strong>Reciprocal Rank Fusion:</strong> Merge the results. 3. <strong>Reranker:</strong> Pass the top results through a cross-encoder.</p>
<p>This setup provides a “pretty good standard performance out of the box” and should be the default baseline before trying exotic methods.</p>
</section>
<section id="families-of-embedding-models" class="level3">
<h3 class="anchored" data-anchor-id="families-of-embedding-models">40. Families of Embedding Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1542s">Timestamp: 25:42</a>)</p>
<p>A taxonomy slide categorizes the models discussed: <strong>Static</strong> (Fastest/Low Accuracy), <strong>Bi-Encoders</strong> (Fast/Good Accuracy), and <strong>Cross-Encoders</strong> (Slow/Best Accuracy).</p>
<p>This summary helps the audience mentally organize the tools available in their toolbox.</p>
</section>
<section id="lots-of-new-models" class="level3">
<h3 class="anchored" data-anchor-id="lots-of-new-models">41. Lots of New Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1550s">Timestamp: 25:50</a>)</p>
<p>Logos for IBM Granite, Google EmbeddingGemma, and others appear. The speaker notes that while new models from major players appear weekly, the improvements are often “incremental.”</p>
<p>He advises against “ripping up” a working system just to switch to a model that is 1% better on a leaderboard.</p>
</section>
<section id="other-retrieval-methods" class="level3">
<h3 class="anchored" data-anchor-id="other-retrieval-methods">42. Other retrieval methods</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1578s">Timestamp: 26:18</a>)</p>
<p>Alternative methods are briefly listed: <strong>SPLADE</strong> (Sparse retrieval), <strong>ColBERT</strong> (Late interaction), and <strong>GraphRAG</strong>.</p>
<p>Shah acknowledges these exist and may fit specific niches, but warns against chasing the “flavor of the week” before establishing a solid baseline with hybrid search.</p>
</section>
<section id="operational-concerns" class="level3">
<h3 class="anchored" data-anchor-id="operational-concerns">43. Operational Concerns</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1650s">Timestamp: 27:30</a>)</p>
<p>The talk shifts to operations. Libraries like <strong>FAISS</strong> are mentioned for efficient vector similarity search.</p>
<p>A key point is that for many use cases, you can simply store embeddings <strong>in memory</strong>. You don’t always need a complex vector database if your dataset fits in RAM.</p>
</section>
<section id="vector-database-options" class="level3">
<h3 class="anchored" data-anchor-id="vector-database-options">44. Vector Database Options</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_44.png" class="img-fluid figure-img"></p>
<figcaption>Slide 44</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1675s">Timestamp: 27:55</a>)</p>
<p>A diagram categorizes storage into <strong>Hot (In-Memory)</strong>, <strong>Warm (SSD/Disk)</strong>, and <strong>Cold</strong> tiers.</p>
<p>Shah notes there are “tons of vector database options” (Snowflake, Pinecone, etc.). The choice should be governed by <strong>latency requirements</strong>. If you need sub-millisecond retrieval, you need in-memory storage.</p>
</section>
<section id="operational-concerns-datastore-size" class="level3">
<h3 class="anchored" data-anchor-id="operational-concerns-datastore-size">45. Operational Concerns (Datastore Size)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_45.png" class="img-fluid figure-img"></p>
<figcaption>Slide 45</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1720s">Timestamp: 28:40</a>)</p>
<p>A graph shows that as <strong>Datastore Size</strong> increases (X-axis), retrieval performance naturally degrades (Y-axis).</p>
<p>To combat this, the speaker strongly recommends using <strong>Metadata Filtering</strong>. “If you’re not using something like metadata… it’s going to be very tough.” Narrowing the search scope is essential for scaling to millions of documents.</p>
</section>
<section id="search-strategy-comparison" class="level3">
<h3 class="anchored" data-anchor-id="search-strategy-comparison">46. Search Strategy Comparison</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_46.png" class="img-fluid figure-img"></p>
<figcaption>Slide 46</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1762s">Timestamp: 29:22</a>)</p>
<p>The presentation pivots to the “exciting part”: <strong>Agentic RAG</strong>. A visual compares “Traditional RAG” (a linear path) with “Agentic RAG” (a winding, exploratory path).</p>
<p>This represents the shift from a “one-shot” retrieval attempt to an iterative system that can explore, backtrack, and reason.</p>
</section>
<section id="tools-use-reasoning" class="level3">
<h3 class="anchored" data-anchor-id="tools-use-reasoning">47. Tools use / Reasoning</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_47.png" class="img-fluid figure-img"></p>
<figcaption>Slide 47</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1780s">Timestamp: 29:40</a>)</p>
<p>Reasoning models (like o1 or DeepSeek R1) enable LLMs to use tools effectively. A code snippet shows an agent loop: query -&gt; generate -&gt; <strong>“Did it answer the question?”</strong></p>
<p>If the answer is no, the model can “rewrite the query… try to find that missing information, feed that back into the loop.” This self-correction is the core of Agentic RAG.</p>
</section>
<section id="agentic-rag-workflow" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-workflow">48. Agentic RAG (Workflow)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_48.png" class="img-fluid figure-img"></p>
<figcaption>Slide 48</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1832s">Timestamp: 30:32</a>)</p>
<p>A flowchart details the Agentic RAG lifecycle. The model thinks through steps: “Oh, this is the query I need to make… based on those results… maybe we should do it a different way.”</p>
<p>This workflow allows the system to synthesize answers from multiple sources or clarify ambiguous queries automatically.</p>
</section>
<section id="tools-use-reasoning-detailed-example" class="level3">
<h3 class="anchored" data-anchor-id="tools-use-reasoning-detailed-example">49. Tools use / Reasoning (Detailed Example)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_49.png" class="img-fluid figure-img"></p>
<figcaption>Slide 49</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1835s">Timestamp: 30:35</a>)</p>
<p>A specific example of a complex query is shown. The agent breaks the problem down, calls tools, and iterates.</p>
<p>This demonstrates that the “Thinking” time is where the value is generated, allowing for a depth of research that a single retrieval pass cannot match.</p>
</section>
<section id="open-deep-research" class="level3">
<h3 class="anchored" data-anchor-id="open-deep-research">50. Open Deep Research</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_50.png" class="img-fluid figure-img"></p>
<figcaption>Slide 50</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1862s">Timestamp: 31:02</a>)</p>
<p>Shah references <strong>“Open Deep Research”</strong> by LangChain, an open-source framework where sub-agents go out, perform research, and report back.</p>
<p>This is a specific category of Agentic RAG focused on generating comprehensive reports rather than quick answers.</p>
</section>
<section id="deepresearch-bench" class="level3">
<h3 class="anchored" data-anchor-id="deepresearch-bench">51. DeepResearch Bench</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_51.png" class="img-fluid figure-img"></p>
<figcaption>Slide 51</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1890s">Timestamp: 31:30</a>)</p>
<p>A leaderboard for <strong>DeepResearch Bench</strong> is shown, testing models on “100 PhD level research tasks.”</p>
<p>The speaker warns that this approach “can get very expensive.” Solving a single complex query might cost significant money due to the number of tokens and iterative steps required.</p>
</section>
<section id="westlaw-ai-deep-research" class="level3">
<h3 class="anchored" data-anchor-id="westlaw-ai-deep-research">52. Westlaw AI Deep Research</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_52.png" class="img-fluid figure-img"></p>
<figcaption>Slide 52</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1915s">Timestamp: 31:55</a>)</p>
<p>A real-world application is highlighted: <strong>Westlaw AI</strong>. In the legal field, thoroughness is worth the latency and cost.</p>
<p>This proves that Agentic RAG isn’t just a toy; it is being commercialized in high-value verticals where accuracy is paramount.</p>
</section>
<section id="agentic-rag-self-rag" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-self-rag">53. Agentic RAG (Self-RAG)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_53.png" class="img-fluid figure-img"></p>
<figcaption>Slide 53</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1931s">Timestamp: 32:11</a>)</p>
<p>The concept of <strong>Self-RAG</strong> is introduced, emphasizing the “Reflection” step. The model critiques its own retrieved documents and generation quality.</p>
<p>Shah notes that this isn’t brand new, but has become practical due to better reasoning models.</p>
</section>
<section id="agentic-rag-langchain-reddit" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-langchain-reddit">54. Agentic RAG (LangChain Reddit)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_54.png" class="img-fluid figure-img"></p>
<figcaption>Slide 54</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2044s">Timestamp: 34:04</a>)</p>
<p>A Reddit post is shown where a developer discusses building a self-reflection RAG system. This highlights the community’s active experimentation with these loops.</p>
</section>
<section id="agentic-rag-efficiency-concerns" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-efficiency-concerns">55. Agentic RAG (Efficiency Concerns)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_55.png" class="img-fluid figure-img"></p>
<figcaption>Slide 55</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2055s">Timestamp: 34:15</a>)</p>
<p>The discussion turns to the “Rub”: <strong>Inefficiency</strong>. Agentic loops can be slow and wasteful, re-retrieving data unnecessarily.</p>
<p>This sets up the trade-off conversation again: Is the extra time and compute worth the accuracy gain?</p>
</section>
<section id="research-bright" class="level3">
<h3 class="anchored" data-anchor-id="research-bright">56. Research: BRIGHT</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_56.png" class="img-fluid figure-img"></p>
<figcaption>Slide 56</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1931s">Timestamp: 32:11</a>)</p>
<p><em>Note: The speaker introduces the BRIGHT benchmark around 32:11, slightly out of slide order in the transcript flow, but connects it here.</em></p>
<p><strong>BRIGHT</strong> is a benchmark specifically designed for <strong>Retrieval Reasoning</strong>. Unlike standard benchmarks that test keyword matching, BRIGHT tests questions that require thinking, logic, and multi-step deduction to find the correct document.</p>
</section>
<section id="bright-1-diver" class="level3">
<h3 class="anchored" data-anchor-id="bright-1-diver">57. BRIGHT #1: DIVER</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_57.png" class="img-fluid figure-img"></p>
<figcaption>Slide 57</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=1968s">Timestamp: 32:48</a>)</p>
<p>The top-performing system on BRIGHT is <strong>DIVER</strong>. The diagram shows it uses the exact components discussed earlier: Chunking, Retrieving, and Reranking, but wrapped in an iterative loop.</p>
<p>Shah points out, “It probably doesn’t look that crazy to you if you’re used to RAG.” The innovation is in the process, not necessarily a magical new model architecture.</p>
</section>
<section id="bright-1-diver-llm-instructions" class="level3">
<h3 class="anchored" data-anchor-id="bright-1-diver-llm-instructions">58. BRIGHT #1: DIVER (LLM Instructions)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_58.png" class="img-fluid figure-img"></p>
<figcaption>Slide 58</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2011s">Timestamp: 33:31</a>)</p>
<p>The specific prompts used in DIVER are shown. The system asks the LLM: “Given a query… what do you think would be possibly helpful to do?”</p>
<p>This <strong>Query Expansion</strong> allows the system to generate new search terms that the user didn’t think of, bridging the semantic gap through reasoning.</p>
</section>
<section id="agentic-rag-on-wixqa" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-on-wixqa">59. Agentic RAG on WixQA</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_59.png" class="img-fluid figure-img"></p>
<figcaption>Slide 59</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2076s">Timestamp: 34:36</a>)</p>
<p>Shah shares his own experiment results on the <strong>WixQA</strong> dataset (technical support). * <strong>One Shot RAG:</strong> 5 seconds latency, <strong>76%</strong> Factuality. * <strong>Agentic RAG:</strong> Slower latency, <strong>93%</strong> Factuality.</p>
<p>This massive jump in accuracy (0.76 to 0.93) is the key takeaway. “That has a ton of implications.” It suggests that the limitation of RAG often isn’t the data, but the lack of reasoning applied to the retrieval process.</p>
</section>
<section id="rethink-your-assumptions" class="level3">
<h3 class="anchored" data-anchor-id="rethink-your-assumptions">60. Rethink your Assumptions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_60.png" class="img-fluid figure-img"></p>
<figcaption>Slide 60</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2230s">Timestamp: 37:10</a>)</p>
<p><strong>This is the climax of the technical argument.</strong> A graph from the BRIGHT paper shows that <strong>BM25 (lexical search)</strong> combined with an Agentic loop (GPT-4) outperforms advanced embedding models (Qwen).</p>
<p>“This is crazy,” Shah exclaims. Because the LLM can rewrite queries into many variations, it mitigates BM25’s weakness (synonyms). This implies you might not need complex vector databases if you have a smart agent.</p>
</section>
<section id="agentic-rag-with-bm25" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-with-bm25">61. Agentic RAG with BM25</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_61.png" class="img-fluid figure-img"></p>
<figcaption>Slide 61</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2300s">Timestamp: 38:20</a>)</p>
<p>Shah validates the paper’s finding with his own internal data (Financial 10Ks). <strong>Agentic RAG with BM25</strong> performed nearly as well as Agentic RAG with Embeddings.</p>
<p>He suggests a radical possibility: “I could throw all that away [vector DBs]… just stick this in a text-only database and use BM25.”</p>
</section>
<section id="agentic-rag-for-code-search" class="level3">
<h3 class="anchored" data-anchor-id="agentic-rag-for-code-search">62. Agentic RAG for Code Search</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_62.png" class="img-fluid figure-img"></p>
<figcaption>Slide 62</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2386s">Timestamp: 39:46</a>)</p>
<p>He connects this finding to <strong>Claude Code</strong>, which uses a lexical approach (like <code>grep</code>) rather than vectors for code search.</p>
<p>Since code doesn’t have the same semantic ambiguity as natural language, and agents can iterate rapidly, lexical search is proving to be superior for coding assistants.</p>
</section>
<section id="combine-retrieval-approaches" class="level3">
<h3 class="anchored" data-anchor-id="combine-retrieval-approaches">63. Combine Retrieval Approaches</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_63.png" class="img-fluid figure-img"></p>
<figcaption>Slide 63</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2415s">Timestamp: 40:15</a>)</p>
<p>A <strong>DoorDash</strong> case study illustrates a two-tier guardrail system. They use simple text similarity first (fast/cheap). If that fails or is uncertain, they kick it to an LLM (slow/expensive).</p>
<p>This “Tiered” approach optimizes the trade-off between cost and accuracy in production.</p>
</section>
<section id="hands-on-agentic-rag-smolagents" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-agentic-rag-smolagents">64. Hands on: Agentic RAG (Smolagents)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_64.png" class="img-fluid figure-img"></p>
<figcaption>Slide 64</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2467s">Timestamp: 41:07</a>)</p>
<p>The speaker points to <strong>Smolagents</strong>, a Hugging Face library, as a way to get hands-on with these concepts. A Colab notebook is provided for the audience to build their own agentic retrieval loops.</p>
</section>
<section id="solutions-for-a-rag-solution" class="level3">
<h3 class="anchored" data-anchor-id="solutions-for-a-rag-solution">65. Solutions for a RAG Solution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_65.png" class="img-fluid figure-img"></p>
<figcaption>Slide 65</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2478s">Timestamp: 41:18</a>)</p>
<p>Shah updates the “Problem Complexity” framework from the beginning of the talk with specific recommendations: * <strong>Low Latency (&lt;5s):</strong> Use BM25 or Static Embeddings. * <strong>High Cost of Mistake:</strong> Add a Reranker. * <strong>Complex Multi-hop:</strong> Use Agentic RAG.</p>
</section>
<section id="retriever-checklist" class="level3">
<h3 class="anchored" data-anchor-id="retriever-checklist">66. Retriever Checklist</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_66.png" class="img-fluid figure-img"></p>
<figcaption>Slide 66</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2512s">Timestamp: 41:52</a>)</p>
<p>A final checklist summarizes the retrieval hierarchy: 1. <strong>Keyword/BM25</strong> (The baseline). 2. <strong>Semantic Search</strong> (The standard). 3. <strong>Agentic/Reasoning</strong> (The problem solver).</p>
<p>This provides the audience with a mental menu to choose from based on their specific constraints.</p>
</section>
<section id="rag-as-a-system-retrieval-with-instruction-following-reranker" class="level3">
<h3 class="anchored" data-anchor-id="rag-as-a-system-retrieval-with-instruction-following-reranker">67. RAG as a system (Retrieval with Instruction Following Reranker)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_67.png" class="img-fluid figure-img"></p>
<figcaption>Slide 67</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2520s">Timestamp: 42:00</a>)</p>
<p>The system diagram is shown one last time, updated to include the <strong>Instruction Following Reranker</strong> in the retrieval box, solidifying the modern RAG architecture.</p>
</section>
<section id="rag---generation" class="level3">
<h3 class="anchored" data-anchor-id="rag---generation">68. RAG - Generation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_68.png" class="img-fluid figure-img"></p>
<figcaption>Slide 68</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2530s">Timestamp: 42:10</a>)</p>
<p><em>Note: The speaker concludes the talk at 42:10, stating “I’m going to end it here.” Slides 68-70 regarding the Generation stage were included in the deck but skipped in the video recording due to time constraints.</em></p>
<p>This slide would have covered the final stage of RAG: generating the answer. The focus here is typically on reducing hallucinations and ensuring the tone matches the user’s needs.</p>
</section>
<section id="rag---generation-model-selection" class="level3">
<h3 class="anchored" data-anchor-id="rag---generation-model-selection">69. RAG - Generation (Model Selection)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_69.png" class="img-fluid figure-img"></p>
<figcaption>Slide 69</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2530s">Timestamp: 42:10</a>)</p>
<p><em>Skipped in video.</em> This slide illustrates the choice of LLM for generation (e.g., GPT-4 vs Llama 3 vs Claude). The choice depends on the “Cost/Latency budget” and specific domain requirements.</p>
</section>
<section id="chunking-approaches" class="level3">
<h3 class="anchored" data-anchor-id="chunking-approaches">70. Chunking approaches</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_70.png" class="img-fluid figure-img"></p>
<figcaption>Slide 70</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2530s">Timestamp: 42:10</a>)</p>
<p><em>Skipped in video.</em> This slide compares <strong>Original Chunking</strong> (cutting text at fixed intervals) with <strong>Contextual Chunking</strong> (adding a summary prefix to every chunk). Contextual chunking significantly improves retrieval because every chunk carries the context of the parent document.</p>
</section>
<section id="title-slide-duplicate" class="level3">
<h3 class="anchored" data-anchor-id="title-slide-duplicate">71. Title Slide (Duplicate)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/rag-talk/slide_71.png" class="img-fluid figure-img"></p>
<figcaption>Slide 71</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/AS_HlJbJjH8&amp;t=2530s">Timestamp: 42:10</a>)</p>
<p>The presentation concludes with the title slide. Rajiv Shah thanks the audience, encouraging them to think about trade-offs rather than just chasing the latest models. “Hopefully I’ve given you a sense of thinking about these trade-offs… thank you all.”</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>RAG</category>
  <category>AI</category>
  <category>Retrieval</category>
  <category>Agentic</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/rag-agentic-world.html</guid>
  <pubDate>Mon, 27 Oct 2025 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/rag-talk/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>My Picks for the Best AI/ML News Sources</title>
  <dc:creator>Rajiv Shah</dc:creator>
  <link>https://rajivshah.com/blog/data-science-news-sources.html</link>
  <description><![CDATA[ 






<p>Want to know the best places for the latest data science, AI, ML news? Here’s my filtered news feed (Updated February 2026).</p>
<p>AI/ML is too huge to try and capture all the news sources. I try to keep a diverse range to keep my pulse on what’s new and useful. Most great data scientists I know probably spend an hour a week reading the news, so please don’t focus on consuming content (build something!). I’m using data science, AI news, and ML news interchangeably here, because I like them all.</p>
<section id="email-newsletters" class="level2">
<h2 class="anchored" data-anchor-id="email-newsletters">Email Newsletters</h2>
<p>Email newsletters are my favorite source of high-quality information (but YouTube is a strong contender for deep dives).</p>
<ul>
<li><a href="https://www.deeplearning.ai/the-batch/">The Batch from DeepLearning.AI</a></li>
<li><a href="https://jack-clark.net/">Jack Clark newsletter</a></li>
<li>Medium (various authors)</li>
<li><a href="https://buttondown.com/ainews">AI News</a> (focuses on biggest names)</li>
<li><a href="https://gradientflow.substack.com/">Ben Lorica</a></li>
<li><a href="https://changelog.com/">Changelog</a></li>
<li><a href="https://www.interconnects.ai/">Interconnects</a></li>
<li><a href="https://gaiinsights.substack.com/">GAI Insights</a></li>
<li><a href="https://peterwildeford.substack.com/">Power Law (Peter Wildeford)</a></li>
</ul>
</section>
<section id="reddit" class="level2">
<h2 class="anchored" data-anchor-id="reddit">Reddit</h2>
<p>Reddit is a source I browse once or twice a week.</p>
<ul>
<li><a href="https://www.reddit.com/r/LocalLLaMA/">Local Llama</a> — running LLMs locally, good mix of content</li>
<li><a href="https://www.reddit.com/r/MachineLearning/">Machine Learning</a></li>
<li><a href="https://www.reddit.com/r/datascience/">DataScience</a> — Not useful for news or learning, but good for getting the vibes of the community</li>
</ul>
</section>
<section id="podcasts" class="level2">
<h2 class="anchored" data-anchor-id="podcasts">Podcasts</h2>
<ul>
<li><a href="https://changelog.com/practicalai">Practical AI</a></li>
<li><a href="https://twimlai.com/">TWIML AI Podcast</a></li>
<li><a href="https://dataskeptic.com/">Data Skeptics</a> - Rotates between categories, so it’s hit or miss</li>
<li><a href="https://www.latent.space/podcast">Latent Space</a> - Less practical in the last six months</li>
<li><a href="https://podcast.thisdayinai.com/">This Day in AI</a></li>
</ul>
</section>
<section id="youtube" class="level2">
<h2 class="anchored" data-anchor-id="youtube">YouTube</h2>
<ul>
<li><a href="https://www.youtube.com/c/YannicKilcher?app=desktop">Yannic</a></li>
<li><a href="https://www.youtube.com/@srush_nlp">Sasha Rush</a> - This has been quiet</li>
<li><a href="https://www.youtube.com/user/PyDataTV?app=desktop">PyData conferences</a></li>
<li><a href="https://www.youtube.com/@CohereAI">Cohere</a></li>
<li><a href="https://www.youtube.com/@StanfordMLSysSeminars">Stanford MLSys Seminars</a></li>
<li><a href="https://www.youtube.com/@Weaviate">Weaviate</a></li>
<li><a href="https://www.youtube.com/@ODSCAI">ODSC</a></li>
</ul>
</section>
<section id="social-media" class="level2">
<h2 class="anchored" data-anchor-id="social-media">Social Media</h2>
<p><strong>TikTok</strong> is something I check regularly, but I don’t get much new data science news from it, <a href="https://www.tiktok.com/@rajistics">except my channel</a>.</p>
<p><strong>Twitter/X</strong> is dropping quickly in my usage and interaction. However, most of my breaking news comes from X. The list of people I actively follow is <a href="https://twitter.com/i/lists/230771382">publicly available here</a>. My one suggestion on X/LinkedIn is DAIR.AI, for their Top AI Papers of the Week.</p>
<p><strong><a href="https://www.linkedin.com/">LinkedIn</a></strong> is useful. The value of LinkedIn comes from my existing network, which posts very useful information.</p>
</section>
<section id="my-curated-feeds" class="level2">
<h2 class="anchored" data-anchor-id="my-curated-feeds">My Curated Feeds</h2>
<p>If you want to know my favorite stories (and often fodder for my videos or posts):</p>
<ul>
<li>Follow my Instagram ML News feed</li>
<li>Check out my <a href="https://www.reddit.com/r/rajistics/">subreddit rajistics</a></li>
</ul>


</section>

 ]]></description>
  <category>ai</category>
  <category>machine learning</category>
  <category>data science</category>
  <category>resources</category>
  <guid>https://rajivshah.com/blog/data-science-news-sources.html</guid>
  <pubDate>Mon, 15 Sep 2025 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Understanding Sparse Matrices through Interactive Visualizations</title>
  <link>https://rajivshah.com/blog/sparsedataframe.html</link>
  <description><![CDATA[ 






<p>When working with machine learning models, preparing data properly is essential. One common preprocessing technique is one-hot encoding, which transforms categorical data into a format algorithms can understand. However, this transformation often creates sparse matrices - dataframes where most values are zero.</p>
<section id="basic-one-hot-encoding" class="level2">
<h2 class="anchored" data-anchor-id="basic-one-hot-encoding">Basic One-Hot Encoding</h2>
<p>The first animation illustrates the fundamental concept of one-hot encoding. This transformation converts a single categorical column (like “city”) into multiple binary columns, where each column represents one possible category value.</p>
<p><a href="./sparse-1.html">View the basic one-hot encoding animation</a></p>
<p>This visualization walks through the transformation step-by-step:</p>
<ol type="1">
<li>Starting with the original dataset containing categorical values</li>
<li>Adding binary indicator columns for each category</li>
<li>Showing how the dataset becomes wider but sparse (mostly filled with zeros)</li>
<li>Demonstrating how the original categorical column becomes redundant</li>
</ol>
<p>In traditional tabular data processing, we often don’t see this sparsity visually. The animation makes it clear how one-hot encoding dramatically changes the structure of our data.</p>
</section>
<section id="the-curse-of-dimensionality" class="level2">
<h2 class="anchored" data-anchor-id="the-curse-of-dimensionality">The Curse of Dimensionality</h2>
<p>The second animation takes the concept further by demonstrating what happens with high-cardinality categorical features - those with many possible values.</p>
<p><a href="./sparse-2.html">View the curse of dimensionality animation</a></p>
<p>This more advanced visualization shows how one-hot encoding can lead to the “curse of dimensionality”:</p>
<ol type="1">
<li>Starting with a modest 4-column dataset</li>
<li>Expanding to over 150 columns when encoding a categorical feature with many values</li>
<li>Creating an extremely sparse matrix where 99% of values are zeros</li>
<li>Illustrating the practical challenges this presents for machine learning</li>
</ol>
</section>
<section id="why-it-matters" class="level2">
<h2 class="anchored" data-anchor-id="why-it-matters">Why It Matters</h2>
<p>Understanding the sparsity that results from one-hot encoding is crucial for several reasons:</p>
<ul>
<li><strong>Memory usage</strong>: Sparse matrices can consume excessive memory if not properly handled</li>
<li><strong>Computational efficiency</strong>: Processing mostly-zero matrices is inefficient</li>
<li><strong>Model performance</strong>: Many algorithms struggle with extremely sparse data</li>
<li><strong>Feature selection</strong>: With hundreds of binary columns, feature selection becomes critical</li>
</ul>
<p>For high-cardinality features, consider alternatives like feature hashing, target encoding, or embeddings to avoid the dimensionality explosion shown in the second animation.</p>
<p>These visualizations help build intuition about what’s happening “under the hood” when we preprocess data - something that’s often hidden when we use high-level libraries that handle these transformations automatically.</p>
<p>Related videos: <a href="https://youtube.com/shorts/M3AhBvaSSvY">Sparsity in AI</a> or <a href="https://www.tiktok.com/@rajistics/video/7470265095654739230">Curse of Dimensionality</a> or <a href="https://www.instagram.com/reel/DGlW4V1A0Ri/">Reality of Models</a></p>


</section>

 ]]></description>
  <category>Sparse</category>
  <category>Dataframes</category>
  <category>Machine Learning</category>
  <category>Data Preprocessing</category>
  <guid>https://rajivshah.com/blog/sparsedataframe.html</guid>
  <pubDate>Fri, 07 Mar 2025 06:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/sparsedataframe.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Feature Selection Methods and Feature Selection Curves</title>
  <link>https://rajivshah.com/blog/Feature_Selection.html</link>
  <description><![CDATA[ 






<p>How to Select the Best Features for Machine Learning!</p>
<p>Let’s deep dive into several feature selection techniques and help you figure out when to use each one. The notebook includes two data sources: the MNIST dataset and the Madelon dataset. The MNIST dataset is a collection of 28x28 pixel images of handwritten digits. The Madelon dataset is a synthetic dataset that you can control.</p>
<p>The notebook uses the following feature selection techniques:</p>
<ul>
<li>F-statistic</li>
<li>Mutual Information</li>
<li>Logistic Regression</li>
<li>Logistics Regression with Lasso (L1) Regularization</li>
<li>Feature Importance</li>
<li>Boruta</li>
<li>MRMR (Minimum Redundancy Maximum Relevance)</li>
<li>Recursive Feature Elimination</li>
<li>Feature importance rank ensembling (FIRE)</li>
</ul>
<p>To help visualize the feature selection process, the notebook includes a feature selection curve. The feature selection curve plots the number of features against the accuracy of the model. This helps you understand how many features you need to achieve a certain level of accuracy.</p>
<p>This notebook is based on the following articles:<br>
<a href="https://towardsdatascience.com/feature-selection-how-to-throw-away-95-of-your-data-and-get-95-accuracy-ad41ca016877">Feature Selection: How to Throw Away 95% of Your Features and Get 95% Accuracy</a> and the associated <a href="https://github.com/smazzanti/mrmr/blob/15cb0983a3e53114bbab94a9629e404c1d42f5d8/notebooks/mnist.ipynb">notebook</a>.</p>
<p>A companion video to this can be found on my youtube site, <span class="citation" data-cites="rajistics">@rajistics</span>: <a href="https://youtu.be/jm7TYGv32zs">Feature Selection Methods and Feature Selection Curves</a>, it’s about 15 minutes and gives more context to the notebook.</p>
<p>This blog post can be found at http://bit.ly/raj_fs or https://rajivshah.com/blog/Feature_Selection.html</p>
<div id="cell-2" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> warnings<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> warnings.filterwarnings(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ignore"</span>)</span></code></pre></div></div>
</div>
<div id="cell-3" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb2-2">os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"KERAS_BACKEND"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"torch"</span></span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> keras</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span></code></pre></div></div>
</div>
<section id="import-data" class="level1">
<h1>Import data</h1>
<section id="a.-mnist-very-visual-dataset" class="level3">
<h3 class="anchored" data-anchor-id="a.-mnist-very-visual-dataset">A. MNIST (Very visual dataset)</h3>
<div id="cell-6" class="cell" data-execution_count="25">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> mnist</span>
<span id="cb3-2"></span>
<span id="cb3-3">(X_train, y_train), (X_test, y_test) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> mnist.load_data()</span>
<span id="cb3-4"></span>
<span id="cb3-5">X_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.reshape(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>)</span>
<span id="cb3-6">X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_test.reshape(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>)</span></code></pre></div></div>
</div>
<div id="cell-7" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>, :]))</span></code></pre></div></div>
</div>
<div id="cell-8" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">plt.imshow(X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>, :].reshape(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>), cmap <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'binary'</span>, vmin <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, vmax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">255</span>)</span>
<span id="cb5-2">plt.xticks([])</span>
<span id="cb5-3">plt.yticks([])</span>
<span id="cb5-4">plt.savefig(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sample_image.png'</span>)</span></code></pre></div></div>
</div>
<div id="cell-9" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">X_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(X_train)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Assuming X_mnist is the MNIST feature data</span></span>
<span id="cb6-2">y_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(y_train)   </span>
<span id="cb6-3">X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(X_test)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Assuming X_mnist is the MNIST feature data</span></span>
<span id="cb6-4">y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(y_test)  </span></code></pre></div></div>
</div>
</section>
<section id="b.-madelon-very-high-dimensional-dataset-that-you-control" class="level3">
<h3 class="anchored" data-anchor-id="b.-madelon-very-high-dimensional-dataset-that-you-control">B. Madelon (Very high-dimensional dataset that you control)</h3>
<p>If you run the following cells, Madelon will be the dataset you use. If you want to use MNIST, you should skip the following cells.</p>
<p>Madelon is a favorite of mine because you know which features are carrying the signal and which ones are noise. In this case, the first 5 features will be informative. I often modify Madelon to include other types of noisy features, interactions, correlations, and then use this dataset to test various machine learning techniques. Since I know what the true signal is, this is very effective at helping me guage the effectiveness of these methods.</p>
<div id="cell-11" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb7-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb7-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> make_classification</span>
<span id="cb7-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> train_test_split</span>
<span id="cb7-5"></span>
<span id="cb7-6">X, y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> make_classification(n_samples<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>,</span>
<span id="cb7-7">                           n_features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>,</span>
<span id="cb7-8">                           n_informative<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb7-9">                           n_classes<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb7-10">                           n_redundant <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb7-11">                           random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb7-12">                           flip_y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>,</span>
<span id="cb7-13">                           class_sep <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>,</span>
<span id="cb7-14">                           n_clusters_per_class<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb7-15">                           shuffle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb7-16"></span>
<span id="cb7-17">X_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(X, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(X.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])])</span>
<span id="cb7-18">y_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(y, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'target'</span>)</span>
<span id="cb7-19"></span>
<span id="cb7-20">X_train, X_test, y_train, y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train_test_split(X_df, y_df, test_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb7-21"></span>
<span id="cb7-22">X_train.columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.columns.astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>)</span>
<span id="cb7-23">X_test.columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_test.columns.astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>)</span></code></pre></div></div>
</div>
<div id="cell-12" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">X_train</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="4">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">0</th>
<th data-quarto-table-cell-role="th">1</th>
<th data-quarto-table-cell-role="th">2</th>
<th data-quarto-table-cell-role="th">3</th>
<th data-quarto-table-cell-role="th">4</th>
<th data-quarto-table-cell-role="th">5</th>
<th data-quarto-table-cell-role="th">6</th>
<th data-quarto-table-cell-role="th">7</th>
<th data-quarto-table-cell-role="th">8</th>
<th data-quarto-table-cell-role="th">9</th>
<th data-quarto-table-cell-role="th">...</th>
<th data-quarto-table-cell-role="th">30</th>
<th data-quarto-table-cell-role="th">31</th>
<th data-quarto-table-cell-role="th">32</th>
<th data-quarto-table-cell-role="th">33</th>
<th data-quarto-table-cell-role="th">34</th>
<th data-quarto-table-cell-role="th">35</th>
<th data-quarto-table-cell-role="th">36</th>
<th data-quarto-table-cell-role="th">37</th>
<th data-quarto-table-cell-role="th">38</th>
<th data-quarto-table-cell-role="th">39</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">9254</th>
<td>0.926694</td>
<td>-1.773357</td>
<td>0.172527</td>
<td>0.217298</td>
<td>-1.733944</td>
<td>0.319415</td>
<td>0.185012</td>
<td>1.907097</td>
<td>-0.649596</td>
<td>0.121499</td>
<td>...</td>
<td>0.706101</td>
<td>-0.880984</td>
<td>-0.980460</td>
<td>-1.040219</td>
<td>-1.495820</td>
<td>2.793184</td>
<td>0.206932</td>
<td>-0.357897</td>
<td>-1.633463</td>
<td>-0.358298</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1561</th>
<td>1.692096</td>
<td>-1.412235</td>
<td>1.294343</td>
<td>-0.672776</td>
<td>-0.576808</td>
<td>1.088448</td>
<td>0.446408</td>
<td>1.081032</td>
<td>-0.355654</td>
<td>-0.940438</td>
<td>...</td>
<td>-2.334601</td>
<td>-0.446046</td>
<td>-0.577543</td>
<td>-0.692218</td>
<td>-0.311946</td>
<td>0.329447</td>
<td>-1.312834</td>
<td>0.339797</td>
<td>-0.291047</td>
<td>0.931088</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">1670</th>
<td>-0.721183</td>
<td>-1.430124</td>
<td>0.776395</td>
<td>0.226875</td>
<td>-1.209252</td>
<td>-0.458278</td>
<td>-1.011414</td>
<td>1.682210</td>
<td>-1.048116</td>
<td>-1.783993</td>
<td>...</td>
<td>1.440083</td>
<td>-0.666334</td>
<td>-0.909174</td>
<td>0.377606</td>
<td>1.303421</td>
<td>-0.655019</td>
<td>0.003210</td>
<td>-0.802838</td>
<td>-1.305648</td>
<td>-0.170390</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">6087</th>
<td>1.429094</td>
<td>1.539467</td>
<td>0.230706</td>
<td>0.256132</td>
<td>-0.478975</td>
<td>-1.493286</td>
<td>1.738055</td>
<td>0.888900</td>
<td>0.164039</td>
<td>-2.488486</td>
<td>...</td>
<td>1.454662</td>
<td>0.493267</td>
<td>0.079875</td>
<td>-1.390000</td>
<td>1.330840</td>
<td>0.212113</td>
<td>1.955695</td>
<td>-0.567808</td>
<td>-0.883676</td>
<td>-0.472567</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">6669</th>
<td>0.207305</td>
<td>0.600810</td>
<td>0.477484</td>
<td>-0.784978</td>
<td>-0.651178</td>
<td>-0.362503</td>
<td>1.032674</td>
<td>0.369245</td>
<td>-0.659173</td>
<td>-1.210180</td>
<td>...</td>
<td>-3.207426</td>
<td>0.423698</td>
<td>1.538654</td>
<td>-0.856037</td>
<td>0.343482</td>
<td>-0.119711</td>
<td>-0.355270</td>
<td>0.724913</td>
<td>1.702261</td>
<td>-1.597048</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">5734</th>
<td>0.484935</td>
<td>0.695846</td>
<td>1.481478</td>
<td>0.223780</td>
<td>-1.012330</td>
<td>-2.116814</td>
<td>0.613437</td>
<td>2.080326</td>
<td>-0.730603</td>
<td>1.408916</td>
<td>...</td>
<td>0.328236</td>
<td>0.413631</td>
<td>0.337577</td>
<td>-0.747556</td>
<td>0.020008</td>
<td>-0.202360</td>
<td>1.484470</td>
<td>-0.465176</td>
<td>1.391591</td>
<td>0.294199</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">5191</th>
<td>-0.715630</td>
<td>-1.895183</td>
<td>-1.091845</td>
<td>-0.579646</td>
<td>-0.474871</td>
<td>2.217163</td>
<td>-0.666726</td>
<td>-0.763180</td>
<td>0.261672</td>
<td>1.570425</td>
<td>...</td>
<td>0.280017</td>
<td>0.836381</td>
<td>0.115396</td>
<td>-0.044588</td>
<td>0.516398</td>
<td>-0.630678</td>
<td>0.755802</td>
<td>0.016894</td>
<td>0.183862</td>
<td>-0.401010</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">5390</th>
<td>1.091513</td>
<td>-1.606975</td>
<td>-1.678945</td>
<td>-0.706068</td>
<td>-0.547585</td>
<td>2.905629</td>
<td>0.827132</td>
<td>-0.997257</td>
<td>-0.983815</td>
<td>-0.609981</td>
<td>...</td>
<td>0.001994</td>
<td>0.411601</td>
<td>-0.809740</td>
<td>-0.163079</td>
<td>0.020689</td>
<td>-0.731637</td>
<td>-0.154384</td>
<td>0.599125</td>
<td>1.094542</td>
<td>-1.020837</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">860</th>
<td>-1.696127</td>
<td>1.277728</td>
<td>0.043566</td>
<td>0.659020</td>
<td>0.537680</td>
<td>-1.793380</td>
<td>-0.878325</td>
<td>-0.168647</td>
<td>-0.712758</td>
<td>2.285642</td>
<td>...</td>
<td>0.233483</td>
<td>0.551602</td>
<td>0.139338</td>
<td>0.805881</td>
<td>-0.628342</td>
<td>0.532257</td>
<td>-0.107130</td>
<td>1.449110</td>
<td>-0.499819</td>
<td>-0.826810</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">7270</th>
<td>2.113646</td>
<td>-2.644226</td>
<td>-0.097455</td>
<td>-0.645787</td>
<td>-1.748579</td>
<td>2.288712</td>
<td>0.920350</td>
<td>1.266990</td>
<td>0.545140</td>
<td>-0.652207</td>
<td>...</td>
<td>-0.048207</td>
<td>1.331188</td>
<td>-0.683416</td>
<td>-0.014705</td>
<td>0.185842</td>
<td>1.100054</td>
<td>-0.244144</td>
<td>-1.529671</td>
<td>-1.914142</td>
<td>0.072284</td>
</tr>
</tbody>
</table>

<p>8000 rows × 40 columns</p>
</div>
</div>
</div>
</section>
</section>
<section id="feature-selection" class="level1">
<h1>Feature selection</h1>
<p>Let’s go through a couple of different methods for feature selection</p>
</section>
<section id="feature-selection-methods-comparison" class="level1">
<h1>Feature Selection Methods Comparison</h1>
<table class="caption-top table">
<colgroup>
<col style="width: 13%">
<col style="width: 9%">
<col style="width: 9%">
<col style="width: 26%">
<col style="width: 40%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Pros</th>
<th>Cons</th>
<th>Best Used When</th>
<th>Computational Complexity</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>F-statistic</td>
<td>- Fast and simple<br>- Works well for linear relationships<br>- Easy to interpret</td>
<td>- Assumes linear relationship<br>- Considers features independently<br>- May miss interaction effects</td>
<td>- Initial screening<br>- Linear problems<br>- Need interpretable results</td>
<td>O(n)</td>
</tr>
<tr class="even">
<td>Mutual Information</td>
<td>- Captures non-linear relationships<br>- No assumptions about distribution</td>
<td>- Can be computationally intensive<br>- May overfit with small samples</td>
<td>- Non-linear relationships<br>- Complex interactions</td>
<td>O(n log n)</td>
</tr>
<tr class="odd">
<td>Logistic Regression</td>
<td>- Fast for high-dimensional data<br>- Provides feature coefficients</td>
<td>- Assumes linear decision boundary<br>- Sensitive to correlated features</td>
<td>- Binary classification<br>- Need interpretable coefficients</td>
<td>O(n^2)</td>
</tr>
<tr class="even">
<td>Lasso (L1)</td>
<td>- Fast for high-dimensional data<br>- Automatically does feature selection</td>
<td>- May struggle with correlated features<br>- Can be sensitive to outliers</td>
<td>- High-dimensional data<br>- Need sparse solutions</td>
<td>O(n^2)</td>
</tr>
<tr class="odd">
<td>LightGBM</td>
<td>- Handles non-linear relationships<br>- Considers feature interactions</td>
<td>- Can be computationally intensive<br>- May overfit with small samples</td>
<td>- Complex relationships<br>- Large datasets</td>
<td>O(n log n)</td>
</tr>
<tr class="even">
<td>MRMR</td>
<td>- Considers feature redundancy<br>- Good for correlated features</td>
<td>- Can be computationally intensive<br>- May struggle with non-linear relationships</td>
<td>- Datasets with correlated features<br>- Need diverse feature set</td>
<td>O(n^2)</td>
</tr>
<tr class="odd">
<td>RFE</td>
<td>- Considers feature interactions<br>- Can capture complex relationships</td>
<td>- Computationally intensive<br>- Can be unstable with small changes in data</td>
<td>- When computational cost isn’t an issue<br>- Need very precise feature selection</td>
<td>O(n^2 log n)</td>
</tr>
</tbody>
</table>
<section id="f-statistic" class="level3">
<h3 class="anchored" data-anchor-id="f-statistic">1. F-statistic</h3>
<p>f_classif relies on the Analysis of Variance (ANOVA) F-statistic to evaluate the relationship between each feature and the target variable. It tests whether the mean values of the target variable differ significantly across the groups defined by each feature. The higher the F-value, the more likely it is that the feature discriminates between different classes. Assumes a linear relationship between the features and the target, and that the target is categorical.</p>
<p>See: https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection</p>
<div id="cell-16" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.feature_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> f_classif</span>
<span id="cb9-2"></span>
<span id="cb9-3">f <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> f_classif(X_train, y_train)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb9-4">f</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="5">
<pre><code>array([2.40810399e+02, 6.64390517e+01, 5.42587843e+01, 1.71099993e-01,
       4.27226542e+01, 9.64735668e+01, 4.72136845e+01, 5.71846457e+01,
       1.27834979e+00, 1.42650284e+00, 3.40020422e-01, 4.08232233e-01,
       1.30120819e-01, 3.36734714e+00, 4.55665866e-01, 3.62300510e-01,
       1.57485437e-01, 6.93572687e-02, 1.64816305e+00, 2.91782944e+00,
       1.24026934e+00, 9.32533895e-01, 7.07099908e-01, 1.87544216e+00,
       1.10130690e+00, 3.54044700e-01, 1.15417945e+00, 2.59156089e-01,
       7.45820681e-01, 7.75403854e-01, 1.35835715e-01, 3.34985292e+00,
       8.36576456e-02, 5.15026453e-02, 4.33788709e-01, 3.12140721e-01,
       3.55118575e+00, 7.37076241e+00, 1.17274619e+00, 4.36532461e+00])</code></pre>
</div>
</div>
</section>
<section id="mutual-information" class="level3">
<h3 class="anchored" data-anchor-id="mutual-information">2. Mutual information</h3>
<p>mutual_info_classif uses the concept of mutual information, which measures the dependency between each feature and the target variable. Mutual information quantifies the amount of information gained about the target by knowing the value of the feature. It captures both linear and non-linear dependencies.</p>
<p>See: https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection</p>
<div id="cell-18" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.feature_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> mutual_info_classif</span>
<span id="cb11-2"></span>
<span id="cb11-3">mi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> mutual_info_classif(X_train, y_train)</span>
<span id="cb11-4">mi</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="6">
<pre><code>array([0.05548077, 0.01340894, 0.04092611, 0.00099355, 0.00516085,
       0.06952759, 0.02752171, 0.04043945, 0.00083431, 0.        ,
       0.        , 0.        , 0.        , 0.01146672, 0.        ,
       0.00298292, 0.        , 0.        , 0.0068234 , 0.00961735,
       0.00935105, 0.00586449, 0.00561433, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00876386, 0.00049355,
       0.        , 0.00478042, 0.00487523, 0.00268551, 0.00118896,
       0.        , 0.00583264, 0.        , 0.        , 0.        ])</code></pre>
</div>
</div>
</section>
<section id="logistic-regression" class="level3">
<h3 class="anchored" data-anchor-id="logistic-regression">3. Logistic regression</h3>
<p>Logistic regression is a linear model for classification rather than regression. It is used to estimate the probability that an instance belongs to a particular class. The coefficients of the model can be used to determine feature importance.</p>
<div id="cell-20" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LogisticRegression</span>
<span id="cb13-2"></span>
<span id="cb13-3">logreg <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LogisticRegression().fit(X_train, y_train)</span>
<span id="cb13-4">logreg.coef_</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="7">
<pre><code>array([[-0.08669758,  0.06064912, -0.04063592,  0.00704782,  0.05079063,
        -0.02366404, -0.03246196, -0.07613473, -0.02216845,  0.00405511,
         0.00890925, -0.01289459, -0.00446435,  0.00330386,  0.01287983,
        -0.00599418,  0.00494212, -0.00385749,  0.03721175, -0.03849129,
        -0.0032749 , -0.01534965, -0.00908255,  0.02016669, -0.00175419,
         0.00918138,  0.01908963,  0.01357562, -0.01804835,  0.00266229,
         0.00180036, -0.00624841, -0.00351875,  0.00131487, -0.01573702,
        -0.00485053, -0.03744854,  0.05047984,  0.0174477 , -0.00658735],
       [ 0.22615114, -0.09497312, -0.04870109,  0.08462927, -0.00434942,
         0.0845374 ,  0.05186658,  0.05312074,  0.01121456,  0.0271935 ,
        -0.00027397,  0.01737956,  0.0080553 , -0.04770429, -0.00379082,
        -0.00963934,  0.00274429,  0.00688572, -0.03159925,  0.02737958,
        -0.0220797 , -0.00789503, -0.01134082, -0.02724488, -0.02432216,
         0.00543336, -0.02498671, -0.00919296,  0.01043275,  0.01323123,
        -0.00059363, -0.02632891, -0.00159478, -0.00753653,  0.01697607,
         0.01247471,  0.04553467, -0.05091659, -0.02028138,  0.03615948],
       [-0.13945356,  0.03432401,  0.089337  , -0.09167709, -0.04644121,
        -0.06087336, -0.01940462,  0.02301399,  0.0109539 , -0.03124862,
        -0.00863527, -0.00448497, -0.00359095,  0.04440043, -0.00908901,
         0.01563352, -0.00768641, -0.00302823, -0.0056125 ,  0.01111171,
         0.02535461,  0.02324468,  0.02042338,  0.0070782 ,  0.02607635,
        -0.01461474,  0.00589708, -0.00438266,  0.0076156 , -0.01589352,
        -0.00120673,  0.03257732,  0.00511353,  0.00622166, -0.00123904,
        -0.00762418, -0.00808614,  0.00043674,  0.00283369, -0.02957212]])</code></pre>
</div>
</div>
</section>
<section id="feature-selection-with-l1-lasso-regularization" class="level3">
<h3 class="anchored" data-anchor-id="feature-selection-with-l1-lasso-regularization">3.5 Feature Selection with L1 (Lasso) Regularization</h3>
<p>Lasso is a great feature selection technique. It’s fast, easy to use, and works well with high-dimensional data. I have often used it when very wide data, greater than 100 features (or even &gt;10k features) to help parse down the number of features. It uses L1 regularization to penalize the absolute size of the coefficients. This leads to sparse solutions, where many of the coefficients are zero. The features with non-zero coefficients are selected. Lasso can be used for feature selection by setting the regularization parameter to a value that results in a sparse solution. The regularization parameter can be tuned using cross-validation.</p>
<p>Try modifying the regularization parameter to see how it affects the number of features selected.</p>
<div id="cell-22" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LogisticRegression</span>
<span id="cb15-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> StandardScaler</span>
<span id="cb15-3"></span>
<span id="cb15-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 4: Standardize the features</span></span>
<span id="cb15-5">scaler <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> StandardScaler()</span>
<span id="cb15-6">X_train_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scaler.fit_transform(X_train)</span>
<span id="cb15-7">X_test_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scaler.transform(X_test)</span>
<span id="cb15-8"></span>
<span id="cb15-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 5: Apply Logistic Regression with L1 regularization for feature selection</span></span>
<span id="cb15-10">logregL1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LogisticRegression(penalty<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'l1'</span>, solver<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'saga'</span>, multi_class<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'multinomial'</span>, C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># C is inverse of regularization strength</span></span>
<span id="cb15-11">logregL1.fit(X_train_scaled, y_train)</span>
<span id="cb15-12"></span>
<span id="cb15-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 6: Get the selected features using the original DataFrame 'X'</span></span>
<span id="cb15-14">selected_features <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.columns[(logregL1.coef_ <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">any</span>(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)]</span>
<span id="cb15-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Selected features: "</span>, selected_features)</span>
<span id="cb15-16"></span>
<span id="cb15-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Optional: Check the coefficients</span></span>
<span id="cb15-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#print("Logistic Regression coefficients: ", logreg.coef_)</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Selected features:  Index([0, 1, 2, 4, 5, 13, 19, 36, 37], dtype='int64')</code></pre>
</div>
</div>
</section>
<section id="lightgbm" class="level3">
<h3 class="anchored" data-anchor-id="lightgbm">4. LightGBM</h3>
<p>LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed for efficiency and can handle large datasets. It can be used to determine feature importance.</p>
<div id="cell-24" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> lightgbm <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LGBMClassifier</span>
<span id="cb17-2"></span>
<span id="cb17-3">lgbm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LGBMClassifier(</span>
<span id="cb17-4">    objective <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'multiclass'</span>,</span>
<span id="cb17-5">    metric <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'multi_logloss'</span>,</span>
<span id="cb17-6">    importance_type <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'gain'</span></span>
<span id="cb17-7">).fit(X_train, y_train)</span>
<span id="cb17-8"></span>
<span id="cb17-9">lgbm.feature_importances_</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001096 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 10200
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 40
[LightGBM] [Info] Start training from score -1.108284
[LightGBM] [Info] Start training from score -1.094371
[LightGBM] [Info] Start training from score -1.093252</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>array([7030.24782425, 4036.90086633, 7197.39050466, 5339.74033117,
       2017.793881  , 8113.67321557, 3905.26838762, 4383.72206521,
        550.03625131,  531.02187729,  472.12624365,  678.28547454,
        526.57803982,  586.75292325,  552.92263156,  433.08122051,
        552.18078488,  534.15573859,  566.58704376,  630.01932001,
        635.43262064,  636.71719581,  560.95981157,  586.52648336,
        553.7755444 ,  563.13766581,  547.99060541,  523.01072556,
        676.76891661,  616.94216621,  634.27822083,  489.91742009,
        680.71264285,  620.95509708,  618.59545827,  418.22946733,
        568.21738124,  592.29172051,  553.43465978,  655.03435677])</code></pre>
</div>
</div>
</section>
<section id="boruta" class="level3">
<h3 class="anchored" data-anchor-id="boruta">5. Boruta</h3>
<p>Boruta is an all-relevant feature selection method. It is an extension of the Random Forest algorithm. It selects all features that are relevant to the target variable, rather than just the most important features.</p>
<div id="cell-26" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">### long training time &gt; 1 hour</span></span>
<span id="cb20-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> boruta <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BorutaPy</span>
<span id="cb20-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier</span>
<span id="cb20-4"></span>
<span id="cb20-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#boruta = BorutaPy(</span></span>
<span id="cb20-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#    estimator = RandomForestClassifier(max_depth = 5), </span></span>
<span id="cb20-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#    n_estimators = 'auto', </span></span>
<span id="cb20-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#    max_iter = 100</span></span>
<span id="cb20-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#).fit(X_train, y_train)</span></span></code></pre></div></div>
</div>
</section>
<section id="mrmr" class="level3">
<h3 class="anchored" data-anchor-id="mrmr">6. MRMR</h3>
<p>MRMR (Minimum Redundancy Maximum Relevance) is a feature selection method that selects features based on their relevance to the target variable and their redundancy with other features. It aims to select features that are highly correlated with the target variable but uncorrelated with each other.</p>
<p>There are several implementations of MRMR available in Python: https://github.com/smazzanti/mrmr https://koaning.github.io/scikit-lego/api/feature-selection/ https://github.com/AutoViML/featurewiz?tab=readme-ov-file</p>
<div id="cell-28" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb21-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mrmr <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> mrmr_classif</span>
<span id="cb21-3"></span>
<span id="cb21-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#mrmr = mrmr_classif(pd.DataFrame(X_train), pd.Series(y_train), K = 784)</span></span>
<span id="cb21-5">mrmr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> mrmr_classif(pd.DataFrame(X_train), pd.Series(y_train), K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>100%|██████████| 40/40 [00:03&lt;00:00, 11.45it/s]</code></pre>
</div>
</div>
</section>
<section id="store-results" class="level3">
<h3 class="anchored" data-anchor-id="store-results">Store results</h3>
<div id="cell-30" class="cell" data-execution_count="12">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb23-2"></span>
<span id="cb23-3">ranking <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]))</span>
<span id="cb23-4"></span>
<span id="cb23-5">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'f'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(f, index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.index).fillna(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).rank(ascending <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb23-6">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mi'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(mi, index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.index).fillna(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).rank(ascending <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb23-7">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'logreg'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(logreg.coef_).mean(axis <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>), index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.index).rank(ascending <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb23-8">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lasso'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(logregL1.coef_).mean(axis <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>), index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.index).rank(ascending <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb23-9">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lightgbm'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(lgbm.feature_importances_, index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.index).rank(ascending <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb23-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#ranking['boruta'] = boruta.support_* 1 + boruta.support_weak_ * 2 + (1 - boruta.support_ - boruta.support_weak_) * X_train.shape[1]</span></span>
<span id="cb23-11">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mrmr'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(</span>
<span id="cb23-12">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(mrmr) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> [<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(mrmr) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(mrmr)),</span>
<span id="cb23-13">    index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> mrmr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(ranking.index) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(mrmr))</span>
<span id="cb23-14">).sort_index()</span>
<span id="cb23-15">ranking[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lasso'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.Series(np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(logregL1.coef_).mean(axis <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>), index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.index).rank(ascending <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb23-16"></span>
<span id="cb23-17"></span>
<span id="cb23-18">ranking <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.replace(to_replace <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>(), value <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])</span>
<span id="cb23-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#ranking.to_csv('ranking.csv', index = False)</span></span></code></pre></div></div>
</div>
</section>
</section>
<section id="evaluate-feature-selection-methods" class="level1">
<h1>Evaluate Feature Selection Methods</h1>
<p>Let’s see how the predictive performance of the model changes as we add more features. We will use the top features selected by each method to train a model and evaluate its performance.</p>
<div id="cell-32" class="cell" data-execution_count="13">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> catboost <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> CatBoostClassifier</span>
<span id="cb24-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> accuracy_score, roc_auc_score</span>
<span id="cb24-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## 22 minutes for mnist</span></span>
<span id="cb24-4"></span>
<span id="cb24-5">algos <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'f'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mi'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'logreg'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lasso'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lightgbm'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mrmr'</span>] <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">##Feel free to change this</span></span>
<span id="cb24-6">ks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>] </span>
<span id="cb24-7">ks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>] <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">##Feel free to change this</span></span>
<span id="cb24-8"></span>
<span id="cb24-9">accuracy <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ks, columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> algos)</span>
<span id="cb24-10">roc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ks, columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> algos)</span>
<span id="cb24-11"></span>
<span id="cb24-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> algo <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> algos:</span>
<span id="cb24-13">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (algo)</span>
<span id="cb24-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> k <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> ks:</span>
<span id="cb24-15">    </span>
<span id="cb24-16">        cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ranking[algo].sort_values().head(k).index.to_list()</span>
<span id="cb24-17">                </span>
<span id="cb24-18">        clf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CatBoostClassifier().fit(</span>
<span id="cb24-19">            X_train[cols], y_train,</span>
<span id="cb24-20">            eval_set<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(X_test[cols], y_test),</span>
<span id="cb24-21">            early_stopping_rounds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>,</span>
<span id="cb24-22">            verbose <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb24-23">        )</span>
<span id="cb24-24">                </span>
<span id="cb24-25">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Store accuracy</span></span>
<span id="cb24-26">        accuracy.loc[k, algo] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> accuracy_score(</span>
<span id="cb24-27">            y_true<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>y_test, y_pred<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>clf.predict(X_test[cols])</span>
<span id="cb24-28">        )</span>
<span id="cb24-29">        </span>
<span id="cb24-30">accuracy.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'accuracyMC.csv'</span>, index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb24-31">roc.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'rocMC.csv'</span>, index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>f
mi
logreg
lasso
lightgbm
mrmr</code></pre>
</div>
</div>
</section>
<section id="feature-selection-curves" class="level1">
<h1>Feature Selection Curves</h1>
<p>Let’s visualize how the model’s accuracy changes as a function of feature selection.<br>
Notice how for Madelon, there is an optimal number of features. Too many features that are noise end up reducing the performance of the model</p>
<div id="cell-34" class="cell" data-execution_count="14">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> algo, label, color <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(</span>
<span id="cb26-2">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mrmr'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'f'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mi'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lightgbm'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'logreg'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lasso"</span>],</span>
<span id="cb26-3">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MRMR'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'F-statistic'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Mutual Info'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'LightGBM'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Log Reg'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Log Reg (L1/Lasso)'</span>],</span>
<span id="cb26-4">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'orangered'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'blue'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'yellow'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lime'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'black'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pink'</span>]):</span>
<span id="cb26-5">        plt.plot(accuracy.index, accuracy[algo], label <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label, color <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> color, lw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb26-6"></span>
<span id="cb26-7">plt.plot(</span>
<span id="cb26-8">    [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>], [pd.Series(y_test).value_counts(normalize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>).iloc[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, </span>
<span id="cb26-9">    label <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'[Random]'</span>, color <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'grey'</span>, ls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'--'</span>, lw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb26-10">)</span>
<span id="cb26-11"></span>
<span id="cb26-12">plt.legend(fontsize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>, loc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'center left'</span>, bbox_to_anchor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span>
<span id="cb26-13">plt.grid()</span>
<span id="cb26-14">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Number of features'</span>, fontsize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>)</span>
<span id="cb26-15">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Accuracy'</span>, fontsize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>)</span>
<span id="cb26-16">plt.savefig(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'accuracy.png'</span>, dpi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>, bbox_inches <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'tight'</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://rajivshah.com/blog/Feature_Selection_files/figure-html/cell-19-output-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="feature-selection-combined-with-feature-elimination-techniques" class="level1">
<h1>Feature Selection combined with Feature Elimination Techniques</h1>
<section id="recursive-feature-elimination" class="level3">
<h3 class="anchored" data-anchor-id="recursive-feature-elimination">Recursive Feature Elimination</h3>
<p>One of the best methods for feature selection consistently is feature importance with LightGBM. We can refine and improve this in several ways: Recursive Feature Elimination uses the same feature importance method, but then iteratively removes the least important features. This iterative process requires training a model several times, but can provide an improvement in feature selection. This method is a version of Recursive Feature Elimination that is widely accepted as a best practice for feature selection.</p>
<div id="cell-37" class="cell" data-execution_count="15">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.feature_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RFE</span>
<span id="cb27-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> xgboost <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> XGBClassifier</span>
<span id="cb27-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.svm <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SVR</span>
<span id="cb27-4">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> XGBClassifier(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb27-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#model = SVR(kernel="linear")  #took 3 minutes, ok results but not as good as XGB on Madelon</span></span>
<span id="cb27-6">rfe <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RFE(model, n_features_to_select<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>, step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb27-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#rfe = RFE(model, n_features_to_select=50, step=200,verbose=2) #for MNIST</span></span>
<span id="cb27-8">rfe.fit(X_train, y_train)</span>
<span id="cb27-9">rfe.support_</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="15">
<pre><code>array([ True,  True,  True,  True, False,  True,  True,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])</code></pre>
</div>
</div>
<div id="cell-38" class="cell" data-execution_count="16">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train an XGBoost model with the selected features from RFE</span></span>
<span id="cb29-2">model_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> XGBClassifier(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb29-3">X_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.loc[:, rfe.support_]</span>
<span id="cb29-4">model_selected.fit(X_selected, y_train)</span>
<span id="cb29-5"></span>
<span id="cb29-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Make predictions on the test set with both models</span></span>
<span id="cb29-7">y_pred_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model_selected.predict(X_test.loc[:, rfe.support_])</span>
<span id="cb29-8">accuracy_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> accuracy_score(y_test, y_pred_selected)</span>
<span id="cb29-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Accuracy with selected features: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>accuracy_selected<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Accuracy with selected features: 0.7135</code></pre>
</div>
</div>
<p>Compare with perfect on Madelon</p>
<div id="cell-40" class="cell" data-execution_count="17">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1">perfect <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb31-2">       <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb31-3">       <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb31-4">       <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb31-5">       <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>]</span></code></pre></div></div>
</div>
<div id="cell-41" class="cell" data-execution_count="18">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb32" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train an XGBoost model with the selected features from RFE</span></span>
<span id="cb32-2">model_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> XGBClassifier(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb32-3">X_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.loc[:, perfect]</span>
<span id="cb32-4">model_selected.fit(X_selected, y_train)</span>
<span id="cb32-5"></span>
<span id="cb32-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Make predictions on the test set with both models</span></span>
<span id="cb32-7">y_pred_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model_selected.predict(X_test.loc[:, perfect])</span>
<span id="cb32-8">accuracy_selected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> accuracy_score(y_test, y_pred_selected)</span>
<span id="cb32-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Accuracy with selected features: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>accuracy_selected<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Accuracy with selected features: 0.7140</code></pre>
</div>
</div>
</section>
<section id="feature-elimination-with-fire" class="level3">
<h3 class="anchored" data-anchor-id="feature-elimination-with-fire">Feature Elimination with FIRE</h3>
<p>At DataRobot, we had a mighty AutoML engine that showed you how feature importance aggregated across different models (this is feature importance from four diverse models). <img src="https://docs.datarobot.com/en/docs/images/fire-2.png" class="img-fluid" alt="https://docs.datarobot.com/en/docs/images/fire-2.png"></p>
<p>You can use this variance as part of feature selection. It takes a lot more compute, but in our experiments, can perform even better feature selection. Read more about feature importance rank ensembling (FIRE) here - https://docs.datarobot.com/en/docs/api/accelerators/adv-approaches/fire.html and a code snippet is here - https://github.com/datarobot-community/examples-for-data-scientists/blob/master/Feature%20Lists%20Manipulation/Python/Advanced%20Feature%20Selection.ipynb</p>
</section>
<section id="featureviz" class="level3">
<h3 class="anchored" data-anchor-id="featureviz">FeatureViz</h3>
<p>Featureviz looks like a cool feature selection package, but I wasn’t able to get it to work. It’s worth checking out. add links</p>
</section>
</section>
<section id="other-great-feature-selection-resources" class="level1">
<h1>Other great feature selection resources:</h1>
<p>A classic dataset where many feature selection techniques have been applied is the <a href="https://www.kaggle.com/competitions/santander-customer-satisfaction">Kaggle Santader Customer Satisfaction</a> competition.</p>
<p><a href="https://www.kaggle.com/code/solegalli/feature-selection-with-feature-engine/notebook">Feature Selection with Feature Engine</a></p>
<p><a href="https://www.kaggle.com/code/adarshsng/extensive-advance-feature-selection-tutorial">Advance Feature Selection Tutorial</a></p>


</section>

 ]]></description>
  <category>featureselection</category>
  <category>MLOps</category>
  <guid>https://rajivshah.com/blog/Feature_Selection.html</guid>
  <pubDate>Tue, 08 Oct 2024 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/Feature_Selection_files/figure-html/cell-19-output-1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Interpretable Machine Learning Models Simply Explained</title>
  <link>https://rajivshah.com/blog/interpretable-ml-models.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/lx4SJOVtxI8" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/lx4SJOVtxI8">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the annotated presentation for “Rules: A Simple &amp; Effective Machine Learning Approach” by Rajiv Shah.</p>
<section id="title-slide" class="level3">
<h3 class="anchored" data-anchor-id="title-slide">1. Title Slide</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=0s">Timestamp: 00:00:00</a>)</p>
<p>The presentation begins by introducing the core topic: <strong>Interpretable Models</strong> and the use of rules in machine learning. Rajiv Shah sets the stage by contrasting this talk with previous discussions on explainability (using tools to explain complex models). Instead, this session focuses on choosing models that are inherently easy to understand.</p>
<p>Shah expresses his interest in how machine learning helps us understand the world. He notes that while tools like SHAP or LIME help unpack complex models, there is immense value in approaching the problem differently: by selecting model architectures that are transparent by design.</p>
<p>The speaker invites the audience to view this not just as a technical lecture but as a discussion on the trade-offs between model complexity and interpretability, setting a collaborative tone for the presentation.</p>
</section>
<section id="table-of-contents" class="level3">
<h3 class="anchored" data-anchor-id="table-of-contents">2. Table of Contents</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=150s">Timestamp: 00:02:30</a>)</p>
<p>This slide outlines the roadmap for the presentation. Shah explains that he will begin with the “Big Picture” concepts—specifically the <strong>“Why?”</strong> and the <strong>“Baseline”</strong>—before diving into four specific technical approaches to rule-based modeling.</p>
<p>The four specific methods to be covered are <strong>Rulefit</strong>, <strong>GA2M</strong> (Generalized Additive Models with interactions), <strong>Rule Lists</strong>, and <strong>Scorecards</strong>. This structure moves from theoretical justification to practical application, comparing different algorithms that prioritize transparency.</p>
<p>Shah also mentions that a GitHub repository is available with code examples for everything shown, allowing the audience to reproduce the results for the tabular datasets discussed.</p>
</section>
<section id="section-1-why" class="level3">
<h3 class="anchored" data-anchor-id="section-1-why">3. Section 1: Why?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=189s">Timestamp: 00:03:09</a>)</p>
<p>This section header introduces the fundamental question: <strong>Why do we want rules?</strong> The speaker moves past the obvious statement that “AI is important” to investigate the influences that drive data scientists toward complex, opaque models.</p>
<p>Shah prepares to discuss the cultural and competitive pressures in data science that prioritize raw accuracy over usability. This section serves as a critique of the “accuracy at all costs” mindset often found in the industry.</p>
</section>
<section id="mark-cuban-quote" class="level3">
<h3 class="anchored" data-anchor-id="mark-cuban-quote">4. Mark Cuban Quote</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=197s">Timestamp: 00:03:17</a>)</p>
<p>The slide features a quote from Mark Cuban: <em>“Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years.”</em></p>
<p>Shah briefly references this as the “obligatory” acknowledgment of AI’s massive importance in the current landscape. It reinforces that while the field is moving fast, the <em>understanding</em> of these systems is paramount, which ties into the presentation’s focus on interpretability.</p>
</section>
<section id="influences-kaggle-academia" class="level3">
<h3 class="anchored" data-anchor-id="influences-kaggle-academia">5. Influences: Kaggle &amp; Academia</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=220s">Timestamp: 00:03:40</a>)</p>
<p>Shah identifies <strong>Kaggle competitions</strong> and academic research as two primary influences on data scientists. He notes that these platforms heavily incentivize accuracy above all else. For example, in the Zillow Prize, the difference between the top scores is minuscule, yet teams fight for that fraction of a percentage.</p>
<p>He argues that this environment trains data scientists to focus solely on improving metrics (like RMSE or AUC), often ignoring other critical trade-offs like model complexity, deployment difficulty, or explainability.</p>
<p>As he states, <em>“One of the byproducts of Kaggle is a very heavy focus on making sure you improve your models around accuracy… and that’s how you can get a conference paper.”</em> This sets up the problem of complexity creep.</p>
</section>
<section id="the-netflix-prize-winners" class="level3">
<h3 class="anchored" data-anchor-id="the-netflix-prize-winners">6. The Netflix Prize Winners</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=339s">Timestamp: 00:05:39</a>)</p>
<p>This slide shows the winners of the famous <strong>Netflix Prize</strong>, a competition held about 15 years ago where a team won $1 million for improving Netflix’s recommendation algorithm by 10%.</p>
<p>Shah uses this story to illustrate the peak of the “accuracy” mindset. The competition drew massive interest and drove innovation, but it also encouraged teams to prioritize the leaderboard score over the practicality of the solution.</p>
</section>
<section id="netflix-prize-progress-graph" class="level3">
<h3 class="anchored" data-anchor-id="netflix-prize-progress-graph">7. Netflix Prize Progress Graph</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=374s">Timestamp: 00:06:14</a>)</p>
<p>The graph displays the progress of teams over time during the Netflix competition. Shah points out that after an initial period of rapid improvement using standard algorithms, progress plateaued.</p>
<p>To break through these plateaus, teams began using <strong>Ensembling</strong>—combining multiple models together. The winning solution was an ensemble of <strong>107 different models</strong>. Shah emphasizes that while this strategy is powerful for eking out the last bit of performance, it creates immense complexity.</p>
</section>
<section id="the-engineering-cost-of-complexity" class="level3">
<h3 class="anchored" data-anchor-id="the-engineering-cost-of-complexity">8. The Engineering Cost of Complexity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=459s">Timestamp: 00:07:39</a>)</p>
<p>This slide reveals the ironic conclusion of the Netflix Prize: the winning model was <strong>never implemented</strong>. The engineering costs to deploy an ensemble of 107 models were simply too high compared to the marginal gain in accuracy.</p>
<p>Shah uses this as a cautionary tale: <em>“If your focus is on accuracy… it drives you down towards this complexity… but often you end up with these complex models [that] are often very difficult to implement.”</em> This highlights the disconnect between competitive data science and enterprise reality.</p>
</section>
<section id="understandable-white-box-model-clear-2" class="level3">
<h3 class="anchored" data-anchor-id="understandable-white-box-model-clear-2">9. Understandable White Box Model (CLEAR-2)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=484s">Timestamp: 00:08:04</a>)</p>
<p>Shah transitions to the alternative: <strong>Interpretable Models</strong>. This slide shows a simple linear model (CLEAR-2) with only two features. This is a classic “White Box” model where the relationship between inputs and outputs is transparent.</p>
<p>The speaker contrasts this with the “Black Box” nature of complex ensembles. He argues that if you cannot understand what is going on inside a model, you cannot effectively debug it, nor can you easily convince stakeholders to trust it.</p>
</section>
<section id="complex-white-box-model-clear-8" class="level3">
<h3 class="anchored" data-anchor-id="complex-white-box-model-clear-8">10. Complex White Box Model (CLEAR-8)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=711s">Timestamp: 00:11:51</a>)</p>
<p>This slide presents a linear model with eight features (CLEAR-8). While technically still a “White Box” model, Shah implies that as feature counts grow, true understandability diminishes.</p>
<p>He touches on this concept later in the “Caveats” section, noting that even linear models can become confusing if there is <strong>multicollinearity</strong> (features moving in the same direction). Just because we can see the coefficients doesn’t mean the model is intuitively “explainable” to a human if the variables interact in complex, non-obvious ways.</p>
</section>
<section id="easy-to-understand-decision-tree" class="level3">
<h3 class="anchored" data-anchor-id="easy-to-understand-decision-tree">11. Easy to Understand Decision Tree</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1155s">Timestamp: 00:19:15</a>)</p>
<p>Here, a simple Decision Tree is presented. Shah connects this to the history of rule-based learning, noting that early research found that keeping decision trees “short and stumpy” made them very easy for humans to explain.</p>
<p>This visual represents the ideal of interpretability: a clear path of logic (e.g., “If X is less than 3, go left”) that leads to a prediction. This is the foundation for the <strong>Rulefit</strong> method discussed later.</p>
</section>
<section id="too-much-to-comprehend" class="level3">
<h3 class="anchored" data-anchor-id="too-much-to-comprehend">12. Too Much to Comprehend</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=466s">Timestamp: 00:07:46</a>)</p>
<p>Contrasting the previous slide, this image shows a chaotic forest of decision trees. This represents modern ensemble methods like Random Forests or Gradient Boosted Machines.</p>
<p>Shah uses this visual to reinforce the point that while ensembles offer <strong>“Better Performance,”</strong> the sheer number of decision paths makes them <strong>“too much to Comprehend.”</strong> You lose the ability to trace the “why” behind a specific prediction, turning the system into a Black Box.</p>
</section>
<section id="pedro-domingos-tweet" class="level3">
<h3 class="anchored" data-anchor-id="pedro-domingos-tweet">13. Pedro Domingos Tweet</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=502s">Timestamp: 00:08:22</a>)</p>
<p>Shah acknowledges the counter-argument by showing a tweet from Pedro Domingos, a prominent machine learning researcher, who suggests that demanding explainability limits the potential of AI.</p>
<p>Shah respectfully disagrees with this stance in the context of enterprise data science. He argues that in the real world, <em>“If you don’t understand what’s going on in your model, it’s hard for you to debug it, it’s hard to convince somebody else to adopt your model.”</em> Practicality and trust often outweigh raw theoretical power.</p>
</section>
<section id="benefits-of-interpretable-models" class="level3">
<h3 class="anchored" data-anchor-id="benefits-of-interpretable-models">14. Benefits of Interpretable Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=556s">Timestamp: 00:09:16</a>)</p>
<p>This slide summarizes the key benefits of using interpretable models, referencing the work of <strong>Cynthia Rudin</strong>. The main advantages are: 1. <strong>Debugging:</strong> It is easier to spot weird behaviors. 2. <strong>Trust:</strong> Stakeholders and legal/risk teams are more likely to approve the model. 3. <strong>Deployment:</strong> These models can often be deployed as simple SQL queries or basic code, avoiding the need for heavy GPU infrastructure.</p>
<p>Shah emphasizes the deployment aspect: <em>“You don’t have to go out and get a GPU… you can actually deploy directly within a database.”</em></p>
</section>
<section id="caveats-of-interpretable-models" class="level3">
<h3 class="anchored" data-anchor-id="caveats-of-interpretable-models">15. Caveats of Interpretable Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=660s">Timestamp: 00:11:00</a>)</p>
<p>Shah provides a necessary reality check. He clarifies that selecting an interpretable <em>algorithm</em> is only one part of the process. True interpretability depends on the entire data pipeline.</p>
<p>Issues like <strong>data labeling</strong>, <strong>feature engineering</strong>, and <strong>multicollinearity</strong> can render even a simple model confusing. For example, if two correlated features have opposite coefficients in a linear model, it becomes very difficult to explain the logic to a business user, even if the math is simple.</p>
</section>
<section id="section-2-baseline" class="level3">
<h3 class="anchored" data-anchor-id="section-2-baseline">16. Section 2: Baseline</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=735s">Timestamp: 00:12:15</a>)</p>
<p>This slide introduces the <strong>Baseline</strong> section. Shah advocates for always starting a project with a simple baseline model to establish a performance benchmark.</p>
<p>He shares an anecdote about people spending a year on a project only to be nearly matched by a simple model built in two hours. Establishing a baseline helps determine how much effort should be spent chasing incremental accuracy improvements.</p>
</section>
<section id="the-problem-uci-adult-dataset" class="level3">
<h3 class="anchored" data-anchor-id="the-problem-uci-adult-dataset">17. The Problem: UCI Adult Dataset</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=774s">Timestamp: 00:12:54</a>)</p>
<p>Shah introduces the dataset he will use for all examples in the talk: the <strong>UCI Adult Dataset</strong> (Census Income). The goal is a binary classification problem: predicting whether someone has a high or low income based on demographics.</p>
<p>He chooses this dataset because it represents typical enterprise tabular data: it has 30,000 rows, a mix of numerical and categorical features, and contains collinearity and interaction effects. This makes it a realistic test bed for the models he will demonstrate.</p>
</section>
<section id="baseline-models" class="level3">
<h3 class="anchored" data-anchor-id="baseline-models">18. Baseline Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=833s">Timestamp: 00:13:53</a>)</p>
<p>The speaker outlines the three baseline models he built to bracket the performance possibilities: 1. <strong>Logistic Regression:</strong> The standard statistical approach. 2. <strong>AutoML (H2O):</strong> A stacked ensemble of many models (Neural Networks, GBMs, etc.) representing the “maximum” possible performance. 3. <strong>OneR:</strong> A very simple rule-based algorithm.</p>
<p>These baselines provide the context for evaluating the interpretable models later.</p>
</section>
<section id="baseline-models-plot" class="level3">
<h3 class="anchored" data-anchor-id="baseline-models-plot">19. Baseline Models Plot</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=852s">Timestamp: 00:14:12</a>)</p>
<p>This plot visualizes <strong>Complexity vs.&nbsp;AUC</strong> (Area Under the Curve). * <strong>OneR</strong> is at the bottom (AUC ~0.60) with very low complexity. * <strong>Logistic Regression</strong> is in the middle (AUC ~0.91). * <strong>Stacked Ensemble</strong> is at the top (AUC ~0.93) but with massive complexity.</p>
<p>Shah notes that while the Stacked Ensemble wins on accuracy, the Logistic Regression is surprisingly close, highlighting that simpler models can often be “good enough.”</p>
</section>
<section id="oner-example" class="level3">
<h3 class="anchored" data-anchor-id="oner-example">20. OneR Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=917s">Timestamp: 00:15:17</a>)</p>
<p>Shah explains the <strong>OneR</strong> (One Rule) algorithm. This method finds the single feature in the dataset that best predicts the target. In the example shown (Iris dataset), utilizing just “Petal Width” classifies 96% of instances correctly.</p>
<p>He suggests OneR is a great way to detect <strong>Target Leakage</strong>—if one feature predicts the target perfectly, it might be “cheating.” It also sets the floor for performance; if a complex model can’t beat OneR, something is wrong.</p>
</section>
<section id="baseline-models-plot-recap" class="level3">
<h3 class="anchored" data-anchor-id="baseline-models-plot-recap">21. Baseline Models Plot (Recap)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=996s">Timestamp: 00:16:36</a>)</p>
<p>Returning to the complexity plot, Shah reiterates the performance gap. The AutoML model sets the “ceiling” at 0.93 AUC.</p>
<p>The goal for the rest of the presentation is to see where the interpretable models (Rulefit, GA2M, etc.) fall on this graph. Can they approach the 0.93 AUC of the ensemble without incurring the massive complexity penalty?</p>
</section>
<section id="section-3-rulefit" class="level3">
<h3 class="anchored" data-anchor-id="section-3-rulefit">22. Section 3: Rulefit</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1098s">Timestamp: 00:18:18</a>)</p>
<p>This slide introduces the first major interpretable technique: <strong>Rulefit</strong>. Shah mentions familiarity with this from his time at Data Robot and notes that it is a powerful way to combine the benefits of trees and linear models.</p>
</section>
<section id="what-is-rulefit" class="level3">
<h3 class="anchored" data-anchor-id="what-is-rulefit">23. What is Rulefit?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1110s">Timestamp: 00:18:30</a>)</p>
<p><strong>Rulefit</strong> is an algorithm developed by Friedman and Popescu (2008). It works by: 1. Building a random forest of short, “stumpy” decision trees. 2. Extracting each path through the trees as a “Rule.” 3. Using these rules as binary features in a sparse linear model (Lasso).</p>
<p>This approach allows the model to capture interactions (via the trees) while maintaining the interpretability of a linear equation.</p>
</section>
<section id="h2o-rulefit-output" class="level3">
<h3 class="anchored" data-anchor-id="h2o-rulefit-output">24. H2O Rulefit Output</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1339s">Timestamp: 00:22:19</a>)</p>
<p>Shah displays the output from the <strong>H2O Rulefit</strong> implementation. The model generates human-readable rules, such as: <em>“If Education &lt; 12 AND Capital Gain &lt; $7000, THEN Coefficient is negative.”</em></p>
<p>He notes that while the rules are readable, the raw output can look like “computer-ese.” However, it allows a data scientist to identify specific segments of the population (e.g., low education, low capital gain) that strongly drive the prediction.</p>
</section>
<section id="overlapping-rules" class="level3">
<h3 class="anchored" data-anchor-id="overlapping-rules">25. Overlapping Rules</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1470s">Timestamp: 00:24:30</a>)</p>
<p>A key characteristic of Rulefit is that the rules <strong>overlap</strong>. A single data point might satisfy multiple rules simultaneously.</p>
<p>Shah points out that this adds a layer of complexity to interpretability. To understand a prediction, you have to sum up the coefficients of <em>all</em> the rules that apply to that person. This is different from a decision tree where you fall into exactly one leaf node.</p>
</section>
<section id="h2o-rulefit-with-linear-terms" class="level3">
<h3 class="anchored" data-anchor-id="h2o-rulefit-with-linear-terms">26. H2O Rulefit with Linear Terms</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1555s">Timestamp: 00:25:55</a>)</p>
<p>One limitation of pure rules is handling continuous variables (like age or miles driven). Rules have to “bin” these variables (e.g., Age &lt; 30, Age 30-40).</p>
<p>Shah explains that H2O Rulefit solves this by including <strong>Linear Terms</strong>. The model can use rules for non-linear interactions <em>and</em> standard linear coefficients for continuous trends. This hybrid approach boosts the AUC significantly (up to 0.88 in this example) by capturing linear relationships more naturally.</p>
</section>
<section id="rulefit-results" class="level3">
<h3 class="anchored" data-anchor-id="rulefit-results">27. Rulefit Results</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1624s">Timestamp: 00:27:04</a>)</p>
<p>This slide plots the performance of Rulefit models with varying numbers of rules. Shah demonstrates that by increasing the number of rules (complexity), the AUC climbs closer to the Stacked Ensemble.</p>
<p>He concludes that Rulefit is a versatile tool. You can tune the “dial” of complexity: fewer rules for more interpretability, or more rules for higher accuracy, often getting very competitive performance.</p>
</section>
<section id="section-4-ga2m" class="level3">
<h3 class="anchored" data-anchor-id="section-4-ga2m">28. Section 4: GA2M</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1895s">Timestamp: 00:31:35</a>)</p>
<p>The presentation moves to the second technique: <strong>GA2M</strong> (Generalized Additive Models with pairwise interactions). Shah notes that while GAMs have existed for a while, modern implementations like Microsoft’s <strong>Explainable Boosting Machines (EBM)</strong> have made them much more accessible and powerful.</p>
</section>
<section id="what-is-ga2m" class="level3">
<h3 class="anchored" data-anchor-id="what-is-ga2m">29. What is GA2M?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1922s">Timestamp: 00:32:02</a>)</p>
<p><strong>GA2M</strong> is essentially a linear model where features are binned, and pairwise interactions are automatically detected. Shah highlights <strong>InterpretML</strong>, an open-source library from Microsoft that implements this via EBMs.</p>
<p>The model structure is additive: <img src="https://latex.codecogs.com/png.latex?g(E%5By%5D)%20=%20%5Cbeta_0%20+%20%5Csum%20f_j(x_j)%20+%20%5Csum%20f_%7Bij%7D(x_i,%20x_j)">. This means the final score is just the sum of individual feature scores and interaction scores, making it very transparent.</p>
</section>
<section id="ga2m-binning" class="level3">
<h3 class="anchored" data-anchor-id="ga2m-binning">30. GA2M Binning</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1962s">Timestamp: 00:32:42</a>)</p>
<p>Shah explains how GA2M handles numerical data. Instead of a single slope coefficient (like in logistic regression), the model <strong>bins</strong> the continuous feature (e.g., dividing “criminal history” into ranges).</p>
<p>Each bin gets its own coefficient. This allows the model to learn non-linear patterns (e.g., risk might go up, then down, then up again as a variable increases) while remaining easy to inspect.</p>
</section>
<section id="interactions-in-ga2m" class="level3">
<h3 class="anchored" data-anchor-id="interactions-in-ga2m">31. Interactions in GA2M</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=1988s">Timestamp: 00:33:08</a>)</p>
<p>The “2” in GA2M stands for <strong>pairwise interactions</strong>. Shah emphasizes that this is the model’s superpower. While standard linear models struggle with interactions (e.g., the combined effect of age and education), GA2M has an efficient algorithm to automatically find the most important pairs.</p>
<p>This allows the model to achieve accuracy levels comparable to complex ensembles (AUC 0.93) because it captures the interaction signal that simple linear models miss.</p>
</section>
<section id="ga2m-visualization" class="level3">
<h3 class="anchored" data-anchor-id="ga2m-visualization">32. GA2M Visualization</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2114s">Timestamp: 00:35:14</a>)</p>
<p>Shah showcases the <strong>InterpretML</strong> dashboard. It provides clear visualizations of how each feature contributes to the prediction.</p>
<p>In the example, we see the coefficients for different marital statuses. This acts like a “lookup table” for risk. Shah argues that this is very “model risk management friendly” because stakeholders can validate every single coefficient and interaction term to ensure they make business sense.</p>
</section>
<section id="section-5-rule-lists" class="level3">
<h3 class="anchored" data-anchor-id="section-5-rule-lists">33. Section 5: Rule Lists</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2428s">Timestamp: 00:40:28</a>)</p>
<p>The third approach is <strong>Rule Lists</strong>. Shah introduces this as a method to solve the “overlapping rules” problem found in Rulefit.</p>
</section>
<section id="what-are-rule-lists" class="level3">
<h3 class="anchored" data-anchor-id="what-are-rule-lists">34. What are Rule Lists?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2448s">Timestamp: 00:40:48</a>)</p>
<p><strong>Rule Lists</strong> are ordered sets of <strong>IF-THEN-ELSE</strong> statements. Unlike Rulefit, where you sum up multiple rules, here an observation triggers only the <strong>first</strong> rule it matches.</p>
<p>Shah mentions implementations like <strong>CORELS</strong> and <strong>SBRL</strong> (Scalable Bayesian Rule Lists). The goal is to produce a concise list that a human can read from top to bottom to make a decision.</p>
</section>
<section id="sbrl-process" class="level3">
<h3 class="anchored" data-anchor-id="sbrl-process">35. SBRL Process</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2469s">Timestamp: 00:41:09</a>)</p>
<p>Creating an optimal rule list is computationally expensive because the algorithm must search through many permutations to find the best order.</p>
<p>Shah explains the logic: The algorithm finds a rule that covers a subset of data, removes those instances, and then finds the next rule for the remaining data. This sequential “peeling off” of data creates the IF-ELSE structure.</p>
</section>
<section id="sbrl-output-example" class="level3">
<h3 class="anchored" data-anchor-id="sbrl-output-example">36. SBRL Output Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2505s">Timestamp: 00:41:45</a>)</p>
<p>The output of an SBRL model is shown. It reads like a checklist: 1. <em>IF Capital Gain &gt; $7500 -&gt; High Income (99% prob)</em> 2. <em>ELSE IF Education &lt; 4 -&gt; Low Income (90% prob)</em> 3. <em>ELSE…</em></p>
<p>Shah highlights the simplicity: <em>“You just go down the list until you find the rule… much easier to explain to those marketing people.”</em> The trade-off is a drop in accuracy (AUC 0.86) compared to GA2M or Rulefit.</p>
</section>
<section id="section-6-scorecard" class="level3">
<h3 class="anchored" data-anchor-id="section-6-scorecard">37. Section 6: Scorecard</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2692s">Timestamp: 00:44:52</a>)</p>
<p>The final approach is the <strong>Scorecard</strong>. Shah introduces this as perhaps the simplest and most widely recognized format for decision-making in industries like credit and criminal justice.</p>
</section>
<section id="what-are-scorecards" class="level3">
<h3 class="anchored" data-anchor-id="what-are-scorecards">38. What are Scorecards?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2704s">Timestamp: 00:45:04</a>)</p>
<p><strong>Scorecards</strong> are simple additive models where features are assigned integer “points.” To get a prediction, you simply add up the points.</p>
<p>Shah mentions tools like <strong>Optbinning</strong> and <strong>SLIM</strong> (Sparse Linear Integer Models). This format is beloved in operations because it can be printed on a physical card or implemented in a basic spreadsheet.</p>
</section>
<section id="scorecard-example" class="level3">
<h3 class="anchored" data-anchor-id="scorecard-example">39. Scorecard Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=2768s">Timestamp: 00:46:08</a>)</p>
<p>This slide shows a scorecard built for the Adult dataset. * <em>Capital Gain &gt; 7000? +29 points.</em> * <em>Age &lt; 25? -5 points.</em></p>
<p>Shah expresses a personal preference for this over raw coefficients: <em>“I actually like this better… I think it’s a little easier to understand which features are most important.”</em> The integer points make the “weight” of each factor immediately obvious to a layperson.</p>
</section>
<section id="summary" class="level3">
<h3 class="anchored" data-anchor-id="summary">40. Summary</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=3071s">Timestamp: 00:51:11</a>)</p>
<p>Shah begins to wrap up the presentation, preparing to consolidate the four methods (Rulefit, GA2M, Rule Lists, Scorecards) into a final comparison.</p>
</section>
<section id="complexity-vs-auc-summary-plot" class="level3">
<h3 class="anchored" data-anchor-id="complexity-vs-auc-summary-plot">41. Complexity vs AUC Summary Plot</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=3073s">Timestamp: 00:51:13</a>)</p>
<p>This is the definitive comparison graph of the talk. It places all discussed models on the <strong>Complexity vs.&nbsp;AUC</strong> plane. * <strong>GA2M (EBM)</strong> and <strong>Rulefit</strong> sit high up, offering near-SOTA accuracy with moderate interpretability. * <strong>Scorecards</strong> and <strong>Rule Lists</strong> sit lower on accuracy but offer maximum simplicity.</p>
<p>Shah summarizes the trade-off: <em>“The Rule Lists and Scorecard… you lose a little bit [of accuracy]… but we talked about the trade-offs of being able to easily understand.”</em></p>
</section>
<section id="take-away" class="level3">
<h3 class="anchored" data-anchor-id="take-away">42. Take Away</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=3128s">Timestamp: 00:52:08</a>)</p>
<p>The final message is a call to action: <strong>Try these approaches.</strong></p>
<p>Shah encourages data scientists to add these tools to their toolkit. He asks them to consider the specific needs of their problem: Is it about transparency in <em>calculation</em> (Scorecard)? Or understanding <em>factors</em> (GA2M)? Often, a simple model that gets deployed is far better than a complex model that gets stuck in review.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">43. Conclusion</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/interpretable-ml-models/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/lx4SJOVtxI8&amp;t=3172s">Timestamp: 00:52:52</a>)</p>
<p>The presentation concludes with Rajiv Shah’s contact information. He mentions an upcoming blog post that will synthesize these topics and invites the audience to reach out with questions or feedback.</p>
<p>He reiterates that these interpretable models are often easier to get “buy-in” for, making them a pragmatic choice for real-world data science success.</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>Interpretability</category>
  <category>Machine Learning</category>
  <category>XAI</category>
  <category>Model Explanation</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/interpretable-ml-models.html</guid>
  <pubDate>Wed, 25 Sep 2024 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/interpretable-ml-models/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Spark of AI: How Transfer Learning Unlocked AI’s Potential</title>
  <link>https://rajivshah.com/blog/spark-of-ai-transfer-learning.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/6NuGEukBfcA" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/6NuGEukBfcA">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the annotated presentation based on the provided video transcript and slide summaries.</p>
<section id="the-spark-of-the-ai-revolution" class="level3">
<h3 class="anchored" data-anchor-id="the-spark-of-the-ai-revolution">1. The Spark of the AI Revolution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=0s">Timestamp: 00:00</a>)</p>
<p>The presentation begins with the title slide, “The Spark of the AI Revolution: Transfer Learning,” presented by Rajiv Shah from Snowflake. This talk was originally given at the University of Cincinnati and recorded later to share the insights with a broader audience.</p>
<p>Rajiv sets the stage by explaining that this is not a deep technical dive into code, but rather a descriptive history and analysis of the drivers behind the current AI boom. The goal is to explain how AI learns and how individuals can start to interrogate and understand these technologies in their own lives.</p>
<p>The core premise is that <strong>Transfer Learning</strong> is the catalyst that shifted AI from academic curiosity to a revolutionary force. The talk aims to bridge the gap for those unfamiliar with the underlying mechanics of how models like ChatGPT came to be.</p>
</section>
<section id="sparks-of-agi-early-experiments" class="level3">
<h3 class="anchored" data-anchor-id="sparks-of-agi-early-experiments">2. Sparks of AGI: Early Experiments</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=60s">Timestamp: 01:00</a>)</p>
<p>This slide illustrates an early experiment conducted by researchers investigating GPT-4. To understand how the model was learning, they gave it a concept and asked it to draw it using code (SVG). The slide displays a progression of abstract animal figures, showing how the model’s ability to represent concepts improved over time during training.</p>
<p>This references the paper “Sparks of Artificial General Intelligence,” which caused significant waves in the tech community. It suggests that these models were beginning to show signs of <strong>Artificial General Intelligence (AGI)</strong>—reasoning capabilities that extend beyond narrow tasks.</p>
<p>The visual progression from crude shapes to recognizable forms serves as a metaphor for the rapid evolution of these models. It highlights the mystery and potential power hidden within the training process of Large Language Models (LLMs).</p>
</section>
<section id="extinction-level-threat" class="level3">
<h3 class="anchored" data-anchor-id="extinction-level-threat">3. Extinction Level Threat?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=96s">Timestamp: 01:36</a>)</p>
<p>The presentation addresses the extreme concerns surrounding the rapid scaling of AI technologies. The slide features a dramatic image reminiscent of the Terminator, referencing fears that unchecked AI development could pose an <strong>“extinction-level” threat</strong> to humanity.</p>
<p>Rajiv notes that as these technologies scale, there is a segment of the research and safety community worried about catastrophic outcomes. This sets up a contrast between the theoretical existential risks and the practical, everyday reality of how AI is currently being used.</p>
<p>This slide acknowledges the “hype and fear” cycle that dominates the media narrative, validating the audience’s anxiety before pivoting to a more grounded explanation of how the technology actually works.</p>
</section>
<section id="the-new-ai-overlords" class="level3">
<h3 class="anchored" data-anchor-id="the-new-ai-overlords">4. The New AI Overlords</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=102s">Timestamp: 01:42</a>)</p>
<p>Shifting to a lighter tone, this slide highlights the widespread adoption of AI by the younger generation. It cites a statistic that <strong>89% of students</strong> have used ChatGPT for homework, humorously suggesting that children have already “accepted our new AI overlords.”</p>
<p>The slide points out a discrepancy in honesty, noting that while 89% use it, a significant portion (implied by the “11% are lying” joke) might not admit it. This reflects a fundamental shift in education and information retrieval that has already taken place.</p>
<p>This context emphasizes that the AI revolution is not just a future possibility but a current reality affecting how the next generation learns and works. It underscores the urgency of understanding these tools.</p>
</section>
<section id="fundamental-questions" class="level3">
<h3 class="anchored" data-anchor-id="fundamental-questions">5. Fundamental Questions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=113s">Timestamp: 01:53</a>)</p>
<p>This slide poses the central questions that the presentation will answer: “What is AI doing?” and “How should you think about AI?” It serves as an agenda setting for the technical explanation that follows.</p>
<p>Rajiv transitions here from the societal impact of AI to the mechanics of machine learning. He prepares the audience to look “under the hood” to demystify the “magic” of tools like ChatGPT.</p>
<p>The goal is to move the audience from passive consumers of AI hype to critical thinkers who understand the limitations and capabilities of the technology based on how it is built.</p>
</section>
<section id="how-we-teach-computers" class="level3">
<h3 class="anchored" data-anchor-id="how-we-teach-computers">6. How We Teach Computers</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=122s">Timestamp: 02:02</a>)</p>
<p>The presentation begins its technical explanation with a fundamental question: <strong>“How do we teach computers?”</strong> The slide uses imagery of blueprints and tools, likening the traditional process of building AI models to craftsmanship.</p>
<p>This introduces the concept of <strong>Supervised Learning</strong> in a relatable way. Before discussing neural networks, Rajiv grounds the audience in traditional analytics, where humans explicitly guide the machine on what to look for.</p>
<p>The focus here is on the human element in traditional machine learning—the “artisan” who must carefully select inputs to get a desired output.</p>
</section>
<section id="identifying-features" class="level3">
<h3 class="anchored" data-anchor-id="identifying-features">7. Identifying Features</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=131s">Timestamp: 02:11</a>)</p>
<p>Using a real estate example, this slide explains the concept of <strong>Features</strong> (or variables). To teach a computer to value a house, one must identify specific characteristics like square footage, number of bedrooms, or closet space.</p>
<p>Rajiv explains that we capture these characteristics and organize them into a tabular format. This process is known as <strong>Feature Engineering</strong>, where the data scientist decides which attributes are relevant for the problem at hand.</p>
<p>This is the bedrock of traditional enterprise AI: converting real-world objects into structured data points that a machine can process mathematically.</p>
</section>
<section id="historical-data-patterns" class="level3">
<h3 class="anchored" data-anchor-id="historical-data-patterns">8. Historical Data Patterns</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=170s">Timestamp: 02:50</a>)</p>
<p>This slide displays a scatter plot correlating “Sales Price” with “Square Feet.” It illustrates how enterprises gather historical data to look for patterns and relationships backwards in time.</p>
<p>Rajiv notes that much of traditional analytics is simply looking at this historical data to understand what happened. However, the power of AI lies in using this data for <strong>forward-looking</strong> purposes.</p>
<p>The visual clearly shows a trend: as square footage increases, the price generally increases. This linear relationship is what the machine needs to “learn.”</p>
</section>
<section id="learning-the-model" class="level3">
<h3 class="anchored" data-anchor-id="learning-the-model">9. Learning the Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=183s">Timestamp: 03:03</a>)</p>
<p>Here, a line is drawn through the data points on the scatter plot. This line represents the <strong>Model</strong>. Learning, in this context, is simply the mathematical process of fitting this line to the historical data to minimize error.</p>
<p>Rajiv explains that the model “understands the relationships” defined by the data. Instead of a human manually writing rules, the algorithm finds the best-fit trend based on the input features.</p>
<p>This simplifies the concept of training a model down to its essence: finding a mathematical representation of a trend within a dataset.</p>
</section>
<section id="making-predictions" class="level3">
<h3 class="anchored" data-anchor-id="making-predictions">10. Making Predictions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=196s">Timestamp: 03:16</a>)</p>
<p>This slide demonstrates the utility of the trained model. When a “New House” comes onto the market, the model uses the learned line to predict its value based on its square footage.</p>
<p>This defines the <strong>Inference</strong> stage of machine learning. The model is no longer learning; it is applying its “knowledge” (the line) to unseen data to generate a prediction.</p>
<p>It highlights the portability of a model—once trained, it can be used to make rapid assessments of new data points without human intervention.</p>
</section>
<section id="the-domain-limitation" class="level3">
<h3 class="anchored" data-anchor-id="the-domain-limitation">11. The Domain Limitation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=210s">Timestamp: 03:30</a>)</p>
<p>The presentation introduces a critical limitation of traditional models. The slide shows the model trained on San Francisco data being applied to houses in South Carolina. The result is labeled “Poor Model.”</p>
<p>Rajiv explains that while you can technically take the model with you, it will fail because the <strong>underlying relationships</strong> between features (size) and targets (price) are different in different domains (geographies).</p>
<p>This illustrates the concept of <strong>Domain Shift</strong> or lack of generalization. A model is only as good as the data it was trained on, and it assumes the future (or new location) looks exactly like the past.</p>
</section>
<section id="the-thinking-emoji" class="level3">
<h3 class="anchored" data-anchor-id="the-thinking-emoji">12. The Thinking Emoji</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=223s">Timestamp: 03:43</a>)</p>
<p>This slide reinforces the previous point with a thinking emoji, emphasizing the realization that the existing model is inadequate. The “San Francisco Model” does not fit the “South Carolina Data.”</p>
<p>It serves as a visual pause to let the problem sink in: traditional machine learning is brittle. It requires the data distribution to remain constant.</p>
<p>Rajiv uses this to set up the labor-intensive nature of traditional analytics, where models cannot simply be “transferred” across different contexts.</p>
</section>
<section id="train-new-model" class="level3">
<h3 class="anchored" data-anchor-id="train-new-model">13. Train New Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=235s">Timestamp: 03:55</a>)</p>
<p>The solution in the traditional paradigm is presented here: <strong>“Train New Model.”</strong> To get accurate predictions for South Carolina, one must collect local data and repeat the entire training process from scratch.</p>
<p>This highlights the “Never-Ending Battle” of enterprise analytics. Data scientists are constantly retraining models for every specific region, product line, or use case.</p>
<p>This sets the baseline for why Transfer Learning (introduced later) is such a revolution. In the old way, knowledge was not portable; every problem required a bespoke solution.</p>
</section>
<section id="artisan-ai" class="level3">
<h3 class="anchored" data-anchor-id="artisan-ai">14. Artisan AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=255s">Timestamp: 04:15</a>)</p>
<p>Rajiv coins the term <strong>“Artisan AI”</strong> to describe this traditional approach. The slide features an image of a craftsman, symbolizing that these models are hand-built and rely heavily on human-crafted features.</p>
<p>This approach is slow and difficult to scale. Just as an artisan can only produce a limited number of goods, a data science team using these methods can only maintain a limited number of models.</p>
<p>It emphasizes that the intelligence in these systems comes largely from the human who engineered the features, not the machine itself.</p>
</section>
<section id="enterprise-ai-use-cases" class="level3">
<h3 class="anchored" data-anchor-id="enterprise-ai-use-cases">15. Enterprise AI Use Cases</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=263s">Timestamp: 04:23</a>)</p>
<p>This slide lists common Enterprise AI applications: Forecasting, Pricing, Customer Churn, and Fraud. It notes that <strong>80% of production models</strong> currently fall into this category.</p>
<p>Rajiv grounds the talk in the reality of today’s business world. Despite the hype around Generative AI, most companies are still running on these “Artisan” structured data models.</p>
<p>This distinction is crucial for understanding the market. There is “Old AI” (highly effective, structured, labor-intensive) and “New AI” (generative, unstructured, scalable), and they solve different problems.</p>
</section>
<section id="the-computer-science-perspective" class="level3">
<h3 class="anchored" data-anchor-id="the-computer-science-perspective">16. The Computer Science Perspective</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=281s">Timestamp: 04:41</a>)</p>
<p>The presentation shifts from the enterprise view to the academic Computer Science view. The slide asks, “How should we teach computers?” signaling a move toward more advanced methodologies.</p>
<p>Rajiv indicates that computer scientists were trying to find ways to move beyond the limitations of manual feature engineering. They wanted machines to learn the features themselves.</p>
<p>This transition introduces the concept of <strong>Deep Learning</strong> and the move toward processing unstructured data like audio, images, and text.</p>
</section>
<section id="frederick-jelineks-insight" class="level3">
<h3 class="anchored" data-anchor-id="frederick-jelineks-insight">17. Frederick Jelinek’s Insight</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=289s">Timestamp: 04:49</a>)</p>
<p>This slide introduces a quote from Frederick Jelinek, a pioneer in speech recognition: <strong>“Every time I fire a linguist, the performance of the speech recognizer goes up.”</strong></p>
<p>This provocative quote encapsulates a major shift in AI philosophy. It suggests that human expertise (linguistics) often gets in the way of raw data processing. Instead of hard-coding grammar rules, it is better to let the model learn patterns directly from the data.</p>
<p>Rajiv asks the audience to “chew on that,” as it foreshadows the “Bitter Lesson” of AI: massive compute and data often outperform human domain expertise.</p>
</section>
<section id="computer-vision-in-2010" class="level3">
<h3 class="anchored" data-anchor-id="computer-vision-in-2010">18. Computer Vision in 2010</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=347s">Timestamp: 05:47</a>)</p>
<p>The slide depicts the state of Computer Vision around 2010. It shows a process of manual feature extraction (like HOG - Histogram of Oriented Gradients) used to identify shapes and edges.</p>
<p>Rajiv explains that even in vision, researchers were essentially doing “Artisan AI.” They sat around thinking about how to mathematically describe the shape of a car or a truck to a computer.</p>
<p>This illustrates that before the deep learning boom, computer vision was stuck in the same “feature engineering” trap as tabular analytics.</p>
</section>
<section id="svm-classification" class="level3">
<h3 class="anchored" data-anchor-id="svm-classification">19. SVM Classification</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=365s">Timestamp: 06:05</a>)</p>
<p>Following feature extraction, this slide shows a <strong>Support Vector Machine (SVM)</strong> classifier separating data points (cars vs.&nbsp;trucks). This was the standard approach: extract features manually, then use a simple algorithm to classify them.</p>
<p>This reinforces the previous point about the limitations of the time. The intelligence was in the manual extraction, not the classification model.</p>
<p>Rajiv mentions his own work at Caterpillar, noting that this was exactly how they tried to separate images of machinery—a tedious and specific process.</p>
</section>
<section id="fei-fei-li-and-big-data" class="level3">
<h3 class="anchored" data-anchor-id="fei-fei-li-and-big-data">20. Fei-Fei Li and Big Data</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=375s">Timestamp: 06:15</a>)</p>
<p>The slide introduces <strong>Professor Fei-Fei Li</strong>, a visionary in computer vision. It features a collage of images, hinting at the need for scale.</p>
<p>Rajiv explains that Fei-Fei Li recognized that for computer vision to advance, it needed to move away from tiny datasets (100-200 images) and toward massive scale. She understood that deep learning required vast amounts of data to generalize.</p>
<p>This marks the beginning of the “Big Data” era in AI, where the focus shifted from better algorithms to better and larger datasets.</p>
</section>
<section id="imagenet" class="level3">
<h3 class="anchored" data-anchor-id="imagenet">21. ImageNet</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=401s">Timestamp: 06:41</a>)</p>
<p>This slide details <strong>ImageNet</strong>, the dataset Fei-Fei Li helped create. It contains <strong>14 million images</strong> across <strong>1000 classes</strong>.</p>
<p>Rajiv highlights the sheer effort involved, noting the use of <strong>Mechanical Turk</strong> to crowdsource the labeling of these images. He calls this the “dirty secret” of AI—that it is powered by low-wage human labor labeling data.</p>
<p>ImageNet became the benchmark that drove the AI revolution. It provided the “fuel” necessary for neural networks to finally work.</p>
</section>
<section id="alexnet-and-gpus" class="level3">
<h3 class="anchored" data-anchor-id="alexnet-and-gpus">22. AlexNet and GPUs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=464s">Timestamp: 07:44</a>)</p>
<p>The presentation introduces <strong>Alex Krizhevsky</strong>, a graduate student under Geoffrey Hinton. The slide mentions “AlexNet” and the use of GPUs (Graphics Processing Units).</p>
<p>Rajiv tells the story of how Alex decided to use NVIDIA gaming cards to train neural networks. Traditional CPUs were too slow for the math required by deep learning.</p>
<p>This moment—combining the massive ImageNet dataset with the parallel processing power of GPUs—was the “big bang” of modern AI.</p>
</section>
<section id="alexnet-training-details" class="level3">
<h3 class="anchored" data-anchor-id="alexnet-training-details">23. AlexNet Training Details</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=485s">Timestamp: 08:05</a>)</p>
<p>This slide provides the technical specs of AlexNet: trained on <strong>1.2 million images</strong>, using <strong>2 GPUs</strong>, taking roughly <strong>6 days</strong>, with <strong>60 million parameters</strong>.</p>
<p>Rajiv emphasizes that while 6 days seems long, the result was a model vastly superior to anything else. It proved that neural networks, which had been theoretical for decades, were now practical.</p>
<p>The “60 million parameters” figure is a precursor to the “billions” and “trillions” we see today, marking the start of the parameter scaling race.</p>
</section>
<section id="crushing-the-competition" class="level3">
<h3 class="anchored" data-anchor-id="crushing-the-competition">24. Crushing the Competition</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=507s">Timestamp: 08:27</a>)</p>
<p>A chart displays the results of the ImageNet Large Scale Visual Recognition Challenge. It shows AlexNet achieving a significantly lower error rate than the competitors.</p>
<p>Rajiv notes that the performance jump was so dramatic that by the following year, <strong>every competitor</strong> had switched to using the AlexNet architecture.</p>
<p>This visualizes the paradigm shift. The “Artisan” methods were instantly obsolete, replaced by Deep Learning.</p>
</section>
<section id="feature-engineering-vs.-deep-learning" class="level3">
<h3 class="anchored" data-anchor-id="feature-engineering-vs.-deep-learning">25. Feature Engineering vs.&nbsp;Deep Learning</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=517s">Timestamp: 08:37</a>)</p>
<p>Using a humorous meme format, this slide compares the “Old Way” (Feature Engineering + SVM) with the “New Way” (AlexNet). The AlexNet side is depicted as a powerful, overwhelming force.</p>
<p>This solidifies the takeaway: Deep Learning didn’t just improve upon the old methods; it completely replaced them for unstructured data tasks like vision.</p>
<p>It emphasizes that the model learned the features itself (edges, textures, shapes) rather than having humans manually code them.</p>
</section>
<section id="the-1000-classes" class="level3">
<h3 class="anchored" data-anchor-id="the-1000-classes">26. The 1000 Classes</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=531s">Timestamp: 08:51</a>)</p>
<p>This slide shows examples of the <strong>1000 classes</strong> in ImageNet, ranging from specific dog breeds to everyday objects.</p>
<p>Rajiv explains that this model learned to identify a vast array of things from the raw pixels. It went from raw vision to understanding textures, shapes, and objects.</p>
<p>However, he sets up the next problem: What if you want to identify something <em>not</em> in those 1000 classes?</p>
</section>
<section id="the-hot-dog-problem" class="level3">
<h3 class="anchored" data-anchor-id="the-hot-dog-problem">27. The Hot Dog Problem</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=541s">Timestamp: 09:01</a>)</p>
<p>referencing a famous scene from the show <em>Silicon Valley</em>, this slide presents the specific challenge of classifying “Hot Dogs.”</p>
<p>Rajiv uses this to ask: How do you help a buddy with a startup who needs to find hot dogs if “hot dog” isn’t one of the primary categories, or if they need a specific <em>type</em> of hot dog? Do you have to start from scratch?</p>
<p>This sets the stage for <strong>Transfer Learning</strong>—the solution to avoiding the need for 14 million images every time you have a new problem.</p>
</section>
<section id="pre-trained-models" class="level3">
<h3 class="anchored" data-anchor-id="pre-trained-models">28. Pre-Trained Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=552s">Timestamp: 09:12</a>)</p>
<p>The slide introduces the concept of a <strong>Pre-trained Model</strong>. This is the model that has already learned the 1000 classes from ImageNet.</p>
<p>Rajiv explains that this model already “knows” how to see. It understands edges, curves, and textures. This knowledge is contained in the “weights” of the neural network.</p>
<p>The key idea is that we don’t need to relearn how to “see” every time we want to identify a new object.</p>
</section>
<section id="transfer-learning-mechanics" class="level3">
<h3 class="anchored" data-anchor-id="transfer-learning-mechanics">29. Transfer Learning Mechanics</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=570s">Timestamp: 09:30</a>)</p>
<p>This technical slide illustrates how <strong>Transfer Learning</strong> works. It shows the layers of a neural network. We keep the early layers (which know shapes and textures) and only retrain the final layers for the new task (e.g., identifying boats).</p>
<p>Rajiv explains that we can transfer “most of that knowledge” and only change a <strong>small amount of parameters</strong> (less than 10%).</p>
<p>This is the revolution: You can build a world-class model with a <em>small</em> amount of data by standing on the shoulders of the giant ImageNet model.</p>
</section>
<section id="the-revolution" class="level3">
<h3 class="anchored" data-anchor-id="the-revolution">30. The Revolution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=594s">Timestamp: 09:54</a>)</p>
<p>A graph titled “Transfer Learning Revolution” shows the dramatic improvement in accuracy when using transfer learning versus training from scratch. It includes a quote from <strong>Andrew Ng</strong> stating that transfer learning will be the next driver of commercial success.</p>
<p>Rajiv emphasizes that this capability allowed startups and companies to build powerful AI without needing Google-sized datasets. It democratized access to high-performance computer vision.</p>
<p>This wraps up the vision section of the talk, establishing Transfer Learning as the “Spark.”</p>
</section>
<section id="the-implications" class="level3">
<h3 class="anchored" data-anchor-id="the-implications">31. The Implications</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=611s">Timestamp: 10:11</a>)</p>
<p>The slide shows a YouTube video thumbnail from 2016 featuring Geoffrey Hinton. This transitions the talk to the societal and professional implications of this technology.</p>
<p>Rajiv prepares to share a famous prediction by Hinton regarding the medical field, specifically radiology. It signals a shift from “how it works” to “what it does to jobs.”</p>
</section>
<section id="the-coyote-moment" class="level3">
<h3 class="anchored" data-anchor-id="the-coyote-moment">32. The Coyote Moment</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=619s">Timestamp: 10:19</a>)</p>
<p>The slide displays a webpage for the University of Cincinnati Radiology Fellows. Rajiv quotes Hinton: <strong>“Radiologists are like the coyote that’s already over the edge of the cliff but hasn’t yet looked down.”</strong></p>
<p>Hinton suggested people should stop training radiologists because AI interprets images better. Rajiv humorously notes that since he was speaking <em>at</em> U of C, he had to show the “coyotes” in the audience.</p>
<p>This highlights the tension between AI capabilities and human expertise, a recurring theme in the presentation.</p>
</section>
<section id="nlp-the-academic-view" class="level3">
<h3 class="anchored" data-anchor-id="nlp-the-academic-view">33. NLP: The Academic View</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=642s">Timestamp: 10:42</a>)</p>
<p>The presentation switches domains from Computer Vision to <strong>Natural Language Processing (NLP)</strong>. The slide depicts a traditional academic setting, representing the text researchers.</p>
<p>Rajiv explains that while Computer Vision was having its revolution with AlexNet, the text folks were still doing things the “Old Way”—crafting features and rules for language.</p>
<p>They saw the success in vision and wondered how to replicate it for text, but language proved more difficult to model than images initially.</p>
</section>
<section id="traditional-nlp-tasks" class="level3">
<h3 class="anchored" data-anchor-id="traditional-nlp-tasks">34. Traditional NLP Tasks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=664s">Timestamp: 11:04</a>)</p>
<p>This slide lists various NLP tasks: Classification, Information Extraction, and Sentiment Analysis.</p>
<p>Rajiv notes that traditionally, each of these was a <strong>separate discipline</strong>. You built a specific model for sentiment, a different one for translation, and another for summarization. There was no “one model to rule them all.”</p>
<p>This fragmentation made NLP difficult and resource-intensive, as knowledge didn’t transfer between tasks.</p>
</section>
<section id="the-glue-benchmark" class="level3">
<h3 class="anchored" data-anchor-id="the-glue-benchmark">35. The GLUE Benchmark</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=685s">Timestamp: 11:25</a>)</p>
<p>The slide introduces the <strong>GLUE Benchmark</strong> (General Language Understanding Evaluation). This was a collection of different text tasks put together to measure general language ability.</p>
<p>Rajiv explains this was an attempt to push the field toward general-purpose models. Researchers wanted a single metric to see if a model could understand language broadly, not just solve one specific trick.</p>
</section>
<section id="the-transformer-architecture" class="level3">
<h3 class="anchored" data-anchor-id="the-transformer-architecture">36. The Transformer Architecture</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=697s">Timestamp: 11:37</a>)</p>
<p>This slide marks the turning point for text: the introduction of the <strong>Transformer</strong> architecture by Google researchers in 2017 (the “Attention Is All You Need” paper).</p>
<p>Rajiv highlights that this architecture was not only more accurate (higher BLEU scores) but, crucially, more efficient.</p>
<p>The Transformer allowed for parallel processing of text, unlike previous sequential models (RNNs/LSTMs), unlocking the ability to train on massive datasets.</p>
</section>
<section id="lower-training-costs" class="level3">
<h3 class="anchored" data-anchor-id="lower-training-costs">37. Lower Training Costs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=706s">Timestamp: 11:46</a>)</p>
<p>The slide emphasizes the <strong>Training Cost</strong> reduction associated with Transformers.</p>
<p>Rajiv points out that because the architecture used less processing power per unit of data, researchers immediately asked: “What happens if we give it <em>more</em> processing?”</p>
<p>This efficiency paradox—making something cheaper allows you to do vastly more of it—sparked the scaling era of LLMs.</p>
</section>
<section id="exponential-growth" class="level3">
<h3 class="anchored" data-anchor-id="exponential-growth">38. Exponential Growth</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=716s">Timestamp: 11:56</a>)</p>
<p>A graph demonstrates the exponential growth in the size of Transformer models (measured in parameters) over just a few years. The curve shoots upward vertically.</p>
<p>Rajiv explains that this scaling—simply making the models bigger and feeding them more data—led to the performance of GPT-4.</p>
<p>This visualizes the “Scale” aspect of modern AI. We haven’t necessarily changed the architecture since 2017; we’ve just made it significantly larger.</p>
</section>
<section id="gpt-4-and-images" class="level3">
<h3 class="anchored" data-anchor-id="gpt-4-and-images">39. GPT-4 and Images</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=731s">Timestamp: 12:11</a>)</p>
<p>The presentation circles back to the GPT-4 generated images from Slide 2.</p>
<p>Rajiv connects the Transformer architecture and scaling directly to these “Sparks of AGI.” The ability to reason and draw emerged from simply predicting the next word at a massive scale.</p>
</section>
<section id="the-era-of-chatgpt" class="level3">
<h3 class="anchored" data-anchor-id="the-era-of-chatgpt">40. The Era of ChatGPT</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=736s">Timestamp: 12:16</a>)</p>
<p>The slide displays the ChatGPT logo, symbolizing the current era where these technical advancements reached the public consciousness.</p>
<p>Rajiv sets up the next section of the talk: explaining exactly <strong>how</strong> a model like ChatGPT is trained. He moves from history to the “Recipe.”</p>
</section>
<section id="the-learning-process" class="level3">
<h3 class="anchored" data-anchor-id="the-learning-process">41. The Learning Process</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=740s">Timestamp: 12:20</a>)</p>
<p>A visual diagram outlines the evolutionary stages of ChatGPT. It previews the three steps Rajiv will cover: Pre-training, Fine-tuning, and Alignment.</p>
<p>This roadmap helps the audience understand that ChatGPT isn’t just one static thing; it’s the result of a multi-stage pipeline involving different types of learning.</p>
</section>
<section id="recipe-step-1-foundation-model" class="level3">
<h3 class="anchored" data-anchor-id="recipe-step-1-foundation-model">42. Recipe Step 1: Foundation Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=747s">Timestamp: 12:27</a>)</p>
<p>The first step identified is the <strong>“Foundation Model”</strong> (or Base Model).</p>
<p>Rajiv explains that the core capability of these models is <strong>Next Word Prediction</strong>. Before it can answer questions or be helpful, it must simply learn the statistical structure of language.</p>
</section>
<section id="predictive-keyboards" class="level3">
<h3 class="anchored" data-anchor-id="predictive-keyboards">43. Predictive Keyboards</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=751s">Timestamp: 12:31</a>)</p>
<p>To make the concept relatable, the slide compares LLMs to the <strong>predictive text</strong> feature on a smartphone keyboard.</p>
<p>Rajiv notes that while the game on your phone is simple, scaling that concept up to the entire internet makes it incredibly powerful. It grounds the “magic” of AI in a familiar user experience.</p>
</section>
<section id="next-token-prediction" class="level3">
<h3 class="anchored" data-anchor-id="next-token-prediction">44. Next Token Prediction</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_44.png" class="img-fluid figure-img"></p>
<figcaption>Slide 44</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=760s">Timestamp: 12:40</a>)</p>
<p>This technical slide defines <strong>“Next Token Prediction.”</strong> It explains that the model looks at a sequence of text and calculates the probability of what comes next.</p>
<p>Rajiv emphasizes that this is a hard statistical problem. There are many possibilities for the next word, and the model must learn to weigh them based on context.</p>
</section>
<section id="the-homer-simpson-challenge" class="level3">
<h3 class="anchored" data-anchor-id="the-homer-simpson-challenge">45. The Homer Simpson Challenge</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_45.png" class="img-fluid figure-img"></p>
<figcaption>Slide 45</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=785s">Timestamp: 13:05</a>)</p>
<p>Rajiv introduces a specific experiment: Training a Transformer to speak like <strong>Homer Simpson</strong>. He mentions using 7MB of Simpsons scripts (~7 million tokens).</p>
<p>This serves as a concrete example to show how training data size affects model performance.</p>
</section>
<section id="million-tokens" class="level3">
<h3 class="anchored" data-anchor-id="million-tokens">46. 4 Million Tokens</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_46.png" class="img-fluid figure-img"></p>
<figcaption>Slide 46</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=812s">Timestamp: 13:32</a>)</p>
<p>The slide shows the output of the model when trained on only <strong>4 Million tokens</strong>. The text is “nonsensical and random.”</p>
<p>Rajiv demonstrates that with insufficient data, the model hasn’t learned grammar or structure yet. It’s just outputting characters.</p>
</section>
<section id="million-tokens-1" class="level3">
<h3 class="anchored" data-anchor-id="million-tokens-1">47. 16 Million Tokens</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_47.png" class="img-fluid figure-img"></p>
<figcaption>Slide 47</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=817s">Timestamp: 13:37</a>)</p>
<p>At <strong>16 Million tokens</strong>, the output improves slightly. It contains random words and incorrect grammar, but it’s recognizable as language.</p>
<p>This illustrates the “grokking” phase where the model starts to pick up on basic syntax but lacks semantic meaning.</p>
</section>
<section id="million-tokens-2" class="level3">
<h3 class="anchored" data-anchor-id="million-tokens-2">48. 64 Million Tokens</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_48.png" class="img-fluid figure-img"></p>
<figcaption>Slide 48</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=819s">Timestamp: 13:39</a>)</p>
<p>With <strong>64 Million tokens</strong>, the model generates text that is “close to a proper sentence” and sounds vaguely like Homer Simpson.</p>
<p>Rajiv uses this progression to prove that these models are statistical engines. With enough data, they mimic the patterns of the training set effectively.</p>
</section>
<section id="gpt-2-specifications" class="level3">
<h3 class="anchored" data-anchor-id="gpt-2-specifications">49. GPT-2 Specifications</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_49.png" class="img-fluid figure-img"></p>
<figcaption>Slide 49</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=834s">Timestamp: 13:54</a>)</p>
<p>The slide details <strong>GPT-2</strong> (released in 2019), which had <strong>1.5 Billion parameters</strong>.</p>
<p>Rajiv recalls that when GPT-2 came out, he wasn’t excited because it was just a “creative storytelling model.” It wasn’t factually accurate. He wants the audience to remember that at their core, these models are just predicting the next word, not checking facts.</p>
</section>
<section id="llama-3.1-and-scale" class="level3">
<h3 class="anchored" data-anchor-id="llama-3.1-and-scale">50. Llama 3.1 and Scale</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_50.png" class="img-fluid figure-img"></p>
<figcaption>Slide 50</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=867s">Timestamp: 14:27</a>)</p>
<p>Updating the timeline, this slide shows <strong>Llama 3.1</strong>. It highlights the training data: <strong>15 Trillion Tokens</strong> and the compute: <strong>40 Million GPU Hours</strong>.</p>
<p>Rajiv emphasizes that 15 trillion tokens is an “unfathomable amount of information.” The scale has increased 10,000x since GPT-2.</p>
<p>This underscores the energy and compute intensity of modern AI—it requires massive infrastructure.</p>
</section>
<section id="hallucinations" class="level3">
<h3 class="anchored" data-anchor-id="hallucinations">51. Hallucinations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_51.png" class="img-fluid figure-img"></p>
<figcaption>Slide 51</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=915s">Timestamp: 15:15</a>)</p>
<p>This slide addresses <strong>Hallucinations</strong>. It uses an example of asking for the “Capital of Mars.” The model will confidently invent an answer.</p>
<p>Rajiv argues that “hallucination” isn’t the right metaphor because the model isn’t malfunctioning. It is doing exactly what it was designed to do: predict the most likely next word. It has no concept of “truth,” only statistical likelihood.</p>
</section>
<section id="gpt-2-failure-on-sentiment" class="level3">
<h3 class="anchored" data-anchor-id="gpt-2-failure-on-sentiment">52. GPT-2 Failure on Sentiment</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_52.png" class="img-fluid figure-img"></p>
<figcaption>Slide 52</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=977s">Timestamp: 16:17</a>)</p>
<p>Rajiv shows an example of trying to use the base GPT-2 model for a specific task: <strong>Customer Sentiment</strong>. When prompted, the model just continues the story instead of classifying the sentiment.</p>
<p>This illustrates that Base Models are creative but <strong>not useful for following instructions</strong>. They don’t know they are supposed to solve a problem; they just want to write text.</p>
</section>
<section id="recipe-step-2-instruction-fine-tuned" class="level3">
<h3 class="anchored" data-anchor-id="recipe-step-2-instruction-fine-tuned">53. Recipe Step 2: Instruction Fine-Tuned</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_53.png" class="img-fluid figure-img"></p>
<figcaption>Slide 53</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=998s">Timestamp: 16:38</a>)</p>
<p>This introduces the second step in the ChatGPT recipe: <strong>“Instruction Fine-Tuned Model.”</strong></p>
<p>Rajiv explains that to make the model useful, we must teach it to follow orders. This is done via Transfer Learning—taking the base model and training it further on examples of instructions and answers.</p>
</section>
<section id="fine-tuning-for-sentiment" class="level3">
<h3 class="anchored" data-anchor-id="fine-tuning-for-sentiment">54. Fine-Tuning for Sentiment</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_54.png" class="img-fluid figure-img"></p>
<figcaption>Slide 54</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1006s">Timestamp: 16:46</a>)</p>
<p>The slide shows the process of fine-tuning the language model specifically for <strong>Sentiment Analysis</strong>.</p>
<p>By showing the model examples of “Sentence -&gt; Sentiment,” we can tweak the parameters so it learns to perform classification rather than just storytelling.</p>
</section>
<section id="multi-task-fine-tuning" class="level3">
<h3 class="anchored" data-anchor-id="multi-task-fine-tuning">55. Multi-Task Fine-Tuning</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_55.png" class="img-fluid figure-img"></p>
<figcaption>Slide 55</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1042s">Timestamp: 17:22</a>)</p>
<p>Rajiv expands the concept. We don’t just fine-tune for one task; we fine-tune for <strong>Topic Classification</strong> as well.</p>
<p>The key insight is that <strong>one model</strong> can now solve multiple problems. Unlike the “Old NLP” where you needed separate models, the LLM can swap between tasks based on the instruction.</p>
</section>
<section id="translation-task" class="level3">
<h3 class="anchored" data-anchor-id="translation-task">56. Translation Task</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_56.png" class="img-fluid figure-img"></p>
<figcaption>Slide 56</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1044s">Timestamp: 17:24</a>)</p>
<p>The slide adds <strong>Translation</strong> to the mix, using about 10,000 examples.</p>
<p>This reinforces the “General Purpose” nature of LLMs. They are Swiss Army knives for text.</p>
</section>
<section id="generalization-to-new-tasks" class="level3">
<h3 class="anchored" data-anchor-id="generalization-to-new-tasks">57. Generalization to New Tasks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_57.png" class="img-fluid figure-img"></p>
<figcaption>Slide 57</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1050s">Timestamp: 17:30</a>)</p>
<p>Rajiv poses a challenge: What happens if you give the model a task it <strong>hasn’t</strong> seen before?</p>
<p>The slide indicates the model will try to solve it. This is the breakthrough of <strong>Generalization</strong>. Because it understands language so well, it can interpolate and attempt tasks it wasn’t explicitly trained on.</p>
</section>
<section id="practical-applications" class="level3">
<h3 class="anchored" data-anchor-id="practical-applications">58. Practical Applications</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_58.png" class="img-fluid figure-img"></p>
<figcaption>Slide 58</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1075s">Timestamp: 17:55</a>)</p>
<p>This slide showcases the wide array of use cases: Code explanation, Creative writing, Information extraction, etc.</p>
<p>Rajiv explains that these capabilities exist because we have “trained these models to follow instructions.” This is why we can talk to them via <strong>Prompts</strong>.</p>
</section>
<section id="zero-shot-learning" class="level3">
<h3 class="anchored" data-anchor-id="zero-shot-learning">59. Zero Shot Learning</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_59.png" class="img-fluid figure-img"></p>
<figcaption>Slide 59</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1099s">Timestamp: 18:19</a>)</p>
<p>The slide introduces <strong>“Zero shot learning”</strong> and <strong>“Prompting.”</strong></p>
<p>This is the ability to get a result without showing the model any examples (zero shots). Rajiv notes that there is a “whole language” around prompting, but fundamentally, it’s just giving the model the instruction we trained it to expect.</p>
</section>
<section id="weeks-vs.-days" class="level3">
<h3 class="anchored" data-anchor-id="weeks-vs.-days">60. Weeks vs.&nbsp;Days</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_60.png" class="img-fluid figure-img"></p>
<figcaption>Slide 60</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1121s">Timestamp: 18:41</a>)</p>
<p>A comparison slide contrasts “Training a ML Model (weeks)” with “Prompting a LLM (days).”</p>
<p>Rajiv highlights the efficiency shift. In the old days, solving a sentiment problem meant weeks of data collection and training. Now, it takes minutes to write a prompt. This is a massive productivity booster for NLP tasks.</p>
</section>
<section id="reasoning-and-planning" class="level3">
<h3 class="anchored" data-anchor-id="reasoning-and-planning">61. Reasoning and Planning</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_61.png" class="img-fluid figure-img"></p>
<figcaption>Slide 61</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1165s">Timestamp: 19:25</a>)</p>
<p>The presentation pivots to the limitations of LLMs, specifically regarding <strong>Reasoning and Planning</strong>. The slide shows a “Block Stacking” puzzle.</p>
<p>Rajiv explains that stacking blocks requires planning several steps ahead. It is not a one-step prediction problem; it requires maintaining a state of the world in memory.</p>
</section>
<section id="mystery-world-failure" class="level3">
<h3 class="anchored" data-anchor-id="mystery-world-failure">62. Mystery World Failure</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_62.png" class="img-fluid figure-img"></p>
<figcaption>Slide 62</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1240s">Timestamp: 20:40</a>)</p>
<p>The slide introduces <strong>“Mystery World,”</strong> a variation of the block problem where the names of the blocks are changed to random words.</p>
<p>While a human (or a 4-year-old) understands that changing the name doesn’t change the physics of stacking, <strong>GPT-4 fails</strong> (3% accuracy). Rajiv explains that the model gets distracted by the creative aspect of the words and loses the logical thread. It shows these models struggle with abstract reasoning.</p>
</section>
<section id="recipe-step-3-aligned-model" class="level3">
<h3 class="anchored" data-anchor-id="recipe-step-3-aligned-model">63. Recipe Step 3: Aligned Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_63.png" class="img-fluid figure-img"></p>
<figcaption>Slide 63</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1316s">Timestamp: 21:56</a>)</p>
<p>The final step in the recipe is the <strong>“Aligned Model.”</strong></p>
<p>Rajiv introduces the need for safety and helpfulness. A model that follows instructions perfectly might follow <em>bad</em> instructions. We need to align it with human values.</p>
</section>
<section id="galactica-science-llm" class="level3">
<h3 class="anchored" data-anchor-id="galactica-science-llm">64. Galactica: Science LLM</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_64.png" class="img-fluid figure-img"></p>
<figcaption>Slide 64</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1320s">Timestamp: 22:00</a>)</p>
<p>The slide presents <strong>Galactica</strong>, a model released by Meta focused on science.</p>
<p>Rajiv describes the intent: a helpful assistant for researchers to write code, summarize papers, and generate scientific content. It was meant to be a specialized tool.</p>
</section>
<section id="galactica-output" class="level3">
<h3 class="anchored" data-anchor-id="galactica-output">65. Galactica Output</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_65.png" class="img-fluid figure-img"></p>
<figcaption>Slide 65</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1340s">Timestamp: 22:20</a>)</p>
<p>An example of Galactica’s output shows it generating technical content.</p>
<p>Rajiv highlights the potential utility. It looked like a powerful tool for accelerating scientific discovery.</p>
</section>
<section id="galactica-pulled" class="level3">
<h3 class="anchored" data-anchor-id="galactica-pulled">66. Galactica Pulled</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_66.png" class="img-fluid figure-img"></p>
<figcaption>Slide 66</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1367s">Timestamp: 22:47</a>)</p>
<p>The slide reveals that Meta <strong>pulled the model</strong> shortly after release.</p>
<p>Rajiv explains why: users found they could ask it for the “benefits of eating crushed glass” or “benefits of suicide,” and the model would happily generate a scientific-sounding justification. It lacked a safety layer. This incident underscored the necessity of <strong>Red Teaming</strong> and alignment before release.</p>
</section>
<section id="learning-what-is-helpful" class="level3">
<h3 class="anchored" data-anchor-id="learning-what-is-helpful">67. Learning What is Helpful</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_67.png" class="img-fluid figure-img"></p>
<figcaption>Slide 67</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1439s">Timestamp: 23:59</a>)</p>
<p>To explain how we define “helpful,” Rajiv shows a <strong>Stack Overflow</strong> question.</p>
<p>He notes that defining “helpful” mathematically is difficult. Unlike “square footage,” helpfulness is subjective and nuanced.</p>
</section>
<section id="technical-answer" class="level3">
<h3 class="anchored" data-anchor-id="technical-answer">68. Technical Answer</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_68.png" class="img-fluid figure-img"></p>
<figcaption>Slide 68</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1445s">Timestamp: 24:05</a>)</p>
<p>The slide shows a detailed technical answer.</p>
<p>Rajiv points out that trying to create a “feature list” for what makes this answer helpful is nearly impossible. We can’t write a rule-based program to detect helpfulness.</p>
</section>
<section id="the-dating-app-analogy" class="level3">
<h3 class="anchored" data-anchor-id="the-dating-app-analogy">69. The Dating App Analogy</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_69.png" class="img-fluid figure-img"></p>
<figcaption>Slide 69</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1466s">Timestamp: 24:26</a>)</p>
<p>Rajiv uses a humorous <strong>Dating App</strong> analogy. He compares the “Old Way” (filling out long compatibility forms/features) with the “New Way” (Swiping).</p>
<p>He explains that <strong>Swiping</strong> is a way of capturing human preferences without asking the user to explicitly define them. This is how we teach AI what is helpful.</p>
</section>
<section id="collect-human-feedback" class="level3">
<h3 class="anchored" data-anchor-id="collect-human-feedback">70. Collect Human Feedback</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_70.png" class="img-fluid figure-img"></p>
<figcaption>Slide 70</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1495s">Timestamp: 24:55</a>)</p>
<p>The slide details the process: <strong>“Collect Human Feedback.”</strong></p>
<p>We present the model with two options and ask a human, “Which is better?” By collecting thousands of these “swipes,” we build a dataset of human preference.</p>
</section>
<section id="rlhf-reinforcement-learning-from-human-feedback" class="level3">
<h3 class="anchored" data-anchor-id="rlhf-reinforcement-learning-from-human-feedback">71. RLHF (Reinforcement Learning from Human Feedback)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_71.png" class="img-fluid figure-img"></p>
<figcaption>Slide 71</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1505s">Timestamp: 25:05</a>)</p>
<p>This slide introduces the technical term: <strong>RLHF</strong>.</p>
<p>Rajiv explains this is the layer that turns a raw instruction-following model into a safe, helpful product like ChatGPT. It is an active curation process, similar to curating an Instagram feed.</p>
</section>
<section id="the-makeover-example" class="level3">
<h3 class="anchored" data-anchor-id="the-makeover-example">72. The Makeover Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_72.png" class="img-fluid figure-img"></p>
<figcaption>Slide 72</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1527s">Timestamp: 25:27</a>)</p>
<p>A “Before and After” makeover image illustrates the effect of RLHF.</p>
<p>The “Before” is the raw model (messy, potentially harmful). The “After” is the aligned model (polished, safe, presentable).</p>
</section>
<section id="tuning-responses" class="level3">
<h3 class="anchored" data-anchor-id="tuning-responses">73. Tuning Responses</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_73.png" class="img-fluid figure-img"></p>
<figcaption>Slide 73</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1544s">Timestamp: 25:44</a>)</p>
<p>The slide shows different ways an AI can answer a question: <strong>Sycophantic</strong> (sucking up to the user), <strong>Baseline Truthful</strong> (blunt), or <strong>Helpful Truthful</strong>.</p>
<p>Rajiv notes we can train models to have specific personalities. We can make them polite, or we can make them “kiss your butt” if the user wants validation.</p>
</section>
<section id="ai-conversations" class="level3">
<h3 class="anchored" data-anchor-id="ai-conversations">74. AI Conversations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_74.png" class="img-fluid figure-img"></p>
<figcaption>Slide 74</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1577s">Timestamp: 26:17</a>)</p>
<p>This slide references <strong>Character.ai</strong> and the trend of people spending hours talking to AI personas.</p>
<p>Rajiv mentions research showing people sometimes <strong>prefer AI doctors</strong> over human ones because the AI is patient, listens, and is polite (due to alignment). This suggests a future where AI handles high-touch conversational roles.</p>
</section>
<section id="the-full-recipe" class="level3">
<h3 class="anchored" data-anchor-id="the-full-recipe">75. The Full Recipe</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_75.png" class="img-fluid figure-img"></p>
<figcaption>Slide 75</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1639s">Timestamp: 27:19</a>)</p>
<p>The presentation summarizes the full pipeline: <strong>Foundation Model -&gt; Instruction Fine-Tuned -&gt; Aligned Model</strong>.</p>
<p>This visual recap cements the three-stage process in the audience’s mind.</p>
</section>
<section id="learning-mechanisms-recap" class="level3">
<h3 class="anchored" data-anchor-id="learning-mechanisms-recap">76. Learning Mechanisms Recap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_76.png" class="img-fluid figure-img"></p>
<figcaption>Slide 76</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1643s">Timestamp: 27:23</a>)</p>
<p>Rajiv maps the learning mechanisms to the stages: 1. <strong>Next Word Prediction</strong> (Foundation) 2. <strong>Multi-task Training</strong> (Instruction) 3. <strong>Human Preferences</strong> (Alignment)</p>
<p>He reiterates that understanding these three mechanics helps explain why the models behave the way they do (hallucinations, ability to code, politeness).</p>
</section>
<section id="key-takeaways" class="level3">
<h3 class="anchored" data-anchor-id="key-takeaways">77. Key Takeaways</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_77.png" class="img-fluid figure-img"></p>
<figcaption>Slide 77</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/6NuGEukBfcA&amp;t=1663s">Timestamp: 27:43</a>)</p>
<p>The presentation transitions to the conclusion with three main takeaways: 1. <strong>Measure Twice</strong> 2. <strong>Respect Scale</strong> 3. <strong>Critical Thinking</strong></p>
<p>Rajiv notes in the video that he skimmed these in the original talk, but the slides provide the detail for how to work effectively with AI.</p>
</section>
<section id="measure-twice-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="measure-twice-benchmarks">78. Measure Twice (Benchmarks)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_78.png" class="img-fluid figure-img"></p>
<figcaption>Slide 78</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide displays a collage of AI benchmarks (MMLU, HumanEval, etc.).</p>
<p>The concept “Measure Twice” emphasizes that because AI models are probabilistic and prone to hallucination, we cannot trust them blindly. We must rely on rigorous benchmarking to understand their capabilities and failures before deployment.</p>
</section>
<section id="targets-for-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="targets-for-evaluation">79. Targets for Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_79.png" class="img-fluid figure-img"></p>
<figcaption>Slide 79</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide likely elaborates on the need for clear <strong>“targets”</strong> or ground truth when evaluating models.</p>
<p>You cannot improve what you cannot measure. In the context of “Prompt Engineering,” this means you shouldn’t just tweak prompts randomly; you need a systematic way to measure if a prompt change actually improved the output.</p>
</section>
<section id="respect-scale" class="level3">
<h3 class="anchored" data-anchor-id="respect-scale">80. Respect Scale</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_80.png" class="img-fluid figure-img"></p>
<figcaption>Slide 80</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide illustrates the exponential growth in <strong>single-chip inference performance</strong>.</p>
<p>“Respect Scale” refers to the lesson that betting against hardware and data scaling is usually a losing bet. The capabilities of these models grow faster than our intuition expects.</p>
</section>
<section id="the-scaling-lesson-humans" class="level3">
<h3 class="anchored" data-anchor-id="the-scaling-lesson-humans">81. The Scaling Lesson (Humans)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_81.png" class="img-fluid figure-img"></p>
<figcaption>Slide 81</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide likely discusses how human expertise fits into the scaling laws. As technology scales, the role of the human shifts from doing the work to evaluating the work.</p>
</section>
<section id="the-plateau" class="level3">
<h3 class="anchored" data-anchor-id="the-plateau">82. The Plateau</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_82.png" class="img-fluid figure-img"></p>
<figcaption>Slide 82</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>A visual showing that human contribution or specific “hacks” tend to <strong>plateau</strong>, whereas general-purpose methods that leverage scale (like Transformers) continue to improve.</p>
<p>This reinforces the “Bitter Lesson”: specialized, hand-crafted solutions eventually lose to general methods that can consume more compute.</p>
</section>
<section id="alexnet-vs-transformers" class="level3">
<h3 class="anchored" data-anchor-id="alexnet-vs-transformers">83. AlexNet vs Transformers</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_83.png" class="img-fluid figure-img"></p>
<figcaption>Slide 83</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>A comparison between <strong>AlexNet</strong> (the start of the deep learning era) and <strong>Transformers</strong> (the current era).</p>
<p>It highlights the massive increase: <strong>10,000x more data</strong> and <strong>1,000x more compute</strong>. This illustrates that the fundamental driver of progress has been scale.</p>
</section>
<section id="the-bitter-lesson" class="level3">
<h3 class="anchored" data-anchor-id="the-bitter-lesson">84. The Bitter Lesson</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_84.png" class="img-fluid figure-img"></p>
<figcaption>Slide 84</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide explicitly references Rich Sutton’s <strong>“The Bitter Lesson.”</strong></p>
<p>The lesson is that researchers often try to build their knowledge into the system (like Jelinek’s linguists), but in the long run, the only thing that matters is leveraging computation. AI succeeds when we stop trying to teach it <em>how</em> to think and just give it enough power to learn on its own.</p>
</section>
<section id="text-to-sql" class="level3">
<h3 class="anchored" data-anchor-id="text-to-sql">85. Text to SQL</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_85.png" class="img-fluid figure-img"></p>
<figcaption>Slide 85</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>The slide examines <strong>Text to SQL</strong>, a common enterprise use case. It compares AI performance to human experts.</p>
<p>It notes that while AI is good, humans still achieve higher exact match accuracy. This nuances the “Respect Scale” argument—for high-precision tasks, human oversight is still required.</p>
</section>
<section id="critical-thinking" class="level3">
<h3 class="anchored" data-anchor-id="critical-thinking">86. Critical Thinking</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_86.png" class="img-fluid figure-img"></p>
<figcaption>Slide 86</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>The final takeaway is <strong>“Critical Thinking.”</strong></p>
<p>In an age where AI can generate convincing but false information, human judgment becomes the most valuable skill. We must critically evaluate the outputs of these models.</p>
</section>
<section id="predictions-and-concerns" class="level3">
<h3 class="anchored" data-anchor-id="predictions-and-concerns">87. Predictions and Concerns</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_87.png" class="img-fluid figure-img"></p>
<figcaption>Slide 87</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide recaps expert predictions, ranging from job displacement to existential threats.</p>
<p>It serves as a reminder that even experts disagree on the timeline and impact, reinforcing the need for individual critical thinking rather than blind faith in pundits.</p>
</section>
<section id="practical-limits-bezos-and-alexa" class="level3">
<h3 class="anchored" data-anchor-id="practical-limits-bezos-and-alexa">88. Practical Limits: Bezos and Alexa</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_88.png" class="img-fluid figure-img"></p>
<figcaption>Slide 88</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>A humorous slide showing <strong>Jeff Bezos</strong> and <strong>Alexa</strong>. It likely references an instance where Alexa failed to understand a simple context despite Amazon’s massive resources.</p>
<p>This illustrates the <strong>“Practical Limits of Learning.”</strong> Despite the hype, current AI still struggles with basic context that humans find trivial.</p>
</section>
<section id="autonomous-driving-limits" class="level3">
<h3 class="anchored" data-anchor-id="autonomous-driving-limits">89. Autonomous Driving Limits</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_89.png" class="img-fluid figure-img"></p>
<figcaption>Slide 89</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>Images of an autonomous driving interface and a car accident.</p>
<p>This points out that in high-stakes physical environments, “99% accuracy” isn’t enough. The “long tail” of edge cases remains a massive hurdle for AI.</p>
</section>
<section id="chatbot-failures" class="level3">
<h3 class="anchored" data-anchor-id="chatbot-failures">90. Chatbot Failures</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_90.png" class="img-fluid figure-img"></p>
<figcaption>Slide 90</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>Examples of chatbots failing simple math or making <strong>“legally binding offers”</strong> (referencing the Air Canada chatbot lawsuit).</p>
<p>This warns against deploying these models in critical business flows without guardrails. They can confidently make costly mistakes.</p>
</section>
<section id="interaction-principles" class="level3">
<h3 class="anchored" data-anchor-id="interaction-principles">91. Interaction Principles</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_91.png" class="img-fluid figure-img"></p>
<figcaption>Slide 91</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>This slide summarizes the three principles for interacting with AI: <strong>“Measure twice,” “Respect scale,”</strong> and <strong>“Think critically.”</strong></p>
<p>It acts as the final instructional slide, giving the audience a mantra for navigating the AI landscape.</p>
</section>
<section id="evolution-of-generative-capabilities" class="level3">
<h3 class="anchored" data-anchor-id="evolution-of-generative-capabilities">92. Evolution of Generative Capabilities</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_92.png" class="img-fluid figure-img"></p>
<figcaption>Slide 92</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>The slide shows a series of <strong>Unicorn images</strong> generated by GPT-4 over time.</p>
<p>This visualizes the rapid improvement in generative capabilities. Just as the “Sparks of AGI” images improved, the fidelity of these outputs continues to evolve, reminding us that we are looking at a moving target.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">93. Conclusion</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_93.png" class="img-fluid figure-img"></p>
<figcaption>Slide 93</figcaption>
</figure>
</div>
<p>([Timestamp: End of Transcript])</p>
<p>The final slide concludes the presentation with Rajiv Shah’s name and affiliation (Snowflake).</p>
<p>It wraps up the narrative: from the spark of Transfer Learning to the fire of the Generative AI revolution, offering a practical, technical, and critical perspective on the technology shaping our future.</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>Transfer Learning</category>
  <category>AI</category>
  <category>LLM</category>
  <category>Deep Learning</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/spark-of-ai-transfer-learning.html</guid>
  <pubDate>Fri, 20 Sep 2024 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/spark-of-ai-transfer-learning/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Practical Lessons in Building Generative AI: RAG and Text to SQL</title>
  <link>https://rajivshah.com/blog/practical-rag-text-to-sql.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/OyY4uxUShys" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/OyY4uxUShys">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the annotated presentation based on the video transcript and slide summaries.</p>
<section id="title-slide-a-practical-perspective-on-generative-ai" class="level3">
<h3 class="anchored" data-anchor-id="title-slide-a-practical-perspective-on-generative-ai">1. Title Slide: A Practical Perspective on Generative AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1s">Timestamp: 00:01</a>)</p>
<p>This presentation begins with an introduction by Rajiv Shah from Snowflake. The talk focuses on distinguishing “what’s easy to do with LLMs, what’s hard to do with LLMs, and where that boundary is for generative AI.” The content is framed as a practical guide for enterprises navigating the hype versus the reality of implementing these technologies.</p>
<p>The speaker sets the stage for a narrative-driven presentation that will move away from abstract theory and into concrete examples. The goal is to walk through the basics of <strong>Large Language Models (LLMs)</strong> and <strong>Retrieval Augmented Generation (RAG)</strong> before applying them to real-world scenarios involving legal research and enterprise data analysis.</p>
</section>
<section id="presentation-goals" class="level3">
<h3 class="anchored" data-anchor-id="presentation-goals">2. Presentation Goals</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=31s">Timestamp: 00:31</a>)</p>
<p>The agenda for the talk is outlined here. The speaker intends to cover the foundational mechanisms of how to use LLMs effectively, specifically focusing on RAG. To make the concepts relatable, the presentation uses two storytelling devices: a fictional law firm (“Dewey, Cheatham, and Howe”) and a hypothetical company (“Frosty”).</p>
<p>These two stories serve to illustrate how people are currently using Generative AI, the specific limitations they encounter, and the engineering required to build a robust application. The speaker emphasizes that the talk will explore “what does it take to actually develop a generative AI application” beyond just simple prompting.</p>
</section>
<section id="the-avianca-case" class="level3">
<h3 class="anchored" data-anchor-id="the-avianca-case">3. The Avianca Case</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=60s">Timestamp: 01:00</a>)</p>
<p>The speaker introduces the concept of <strong>hallucinations</strong> through a famous real-world example involving the airline Avianca. A lawyer, attempting to speed up his work on a brief regarding a personal injury case, used ChatGPT for legal research. The AI “found some cases that were unpublished,” which the lawyer cited in court.</p>
<p>However, ChatGPT had “made up those cases.” The lawyer was admonished by the bar for submitting fictitious legal precedents. This slide serves as a warning: while LLMs are powerful tools, they cannot be blindly trusted for factual research because they are prone to fabricating information when they don’t know the answer.</p>
</section>
<section id="generative-ai-in-action" class="level3">
<h3 class="anchored" data-anchor-id="generative-ai-in-action">4. Generative AI in Action</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=134s">Timestamp: 02:14</a>)</p>
<p>To demonstrate the variability of LLMs, the speaker presents a side-by-side comparison of two models (Google Gemma and a “Woflesh” model) answering the same prompt: “How many vehicles will Rivian manufacture in Normal, Illinois?” The models provide different answers.</p>
<p>This illustrates a key characteristic of Generative AI: “Two different manufacturers, two different methods for training these models are probably going to lead to two different results.” It highlights that out-of-the-box models rely on their specific training data, which may be outdated or weighted differently, leading to inconsistent factual accuracy.</p>
</section>
<section id="next-token-prediction" class="level3">
<h3 class="anchored" data-anchor-id="next-token-prediction">5. Next Token Prediction</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=182s">Timestamp: 03:02</a>)</p>
<p>This technical diagram explains <em>why</em> models hallucinate. The speaker clarifies that LLMs function by trying to <strong>predict the next word or token</strong> based on statistical likelihood. They are not databases of facts; they are engines designed to construct coherent sentences.</p>
<p>“They’re not worried about truth and false; they’re really trying to tell what the most cohesive, coherent story is.” Because the model is optimizing for the most probable next word to complete a pattern, it will confidently generate plausible-sounding but factually incorrect information if that sequence of words is statistically likely.</p>
</section>
<section id="llm-mistakes" class="level3">
<h3 class="anchored" data-anchor-id="llm-mistakes">6. LLM Mistakes</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=210s">Timestamp: 03:30</a>)</p>
<p>Here, the speaker provides examples of the “Next Token Prediction” logic failing to provide truth. If asked for the “Capital of Mars,” the model doesn’t know Mars has no capital; it simply tries to “complete that story” by inventing a name. Similarly, when asked to perform math, the model isn’t calculating; it is predicting the next characters in a math-like sequence.</p>
<p>The slide shows the model failing at basic arithmetic because “it looks like it’s read too many release notes, not actually enough math.” This reinforces that LLMs are linguistic tools, not calculators or knowledge bases, and they lack an internal concept of “fictional” versus “factual.”</p>
</section>
<section id="risks-for-enterprises" class="level3">
<h3 class="anchored" data-anchor-id="risks-for-enterprises">7. Risks for Enterprises</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=243s">Timestamp: 04:03</a>)</p>
<p>This slide highlights the liability risks for companies, citing the <strong>Air Canada chatbot case</strong>. In this instance, a chatbot invented a refund policy that did not exist. When the customer sued, the airline argued the chatbot was responsible, but the tribunal ruled the company was liable for its agent’s statements.</p>
<p>The speaker notes, “We’re going to treat this chatbot just like one of your employees… you’re responsible for what this model says.” This legal precedent explains why enterprises are hesitant to deploy Gen AI and why “Gen AI committees” are forming to manage governance and risk before public deployment.</p>
</section>
<section id="retrieval-augmented-generation-rag" class="level3">
<h3 class="anchored" data-anchor-id="retrieval-augmented-generation-rag">8. Retrieval-Augmented Generation (RAG)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=291s">Timestamp: 04:51</a>)</p>
<p>To solve the hallucination problem, the presentation introduces <strong>Retrieval-Augmented Generation (RAG)</strong>. The speaker describes this as a solution from academia designed to “ground” the model. Instead of relying solely on the model’s internal training data, RAG surrounds the model with external context.</p>
<p>The core idea is simple: “We’re going to ground it with information so it uses that information in answering the question.” This technique attempts to bridge the gap between the model’s linguistic capabilities and the need for factual accuracy in enterprise applications.</p>
</section>
<section id="how-rag-works" class="level3">
<h3 class="anchored" data-anchor-id="how-rag-works">9. How RAG Works</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=320s">Timestamp: 05:20</a>)</p>
<p>This diagram breaks down the RAG architecture. When a user asks a question, the system does not send it directly to the LLM. First, it goes out to “search and look for is there relevant information that’s related to this question.”</p>
<p>Once relevant documents are collected from a knowledge base, they are bundled with the original question and sent to the LLM. The LLM then generates an answer based <em>only</em> on that provided context. This ensures the “final answer is grounded” by factual documents rather than the model’s statistical predictions alone.</p>
</section>
<section id="grounding-with-10-k-forms" class="level3">
<h3 class="anchored" data-anchor-id="grounding-with-10-k-forms">10. Grounding with 10-K Forms</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=351s">Timestamp: 05:51</a>)</p>
<p>The speaker sets up a practical RAG demonstration using <strong>10-K forms</strong> (annual reports filed by public companies). These documents are chosen because “you can trust that they’re factual.”</p>
<p>This slide prepares the audience to see how the previous question about Rivian’s manufacturing capacity—which generated inconsistent answers earlier—can be answered accurately when the model is forced to look at Rivian’s official financial filings.</p>
</section>
<section id="rivian-manufacturing-answer" class="level3">
<h3 class="anchored" data-anchor-id="rivian-manufacturing-answer">11. Rivian Manufacturing Answer</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=367s">Timestamp: 06:07</a>)</p>
<p>The slide shows the output of a RAG application. The question “How many vehicles do you manufacture in Normal?” is asked again. This time, the application provides a specific, fact-based answer derived from the uploaded documents.</p>
<p>This demonstrates the immediate utility of RAG: it turns the LLM from a creative writing engine into a synthesis engine that can read specific enterprise documents and extract the correct answer, mitigating the hallucination issues seen in Slide 4.</p>
</section>
<section id="context-and-citations" class="level3">
<h3 class="anchored" data-anchor-id="context-and-citations">12. Context and Citations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=381s">Timestamp: 06:21</a>)</p>
<p>A critical feature of RAG is displayed here: <strong>Citations</strong>. The application shows exactly which document the answer came from. The speaker notes, “I can see exactly what’s the document that this answer came from… a nice source.”</p>
<p>This transparency is why RAG is the “number one most popular generative AI application.” It allows users to verify the AI’s work, building trust in the system—something impossible with a standard “black box” LLM response.</p>
</section>
<section id="chatbot-for-legal-research" class="level3">
<h3 class="anchored" data-anchor-id="chatbot-for-legal-research">13. Chatbot for Legal Research</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=420s">Timestamp: 07:00</a>)</p>
<p>The narrative shifts to the fictional law firm “Dewey, Cheatham, and Howe.” The firm wants to use AI to reduce the heavy workload of legal research. The initial thought process is to use raw LLMs because they are knowledgeable.</p>
<p>The speaker introduces a colleague who assumes, “I know it could pass the bar exam… why don’t I just wire it up directly?” This sets up the common misconception that because a model has general knowledge (passing a test), it is suitable for specialized professional work without further engineering.</p>
</section>
<section id="gpt-models-on-the-bar-exam" class="level3">
<h3 class="anchored" data-anchor-id="gpt-models-on-the-bar-exam">14. GPT Models on the Bar Exam</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=442s">Timestamp: 07:22</a>)</p>
<p>This chart reinforces the previous assumption, showing the progression of GPT models on the <strong>Multistate Bar Exam (MBE)</strong>. GPT-4 significantly outperforms its predecessors, achieving a passing score.</p>
<p>While this suggests the model “knows something about the law,” the speaker hints that this is merely a multiple-choice test. Success here does not necessarily translate to the nuance required for actual legal practice, foreshadowing the errors to come in the story.</p>
</section>
<section id="hallucinating-statutes" class="level3">
<h3 class="anchored" data-anchor-id="hallucinating-statutes">15. Hallucinating Statutes</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=470s">Timestamp: 07:50</a>)</p>
<p>The first failure of the “raw LLM” approach is revealed. A lawyer asks for statutes regarding “online dating services in Connecticut.” The model confidently provides “Connecticut General Statute § 42-290.”</p>
<p>However, the lawyer discovers “there is no statute; this was entirely hallucinated.” Despite passing the bar exam, the model fabricated a law that sounded plausible but did not exist. This forces the firm to pivot toward a RAG approach to ground the AI in real legal literature.</p>
</section>
<section id="lexis-ai" class="level3">
<h3 class="anchored" data-anchor-id="lexis-ai">16. Lexis+ AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=510s">Timestamp: 08:30</a>)</p>
<p>The firm decides to use professional tools. They turn to <strong>Lexis+ AI</strong>, a commercial product that promises “Hallucination-Free Linked Legal Citations.” This tool uses the RAG approach discussed earlier, retrieving from a database of real case law.</p>
<p>The expectation is that by using a trusted vendor with a RAG architecture, the hallucination problem will be solved, and lawyers will receive accurate, citable information.</p>
</section>
<section id="conceptual-hallucinations" class="level3">
<h3 class="anchored" data-anchor-id="conceptual-hallucinations">17. Conceptual Hallucinations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=530s">Timestamp: 08:50</a>)</p>
<p>Even with RAG and real citations, a new problem emerges: <strong>Conceptual confusion</strong>. The AI provides a real case but confuses the “Equity Cleanup Doctrine” with the “Doctrine of Clean Hands.” The speaker explains that while the words are similar, the legal concepts are distinct (one is about consolidating claims, the other about a plaintiff’s conduct, illustrated by a joke about P. Diddy).</p>
<p>The model found a document containing the words but failed to understand the <em>meaning</em>. This shows that RAG ensures the <em>document</em> exists, but not necessarily that the <em>reasoning</em> or application of that document is correct.</p>
</section>
<section id="the-fictional-judge" class="level3">
<h3 class="anchored" data-anchor-id="the-fictional-judge">18. The Fictional Judge</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=590s">Timestamp: 09:50</a>)</p>
<p>The model’s failure deepens with an example of an “inside joke.” A lawyer asks for opinions by “Judge Luther A. Wilgarten.” Wilgarten is a fictional judge created as a prank in law reviews.</p>
<p>The AI, treating the law reviews as factual text, retrieves “cases” by this fake judge. It fails to distinguish between a real judicial opinion and a satirical article within its knowledge base. This illustrates the “garbage in, garbage out” risk even within RAG systems if the model cannot discern the nature of the source material.</p>
</section>
<section id="hallucination-rates-in-legal-ai" class="level3">
<h3 class="anchored" data-anchor-id="hallucination-rates-in-legal-ai">19. Hallucination Rates in Legal AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=644s">Timestamp: 10:44</a>)</p>
<p>The speaker references a Stanford paper analyzing hallucination rates across major legal AI tools (Lexis, Westlaw, GPT-4). The chart shows that these tools still hallucinate or provide incomplete answers <strong>17% to 33% of the time</strong>.</p>
<p>This data point serves as a reality check: “These models hallucinate using real questions.” Despite marketing claims of being “hallucination-free,” the complexity of the domain means that errors are still frequent, posing significant risks for professional use.</p>
</section>
<section id="limits-of-rag" class="level3">
<h3 class="anchored" data-anchor-id="limits-of-rag">20. Limits of RAG</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=663s">Timestamp: 11:03</a>)</p>
<p>This slide summarizes the limitations discovered in the legal example. RAG works well when documents are “True, Authoritative, and Applicable.” However, in complex domains like law, these attributes are often contested.</p>
<p>“Sometimes all these things are very contested and it gets really hard to separate it.” If the underlying documents contain conflicting information, satire, or outdated facts, the RAG system (which assumes retrieved text is “truth”) will propagate those errors to the user.</p>
</section>
<section id="why-legal-is-hard" class="level3">
<h3 class="anchored" data-anchor-id="why-legal-is-hard">21. Why Legal is Hard</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=688s">Timestamp: 11:28</a>)</p>
<p>The speaker elaborates on the complexity of legal research. It involves navigating different <strong>specialties</strong> (Tort vs.&nbsp;Maritime), <strong>jurisdictions</strong> (Federal vs.&nbsp;State), and <strong>authorities</strong> (Supreme Court vs.&nbsp;Law Reviews). Furthermore, the element of <strong>time</strong> is crucial—knowing if a case has been overturned.</p>
<p>“You really have to have a lot of knowledge to be able to weave everything in and out.” An LLM often lacks the meta-knowledge to weigh these factors, treating a lower court opinion from 1950 with the same weight as a Supreme Court ruling from 2024.</p>
</section>
<section id="conclusion-on-legal-chatbots" class="level3">
<h3 class="anchored" data-anchor-id="conclusion-on-legal-chatbots">22. Conclusion on Legal Chatbots</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=780s">Timestamp: 13:00</a>)</p>
<p>The conclusion for the legal use case is that human expertise remains essential. While AI can “get you a stack of documents,” you still need “facts people to actually tease out the insights.”</p>
<p>The current state of technology is an aid, not a replacement. The speaker transitions away from the legal example to a new story about building a data application, suggesting that while law is hard, structured data might offer different challenges and solutions.</p>
</section>
<section id="building-generative-ai-text-to-sql" class="level3">
<h3 class="anchored" data-anchor-id="building-generative-ai-text-to-sql">23. Building Generative AI (Text-to-SQL)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=800s">Timestamp: 13:20</a>)</p>
<p>The presentation shifts to the story of “Frosty,” a company building a <strong>Text-to-SQL</strong> application. The goal is to turn natural language questions (e.g., “How many orders do I have in each state?”) into SQL code that can query a database.</p>
<p>This is a “very common application” for Gen AI, allowing non-technical users to interact with data. This section will focus on the engineering steps required to build this system, moving beyond the simple RAG implementation discussed previously.</p>
</section>
<section id="evaluating-sql-queries" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-sql-queries">24. Evaluating SQL Queries</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=854s">Timestamp: 14:14</a>)</p>
<p>The first challenge in building this app is <strong>evaluation</strong>. How do you know if the AI’s generated SQL is good? The slide shows a “Gold Standard” query (the correct answer) and a “Candidate SQL” (the AI’s attempt).</p>
<p>In this example, the AI added an extra column (“latitude”) that wasn’t requested. While the query might still work, it isn’t an exact match. The speaker notes, “We really need to have a way to give partial credit,” because simple string matching would mark this helpful addition as a failure.</p>
</section>
<section id="model-based-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="model-based-evaluation">25. Model Based Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=945s">Timestamp: 15:45</a>)</p>
<p>To solve the grading problem at scale, the speaker introduces <strong>Model-Based Evaluation</strong>. This involves using an LLM (like GPT-4) to act as the “judge” for the output of another model.</p>
<p>Instead of humans manually grading thousands of SQL queries, “we’re going to use a large language model to do this.” This allows for nuanced grading (partial credit) that strict code comparison cannot provide.</p>
</section>
<section id="skepticism-of-model-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="skepticism-of-model-evaluation">26. Skepticism of Model Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=959s">Timestamp: 15:59</a>)</p>
<p>The speaker acknowledges the common reaction to this technique: “Is that going to work? I mean that’s like the fox guarding the outhouse.” There is a fear of “model collapse” or circular logic when AI evaluates AI.</p>
<p>Despite this intuition, the speaker assures the audience that this is a standard and effective practice in modern AI development, and proceeds to explain how to implement it correctly.</p>
</section>
<section id="the-evaluation-prompt" class="level3">
<h3 class="anchored" data-anchor-id="the-evaluation-prompt">27. The Evaluation Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=974s">Timestamp: 16:14</a>)</p>
<p>This slide reveals the <strong>system prompt</strong> used for the model-based judge. It instructs the LLM to act as a “data quality analyst” and provides a specific grading rubric (0 to 3 scale).</p>
<p>By explicitly defining what constitutes a “Perfect Match,” “Good Match,” or “No Match,” the engineer can control how the AI judges the output. This turns a subjective assessment into a structured, automated process.</p>
</section>
<section id="the-tx-vs-texas-problem" class="level3">
<h3 class="anchored" data-anchor-id="the-tx-vs-texas-problem">28. The “TX” vs “TEXAS” Problem</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1010s">Timestamp: 16:50</a>)</p>
<p>A specific example of why strict matching fails. The user asked for data in “Texas.” The database uses the abbreviation ‘TX’, but the AI generated a query looking for ‘TEXAS’.</p>
<p>“It’s a natural mistake here to confuse TX and Texas… but if we go with the strict criteria of that exact match, we don’t get an exact match.” A standard code test would fail this, even though the intent is correct and easily fixable.</p>
</section>
<section id="execution-accuracy-failure" class="level3">
<h3 class="anchored" data-anchor-id="execution-accuracy-failure">29. Execution Accuracy Failure</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1036s">Timestamp: 17:16</a>)</p>
<p>This slide confirms that under “Execution Accuracy” (strict matching), the query is a failure (“No Match”). This metric is too harsh for development because it obscures progress; a model that gets the logic right but misses an abbreviation is much better than one that writes gibberish.</p>
</section>
<section id="execution-score-success" class="level3">
<h3 class="anchored" data-anchor-id="execution-score-success">30. Execution Score Success</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1050s">Timestamp: 17:30</a>)</p>
<p>Using the <strong>Model-Based Evaluation</strong>, the same ‘TX’ vs ‘TEXAS’ error is graded differently. The “Execution Score” is a “Perfect Match” because the judge recognizes the semantic intent was captured.</p>
<p>“It captures the user’s intent… the user could easily fix this.” This allows developers to optimize the model for logic and reasoning first, handling minor syntax issues separately.</p>
</section>
<section id="correlation-with-other-metrics" class="level3">
<h3 class="anchored" data-anchor-id="correlation-with-other-metrics">31. Correlation with Other Metrics</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1070s">Timestamp: 17:50</a>)</p>
<p>The speaker presents data showing a <strong>strong correlation</strong> between the model-based scores and other evaluation methods. When the model judge gives a 5/5, other metrics generally agree.</p>
<p>This validation step is crucial. The engineer in the story checked her results and found “80% were the exact same when she scored them.” This high level of agreement gives confidence in automating the evaluation pipeline.</p>
</section>
<section id="research-on-model-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="research-on-model-evaluation">32. Research on Model Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1099s">Timestamp: 18:19</a>)</p>
<p>Supporting the anecdote, this slide references broader research indicating that LLMs correlate with human judges about <strong>80% of the time</strong> regarding correctness and readability.</p>
<p>“I got tired of adding research sites here… universally we see that often in many contexts that these large language models correlate about 80% of the time to humans.” This establishes model-based evaluation as an industry standard.</p>
</section>
<section id="initial-benchmark-results" class="level3">
<h3 class="anchored" data-anchor-id="initial-benchmark-results">33. Initial Benchmark Results</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1130s">Timestamp: 18:50</a>)</p>
<p>After setting up the evaluation pipeline and creating an <strong>internal enterprise benchmark</strong> (not a public dataset), the initial results are poor: only <strong>33% accuracy</strong>.</p>
<p>The speaker emphasizes the importance of using internal data for benchmarks: “You can’t trust those public data sets… they’re far too easy.” The low score sets the stage for the iterative engineering process required to improve the application.</p>
</section>
<section id="using-multiple-models" class="level3">
<h3 class="anchored" data-anchor-id="using-multiple-models">34. Using Multiple Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1155s">Timestamp: 19:15</a>)</p>
<p>The first improvement strategy is <strong>Ensembling</strong>. The engineer noticed different models had different strengths, so she combined them.</p>
<p>“In traditional machine learning, we often Ensemble models… she decided to try the same thing here.” By using multiple Text-to-SQL models and combining their outputs, performance improved.</p>
</section>
<section id="error-correction-self-reflection" class="level3">
<h3 class="anchored" data-anchor-id="error-correction-self-reflection">35. Error Correction (Self-Reflection)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1180s">Timestamp: 19:40</a>)</p>
<p>The next optimization is <strong>Error Correction</strong> via self-reflection. When the model generates an error, the system asks the model to “reflect upon it” or think “step-by-step.”</p>
<p>“That actually makes the model spend more time thinking about it… and actually they can use all of that to get a better answer.” This technique, often called <strong>Chain of Thought</strong>, leverages the model’s ability to debug its own output when prompted correctly.</p>
</section>
<section id="screening-inputs" class="level3">
<h3 class="anchored" data-anchor-id="screening-inputs">36. Screening Inputs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1230s">Timestamp: 20:30</a>)</p>
<p>Improving the input data is just as important as improving the model. The engineer adds a <strong>Screening</strong> layer to filter out questions that are ambiguous or irrelevant (non-SQL questions).</p>
<p>“She noticed that a lot of what the users were typing in just didn’t make sense.” By catching bad queries early and asking the user for clarification, the system avoids processing garbage data, thereby increasing overall success rates.</p>
</section>
<section id="feature-extraction" class="level3">
<h3 class="anchored" data-anchor-id="feature-extraction">37. Feature Extraction</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1300s">Timestamp: 21:40</a>)</p>
<p>Recognizing that different questions require different handling, the engineer implements <strong>Feature Extraction</strong>. A time-series question needs different context than a ranking question.</p>
<p>“If I’m cooking macaroni and cheese I need different ingredients than if I’m making tacos.” The system now identifies the <em>type</em> of question and extracts the specific features (metadata, table schemas) relevant to that type before generating SQL.</p>
</section>
<section id="the-semantic-layer" class="level3">
<h3 class="anchored" data-anchor-id="the-semantic-layer">38. The Semantic Layer</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1370s">Timestamp: 22:50</a>)</p>
<p>To bridge the gap between messy enterprise databases and user language, a <strong>Semantic Layer</strong> is added. This involves human experts defining the data structure in business terms.</p>
<p>“We’re going to use the expertise… to give us details of the data structure in a way that deals with all this confusing structure.” This layer translates business logic (e.g., what defines a “churned customer”) into a schema the AI can understand, significantly boosting accuracy.</p>
</section>
<section id="generative-ai-decision-app" class="level3">
<h3 class="anchored" data-anchor-id="generative-ai-decision-app">39. Generative AI Decision App</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1450s">Timestamp: 24:10</a>)</p>
<p>This flowchart represents the final, production-grade system. It is no longer just a prompt sent to a model. It includes classification, feature extraction, multiple SQL generation agents, error correction, and a semantic layer.</p>
<p>The lesson is that “Generative AI is not about a data scientist sitting out an Island by themselves… instead it’s building a system like this.” It requires a cross-functional team of analysts, engineers, and domain experts to build a reliable application.</p>
</section>
<section id="the-future-of-ai" class="level3">
<h3 class="anchored" data-anchor-id="the-future-of-ai">40. The Future of AI</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1500s">Timestamp: 25:00</a>)</p>
<p>The speaker pivots to the future, acknowledging the rapid pace of innovation from companies like <strong>OpenAI</strong> and <strong>Google DeepMind</strong>. He addresses the audience’s potential skepticism: “The future is you’re just going to be able to just take all your data cram it into one thing it’s just going to solve it all for you.”</p>
<p>This sets up the final section on <strong>Reasoning and Planning</strong>, moving beyond simple retrieval and text generation.</p>
</section>
<section id="can-llms-reason-and-plan-block-world" class="level3">
<h3 class="anchored" data-anchor-id="can-llms-reason-and-plan-block-world">41. Can LLMs Reason and Plan? (Block World)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1538s">Timestamp: 25:38</a>)</p>
<p>To test reasoning, the speaker introduces the <strong>Block World</strong> benchmark. The task is to stack colored blocks in a specific order. This requires multi-step planning.</p>
<p>“You have to logically think and plan for maybe five, six, ten, even 20 steps to be able to solve it.” This tests the model’s ability to handle dependencies and sub-tasks, rather than just predicting the next word.</p>
</section>
<section id="gpt-4-planning-performance" class="level3">
<h3 class="anchored" data-anchor-id="gpt-4-planning-performance">42. GPT-4 Planning Performance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1600s">Timestamp: 26:40</a>)</p>
<p>The results for GPT-4 are shown. While it achieves 34% on the standard Block World, its performance collapses to <strong>3%</strong> in “Mystery World.” Mystery World is the same problem, but the block names are randomized (e.g., obfuscated).</p>
<p>“What you call them doesn’t matter [to a human]… but for a large language model, what you does call them matters a lot.” The collapse in performance proves the model was relying on memorized patterns (approximate reasoning) rather than true logical planning.</p>
</section>
<section id="o1-models-and-progress" class="level3">
<h3 class="anchored" data-anchor-id="o1-models-and-progress">43. o1 Models and Progress</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1670s">Timestamp: 27:50</a>)</p>
<p>The speaker updates the data with the very latest <strong>OpenAI o1 model</strong> results. This model uses “Chain of Thought on steroids” (reinforcement learning). It shows a massive improvement, jumping to nearly 100% on Block World and significantly higher on Mystery World (around 37-53%).</p>
<p>While this is “solid progress,” the speaker notes it “still has a ways to go.” The models are getting better at <strong>approximate reasoning</strong>, but they are not infallible logic engines yet.</p>
</section>
<section id="be-skeptical-of-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="be-skeptical-of-benchmarks">44. Be Skeptical of Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_44.png" class="img-fluid figure-img"></p>
<figcaption>Slide 44</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1776s">Timestamp: 29:36</a>)</p>
<p>A warning accompanies the new capabilities: <strong>Be skeptical</strong>. As models get better at approximating reasoning, their mistakes will become harder to spot. They will sound incredibly convincing even when they are logically flawed.</p>
<p>“You’re going to have to have an expert to be able to tell when this models are going off the rails… because the Baseline for these models is so good.” Just as legal experts were needed for RAG, domain experts are needed to verify AI reasoning.</p>
</section>
<section id="common-gen-ai-use-cases-summary" class="level3">
<h3 class="anchored" data-anchor-id="common-gen-ai-use-cases-summary">45. Common Gen AI Use Cases Summary</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_45.png" class="img-fluid figure-img"></p>
<figcaption>Slide 45</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1813s">Timestamp: 30:13</a>)</p>
<p>The speaker summarizes the key technical concepts covered: <strong>Hallucinations</strong>, <strong>RAG</strong>, <strong>Reasoning</strong>, <strong>Evaluation</strong>, <strong>Model as a Judge</strong>, and <strong>Data Enrichment</strong>.</p>
<p>These pillars form the basis of current Gen AI development. The presentation has moved from the simple idea of “asking a chatbot” to the complex reality of building systems that manage retrieval, evaluation, and reasoning.</p>
</section>
<section id="project-reality" class="level3">
<h3 class="anchored" data-anchor-id="project-reality">46. Project Reality</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_46.png" class="img-fluid figure-img"></p>
<figcaption>Slide 46</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1835s">Timestamp: 30:35</a>)</p>
<p>The final takeaway emphasizes the organizational aspect. “Generative AI is like any other project and doesn’t go as planned.” It is not magic; it is engineering.</p>
<p>Success requires a diverse team (“system of people”) including evaluators, analysts, and technical builders. It is an iterative process that involves handling messy data and managing expectations.</p>
</section>
<section id="closing-title" class="level3">
<h3 class="anchored" data-anchor-id="closing-title">47. Closing Title</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_47.png" class="img-fluid figure-img"></p>
<figcaption>Slide 47</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/OyY4uxUShys&amp;t=1853s">Timestamp: 30:53</a>)</p>
<p>The presentation concludes. The speaker thanks the audience, hoping the stories of the law firm and the data company provided a realistic “Practical Perspective” on the current state of Generative AI.</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>RAG</category>
  <category>Text-to-SQL</category>
  <category>AI</category>
  <category>Generative AI</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/practical-rag-text-to-sql.html</guid>
  <pubDate>Sun, 15 Sep 2024 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/practical-rag-text-to-sql/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Snowflake ML Intro Notebook - ML Forecasting</title>
  <link>https://rajivshah.com/blog/Snowpark_Forecasting_Bus.html</link>
  <description><![CDATA[ 






<p>This notebook introduces several key features of Snowflake ML in the process of training a machine learning model for forecasting Chicago bus ridership.</p>
<ul>
<li>Establish secure connection to Snowflake</li>
<li>Load features and target from Snowflake table into Snowpark DataFrame</li>
<li>Prepare features for model training</li>
<li>Train ML model using Snowpark ML distributed processing</li>
<li>Save the model to the Snowflake Model Registry</li>
<li>Run model predictions inside Snowflake</li>
</ul>
<p>This notebook is intended to highlight Snowflake functionality and should not be taken as a best practice for time series forecasting.</p>
<p><a href="https://github.com/rajshah4/snowflake-notebooks/blob/main/Forecasting_ChicagoBus/Snowpark_Forecasting_Bus.ipynb">Get Notebook</a></p>
<p><a href="https://github.com/rajshah4/snowflake-notebooks/blob/main/Forecasting_ChicagoBus/">Go to folder with dataset</a></p>
<p><a href="https://github.com/rajshah4/snowflake-notebooks/">See more snowflake notebooks from raj</a></p>
<section id="setup-environment" class="level2">
<h2 class="anchored" data-anchor-id="setup-environment">1. Setup Environment</h2>
<div id="persistent-reasoning" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;papermill&quot;,&quot;value&quot;:{&quot;duration&quot;:0.872363,&quot;end_time&quot;:&quot;2021-05-15T09:33:41.413139&quot;,&quot;exception&quot;:false,&quot;start_time&quot;:&quot;2021-05-15T09:33:40.540776&quot;,&quot;status&quot;:&quot;completed&quot;}}" data-tags="[]" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Snowflake connector</span></span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> connector</span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#from snowflake.ml.utils import connection_params</span></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Snowpark for Python</span></span>
<span id="cb1-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.session <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Session</span>
<span id="cb1-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.types <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Variant</span>
<span id="cb1-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.version <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> VERSION</span>
<span id="cb1-9"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> functions <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> F</span>
<span id="cb1-10"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.types <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Snowpark ML</span></span>
<span id="cb1-13"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.compose <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ColumnTransformer</span>
<span id="cb1-14"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.pipeline <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Pipeline</span>
<span id="cb1-15"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> StandardScaler, OrdinalEncoder</span>
<span id="cb1-16"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.impute <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SimpleImputer</span>
<span id="cb1-17"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GridSearchCV</span>
<span id="cb1-18"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.xgboost <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> XGBRegressor</span>
<span id="cb1-19"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> version</span>
<span id="cb1-20">mlversion <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> version.VERSION</span>
<span id="cb1-21"></span>
<span id="cb1-22"></span>
<span id="cb1-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Misc</span></span>
<span id="cb1-24"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-25"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb1-26"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> logging </span>
<span id="cb1-27">logger <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> logging.getLogger(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"snowflake.snowpark.session"</span>)</span>
<span id="cb1-28">logger.setLevel(logging.ERROR)</span></code></pre></div></div>
</div>
</section>
<section id="establish-secure-connection-to-snowflake" class="level2">
<h2 class="anchored" data-anchor-id="establish-secure-connection-to-snowflake">Establish Secure Connection to Snowflake</h2>
<p>Using the Snowflake ML Python API, it’s quick and easy to establish a secure connection between Snowflake and Notebook. I prefer using a <code>toml</code> configuration file <a href="https://docs.snowflake.com/en/developer-guide/snowflake-python-api/snowflake-python-connecting-snowflake">as documented here</a>. <em>Note: Other connection options include Username/Password, MFA, OAuth, Okta, SSO</em></p>
<p>The creds.json should look like this:</p>
<pre><code>{
    "account": "awb99999",
    "user": "your_user_name",
    "password": "your_password",
    "warehouse": "your_warehouse"
  }

::: {#953f6906 .cell execution_count=2}
``` {.python .cell-code}
with open('../../creds.json') as f:
    data = json.load(f)
    USERNAME = data['user']
    PASSWORD = data['password']
    SF_ACCOUNT = data['account']
    SF_WH = data['warehouse']

CONNECTION_PARAMETERS = {
   "account": SF_ACCOUNT,
   "user": USERNAME,
   "password": PASSWORD,
}

session = Session.builder.configs(CONNECTION_PARAMETERS).create()</code></pre>
<p>:::</p>
<p>Verify everything is connected. I like to do this to remind people to make sure they are using the latest versions.</p>
<div id="4fb1f7af" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">snowflake_environment <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'select current_user(), current_version()'</span>).collect()</span>
<span id="cb3-2">snowpark_version <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> VERSION</span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Current Environment Details</span></span>
<span id="cb3-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'User                        : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(snowflake_environment[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]))</span>
<span id="cb3-6"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Role                        : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(session.get_current_role()))</span>
<span id="cb3-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Database                    : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(session.get_current_database()))</span>
<span id="cb3-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Schema                      : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(session.get_current_schema()))</span>
<span id="cb3-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Warehouse                   : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(session.get_current_warehouse()))</span>
<span id="cb3-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Snowflake version           : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(snowflake_environment[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]))</span>
<span id="cb3-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Snowpark for Python version : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(snowpark_version[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>],snowpark_version[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>],snowpark_version[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]))</span>
<span id="cb3-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Snowflake ML version        : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(mlversion[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>],mlversion[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>],mlversion[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>]))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>User                        : RSHAH
Role                        : "RAJIV"
Database                    : "RAJIV"
Schema                      : "DOCAI"
Warehouse                   : "RAJIV"
Snowflake version           : 8.19.2
Snowpark for Python version : 1.15.0a1
Snowflake ML version        : 1.5.0</code></pre>
</div>
</div>
<p>Throughout this notebook, I will change warehouse sizes. For this notebook warehouse size really doesn’t matter much, but I want people to understand how easily and quickly you can change the warehouse size. This is one of my favorite features of Snowflake, just how its always ready for me.</p>
<div id="e11789e5" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"create or replace warehouse snowpark_opt_wh with warehouse_size = 'SMALL'"</span>).collect()</span>
<span id="cb5-2">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"USE SCHEMA PUBLIC"</span>).collect()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="4">
<pre><code>[Row(status='Statement executed successfully.')]</code></pre>
</div>
</div>
</section>
<section id="load-data-in-snowflake" class="level2">
<h2 class="anchored" data-anchor-id="load-data-in-snowflake">2. Load Data in Snowflake</h2>
<p>Let’s get the data (900k rows) and also make the column names all upper cases. It’s easier to work with columns names that aren’t case sensitive.</p>
<div id="8dfb28da" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">df_clean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'CTA_Daily_Totals_by_Route.csv'</span>)</span>
<span id="cb7-2">df_clean.columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_clean.columns.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>.upper()</span>
<span id="cb7-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (df_clean.shape)</span>
<span id="cb7-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (df_clean.dtypes)</span>
<span id="cb7-5">df_clean.head()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>(893603, 4)
ROUTE      object
DATE       object
DAYTYPE    object
RIDES       int64
dtype: object</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="5">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">ROUTE</th>
<th data-quarto-table-cell-role="th">DATE</th>
<th data-quarto-table-cell-role="th">DAYTYPE</th>
<th data-quarto-table-cell-role="th">RIDES</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>3</td>
<td>01/01/2001</td>
<td>U</td>
<td>7354</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>4</td>
<td>01/01/2001</td>
<td>U</td>
<td>9288</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>6</td>
<td>01/01/2001</td>
<td>U</td>
<td>6048</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>8</td>
<td>01/01/2001</td>
<td>U</td>
<td>6309</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>9</td>
<td>01/01/2001</td>
<td>U</td>
<td>11207</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Let’s create a Snowpark dataframe and split the data for test/train. This operation is done inside Snowflake and not in your local environment. We will also save this as a table so we don’t ever have to manually upload this dataset again.</p>
<p>PRO TIP – Snowpark will inherit the schema of a pandas dataframe into Snowflake. Either change your schema before importing or after it has landed in snowflake. People that put models into production are very careful about data types.</p>
<div id="77038199" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">input_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> session.create_dataframe(df_clean)</span>
<span id="cb9-2">schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> input_df.schema</span>
<span id="cb9-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(schema)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>StructType([StructField('ROUTE', StringType(16777216), nullable=True), StructField('DATE', StringType(16777216), nullable=True), StructField('DAYTYPE', StringType(16777216), nullable=True), StructField('RIDES', LongType(), nullable=True)])</code></pre>
</div>
</div>
<div id="9a6209bd" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">input_df.write.mode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'overwrite'</span>).save_as_table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'CHICAGO_BUS_RIDES'</span>)</span></code></pre></div></div>
</div>
<p>Let’s read from the table, since that is generally what you will be doing in production. We have 893,000 rows of ridership data.</p>
<div id="b2b86720" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> session.read.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'CHICAGO_BUS_RIDES'</span>)</span>
<span id="cb12-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (df.count())</span>
<span id="cb12-3">df.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>893603
----------------------------------------------
|"ROUTE"  |"DATE"      |"DAYTYPE"  |"RIDES"  |
----------------------------------------------
|3        |01/01/2001  |U          |7354     |
|4        |01/01/2001  |U          |9288     |
|6        |01/01/2001  |U          |6048     |
|8        |01/01/2001  |U          |6309     |
|9        |01/01/2001  |U          |11207    |
|10       |01/01/2001  |U          |385      |
|11       |01/01/2001  |U          |610      |
|12       |01/01/2001  |U          |3678     |
|18       |01/01/2001  |U          |375      |
|20       |01/01/2001  |U          |7096     |
----------------------------------------------
</code></pre>
</div>
</div>
</section>
<section id="distributed-feature-engineering" class="level2">
<h2 class="anchored" data-anchor-id="distributed-feature-engineering">3. Distributed Feature Engineering</h2>
<p>Let’s add the Day of the week and then Aggregate the data by day. Let’s join in weather data</p>
<p>These operations are done inside the Snowpark warehouse which provides improved performance and scalability with distributed execution for these scikit-learn preprocessing functions. This dataset uses SMALL, but you can always move up to larger ones including Snowpark Optimized warehouses (16x memory per node than a standard warehouse), e.g., <code>session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'").collect()</code></p>
<div id="894622be" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'"</span>).collect()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>[Row(status='Warehouse SNOWPARK_OPT_WH successfully created.')]</code></pre>
</div>
</div>
<p>Simple feature engineering</p>
<div id="50e5d5e9" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.functions <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> col, to_timestamp, dayofweek, month,<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>, listagg, lag</span>
<span id="cb16-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Window</span>
<span id="cb16-3"></span>
<span id="cb16-4">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.with_column(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>, to_timestamp(col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MM/DD/YYYY'</span>))</span>
<span id="cb16-5"></span>
<span id="cb16-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a new column for the day of the week</span></span>
<span id="cb16-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The day of week is represented as an integer, with 0 = Sunday, 1 = Monday, ..., 6 = Saturday</span></span>
<span id="cb16-8">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.with_column(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DAY_OF_WEEK'</span>, dayofweek(col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>)))</span>
<span id="cb16-9"></span>
<span id="cb16-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the resulting dataframe</span></span>
<span id="cb16-11">df.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>-----------------------------------------------------------------------
|"ROUTE"  |"DAYTYPE"  |"RIDES"  |"DATE"               |"DAY_OF_WEEK"  |
-----------------------------------------------------------------------
|3        |U          |7354     |2001-01-01 00:00:00  |1              |
|4        |U          |9288     |2001-01-01 00:00:00  |1              |
|6        |U          |6048     |2001-01-01 00:00:00  |1              |
|8        |U          |6309     |2001-01-01 00:00:00  |1              |
|9        |U          |11207    |2001-01-01 00:00:00  |1              |
|10       |U          |385      |2001-01-01 00:00:00  |1              |
|11       |U          |610      |2001-01-01 00:00:00  |1              |
|12       |U          |3678     |2001-01-01 00:00:00  |1              |
|18       |U          |375      |2001-01-01 00:00:00  |1              |
|20       |U          |7096     |2001-01-01 00:00:00  |1              |
-----------------------------------------------------------------------
</code></pre>
</div>
</div>
<p>A bit more feature engineering, but again, this is very familiar syntax.</p>
<div id="0022d07b" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a new column for the month</span></span>
<span id="cb18-2">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.with_column(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MONTH'</span>, month(col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>)))</span>
<span id="cb18-3"></span>
<span id="cb18-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Group by DATE, DAY_OF_WEEK, and MONTH, then aggregate</span></span>
<span id="cb18-5">total_riders <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DAY_OF_WEEK'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MONTH'</span>).agg(</span>
<span id="cb18-6">    F.listagg(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DAYTYPE'</span>, is_distinct<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>).alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DAYTYPE'</span>),</span>
<span id="cb18-7">    F.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'RIDES'</span>).alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'TOTAL_RIDERS'</span>)</span>
<span id="cb18-8">).order_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>)</span>
<span id="cb18-9"></span>
<span id="cb18-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#Define a window specification</span></span>
<span id="cb18-11">window_spec <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Window.order_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>)</span>
<span id="cb18-12"></span>
<span id="cb18-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a lagged column for total ridership of the previous day</span></span>
<span id="cb18-14">total_riders <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> total_riders.with_column(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'PREV_DAY_RIDERS'</span>, lag(col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'TOTAL_RIDERS'</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).over(window_spec))</span>
<span id="cb18-15"></span>
<span id="cb18-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the resulting dataframe</span></span>
<span id="cb18-17"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (total_riders.count())</span>
<span id="cb18-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (total_riders.show())</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>7364
--------------------------------------------------------------------------------------------------
|"DATE"               |"DAY_OF_WEEK"  |"MONTH"  |"DAYTYPE"  |"TOTAL_RIDERS"  |"PREV_DAY_RIDERS"  |
--------------------------------------------------------------------------------------------------
|2001-01-01 00:00:00  |1              |1        |U          |295439          |NULL               |
|2001-01-02 00:00:00  |2              |1        |W          |776862          |295439             |
|2001-01-03 00:00:00  |3              |1        |W          |820048          |776862             |
|2001-01-04 00:00:00  |4              |1        |W          |867675          |820048             |
|2001-01-05 00:00:00  |5              |1        |W          |887519          |867675             |
|2001-01-06 00:00:00  |6              |1        |A          |575407          |887519             |
|2001-01-07 00:00:00  |0              |1        |U          |374435          |575407             |
|2001-01-08 00:00:00  |1              |1        |W          |980660          |374435             |
|2001-01-09 00:00:00  |2              |1        |W          |974858          |980660             |
|2001-01-10 00:00:00  |3              |1        |W          |980656          |974858             |
--------------------------------------------------------------------------------------------------

None</code></pre>
</div>
</div>
<section id="also-you-can-use-chatgpt-to-generate-the-code-for-you." class="level3">
<h3 class="anchored" data-anchor-id="also-you-can-use-chatgpt-to-generate-the-code-for-you.">Also, you can use ChatGPT to generate the code for you.</h3>
<p><img src="https://rajivshah.com/blog/fe_forecasting.png" alt="Forecasting Visualization" width="600"></p>
</section>
</section>
<section id="join-in-the-weather-data-from-the-snowflake-marketplace" class="level2">
<h2 class="anchored" data-anchor-id="join-in-the-weather-data-from-the-snowflake-marketplace">Join in the Weather Data from the Snowflake Marketplace</h2>
<p>Instead of downloading data and building pipelines, Snowflake has a lot of useful data, including weather data in it’s Marketplace. This means the data is only a SQL query away.</p>
<p><a href="https://app.snowflake.com/marketplace/listing/GZTSZAS2KIM/cybersyn-inc-weather-environmental-essentials?search=weather">Cybersyn Weather</a></p>
<p>SQL QUERY:</p>
<pre><code>SELECT
  ts.noaa_weather_station_id,
  ts.DATE,
  COALESCE(MAX(CASE WHEN ts.variable = 'minimum_temperature' THEN ts.Value ELSE NULL END), 0) AS minimum_temperature,
  COALESCE(MAX(CASE WHEN ts.variable = 'precipitation' THEN ts.Value ELSE NULL END), 0) AS precipitation,
  COALESCE(MAX(CASE WHEN ts.variable = 'maximum_temperature' THEN ts.Value ELSE NULL END), 0) AS maximum_temperature
FROM
  cybersyn.noaa_weather_metrics_timeseries AS ts
JOIN
  cybersyn.noaa_weather_station_index AS idx
ON
  (ts.noaa_weather_station_id = idx.noaa_weather_station_id)
WHERE
  idx.NOAA_WEATHER_STATION_ID = 'USW00014819'
  AND (ts.VARIABLE = 'minimum_temperature' OR ts.VARIABLE = 'precipitation' OR ts.VARIABLE = 'maximum_temperature')
GROUP BY
  ts.noaa_weather_station_id,
  ts.DATE
LIMIT 1000;</code></pre>
<div id="b56ea3fe" class="cell" data-execution_count="12">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">weather <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> session.read.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'CHICAGO_WEATHER'</span>)</span>
<span id="cb21-2"></span>
<span id="cb21-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.types <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DoubleType</span>
<span id="cb21-4">weather <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> weather.withColumn(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MINIMUM_TEMPERATURE'</span>, weather[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MINIMUM_TEMPERATURE'</span>].cast(DoubleType()))</span>
<span id="cb21-5">weather <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> weather.withColumn(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MAXIMUM_TEMPERATURE'</span>, weather[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MAXIMUM_TEMPERATURE'</span>].cast(DoubleType()))</span>
<span id="cb21-6">weather <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> weather.withColumn(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'PRECIPITATION'</span>, weather[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'PRECIPITATION'</span>].cast(DoubleType()))</span>
<span id="cb21-7"></span>
<span id="cb21-8">weather.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>------------------------------------------------------------------------------------------------------------
|"NOAA_WEATHER_STATION_ID"  |"DATE"      |"MINIMUM_TEMPERATURE"  |"MAXIMUM_TEMPERATURE"  |"PRECIPITATION"  |
------------------------------------------------------------------------------------------------------------
|USW00014819                |2019-07-16  |22.2                   |28.9                   |3.8              |
|USW00014819                |2002-01-06  |-3.9                   |3.3                    |0.0              |
|USW00014819                |2008-03-17  |-0.5                   |4.4                    |2.0              |
|USW00014819                |2000-01-29  |-6.7                   |-2.2                   |0.0              |
|USW00014819                |2004-06-12  |16.7                   |26.7                   |6.6              |
|USW00014819                |2017-07-15  |16.1                   |28.3                   |0.0              |
|USW00014819                |2001-10-22  |12.2                   |18.9                   |2.3              |
|USW00014819                |2021-05-01  |6.1                    |28.3                   |0.0              |
|USW00014819                |2016-11-29  |7.2                    |14.4                   |0.0              |
|USW00014819                |2020-08-01  |18.3                   |26.1                   |5.1              |
------------------------------------------------------------------------------------------------------------
</code></pre>
</div>
</div>
<div id="027f9449" class="cell" data-execution_count="13">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Perform the join operation</span></span>
<span id="cb23-2">joined_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> weather.join(</span>
<span id="cb23-3">    total_riders,</span>
<span id="cb23-4">    weather[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATE"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> total_riders[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATE"</span>],</span>
<span id="cb23-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inner"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># This is the type of join: inner, outer, left, right,</span></span>
<span id="cb23-6">    lsuffix<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w"</span></span>
<span id="cb23-7">)</span>
<span id="cb23-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the result of the join</span></span>
<span id="cb23-9">joined_df.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"NOAA_WEATHER_STATION_ID"  |"DATEW"     |"MINIMUM_TEMPERATURE"  |"MAXIMUM_TEMPERATURE"  |"PRECIPITATION"  |"DATE"               |"DAY_OF_WEEK"  |"MONTH"  |"DAYTYPE"  |"TOTAL_RIDERS"  |"PREV_DAY_RIDERS"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|USW00014819                |2005-10-15  |8.9                    |20.0                   |0.0              |2005-10-15 00:00:00  |6              |10       |A          |666129          |1087863            |
|USW00014819                |2019-04-29  |6.1                    |11.7                   |29.0             |2019-04-29 00:00:00  |1              |4        |W          |724030          |332461             |
|USW00014819                |2019-09-26  |13.9                   |22.8                   |0.0              |2019-09-26 00:00:00  |4              |9        |W          |847678          |852326             |
|USW00014819                |2006-12-09  |-4.9                   |3.3                    |0.0              |2006-12-09 00:00:00  |6              |12       |A          |586623          |948538             |
|USW00014819                |2015-05-05  |10.6                   |15.6                   |20.1             |2015-05-05 00:00:00  |2              |5        |W          |913079          |926775             |
|USW00014819                |2006-05-05  |8.9                    |15.6                   |0.0              |2006-05-05 00:00:00  |5              |5        |W          |1018785         |1042392            |
|USW00014819                |2019-11-04  |3.9                    |12.2                   |0.0              |2019-11-04 00:00:00  |1              |11       |W          |842258          |354020             |
|USW00014819                |2013-02-07  |0.0                    |2.2                    |14.5             |2013-02-07 00:00:00  |4              |2        |W          |963866          |1026678            |
|USW00014819                |2013-08-30  |20.6                   |35.6                   |11.9             |2013-08-30 00:00:00  |5              |8        |W          |1004986         |1029901            |
|USW00014819                |2007-04-28  |4.4                    |22.2                   |0.0              |2007-04-28 00:00:00  |6              |4        |A          |662079          |1018455            |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
</code></pre>
</div>
</div>
<div id="c5237c42" class="cell" data-execution_count="14">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## Dropping any null values</span></span>
<span id="cb25-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.snowpark.functions <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> col, is_null</span>
<span id="cb25-3"></span>
<span id="cb25-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a filter condition for non-finite values across all columns</span></span>
<span id="cb25-5">non_finite_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb25-6"></span>
<span id="cb25-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Iterate over all columns and update the filter condition</span></span>
<span id="cb25-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> column <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> joined_df.columns:</span>
<span id="cb25-9">    current_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> is_null(col(column))</span>
<span id="cb25-10">    non_finite_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> current_filter <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> non_finite_filter <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> (non_finite_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> current_filter)</span>
<span id="cb25-11"></span>
<span id="cb25-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Apply the filter to the DataFrame to exclude rows with any non-finite values</span></span>
<span id="cb25-13">df_filtered <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> joined_df.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>non_finite_filter)</span></code></pre></div></div>
</div>
<div id="96a10c75" class="cell" data-execution_count="15">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#Split the data into training and test sets</span></span>
<span id="cb26-2">train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_filtered.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2019-01-01'</span>)</span>
<span id="cb26-3">test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_filtered.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DATE'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2019-01-01'</span>)</span></code></pre></div></div>
</div>
<div id="db5114b0" class="cell" data-execution_count="16">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (train.count())</span>
<span id="cb27-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span> (test.count())</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>6570
790</code></pre>
</div>
</div>
</section>
<section id="distributed-feature-engineering-in-a-pipeline" class="level2">
<h2 class="anchored" data-anchor-id="distributed-feature-engineering-in-a-pipeline">4. Distributed Feature Engineering in a Pipeline</h2>
<p>Feature engineering + XGBoost</p>
<div id="dc637e73" class="cell" data-execution_count="17">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'"</span>).collect()</span>
<span id="cb29-2">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"alter warehouse snowpark_opt_wh set max_concurrency_level = 1"</span>).collect()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="17">
<pre><code>[Row(status='Statement executed successfully.')]</code></pre>
</div>
</div>
<div id="ad265fe5" class="cell" data-execution_count="18">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## Distributed Preprocessing - 25X to 50X faster</span></span>
<span id="cb31-2">numeric_features <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DAY_OF_WEEK'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MONTH'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'PREV_DAY_RIDERS'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MINIMUM_TEMPERATURE'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MAXIMUM_TEMPERATURE'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'PRECIPITATION'</span>]</span>
<span id="cb31-3">numeric_transformer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline(steps<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'scaler'</span>, StandardScaler())])</span>
<span id="cb31-4"></span>
<span id="cb31-5">categorical_cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'DAYTYPE'</span>]</span>
<span id="cb31-6">categorical_transformer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline(steps<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb31-7">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'imputer'</span>, SimpleImputer(strategy<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'most_frequent'</span>)),</span>
<span id="cb31-8">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'onehot'</span>, OrdinalEncoder(handle_unknown<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'use_encoded_value'</span>,unknown_value<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">99999</span>))</span>
<span id="cb31-9">])</span>
<span id="cb31-10"></span>
<span id="cb31-11">preprocessor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ColumnTransformer(</span>
<span id="cb31-12">    transformers<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb31-13">        (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'num'</span>, numeric_transformer, numeric_features),</span>
<span id="cb31-14">        (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cat'</span>, categorical_transformer, categorical_cols)</span>
<span id="cb31-15">        ])</span>
<span id="cb31-16"></span>
<span id="cb31-17">pipeline <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline(steps<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'preprocessor'</span>, preprocessor),(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'model'</span>, XGBRegressor())])</span></code></pre></div></div>
</div>
</section>
<section id="distributed-training" class="level2">
<h2 class="anchored" data-anchor-id="distributed-training">5. Distributed Training</h2>
<p>These operations are done inside the Snowpark warehouse which provides improved performance and scalability with distributed execution for these scikit-learn preprocessing functions and XGBoost training (and many other types of models).</p>
<div id="a1c973e3" class="cell" data-execution_count="19">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb32" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## Distributed HyperParameter Optimization</span></span>
<span id="cb32-2">hyper_param <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(</span>
<span id="cb32-3">        model__max_depth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>],</span>
<span id="cb32-4">        model__learning_rate<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>],</span>
<span id="cb32-5">    )</span>
<span id="cb32-6"></span>
<span id="cb32-7">xg_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GridSearchCV(</span>
<span id="cb32-8">    estimator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>pipeline,</span>
<span id="cb32-9">    param_grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>hyper_param,</span>
<span id="cb32-10">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#cv=5,</span></span>
<span id="cb32-11">    input_cols<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>numeric_features <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> categorical_cols,</span>
<span id="cb32-12">    label_cols<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'TOTAL_RIDERS'</span>],</span>
<span id="cb32-13">    output_cols<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TOTAL_RIDERS_FORECAST"</span>],</span>
<span id="cb32-14">)</span>
<span id="cb32-15"></span>
<span id="cb32-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fit and Score</span></span>
<span id="cb32-17">xg_model.fit(train)</span>
<span id="cb32-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">##Takes 25 seconds</span></span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="19">
<pre><code>&lt;snowflake.ml.modeling.model_selection.grid_search_cv.GridSearchCV at 0x173418df0&gt;</code></pre>
</div>
</div>
</section>
<section id="model-evaluation" class="level2">
<h2 class="anchored" data-anchor-id="model-evaluation">6. Model Evaluation</h2>
<p>Look at the results of the mode. cv_results is a dictionary, where each key is a string describing one of the metrics or parameters, and the corresponding value is an array with one entry per combination of parameters</p>
<div id="5d0e9845" class="cell" data-execution_count="20">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb34" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"create or replace warehouse snowpark_opt_wh with warehouse_size = 'SMALL'"</span>).collect()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="20">
<pre><code>[Row(status='Warehouse SNOWPARK_OPT_WH successfully created.')]</code></pre>
</div>
</div>
<div id="70d680fd" class="cell" data-execution_count="21">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb36" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1">cv_results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xg_model.to_sklearn().cv_results_</span>
<span id="cb36-2"></span>
<span id="cb36-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(cv_results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'params'</span>])):</span>
<span id="cb36-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Parameters: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cv_results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'params'</span>][i]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb36-5">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Mean Test Score: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cv_results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mean_test_score'</span>][i]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb36-6">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Parameters: {'model__learning_rate': 0.1, 'model__max_depth': 2}
Mean Test Score: 0.927693653032996

Parameters: {'model__learning_rate': 0.1, 'model__max_depth': 4}
Mean Test Score: 0.9440192568004221

Parameters: {'model__learning_rate': 0.3, 'model__max_depth': 2}
Mean Test Score: 0.9367972284370352

Parameters: {'model__learning_rate': 0.3, 'model__max_depth': 4}
Mean Test Score: 0.9425057277525181
</code></pre>
</div>
</div>
<p>Look at the accuracy of the model</p>
<div id="73c36d50" class="cell" data-execution_count="22">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb38" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> mean_absolute_error</span>
<span id="cb38-2">testpreds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xg_model.predict(test)</span>
<span id="cb38-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'MSE:'</span>, mean_absolute_error(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>testpreds, y_true_col_names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'TOTAL_RIDERS'</span>, y_pred_col_names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'"TOTAL_RIDERS_FORECAST"'</span>))</span>
<span id="cb38-4">testpreds.select(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATEW"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TOTAL_RIDERS"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TOTAL_RIDERS_FORECAST"</span>).show(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)         </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>MSE: 183320.1351068038
---------------------------------------------------------
|"DATEW"     |"TOTAL_RIDERS"  |"TOTAL_RIDERS_FORECAST"  |
---------------------------------------------------------
|2019-11-09  |476467          |489406.65625             |
|2019-05-31  |810422          |836847.0625              |
|2020-12-15  |270178          |633812.25                |
|2020-08-05  |315741          |710399.9375              |
|2020-05-17  |118681          |347373.59375             |
|2019-02-27  |793731          |792628.8125              |
|2019-01-27  |257918          |286517.4375              |
|2020-02-07  |771641          |789460.875               |
|2020-12-06  |143231          |333279.25                |
|2020-04-02  |213131          |656467.625               |
---------------------------------------------------------
</code></pre>
</div>
</div>
<p>Materialize the results to a table</p>
<div id="d9489eb7" class="cell" data-execution_count="23">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb40" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1">testpreds.write.save_as_table(table_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'CHICAGO_BUS_RIDES_FORECAST'</span>, mode<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'overwrite'</span>)</span></code></pre></div></div>
</div>
<p>Using metrics from snowpark so calculation is done inside snowflake</p>
</section>
<section id="save-to-the-model-registry-and-use-for-predictions-python-sql" class="level2">
<h2 class="anchored" data-anchor-id="save-to-the-model-registry-and-use-for-predictions-python-sql">7. Save to the Model Registry and use for Predictions (Python &amp; SQL)</h2>
<p>Connect to the registry</p>
<div id="c97f06d1" class="cell" data-execution_count="24">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb41" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb41-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.registry <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Registry</span>
<span id="cb41-2">reg <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Registry(session<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>session, database_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RAJIV"</span>, schema_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PUBLIC"</span>)</span></code></pre></div></div>
</div>
<div id="99012d7d" class="cell" data-execution_count="25">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb42" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1">model_ref <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> reg.log_model(</span>
<span id="cb42-2">    model_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Forecasting_Bus_Ridership"</span>,</span>
<span id="cb42-3">    version_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"v37"</span>,    </span>
<span id="cb42-4">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>xg_model,</span>
<span id="cb42-5">    conda_dependencies<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"scikit-learn"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"xgboost"</span>],</span>
<span id="cb42-6">    sample_input_data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>train,</span>
<span id="cb42-7">    comment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"XGBoost model, run 36, May13"</span></span>
<span id="cb42-8">)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>/Users/rajishah/anaconda3/envs/working38/lib/python3.8/contextlib.py:113: UserWarning: `relax_version` is not set and therefore defaulted to True. Dependency version constraints relaxed from ==x.y.z to &gt;=x.y, &lt;(x+1). To use specific dependency versions for compatibility, reproducibility, etc., set `options={'relax_version': False}` when logging the model.
  return next(self.gen)
/Users/rajishah/anaconda3/envs/working38/lib/python3.8/site-packages/snowflake/ml/model/_packager/model_packager.py:92: UserWarning: Inferring model signature from sample input or providing model signature for Snowpark ML Modeling model is not required. Model signature will automatically be inferred during fitting. 
  handler.save_model(</code></pre>
</div>
</div>
<div id="3ac62368" class="cell" data-execution_count="26">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb44" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1">reg.show_models()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="26">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">created_on</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">database_name</th>
<th data-quarto-table-cell-role="th">schema_name</th>
<th data-quarto-table-cell-role="th">comment</th>
<th data-quarto-table-cell-role="th">owner</th>
<th data-quarto-table-cell-role="th">default_version_name</th>
<th data-quarto-table-cell-role="th">versions</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>2024-01-23 18:58:41.929000-08:00</td>
<td>DIABETES_XGBOOSTER</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V2</td>
<td>["V2","V3","V4","V5","V7"]</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>2024-02-19 17:12:27.005000-08:00</td>
<td>E5_BASE_V2</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V1</td>
<td>["V1"]</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>2024-02-07 13:00:56.292000-08:00</td>
<td>FINBERT</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V1</td>
<td>["V1"]</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>2024-02-26 18:55:00.548000-08:00</td>
<td>FORECASTING_BUS_RIDERSHIP</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V7</td>
<td>["V10","V11","V12","V13","V14","V15","V16","V1...</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>2024-02-19 17:19:12.122000-08:00</td>
<td>MINILMV2</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V1</td>
<td>["V1","V2","V4","V5"]</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">5</th>
<td>2024-02-07 13:14:44.823000-08:00</td>
<td>MPNET_BASE</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V1</td>
<td>["V1","V2","V3"]</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">6</th>
<td>2024-01-25 14:54:04.655000-08:00</td>
<td>TPCDS_XGBOOST_DEMO</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V5</td>
<td>["V5","V6","V7","V8","V9"]</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">7</th>
<td>2024-01-23 18:49:09.294000-08:00</td>
<td>XGBOOSTER</td>
<td>RAJIV</td>
<td>PUBLIC</td>
<td>None</td>
<td>RAJIV</td>
<td>V1</td>
<td>["V1","V2"]</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Let’s retrieve the model from the registry</p>
<div id="16f2b93f" class="cell" data-execution_count="27">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb45-1">reg_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> reg.get_model(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Forecasting_Bus_Ridership"</span>).version(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"v37"</span>)</span></code></pre></div></div>
</div>
<p>Let’s do predictions inside the warehouse</p>
<div id="bf7f3388" class="cell" data-execution_count="28">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb46" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1">remote_prediction <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> reg_model.run(test, function_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'predict'</span>)</span>
<span id="cb46-2">remote_prediction.sort(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATEW"</span>).select(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATEW"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TOTAL_RIDERS"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TOTAL_RIDERS_FORECAST"</span>).show(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>---------------------------------------------------------
|"DATEW"     |"TOTAL_RIDERS"  |"TOTAL_RIDERS_FORECAST"  |
---------------------------------------------------------
|2019-01-01  |247279          |290942.375               |
|2019-01-02  |585996          |668251.3125              |
|2019-01-03  |660631          |767229.875               |
|2019-01-04  |662011          |759055.3125              |
|2019-01-05  |440848          |491881.78125             |
|2019-01-06  |316844          |351156.84375             |
|2019-01-07  |717818          |762515.625               |
|2019-01-08  |779946          |879376.0625              |
|2019-01-09  |743021          |790567.625               |
|2019-01-10  |743075          |764690.8125              |
---------------------------------------------------------
</code></pre>
</div>
</div>
<p>If you look in the activity view, you can find the SQL which will run a bit faster. This SQL command is showing the result in a snowflake dataframe. You could use <code>collect</code> to pull the info out into your local session.</p>
<p>Modify the SQL with by adding in your specific model with this line: <code>WITH MODEL_VERSION_ALIAS AS MODEL RAJIV.PUBLIC.DIABETES_XGBOOSTER VERSION V7</code> and updating the location of your target predictions which is located here: <code>SNOWPARK_ML_MODEL_INFERENCE_INPUT</code></p>
<div id="19ab86b4" class="cell" data-execution_count="29">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb48" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb48-1">sqlquery <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""SELECT "DATEW", "TOTAL_RIDERS",  CAST ("TMP_RESULT"['TOTAL_RIDERS_FORECAST'] AS DOUBLE) AS "TOTAL_RIDERS_FORECAST" FROM (WITH SNOWPARK_ML_MODEL_INFERENCE_INPUT AS (SELECT  *  FROM ( SELECT  *  FROM (( SELECT "NOAA_WEATHER_STATION_ID" AS "NOAA_WEATHER_STATION_ID", "DATE" AS "DATEW", "MINIMUM_TEMPERATURE" AS "MINIMUM_TEMPERATURE", "MAXIMUM_TEMPERATURE" AS "MAXIMUM_TEMPERATURE", "PRECIPITATION" AS "PRECIPITATION" FROM ( SELECT "NOAA_WEATHER_STATION_ID", "DATE",  CAST ("MINIMUM_TEMPERATURE" AS DOUBLE) AS "MINIMUM_TEMPERATURE",  CAST ("MAXIMUM_TEMPERATURE" AS DOUBLE) AS "MAXIMUM_TEMPERATURE",  CAST ("PRECIPITATION" AS DOUBLE) AS "PRECIPITATION" FROM CHICAGO_WEATHER)) AS SNOWPARK_LEFT INNER JOIN ( SELECT "DATE" AS "DATE", "DAY_OF_WEEK" AS "DAY_OF_WEEK", "MONTH" AS "MONTH", "DAYTYPE" AS "DAYTYPE", "TOTAL_RIDERS" AS "TOTAL_RIDERS", "PREV_DAY_RIDERS" AS "PREV_DAY_RIDERS" FROM ( SELECT "DATE", "DAY_OF_WEEK", "MONTH", "DAYTYPE", "TOTAL_RIDERS", LAG("TOTAL_RIDERS", 1, NULL) OVER (  ORDER BY "DATE" ASC NULLS FIRST ) AS "PREV_DAY_RIDERS" FROM ( SELECT "DATE", "DAY_OF_WEEK", "MONTH",  LISTAGG ( DISTINCT "DAYTYPE", '') AS "DAYTYPE", sum("RIDES") AS "TOTAL_RIDERS" FROM ( SELECT "ROUTE", "DAYTYPE", "RIDES", "DATE", dayofweek("DATE") AS "DAY_OF_WEEK", month("DATE") AS "MONTH" FROM ( SELECT "ROUTE", "DAYTYPE", "RIDES", to_timestamp("DATE", 'MM/DD/YYYY') AS "DATE" FROM CHICAGO_BUS_RIDES)) GROUP BY "DATE", "DAY_OF_WEEK", "MONTH") ORDER BY "DATE" ASC NULLS FIRST)) AS SNOWPARK_RIGHT ON ("DATEW" = "DATE"))) WHERE (NOT (((((((((("NOAA_WEATHER_STATION_ID" IS NULL OR "DATEW" IS NULL) OR "MINIMUM_TEMPERATURE" IS NULL) OR "MAXIMUM_TEMPERATURE" IS NULL) OR "PRECIPITATION" IS NULL) OR "DATE" IS NULL) OR "DAY_OF_WEEK" IS NULL) OR "MONTH" IS NULL) OR "DAYTYPE" IS NULL) OR "TOTAL_RIDERS" IS NULL) OR "PREV_DAY_RIDERS" IS NULL) AND ("DATE" &gt;= '2019-01-01'))),MODEL_VERSION_ALIAS AS MODEL RAJIV.PUBLIC.FORECASTING_BUS_RIDERSHIP VERSION V27</span></span>
<span id="cb48-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                SELECT *,</span></span>
<span id="cb48-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                    MODEL_VERSION_ALIAS!PREDICT(DAY_OF_WEEK, MONTH, PREV_DAY_RIDERS, MINIMUM_TEMPERATURE, MAXIMUM_TEMPERATURE, PRECIPITATION, DAYTYPE) AS TMP_RESULT</span></span>
<span id="cb48-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                FROM SNOWPARK_ML_MODEL_INFERENCE_INPUT) ORDER BY "DATEW" ASC NULLS FIRST LIMIT 10"""</span></span></code></pre></div></div>
</div>
<div id="6d60a60b" class="cell" data-execution_count="30">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb49" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb49-1">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> session.sql(sqlquery).show()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>---------------------------------------------------------
|"DATEW"     |"TOTAL_RIDERS"  |"TOTAL_RIDERS_FORECAST"  |
---------------------------------------------------------
|2019-01-01  |247279          |290942.375               |
|2019-01-02  |585996          |668251.3125              |
|2019-01-03  |660631          |767229.875               |
|2019-01-04  |662011          |759055.3125              |
|2019-01-05  |440848          |491881.78125             |
|2019-01-06  |316844          |351156.84375             |
|2019-01-07  |717818          |762515.625               |
|2019-01-08  |779946          |879376.0625              |
|2019-01-09  |743021          |790567.625               |
|2019-01-10  |743075          |764690.8125              |
---------------------------------------------------------
</code></pre>
</div>
</div>
<div id="0412e19f" class="cell" data-execution_count="31">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb51" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1">test.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"NOAA_WEATHER_STATION_ID"  |"DATEW"     |"MINIMUM_TEMPERATURE"  |"MAXIMUM_TEMPERATURE"  |"PRECIPITATION"  |"DATE"               |"DAY_OF_WEEK"  |"MONTH"  |"DAYTYPE"  |"TOTAL_RIDERS"  |"PREV_DAY_RIDERS"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|USW00014819                |2019-07-23  |16.7                   |27.2                   |0.0              |2019-07-23 00:00:00  |2              |7        |W          |751862          |729088             |
|USW00014819                |2020-12-25  |-12.7                  |-5.5                   |0.0              |2020-12-25 00:00:00  |5              |12       |U          |80199           |199439             |
|USW00014819                |2020-07-23  |18.3                   |27.2                   |0.0              |2020-07-23 00:00:00  |4              |7        |W          |312243          |303124             |
|USW00014819                |2021-01-15  |-2.1                   |3.3                    |0.0              |2021-01-15 00:00:00  |5              |1        |W          |274858          |273087             |
|USW00014819                |2019-06-05  |15.6                   |30.0                   |9.7              |2019-06-05 00:00:00  |3              |6        |W          |814691          |794543             |
|USW00014819                |2019-05-08  |8.3                    |24.4                   |9.7              |2019-05-08 00:00:00  |3              |5        |W          |820018          |802783             |
|USW00014819                |2021-01-17  |-2.1                   |1.1                    |0.5              |2021-01-17 00:00:00  |0              |1        |U          |141354          |190486             |
|USW00014819                |2019-12-09  |-3.8                   |9.4                    |0.5              |2019-12-09 00:00:00  |1              |12       |W          |780897          |341326             |
|USW00014819                |2020-07-30  |22.2                   |28.3                   |0.0              |2020-07-30 00:00:00  |4              |7        |W          |304656          |302637             |
|USW00014819                |2019-05-23  |17.8                   |24.4                   |0.5              |2019-05-23 00:00:00  |4              |5        |W          |799367          |805534             |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
</code></pre>
</div>
</div>
<p>Let’s calculate and save the metrics to the registry</p>
<div id="a6bea723" class="cell" data-execution_count="32">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb53" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb53-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> snowflake.ml.modeling.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> mean_absolute_error</span>
<span id="cb53-2">testpreds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> reg_model.run(test, function_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'predict'</span>)</span>
<span id="cb53-3">mae <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> mean_absolute_error(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>testpreds, y_true_col_names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'TOTAL_RIDERS'</span>, y_pred_col_names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'"TOTAL_RIDERS_FORECAST"'</span>)</span>
<span id="cb53-4">reg_model.set_metric(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"MAE"</span>, value<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>mae)</span></code></pre></div></div>
</div>
<div id="5c587021" class="cell" data-execution_count="33">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb54" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb54-1">reg_model.show_metrics()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="33">
<pre><code>{'MAE': 183320.1351068038}</code></pre>
</div>
</div>
<div id="00315bbd" class="cell" data-execution_count="34">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb56" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb56-1">session.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"create or replace warehouse snowpark_opt_wh with warehouse_size = 'SMALL'"</span>).collect()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="34">
<pre><code>[Row(status='Warehouse SNOWPARK_OPT_WH successfully created.')]</code></pre>
</div>
</div>
<div id="6b233260" class="cell" data-execution_count="35">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb58" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb58-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#session.close()</span></span></code></pre></div></div>
</div>


</section>

 ]]></description>
  <category>snowflake</category>
  <category>MLOps</category>
  <guid>https://rajivshah.com/blog/Snowpark_Forecasting_Bus.html</guid>
  <pubDate>Mon, 20 May 2024 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/fe_forecasting.png" medium="image" type="image/png" height="156" width="144"/>
</item>
<item>
  <title>Evaluation for Large Language Models (LLMs) and Generative AI - A Deep Dive</title>
  <link>https://rajivshah.com/blog/evaluating-llms-deep-dive.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/iQl03pQlYWY" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/iQl03pQlYWY">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the annotated presentation for “Evaluating LLMs” by Rajiv Shah.</p>
<section id="title-slide-evaluating-llms" class="level3">
<h3 class="anchored" data-anchor-id="title-slide-evaluating-llms">1. Title Slide: Evaluating LLMs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=0s">Timestamp: 00:00</a>)</p>
<p>The presentation begins with the title slide, introducing the speaker, Rajiv Shah, and the topic of <strong>Evaluating Large Language Models (LLMs)</strong>. The slide includes a link to a GitHub repository (<code>LLM-Evaluation</code>), which serves as a companion resource containing notebooks and code examples referenced throughout the talk.</p>
<p>Rajiv sets the stage by explaining his motivation: he sees many enterprises treating Generative AI as “science experiments” that fail to reach production. He argues that a major reason for this failure is a lack of proper evaluation strategies.</p>
<p>The goal of this talk is to move beyond experimentation and discuss how to rigorously evaluate models to get them into production and keep them there, covering technical, business, and operational perspectives.</p>
</section>
<section id="no-impact" class="level3">
<h3 class="anchored" data-anchor-id="no-impact">2. No Impact!</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=5s">Timestamp: 00:05</a>)</p>
<p>This slide humorously illustrates the current state of many LLM projects. It depicts a chaotic lab scene and a cartoon character in a strange vehicle, captioned “No impact!” This visualizes the frustration of data scientists building cool things that never deliver real-world value.</p>
<p>Rajiv uses this to highlight the “science experiment” nature of current GenAI work. Without proper evaluation, teams cannot prove the reliability or value of their models, preventing deployment.</p>
<p>The slide emphasizes the necessity of shifting from “playing around” with models to applying rigorous engineering discipline, starting with evaluation.</p>
</section>
<section id="three-pillars-of-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="three-pillars-of-evaluation">3. Three Pillars of Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=41s">Timestamp: 00:41</a>)</p>
<p>This slide breaks down Generative AI evaluation into three critical dimensions: <strong>Technical (F1)</strong>, <strong>Business ($$)</strong>, and <strong>Operational (TCO)</strong>. While the talk focuses heavily on technical metrics, Rajiv stresses that the other two are equally vital for production success.</p>
<p>The <strong>Business</strong> dimension asks about the return on investment and the cost of errors, while the <strong>Operational</strong> dimension considers the Total Cost of Ownership (TCO), latency, and maintenance.</p>
<p>Understanding all three pillars is what distinguishes a successful production deployment from a mere prototype.</p>
</section>
<section id="generative-ai-evaluation-methods" class="level3">
<h3 class="anchored" data-anchor-id="generative-ai-evaluation-methods">4. Generative AI Evaluation Methods</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=63s">Timestamp: 01:03</a>)</p>
<p>This chart is the central framework of the presentation. It categorizes technical evaluation methods based on <strong>Cost</strong> (y-axis) and <strong>Flexibility</strong> (x-axis). The methods range from rigid, low-cost approaches like <strong>Exact Matching</strong> to flexible, high-cost approaches like <strong>Red Teaming</strong>.</p>
<p>The slide lists specific methodologies: Exact matching, Similarity (BLEU/ROUGE), Functional correctness (Unit tests), Benchmarks (MMLU), Human evaluation, Model-based approaches (LLM-as-a-Judge), and Red teaming.</p>
<p>Rajiv notes that these categories overlap and are not mutually exclusive. This visual guide helps practitioners choose the right tool for their specific stage of development and resource constraints.</p>
</section>
<section id="application-to-rag" class="level3">
<h3 class="anchored" data-anchor-id="application-to-rag">5. Application to RAG</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=76s">Timestamp: 01:16</a>)</p>
<p>This slide previews the case study at the end of the talk: <strong>Retrieval Augmented Generation (RAG)</strong>. It shows a diagram splitting the RAG process into two distinct components: <strong>Retrieval</strong> (finding the data) and <strong>Augmented Generation</strong> (synthesizing the answer).</p>
<p>Rajiv introduces this here to promise a practical application of the concepts. He explains that after covering the evaluation methods, he will demonstrate how to apply them specifically to a RAG system.</p>
<p>This foreshadows the importance of <strong>component-wise evaluation</strong>—evaluating the retriever and the generator separately rather than just the system as a whole.</p>
</section>
<section id="evaluating-llms-title-repeat" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-llms-title-repeat">6. Evaluating LLMs (Title Repeat)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=91s">Timestamp: 01:31</a>)</p>
<p>This slide serves as a transition point, reiterating the talk’s title and contact information. It signals the end of the introduction and the beginning of the deep dive into the current state of LLMs.</p>
<p>Rajiv notes that this will be a long, detailed talk, encouraging viewers to use the video timeline to skip around. He sets expectations for the pace and depth of the technical content to follow.</p>
</section>
<section id="many-ways-to-use-llms" class="level3">
<h3 class="anchored" data-anchor-id="many-ways-to-use-llms">7. Many Ways to Use LLMs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=105s">Timestamp: 01:45</a>)</p>
<p>This slide illustrates the versatility of LLMs, showing examples of <strong>Question Answering</strong> and <strong>Code Generation</strong>. It highlights that LLMs are not limited to a single task like classification; they can summarize, chat, write code, and reason.</p>
<p>Rajiv explains that this versatility makes evaluation difficult. Unlike traditional ML where a simple confusion matrix might suffice, LLMs produce varied, open-ended outputs that require more complex assessment strategies.</p>
<p>The slide sets up the problem statement: because LLMs can do so much, we need a diverse set of evaluation tools to measure their performance across different modalities.</p>
</section>
<section id="open-source-llm-leaderboard" class="level3">
<h3 class="anchored" data-anchor-id="open-source-llm-leaderboard">8. Open Source LLM Leaderboard</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=138s">Timestamp: 02:18</a>)</p>
<p>This slide shows a screenshot of the <strong>Hugging Face Open LLM Leaderboard</strong>. It notes that over 2,000 LLMs have been evaluated, visualizing the sheer volume of models available to practitioners.</p>
<p>Rajiv describes the experience of looking for a model as “overwhelming.” With new models releasing weekly, relying solely on public leaderboards to pick a model is daunting and potentially misleading.</p>
<p>This introduces the concept of “Leaderboard Fatigue” and questions whether these general-purpose rankings are useful for specific enterprise use cases.</p>
</section>
<section id="helm-framework" class="level3">
<h3 class="anchored" data-anchor-id="helm-framework">9. HELM Framework</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=172s">Timestamp: 02:52</a>)</p>
<p>This slide introduces <strong>HELM (Holistic Evaluation of Language Models)</strong> from Stanford. It displays the framework’s structure, which evaluates models across various scenarios (datasets) and metrics (accuracy, bias, toxicity).</p>
<p>Rajiv presents HELM as the academic approach to the evaluation problem. It attempts to be comprehensive by measuring everything across many dimensions, offering a more rigorous alternative to simple leaderboards.</p>
<p>However, he points out that even this comprehensive approach has its downsides, primarily the sheer volume of data it produces.</p>
</section>
<section id="overwhelming-information" class="level3">
<h3 class="anchored" data-anchor-id="overwhelming-information">10. Overwhelming Information</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=205s">Timestamp: 03:25</a>)</p>
<p>This slide displays a screenshot of the HELM research paper, emphasizing its length (163 pages). The caption “Overwhelming!” reflects the difficulty a data scientist faces when trying to digest this amount of information.</p>
<p>Rajiv humorously compares the paper’s size to a “Harry Potter book,” illustrating that while the academic rigor is high, the practical barrier to entry is also significant.</p>
<p>The key takeaway is that while comprehensive benchmarks exist, they are often too dense for quick, practical decision-making in an enterprise setting.</p>
</section>
<section id="feeling-overwhelmed" class="level3">
<h3 class="anchored" data-anchor-id="feeling-overwhelmed">11. Feeling Overwhelmed</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=226s">Timestamp: 03:46</a>)</p>
<p>This visual slide features a person looking frustrated and burying their face in their hands. It represents the emotional state of a data scientist trying to navigate the complex, rapidly changing landscape of LLM evaluation.</p>
<p>Rajiv uses this to empathize with the audience. Between the thousands of models on Hugging Face and the hundreds of pages of academic papers, it is easy to feel lost.</p>
<p>This sets the stage for the need for simpler, more fundamental principles to guide evaluation.</p>
</section>
<section id="reliability-of-helm" class="level3">
<h3 class="anchored" data-anchor-id="reliability-of-helm">12. Reliability of HELM</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=237s">Timestamp: 03:57</a>)</p>
<p>This slide questions the reliability of benchmarks like HELM. It presents data showing that minor changes in dataset selection can lead to different scoring and winners <strong>22% of the time</strong>. A correlation matrix visualizes the relationships between different metrics.</p>
<p>Rajiv points out that benchmarks are fragile. If you change the specific datasets used to represent a “scenario,” the ranking of the models changes.</p>
<p>This implies that “winners” on leaderboards are often dependent on the specific composition of the benchmark rather than inherent superiority across all tasks.</p>
</section>
<section id="davinci-002-vs-davinci-003" class="level3">
<h3 class="anchored" data-anchor-id="davinci-002-vs-davinci-003">13. Davinci-002 vs Davinci-003</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=255s">Timestamp: 04:15</a>)</p>
<p>This slide highlights a specific anomaly in HELM results where an older model (<code>text-davinci-002</code>) appears to outperform a newer, better model (<code>text-davinci-003</code>) in accuracy.</p>
<p>Rajiv expresses skepticism, noting that OpenAI is unlikely to release a newer model that is objectively worse. This discrepancy suggests that the benchmark might not be capturing the improvements in the newer model, such as better instruction following or safety.</p>
<p>The slide serves as a warning: <strong>Do not blindly trust benchmark rankings</strong>, as they may not reflect the actual capabilities or “quality” of a model for your specific needs.</p>
</section>
<section id="leaderboard-reliability" class="level3">
<h3 class="anchored" data-anchor-id="leaderboard-reliability">14. Leaderboard Reliability</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=293s">Timestamp: 04:53</a>)</p>
<p>This slide examines the <strong>Open LLM Leaderboard</strong> again, pointing out that rankings are heavily influenced by specific datasets like <strong>TruthfulQA</strong>. It asks, “Is this impactful?”</p>
<p>Rajiv argues that if a model’s high ranking is driven primarily by its performance on a dataset like TruthfulQA, it might not be relevant to a user whose use case (e.g., summarizing financial documents) has nothing to do with that specific benchmark.</p>
<p>This reinforces the idea that general-purpose leaderboards may not align with specific business goals.</p>
</section>
<section id="model-evals-vs-system-evals" class="level3">
<h3 class="anchored" data-anchor-id="model-evals-vs-system-evals">15. Model Evals vs System Evals</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=333s">Timestamp: 05:33</a>)</p>
<p>This slide distinguishes between <strong>Model Evals</strong> (selecting the best model from <em>n</em> options) and <strong>System Evals</strong> (optimizing a single model for a specific task).</p>
<p>Rajiv explains that most public benchmarks focus on the former—comparing thousands of models. However, in enterprise settings, the goal is usually the latter: you pick a model (like GPT-4 or Llama 2) and need to evaluate how to optimize it for your specific application.</p>
<p>The talk focuses on bridging this gap, helping practitioners evaluate their specific implementation rather than just comparing base models.</p>
</section>
<section id="lost-in-the-maze" class="level3">
<h3 class="anchored" data-anchor-id="lost-in-the-maze">16. Lost in the Maze</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=393s">Timestamp: 06:33</a>)</p>
<p>This slide features an image of a hedge maze with the word “Lost,” symbolizing the confusion in the current evaluation landscape.</p>
<p>Rajiv uses this to pivot back to <strong>fundamentals</strong>. When lost in complex new technology, the best approach is to return to first principles of data science evaluation.</p>
<p>He prepares the audience to look at a classic machine learning problem to ground the upcoming LLM concepts.</p>
</section>
<section id="evaluating-customer-churn" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-customer-churn">17. Evaluating Customer Churn</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=409s">Timestamp: 06:49</a>)</p>
<p>This slide introduces a classic “Data Science 101” problem: <strong>Customer Churn</strong>. It depicts an exit door and a pie chart, setting up a scenario where a data scientist must evaluate a model designed to predict which customers will leave.</p>
<p>Rajiv uses this familiar example to contrast different levels of evaluation maturity, which he will then map onto GenAI.</p>
</section>
<section id="junior-data-scientist-approach" class="level3">
<h3 class="anchored" data-anchor-id="junior-data-scientist-approach">18. Junior Data Scientist Approach</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=427s">Timestamp: 07:07</a>)</p>
<p>This slide shows standard classification metrics: <strong>ROC curve</strong>, <strong>Confusion Matrix</strong>, <strong>F1 Score</strong>, and <strong>True Positive Rate</strong>. Rajiv labels this as the “Junior Data Scientist” approach.</p>
<p>While these metrics are technically correct, they are abstract. A junior data scientist presents these to a boss and says, “Look, I improved the AUC,” which often fails to communicate business value.</p>
<p>This represents the <strong>Technical</strong> pillar of evaluation—necessary, but insufficient for business stakeholders.</p>
</section>
<section id="senior-data-scientist-approach" class="level3">
<h3 class="anchored" data-anchor-id="senior-data-scientist-approach">19. Senior Data Scientist Approach</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=460s">Timestamp: 07:40</a>)</p>
<p>This slide introduces <strong>Profit Curves</strong>. It translates the confusion matrix into dollar values (cost of false positives vs.&nbsp;value of true positives). Rajiv calls this the “Senior Data Scientist” approach.</p>
<p>Here, the evaluation focuses on <strong>Business Value</strong>: “How much profit will this model generate compared to the baseline?” This aligns the technical model with business goals ($$).</p>
<p>The lesson is that LLM evaluation must eventually map to business outcomes, not just technical benchmarks.</p>
</section>
<section id="data-science-leader-approach" class="level3">
<h3 class="anchored" data-anchor-id="data-science-leader-approach">20. Data Science Leader Approach</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=507s">Timestamp: 08:27</a>)</p>
<p>This slide discusses the <strong>Total Cost of Ownership (TCO)</strong> and <strong>Monitoring</strong>. It reflects the “Data Science Leader” perspective, which looks at the system holistically.</p>
<p>A leader asks: “Is it worth spending 5 more weeks to get 3% more accuracy?” and “How will we monitor this when customer behavior changes?”</p>
<p>This corresponds to the <strong>Operational</strong> pillar. It emphasizes that evaluation includes considering the cost of building, maintaining, and running the model over time.</p>
</section>
<section id="evaluate-generative-ai-tasks" class="level3">
<h3 class="anchored" data-anchor-id="evaluate-generative-ai-tasks">21. Evaluate Generative AI Tasks?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=563s">Timestamp: 09:23</a>)</p>
<p>This slide transitions back to Generative AI, showing examples of code generation and summarization. It asks how to apply the principles just discussed (Technical, Business, Operational) to these new, complex tasks.</p>
<p>Rajiv acknowledges that while the outputs (text, code) are different from simple classification labels, the fundamental need to evaluate across three dimensions remains.</p>
</section>
<section id="three-pillars-genai-context" class="level3">
<h3 class="anchored" data-anchor-id="three-pillars-genai-context">22. Three Pillars (GenAI Context)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=576s">Timestamp: 09:36</a>)</p>
<p>This slide repeats the <strong>Technical, Business, Operational</strong> framework, asserting “Still the same principles!”</p>
<p>Rajiv reinforces that despite the hype and novelty of LLMs, we must not abandon standard engineering practices. We still need to measure technical accuracy (F1 equivalent), business impact ($$), and operational costs (TCO).</p>
</section>
<section id="evaluation-in-the-ml-lifecycle" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-in-the-ml-lifecycle">23. Evaluation in the ML Lifecycle</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=585s">Timestamp: 09:45</a>)</p>
<p>This slide displays a “multi-headed llama” graphic representing the ML lifecycle: <strong>Development</strong>, <strong>Training</strong>, and <strong>Deployment</strong>.</p>
<p>Rajiv explains that evaluation is not a one-time step. It happens: 1. <strong>Before:</strong> To decide if a project is viable. 2. <strong>During:</strong> To train and tune the model. 3. <strong>After:</strong> To monitor the model in production (Monitoring is the “sibling” of Evaluation).</p>
</section>
<section id="faster-better-cheaper" class="level3">
<h3 class="anchored" data-anchor-id="faster-better-cheaper">24. Faster, Better, Cheaper</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=633s">Timestamp: 10:33</a>)</p>
<p>This slide features a tweet by Eugene Yan, stating that automated evaluations lead to <strong>“faster, better, cheaper”</strong> LLMs. It mentions that good eval pipelines allow for safer deployments and faster experiments.</p>
<p>Rajiv cites the example of Hugging Face’s <strong>Zephyr</strong> model. The team built it in just a few days because they had spent months building a robust evaluation pipeline.</p>
<p>The key insight is that investing in evaluation infrastructure upfront accelerates actual model development and iteration.</p>
</section>
<section id="traditional-nlp-tasks" class="level3">
<h3 class="anchored" data-anchor-id="traditional-nlp-tasks">25. Traditional NLP Tasks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=711s">Timestamp: 11:51</a>)</p>
<p>This slide advises that if you are using GenAI for a traditional NLP task (like sentiment analysis), you should <strong>“start with traditional metrics/datasets.”</strong></p>
<p>However, Rajiv warns about <strong>Data Leakage</strong>. Because LLMs are trained on the internet, they may have already seen the test sets of standard benchmarks.</p>
<p>The takeaway: Use standard metrics if applicable, but be skeptical of results that seem too good, as the model might be memorizing the test data.</p>
</section>
<section id="breaking-existing-evaluations" class="level3">
<h3 class="anchored" data-anchor-id="breaking-existing-evaluations">26. Breaking Existing Evaluations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=758s">Timestamp: 12:38</a>)</p>
<p>This slide explains that LLMs can <strong>“break existing evaluations.”</strong> It cites research where LLMs scored poorly on automated metrics but were rated highly by humans.</p>
<p>Rajiv explains that LLMs have such a fluid and rich understanding of language that they often produce correct answers that old, rigid metrics fail to recognize.</p>
<p>This highlights the limitation of using pre-LLM automated metrics for modern models; the models have outpaced the measurement tools.</p>
</section>
<section id="beating-human-baselines" class="level3">
<h3 class="anchored" data-anchor-id="beating-human-baselines">27. Beating Human Baselines</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=809s">Timestamp: 13:29</a>)</p>
<p>This slide presents data showing LLMs (GPT-3/4) beating <strong>human baselines</strong> in tasks like summarization. The charts show LLMs scoring higher in faithfulness, coherence, and relevance.</p>
<p>Rajiv mentions recent research where GPT-4 wrote medical reports that doctors preferred over those written by other humans.</p>
<p>This poses a challenge: How do you evaluate a model when it is better than the human annotators you would typically use as a gold standard?</p>
</section>
<section id="methods-chart-recap" class="level3">
<h3 class="anchored" data-anchor-id="methods-chart-recap">28. Methods Chart (Recap)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=843s">Timestamp: 14:03</a>)</p>
<p>This slide brings back the <strong>Generative AI Evaluation Methods</strong> chart (Cost vs.&nbsp;Flexibility). An arrow points to “Raj guess,” indicating that the placement of these methods is an estimation.</p>
<p>Rajiv uses this to reorient the audience before diving into the specific methods one by one, starting from the bottom left (least flexible/cheapest).</p>
</section>
<section id="progression-of-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="progression-of-evaluation">29. Progression of Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=867s">Timestamp: 14:27</a>)</p>
<p>This slide shows a directional arrow moving “up” the chart, from <strong>Exact Matching</strong> toward <strong>Red Teaming</strong>.</p>
<p>Rajiv explains the flow of the presentation: we will start with rigid, simple metrics and move toward more complex, flexible, and human-centric evaluation methods.</p>
</section>
<section id="exact-matching-approach" class="level3">
<h3 class="anchored" data-anchor-id="exact-matching-approach">30. Exact Matching Approach</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=922s">Timestamp: 15:22</a>)</p>
<p>This slide highlights the <strong>“Exact matching approach”</strong> box on the chart.</p>
<p>This is the starting point: the simplest form of evaluation where the model’s output must be identical to a reference answer.</p>
</section>
<section id="how-hard-could-it-be" class="level3">
<h3 class="anchored" data-anchor-id="how-hard-could-it-be">31. How Hard Could It Be?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=962s">Timestamp: 16:02</a>)</p>
<p>This slide asks, <strong>“How hard could evaluation be?”</strong> It shows simple outputs (Yes/No, A/B/C/D) and suggests that checking if string A equals string B should be trivial.</p>
<p>Rajiv uses this to set up a contrast. While it <em>looks</em> simple like a basic Python script, the reality of LLMs makes even this basic task complicated due to formatting and non-determinism.</p>
</section>
<section id="consistent-prediction-workflow" class="level3">
<h3 class="anchored" data-anchor-id="consistent-prediction-workflow">32. Consistent Prediction Workflow</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=980s">Timestamp: 16:20</a>)</p>
<p>This slide outlines a workflow: <strong>Inputs</strong> (Tokenization, Prompts) -&gt; <strong>Model</strong> (Hyperparameters) -&gt; <strong>Outputs</strong> (Evaluation).</p>
<p>Rajiv emphasizes that to get exact matching to work, you need extreme consistency across this entire pipeline. He warns that you must plan for multiple iterations because things will go wrong at every step.</p>
</section>
<section id="story-time-mmlu-leaderboards" class="level3">
<h3 class="anchored" data-anchor-id="story-time-mmlu-leaderboards">33. Story Time: MMLU Leaderboards</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1000s">Timestamp: 16:40</a>)</p>
<p>This slide shows a tweet announcing a new LLM topping the leaderboard, but points out a discrepancy: <strong>“Why did we have two different MMLU scores?”</strong></p>
<p>Rajiv tells the story of how a model claimed a high score on Twitter, but the actual paper showed a lower score. This discrepancy triggered an investigation into why the same model on the same benchmark produced different results.</p>
</section>
<section id="what-is-mmlu" class="level3">
<h3 class="anchored" data-anchor-id="what-is-mmlu">34. What is MMLU?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1057s">Timestamp: 17:37</a>)</p>
<p>This slide defines <strong>MMLU (Massive Multitask Language Understanding)</strong>. It is a benchmark covering 57 tasks (Math, History, CS), designed to measure the “knowledge” of a model.</p>
<p>Rajiv shows examples of questions (Microeconomics, Physics) to illustrate that these are multiple-choice questions used to gauge general intelligence.</p>
</section>
<section id="why-mmlu-evaluation-differed" class="level3">
<h3 class="anchored" data-anchor-id="why-mmlu-evaluation-differed">35. Why MMLU Evaluation Differed</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1096s">Timestamp: 18:16</a>)</p>
<p>This slide reveals the culprit behind the score discrepancy: <strong>Prompt Formatting</strong>. It shows three different prompt styles (HELM, Eleuther, Original) used by different evaluation harnesses.</p>
<p>Rajiv challenges the audience to spot the differences. They are subtle: an extra space, a different bracket style around the letter <code>(A)</code> vs <code>A.</code>, or the inclusion of a subject line.</p>
</section>
<section id="style-changes-accuracy" class="level3">
<h3 class="anchored" data-anchor-id="style-changes-accuracy">36. Style Changes Accuracy</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1153s">Timestamp: 19:13</a>)</p>
<p>This slide states that these simple style changes resulted in a <strong>~5% change in accuracy</strong>.</p>
<p>Rajiv underscores the significance: a 5% swing is massive on a leaderboard. This proves that LLMs are incredibly sensitive to prompt syntax. It also serves as a warning to be skeptical of reported benchmark scores, as they can be “massaged” simply by tweaking the prompt format.</p>
</section>
<section id="story-falcon-model-bias" class="level3">
<h3 class="anchored" data-anchor-id="story-falcon-model-bias">37. Story: Falcon Model Bias</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1235s">Timestamp: 20:35</a>)</p>
<p>This slide introduces the <strong>Falcon</strong> model story. Users noticed that when asked for a “technologically advanced city,” Falcon would almost always suggest <strong>Abu Dhabi</strong>.</p>
<p>Rajiv sets up the mystery: Was the model biased because it was trained in the Middle East? Why was it so fixated on this specific city?</p>
</section>
<section id="biased-model-human-rights" class="level3">
<h3 class="anchored" data-anchor-id="biased-model-human-rights">38. Biased Model (Human Rights)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1319s">Timestamp: 21:59</a>)</p>
<p>This slide shows that the Falcon model also refused to discuss <strong>human rights abuses</strong> in Abu Dhabi.</p>
<p>This fueled speculation that the model had been censored or biased during training to avoid sensitive topics regarding its region of origin.</p>
</section>
<section id="demo-placeholder" class="level3">
<h3 class="anchored" data-anchor-id="demo-placeholder">39. Demo Placeholder</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1276s">Timestamp: 21:16</a>)</p>
<p>This slide simply says “Let’s try to demo this.” In the video, Rajiv switches to a live recording of him interacting with the model to demonstrate the bias firsthand.</p>
</section>
<section id="check-the-system-prompt" class="level3">
<h3 class="anchored" data-anchor-id="check-the-system-prompt">40. Check the System Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1346s">Timestamp: 22:26</a>)</p>
<p>This slide reveals the answer to the Falcon mystery: <strong>The System Prompt</strong>.</p>
<p>It turns out the model had a hidden system instruction explicitly stating it was built in Abu Dhabi. When researchers changed this prompt (e.g., to “Mexico”), the model’s behavior changed, and it stopped forcing Abu Dhabi into answers.</p>
<p>The lesson: <strong>System prompts heavily influence evaluation results</strong>. Small changes in hidden instructions can radically alter model behavior.</p>
</section>
<section id="prompt-engineering" class="level3">
<h3 class="anchored" data-anchor-id="prompt-engineering">41. Prompt Engineering</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1406s">Timestamp: 23:26</a>)</p>
<p>This slide discusses <strong>Prompt Engineering</strong> techniques like <strong>Chain-of-Thought (COT)</strong>. It shows how asking a model to “think step by step” improves reasoning on math problems.</p>
<p>Rajiv emphasizes that identifying the <em>best</em> prompt is a crucial part of the evaluation workflow. You aren’t just evaluating the model; you are evaluating the model <em>plus</em> the prompt.</p>
</section>
<section id="hands-on-prompting" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-prompting">42. Hands on: Prompting</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1445s">Timestamp: 24:05</a>)</p>
<p>This slide introduces a hands-on exercise. It encourages users to use OpenAI’s playground to experiment with different prompts, specifically COT and system prompt variations.</p>
</section>
<section id="hands-on-glados" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-glados">43. Hands on: GLaDOS</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1454s">Timestamp: 24:14</a>)</p>
<p>This slide shows a fun example where the system prompt turns ChatGPT into <strong>GLaDOS</strong> (from the game Portal).</p>
<p>Rajiv uses this to demonstrate the power of the system prompt to change the persona and tone of the model completely.</p>
</section>
<section id="workflow-inputs-recap" class="level3">
<h3 class="anchored" data-anchor-id="workflow-inputs-recap">44. Workflow: Inputs Recap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_44.png" class="img-fluid figure-img"></p>
<figcaption>Slide 44</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1466s">Timestamp: 24:26</a>)</p>
<p>This slide updates the <strong>Consistent Prediction Workflow</strong>. Under “Inputs,” it now explicitly lists <strong>System Prompt</strong>, Tokenization, Prompt Styles, and Prompt Engineering.</p>
<p>This summarizes the section: to get consistent evaluation, you must control all these input variables.</p>
</section>
<section id="variability-of-llm-models" class="level3">
<h3 class="anchored" data-anchor-id="variability-of-llm-models">45. Variability of LLM Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_45.png" class="img-fluid figure-img"></p>
<figcaption>Slide 45</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1479s">Timestamp: 24:39</a>)</p>
<p>This slide shifts focus to the <strong>Model</strong> component. It notes that model size affects scores (Llama-2 example) and introduces the concept of <strong>Non-deterministic inference</strong>.</p>
<p>Rajiv points out that GPU calculations introduce slight randomness, meaning you might not get bit-wise reproducibility even with the same settings.</p>
</section>
<section id="gpt-4-vs-gpt-3.5" class="level3">
<h3 class="anchored" data-anchor-id="gpt-4-vs-gpt-3.5">46. GPT-4 vs GPT-3.5</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_46.png" class="img-fluid figure-img"></p>
<figcaption>Slide 46</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1486s">Timestamp: 24:46</a>)</p>
<p>This slide compares <strong>GPT-4 vs GPT-3.5</strong>. It shows that even models from the same “family” give very different answers to political opinion questions.</p>
<p>Rajiv uses this to show that you cannot swap models (e.g., using a cheaper model for dev and a larger one for prod) without re-evaluating, as their behaviors diverge significantly.</p>
</section>
<section id="non-deterministic-inference" class="level3">
<h3 class="anchored" data-anchor-id="non-deterministic-inference">47. Non-deterministic Inference</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_47.png" class="img-fluid figure-img"></p>
<figcaption>Slide 47</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1528s">Timestamp: 25:28</a>)</p>
<p>This slide dives deeper into <strong>Non-deterministic inference</strong>. It explains that floating-point calculations on GPUs can have tiny variances that ripple out to affect token selection.</p>
<p>For data scientists coming from deterministic systems (like logistic regression), this lack of 100% reproducibility can be a shock and complicates “exact match” testing.</p>
</section>
<section id="reliability-of-commercial-apis" class="level3">
<h3 class="anchored" data-anchor-id="reliability-of-commercial-apis">48. Reliability of Commercial APIs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_48.png" class="img-fluid figure-img"></p>
<figcaption>Slide 48</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1563s">Timestamp: 26:03</a>)</p>
<p>This slide addresses <strong>Model Drift</strong> in commercial APIs. It shows graphs of GPT-3.5 and GPT-4 performance changing over time on tasks like identifying prime numbers.</p>
<p>Rajiv warns that if you don’t own the model (i.e., you use an API), the vendor might update it behind the scenes, breaking your evaluation baselines.</p>
</section>
<section id="hyperparameters" class="level3">
<h3 class="anchored" data-anchor-id="hyperparameters">49. Hyperparameters</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_49.png" class="img-fluid figure-img"></p>
<figcaption>Slide 49</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1587s">Timestamp: 26:27</a>)</p>
<p>This slide shows the UI for <strong>Hyperparameters</strong> (Temperature, Max Length, Top P).</p>
<p>Rajiv reminds the audience that these settings drastically influence predictions. Evaluation must be done with the exact same hyperparameters intended for production.</p>
</section>
<section id="output-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="output-evaluation">50. Output Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_50.png" class="img-fluid figure-img"></p>
<figcaption>Slide 50</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1600s">Timestamp: 26:40</a>)</p>
<p>This slide highlights the <strong>“Output evaluation”</strong> step in the workflow.</p>
<p>Now that we’ve covered inputs and models, Rajiv moves to the challenge of parsing and judging the text the model actually produces.</p>
</section>
<section id="generating-multiple-choice-output" class="level3">
<h3 class="anchored" data-anchor-id="generating-multiple-choice-output">51. Generating Multiple Choice Output</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_51.png" class="img-fluid figure-img"></p>
<figcaption>Slide 51</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1606s">Timestamp: 26:46</a>)</p>
<p>This slide discusses the difficulty of evaluating <strong>Multiple Choice</strong> answers. * <strong>First Letter Approach:</strong> Just look for “A” or “B”. Fails if the model says “The answer is A”. * <strong>Entire Answer:</strong> Look for the full text. Fails if the model phrases it slightly differently.</p>
<p>Rajiv illustrates that even “simple” multiple-choice evaluation requires complex parsing logic because LLMs love to “chat” and add extra text.</p>
</section>
<section id="evaluating-mmlu-different-outputs" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-mmlu-different-outputs">52. Evaluating MMLU: Different Outputs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_52.png" class="img-fluid figure-img"></p>
<figcaption>Slide 52</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1648s">Timestamp: 27:28</a>)</p>
<p>This slide compares how <strong>HELM</strong>, <strong>AI Harness</strong>, and the <strong>Original</strong> MMLU implementation parsed outputs.</p>
<p>It reveals that the discrepancy in MMLU scores wasn’t just about prompts; it was also about <em>how</em> the evaluation code extracted the answer from the model’s response.</p>
</section>
<section id="consistency-is-hard" class="level3">
<h3 class="anchored" data-anchor-id="consistency-is-hard">53. Consistency is Hard!</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_53.png" class="img-fluid figure-img"></p>
<figcaption>Slide 53</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1660s">Timestamp: 27:40</a>)</p>
<p>This slide summarizes the MMLU saga: <strong>“Consistency is hard!”</strong> It shows the table of scores again.</p>
<p>The takeaway is that “Exact Match” is a misnomer. It requires rigorous standardization of inputs, models, and output parsing to be reliable.</p>
</section>
<section id="hands-on-evaluating-outputs" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-evaluating-outputs">54. Hands on: Evaluating Outputs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_54.png" class="img-fluid figure-img"></p>
<figcaption>Slide 54</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1679s">Timestamp: 27:59</a>)</p>
<p>This slide introduces a hands-on exercise evaluating sentiment analysis. It shows a spreadsheet where different models output sentiment in different formats (some verbose, some concise).</p>
<p>Rajiv uses this to show the messy reality of parsing LLM outputs.</p>
</section>
<section id="solutions-standardizing-outputs" class="level3">
<h3 class="anchored" data-anchor-id="solutions-standardizing-outputs">55. Solutions: Standardizing Outputs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_55.png" class="img-fluid figure-img"></p>
<figcaption>Slide 55</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1714s">Timestamp: 28:34</a>)</p>
<p>This slide presents solutions for the output problem: 1. <strong>OpenAI Function Calling:</strong> Forces the model to output structured JSON. 2. <strong>Guardrails AI:</strong> A library for validating outputs against a schema.</p>
<p>Rajiv suggests that using these tools to tame the model into structured output makes “Exact Match” evaluation much more feasible.</p>
</section>
<section id="workflow-types-of-prompts" class="level3">
<h3 class="anchored" data-anchor-id="workflow-types-of-prompts">56. Workflow: Types of Prompts</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_56.png" class="img-fluid figure-img"></p>
<figcaption>Slide 56</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1739s">Timestamp: 28:59</a>)</p>
<p>This slide adds <strong>“Types of Prompts”</strong> to the Input section of the workflow diagram.</p>
<p>Rajiv reiterates the need to plan for <strong>multiple iterations</strong>. You will likely need to tweak your prompts and parsing logic many times to get a stable evaluation pipeline.</p>
</section>
<section id="resources-prompting" class="level3">
<h3 class="anchored" data-anchor-id="resources-prompting">57. Resources: Prompting</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_57.png" class="img-fluid figure-img"></p>
<figcaption>Slide 57</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1750s">Timestamp: 29:10</a>)</p>
<p>This slide lists resources for learning prompting, including the OpenAI Cookbook and the DAIR.AI Prompt Engineering Guide.</p>
</section>
<section id="similarity-approach" class="level3">
<h3 class="anchored" data-anchor-id="similarity-approach">58. Similarity Approach</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_58.png" class="img-fluid figure-img"></p>
<figcaption>Slide 58</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1753s">Timestamp: 29:13</a>)</p>
<p>This slide moves up the chart to the <strong>Similarity approach</strong>.</p>
<p>Rajiv introduces this as the next level of flexibility. If exact matching is too rigid, we check if the output is “similar enough” to the reference.</p>
</section>
<section id="story-translation" class="level3">
<h3 class="anchored" data-anchor-id="story-translation">59. Story: Translation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_59.png" class="img-fluid figure-img"></p>
<figcaption>Slide 59</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1769s">Timestamp: 29:29</a>)</p>
<p>This slide presents a translation challenge. It shows three human references for a Chinese-to-English translation and two computer candidates.</p>
<p>Rajiv asks the audience to guess which candidate is better. This exercise builds intuition for how similarity metrics work: we look for overlapping words and phrases between the candidate and the references.</p>
</section>
<section id="bleu-metric" class="level3">
<h3 class="anchored" data-anchor-id="bleu-metric">60. BLEU Metric</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_60.png" class="img-fluid figure-img"></p>
<figcaption>Slide 60</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1847s">Timestamp: 30:47</a>)</p>
<p>This slide introduces <strong>BLEU (Bilingual Evaluation Understudy)</strong>. It explains that BLEU calculates scores based on n-gram overlap (1-gram to 4-gram) between the generated text and reference text.</p>
<p>This is the mathematical formalization of the intuition from the previous slide. It’s a standard metric for translation.</p>
</section>
<section id="many-similarity-methods" class="level3">
<h3 class="anchored" data-anchor-id="many-similarity-methods">61. Many Similarity Methods</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_61.png" class="img-fluid figure-img"></p>
<figcaption>Slide 61</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1875s">Timestamp: 31:15</a>)</p>
<p>This slide lists various similarity metrics: <strong>Exact match, Edit distance, ROUGE, WER, METEOR, Cosine similarity</strong>.</p>
<p>Rajiv notes the pros and cons: They are fast and easy to calculate, but they <strong>don’t consider meaning</strong> (semantics) and are biased toward shorter text. They measure lexical overlap, not understanding.</p>
</section>
<section id="taxonomy-of-similarity-methods" class="level3">
<h3 class="anchored" data-anchor-id="taxonomy-of-similarity-methods">62. Taxonomy of Similarity Methods</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_62.png" class="img-fluid figure-img"></p>
<figcaption>Slide 62</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1926s">Timestamp: 32:06</a>)</p>
<p>This slide shows a complex flow chart categorizing similarity methods into <strong>Untrained</strong> (lexical, character-based) and <strong>Trained</strong> (embedding-based).</p>
<p>It illustrates the depth of research in this field, showing that there are dozens of ways to calculate “similarity.”</p>
</section>
<section id="similarity-methods-for-code" class="level3">
<h3 class="anchored" data-anchor-id="similarity-methods-for-code">63. Similarity Methods for Code</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_63.png" class="img-fluid figure-img"></p>
<figcaption>Slide 63</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1935s">Timestamp: 32:15</a>)</p>
<p>This slide asks if similarity works for <strong>Code</strong>. It shows a Python function <code>incr_list</code>.</p>
<p>Rajiv argues that similarity <strong>“Doesn’t work for code.”</strong> In code, variable names can change, and logic can be refactored, resulting in zero string similarity even if the code functions identically. Conversely, a single missing character (syntax error) can break code that is 99% similar textually.</p>
</section>
<section id="functional-correctness" class="level3">
<h3 class="anchored" data-anchor-id="functional-correctness">64. Functional Correctness</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_64.png" class="img-fluid figure-img"></p>
<figcaption>Slide 64</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1957s">Timestamp: 32:37</a>)</p>
<p>This slide highlights <strong>Functional Correctness</strong> on the chart.</p>
<p>This is the solution to the code evaluation problem. Instead of checking if the text looks right, we execute it to see if it <em>works</em>.</p>
</section>
<section id="problem-evaluating-code" class="level3">
<h3 class="anchored" data-anchor-id="problem-evaluating-code">65. Problem: Evaluating Code</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_65.png" class="img-fluid figure-img"></p>
<figcaption>Slide 65</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1974s">Timestamp: 32:54</a>)</p>
<p>This slide reinforces the failure of BLEU for code. It shows that a correct solution might have a low BLEU score because it uses different variable names than the reference.</p>
</section>
<section id="evaluating-code-with-unit-tests" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-code-with-unit-tests">66. Evaluating Code with Unit Tests</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_66.png" class="img-fluid figure-img"></p>
<figcaption>Slide 66</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=1997s">Timestamp: 33:17</a>)</p>
<p>This slide introduces the <strong>Unit Test</strong> approach. We take the generated code, run it against a set of test cases (inputs and expected outputs), and check for a pass/fail result.</p>
<p>Rajiv advocates for this approach because it is unambiguous. The code either runs and produces the right result, or it doesn’t.</p>
</section>
<section id="humaneval-benchmark" class="level3">
<h3 class="anchored" data-anchor-id="humaneval-benchmark">67. HumanEval Benchmark</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_67.png" class="img-fluid figure-img"></p>
<figcaption>Slide 67</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2056s">Timestamp: 34:16</a>)</p>
<p>This slide presents <strong>HumanEval</strong>, a famous benchmark for code LLMs that uses functional correctness (pass@1). It lists models like GPT-4 and WizardCoder and their scores.</p>
<p>This validates that functional correctness is the industry standard for evaluating coding capabilities.</p>
</section>
<section id="hands-on-building-functional-tests-email" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-building-functional-tests-email">68. Hands on: Building Functional Tests (Email)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_68.png" class="img-fluid figure-img"></p>
<figcaption>Slide 68</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2068s">Timestamp: 34:28</a>)</p>
<p>This slide asks how to apply functional correctness to <strong>Text</strong> (e.g., drafting emails).</p>
<p>Rajiv suggests defining “functional” properties for text: Is it concise? Does it include a call to action? Is the tone polite? These are testable assertions we can make about text output.</p>
</section>
<section id="hands-on-python-test-for-text" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-python-test-for-text">69. Hands on: Python Test for Text</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_69.png" class="img-fluid figure-img"></p>
<figcaption>Slide 69</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2119s">Timestamp: 35:19</a>)</p>
<p>This slide shows a Python snippet that tests if an email uses “informal language.”</p>
<p>It demonstrates that we can write code to evaluate text properties, effectively treating text generation as a “functional” problem with pass/fail criteria.</p>
</section>
<section id="evaluation-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-benchmarks">70. Evaluation Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_70.png" class="img-fluid figure-img"></p>
<figcaption>Slide 70</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2154s">Timestamp: 35:54</a>)</p>
<p>This slide highlights <strong>Evaluation Benchmarks</strong> on the chart.</p>
<p>Rajiv moves to this category, explaining that benchmarks are essentially collections of the previous methods (exact match, functional tests) aggregated into large suites.</p>
</section>
<section id="story-glue-benchmark" class="level3">
<h3 class="anchored" data-anchor-id="story-glue-benchmark">71. Story: GLUE Benchmark</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_71.png" class="img-fluid figure-img"></p>
<figcaption>Slide 71</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2165s">Timestamp: 36:05</a>)</p>
<p>This slide tells the history of <strong>GLUE (2018)</strong>. Before GLUE, models were specialized for single tasks. GLUE introduced the idea of a <strong>General Language Understanding Evaluation</strong>, pushing the field toward models that could handle many different tasks well.</p>
<p>Rajiv credits GLUE with driving the progress that led to modern LLMs by giving researchers a unified target.</p>
</section>
<section id="so-many-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="so-many-benchmarks">72. So Many Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_72.png" class="img-fluid figure-img"></p>
<figcaption>Slide 72</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2267s">Timestamp: 37:47</a>)</p>
<p>This slide introduces successors to GLUE: <strong>HellaSwag</strong> (commonsense) and <strong>Big Bench</strong> (reasoning).</p>
<p>Rajiv notes that Big Bench Hard compares models to average and max human performance, providing a measuring stick for how close AI is getting to human-level reasoning.</p>
</section>
<section id="even-more-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="even-more-benchmarks">73. Even More Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_73.png" class="img-fluid figure-img"></p>
<figcaption>Slide 73</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2320s">Timestamp: 38:40</a>)</p>
<p>This slide scrolls through a massive list of over 80 benchmarks.</p>
<p>Rajiv uses this to illustrate the explosion of evaluation datasets. There is a benchmark for almost everything, but this abundance can be paralyzed.</p>
</section>
<section id="multi-task-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="multi-task-benchmarks">74. Multi-task Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_74.png" class="img-fluid figure-img"></p>
<figcaption>Slide 74</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2342s">Timestamp: 39:02</a>)</p>
<p>This slide explains that <strong>Multi-task benchmarks</strong> aggregate many specific tasks (stories, code, legal) into a single score.</p>
<p>This allows for a robust, high-level view of a model’s general capability, though it risks hiding specific weaknesses.</p>
</section>
<section id="gaming-benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="gaming-benchmarks">75. Gaming Benchmarks</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_75.png" class="img-fluid figure-img"></p>
<figcaption>Slide 75</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2376s">Timestamp: 39:36</a>)</p>
<p>This slide discusses <strong>Gaming</strong> and <strong>Data Contamination</strong>. It mentions <strong>AlpacaEval</strong> and how models might cheat by training on the test data.</p>
<p>Rajiv warns that high benchmark scores might just mean the model has memorized the answers, making the benchmark useless for measuring true generalization.</p>
</section>
<section id="hands-on-langtest" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-langtest">76. Hands on: Langtest</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_76.png" class="img-fluid figure-img"></p>
<figcaption>Slide 76</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2421s">Timestamp: 40:21</a>)</p>
<p>This slide introduces <strong>Langtest</strong> by John Snow Labs. It is a library with 50+ test types for accuracy, bias, and robustness.</p>
<p>Rajiv recommends it as a tool for running standard benchmarks on your own models.</p>
</section>
<section id="hands-on-eleuther-harness" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-eleuther-harness">77. Hands on: Eleuther Harness</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_77.png" class="img-fluid figure-img"></p>
<figcaption>Slide 77</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2440s">Timestamp: 40:40</a>)</p>
<p>This slide introduces the <strong>Eleuther AI Evaluation Harness</strong>. Rajiv calls this the “OG” (original gangster) framework. It supports over 200 tasks.</p>
<p>He provides a code snippet showing how easy it is to run a benchmark like MMLU on a Hugging Face model using this harness.</p>
</section>
<section id="openai-evals" class="level3">
<h3 class="anchored" data-anchor-id="openai-evals">78. OpenAI Evals</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_78.png" class="img-fluid figure-img"></p>
<figcaption>Slide 78</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2480s">Timestamp: 41:20</a>)</p>
<p>This slide mentions <strong>OpenAI Evals</strong>, another framework for evaluating LLMs.</p>
<p>Rajiv notes it is useful but emphasizes that standardized templates work best when content variation is low.</p>
</section>
<section id="benchmarking-test-suites-summary" class="level3">
<h3 class="anchored" data-anchor-id="benchmarking-test-suites-summary">79. Benchmarking Test Suites Summary</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_79.png" class="img-fluid figure-img"></p>
<figcaption>Slide 79</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2489s">Timestamp: 41:29</a>)</p>
<p>This slide summarizes the Pros and Cons of benchmarks. * <strong>Pros:</strong> Wide coverage, cheap, automated. * <strong>Cons:</strong> Limited to easily measured tasks (often multiple choice), risk of leakage.</p>
<p>Rajiv reminds us that benchmarks are proxies for quality, not definitive proof of utility for a specific business case.</p>
</section>
<section id="so-many-leaderboards" class="level3">
<h3 class="anchored" data-anchor-id="so-many-leaderboards">80. So Many Leaderboards</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_80.png" class="img-fluid figure-img"></p>
<figcaption>Slide 80</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2527s">Timestamp: 42:07</a>)</p>
<p>This slide visualizes the ecosystem of leaderboards: Open LLM, Mosaic Eval Gauntlet, HELM.</p>
</section>
<section id="pro-tip-build-your-own-benchmark" class="level3">
<h3 class="anchored" data-anchor-id="pro-tip-build-your-own-benchmark">81. Pro Tip: Build Your Own Benchmark</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_81.png" class="img-fluid figure-img"></p>
<figcaption>Slide 81</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2538s">Timestamp: 42:18</a>)</p>
<p>This is a key takeaway: <strong>“Build your own benchmark / leaderboards.”</strong></p>
<p>Rajiv argues that for an enterprise, public leaderboards are insufficient. You should curate a set of tasks that reflect <em>your</em> specific domain (e.g., legal, IT ops) and evaluate models against that.</p>
</section>
<section id="custom-leaderboard-example" class="level3">
<h3 class="anchored" data-anchor-id="custom-leaderboard-example">82. Custom Leaderboard Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_82.png" class="img-fluid figure-img"></p>
<figcaption>Slide 82</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2611s">Timestamp: 43:31</a>)</p>
<p>This slide shows an example of a custom internal leaderboard (“AtmosBank”). It tracks how different models perform on the specific datasets that matter to that organization.</p>
<p>This allows a company to quickly vet new models (like a new Llama release) against their specific needs.</p>
</section>
<section id="benchmark-dataset-owl" class="level3">
<h3 class="anchored" data-anchor-id="benchmark-dataset-owl">83. Benchmark Dataset: OWL</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_83.png" class="img-fluid figure-img"></p>
<figcaption>Slide 83</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2622s">Timestamp: 43:42</a>)</p>
<p>This slide details <strong>OWL</strong>, a benchmark for IT Operations. It highlights the effort required to build it: manual review of hundreds of questions.</p>
<p>Rajiv uses this to be realistic: building a custom benchmark has a <strong>cost</strong>. You need to invest human time to create the “Gold Standard” questions and answers.</p>
</section>
<section id="averaging-can-mask-issues" class="level3">
<h3 class="anchored" data-anchor-id="averaging-can-mask-issues">84. Averaging Can Mask Issues</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_84.png" class="img-fluid figure-img"></p>
<figcaption>Slide 84</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2679s">Timestamp: 44:39</a>)</p>
<p>This slide warns that <strong>“Averaging can mask issues.”</strong> If Model 2 is amazing at your specific task but terrible at 9 others, an average score will hide its value.</p>
<p>Rajiv advises looking at individual task scores rather than just the aggregate number on a leaderboard.</p>
</section>
<section id="human-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="human-evaluation">85. Human Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_85.png" class="img-fluid figure-img"></p>
<figcaption>Slide 85</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2713s">Timestamp: 45:13</a>)</p>
<p>This slide highlights <strong>Human Evaluation</strong> on the chart.</p>
<p>Rajiv moves to the high-cost, high-flexibility zone. Humans are the ultimate judges of quality, capturing nuance that automated metrics miss.</p>
</section>
<section id="human-evaluation---best-practices" class="level3">
<h3 class="anchored" data-anchor-id="human-evaluation---best-practices">86. Human Evaluation - Best Practices</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_86.png" class="img-fluid figure-img"></p>
<figcaption>Slide 86</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2738s">Timestamp: 45:38</a>)</p>
<p>This slide lists best practices: <strong>Inter-annotator agreement</strong>, clear guidelines, and training.</p>
<p>Rajiv notes that we know how to do this from traditional data labeling. If humans can’t agree on the quality of an output (e.g., only 80% agreement), you can’t expect the model to do better.</p>
</section>
<section id="human-evaluation---limitations" class="level3">
<h3 class="anchored" data-anchor-id="human-evaluation---limitations">87. Human Evaluation - Limitations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_87.png" class="img-fluid figure-img"></p>
<figcaption>Slide 87</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2785s">Timestamp: 46:25</a>)</p>
<p>This slide discusses limitations. Humans are bad at checking <strong>factuality</strong> (it takes effort to Google facts) and are easily swayed by <strong>assertiveness</strong>.</p>
<p>If an LLM sounds confident, humans tend to rate it highly even if it is wrong.</p>
</section>
<section id="sycophancy-bias" class="level3">
<h3 class="anchored" data-anchor-id="sycophancy-bias">88. Sycophancy Bias</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_88.png" class="img-fluid figure-img"></p>
<figcaption>Slide 88</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2803s">Timestamp: 46:43</a>)</p>
<p>This slide defines <strong>Sycophancy</strong>: LLMs tend to generate responses that please the user rather than telling the truth.</p>
<p>Rajiv shows an example where a model reinforces a user’s misconception because it wants to be “helpful.” Humans often rate these pleasing answers higher, reinforcing the bias.</p>
</section>
<section id="human-evaluation-summary" class="level3">
<h3 class="anchored" data-anchor-id="human-evaluation-summary">89. Human Evaluation Summary</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_89.png" class="img-fluid figure-img"></p>
<figcaption>Slide 89</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2823s">Timestamp: 47:03</a>)</p>
<p>This slide summarizes Human Eval. * <strong>Strengths:</strong> Gold standard, handles variety. * <strong>Weaknesses:</strong> Expensive, slow, high variance, subject to bias.</p>
</section>
<section id="hands-on-argilla" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-argilla">90. Hands on: Argilla</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_90.png" class="img-fluid figure-img"></p>
<figcaption>Slide 90</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2877s">Timestamp: 47:57</a>)</p>
<p>This slide showcases <strong>Argilla</strong>, an open-source tool for data annotation.</p>
<p>Rajiv encourages teams to set up tools like this to make it easy for domain experts (doctors, lawyers) to provide feedback on model outputs.</p>
</section>
<section id="annotation-tools" class="level3">
<h3 class="anchored" data-anchor-id="annotation-tools">91. Annotation Tools</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_91.png" class="img-fluid figure-img"></p>
<figcaption>Slide 91</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2900s">Timestamp: 48:20</a>)</p>
<p>This slide lists other tools: <strong>LabelStudio</strong> and <strong>Prodigy</strong>. The message is: don’t reinvent the wheel, use existing tooling to gather human feedback.</p>
</section>
<section id="longeval" class="level3">
<h3 class="anchored" data-anchor-id="longeval">92. LongEval</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_92.png" class="img-fluid figure-img"></p>
<figcaption>Slide 92</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2919s">Timestamp: 48:39</a>)</p>
<p>This slide references <strong>LongEval</strong>, a study on evaluating long summaries. It emphasizes that guidelines for humans need to be specific (coarse vs fine-grained) to get reliable results.</p>
</section>
<section id="human-comparisonarena" class="level3">
<h3 class="anchored" data-anchor-id="human-comparisonarena">93. Human Comparison/Arena</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_93.png" class="img-fluid figure-img"></p>
<figcaption>Slide 93</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2944s">Timestamp: 49:04</a>)</p>
<p>This slide highlights <strong>Human Comparison/Arena</strong> on the chart.</p>
<p>This is a specific subset of human evaluation focused on <em>preferences</em> rather than absolute scoring.</p>
</section>
<section id="story-dating-preferences" class="level3">
<h3 class="anchored" data-anchor-id="story-dating-preferences">94. Story: Dating (Preferences)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_94.png" class="img-fluid figure-img"></p>
<figcaption>Slide 94</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=2966s">Timestamp: 49:26</a>)</p>
<p>This slide uses a dating analogy. Old dating sites used long forms (detailed evaluation), but modern apps use swiping (binary preference).</p>
<p>Rajiv argues that it is much easier and faster for humans to say “I prefer A over B” (swiping) than to fill out a detailed scorecard. This is the logic behind Arena evaluations.</p>
</section>
<section id="head-to-head-preferences" class="level3">
<h3 class="anchored" data-anchor-id="head-to-head-preferences">95. Head to Head Preferences</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_95.png" class="img-fluid figure-img"></p>
<figcaption>Slide 95</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3016s">Timestamp: 50:16</a>)</p>
<p>This slide shows a “Head to Head” interface. The user sees two model outputs and clicks the one they like better.</p>
<p>This method is widely used (e.g., in RLHF) because it scales well and reduces cognitive load on annotators.</p>
</section>
<section id="head-to-head-leaderboards" class="level3">
<h3 class="anchored" data-anchor-id="head-to-head-leaderboards">96. Head to Head Leaderboards</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_96.png" class="img-fluid figure-img"></p>
<figcaption>Slide 96</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3050s">Timestamp: 50:50</a>)</p>
<p>This slide introduces the <strong>LM-SYS Arena</strong>. It uses an <strong>Elo rating system</strong> (like in Chess) based on thousands of anonymous battles between models.</p>
<p>Rajiv notes this is a very effective way to rank models based on general human preference.</p>
</section>
<section id="arena-solutions" class="level3">
<h3 class="anchored" data-anchor-id="arena-solutions">97. Arena Solutions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_97.png" class="img-fluid figure-img"></p>
<figcaption>Slide 97</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3104s">Timestamp: 51:44</a>)</p>
<p>This slide provides links to the code for the LM-SYS arena. Rajiv suggests that enterprises can set up their own internal arenas to gamify evaluation for their employees.</p>
</section>
<section id="model-based-approaches" class="level3">
<h3 class="anchored" data-anchor-id="model-based-approaches">98. Model Based Approaches</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_98.png" class="img-fluid figure-img"></p>
<figcaption>Slide 98</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3115s">Timestamp: 51:55</a>)</p>
<p>This slide highlights <strong>Model based Approaches</strong> on the chart.</p>
<p>This is the most rapidly evolving area: using <strong>LLMs to evaluate other LLMs</strong> (LLM-as-a-Judge).</p>
</section>
<section id="evaluating-factuality" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-factuality">99. Evaluating Factuality</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_99.png" class="img-fluid figure-img"></p>
<figcaption>Slide 99</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3144s">Timestamp: 52:24</a>)</p>
<p>This slide discusses the limitation of reference-based factuality (comparing to a known ground truth). It notes that this is “Pretty limited utility” because we often don’t have ground truth for every new query.</p>
</section>
<section id="model-based-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="model-based-evaluation">100. Model Based Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_100.png" class="img-fluid figure-img"></p>
<figcaption>Slide 100</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3174s">Timestamp: 52:54</a>)</p>
<p>This slide illustrates the core concept: Instead of a human checking if the story is grammatical, we ask GPT-3 (or GPT-4) to do it.</p>
<p>Rajiv explains that models are now good enough to act as proxy evaluators.</p>
</section>
<section id="assertions" class="level3">
<h3 class="anchored" data-anchor-id="assertions">101. Assertions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_101.png" class="img-fluid figure-img"></p>
<figcaption>Slide 101</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3192s">Timestamp: 53:12</a>)</p>
<p>This slide lists simple model-based checks called <strong>Assertions</strong>: Language Match, Sentiment, Toxicity, Length.</p>
<p>These act like unit tests but use the LLM to classify the output (e.g., “Is this text toxic? Yes/No”).</p>
</section>
<section id="g-eval" class="level3">
<h3 class="anchored" data-anchor-id="g-eval">102. G-Eval</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_102.png" class="img-fluid figure-img"></p>
<figcaption>Slide 102</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3307s">Timestamp: 55:07</a>)</p>
<p>This slide introduces <strong>G-Eval</strong>, a framework that uses Chain-of-Thought (CoT) to generate a score. It provides the model with evaluation criteria and steps, asking it to reason before assigning a grade.</p>
</section>
<section id="selfcheckgpt" class="level3">
<h3 class="anchored" data-anchor-id="selfcheckgpt">103. SelfCheckGPT</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_103.png" class="img-fluid figure-img"></p>
<figcaption>Slide 103</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3326s">Timestamp: 55:26</a>)</p>
<p>This slide describes <strong>SelfCheckGPT</strong>. This method detects hallucinations by sampling the model multiple times. If the model tells the same story consistently, it’s likely true. If the details change every time, it’s likely hallucinating.</p>
</section>
<section id="which-model-for-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="which-model-for-evaluation">104. Which Model for Evaluation?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_104.png" class="img-fluid figure-img"></p>
<figcaption>Slide 104</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3350s">Timestamp: 55:50</a>)</p>
<p>This slide asks which model to use as the judge. * <strong>GPT-4:</strong> Strongest evaluator, best for reasoning. * <strong>GPT-3.5:</strong> Cheaper, good for simple tasks. * <strong>JudgeLM:</strong> Fine-tuned specifically for evaluation.</p>
</section>
<section id="human-alignment" class="level3">
<h3 class="anchored" data-anchor-id="human-alignment">105. Human Alignment</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_105.png" class="img-fluid figure-img"></p>
<figcaption>Slide 105</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3404s">Timestamp: 56:44</a>)</p>
<p>This slide presents data showing high <strong>Human Alignment</strong>. GPT-4 judges agree with human judges 80-95% of the time.</p>
<p>This validates the approach: LLM judges are a scalable, cheap proxy for human evaluation.</p>
</section>
<section id="model-evaluation-biases" class="level3">
<h3 class="anchored" data-anchor-id="model-evaluation-biases">106. Model Evaluation Biases</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_106.png" class="img-fluid figure-img"></p>
<figcaption>Slide 106</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3452s">Timestamp: 57:32</a>)</p>
<p>This slide warns about biases in LLM judges: * <strong>Position Bias:</strong> Preferring the first answer. * <strong>Verbosity Bias:</strong> Preferring longer answers. * <strong>Self-Enhancement:</strong> Preferring its own outputs.</p>
<p>Rajiv suggests mitigations like swapping order and using different models for judging.</p>
</section>
<section id="summary-model-based-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="summary-model-based-evaluation">107. Summary: Model Based Evaluation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_107.png" class="img-fluid figure-img"></p>
<figcaption>Slide 107</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3511s">Timestamp: 58:31</a>)</p>
<p>This slide categorizes model-based methods: <strong>Assertions</strong> (simple), <strong>Concept based</strong> (G-Eval), <strong>Sampling based</strong> (SelfCheck), and <strong>Preference based</strong> (RLHF).</p>
</section>
<section id="pros-and-cons" class="level3">
<h3 class="anchored" data-anchor-id="pros-and-cons">108. Pros and Cons</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_108.png" class="img-fluid figure-img"></p>
<figcaption>Slide 108</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3557s">Timestamp: 59:17</a>)</p>
<p>This slide summarizes the trade-offs. * <strong>Pros:</strong> Cheaper/faster than humans, good alignment. * <strong>Cons:</strong> Sensitive to prompts, known biases.</p>
</section>
<section id="ragas" class="level3">
<h3 class="anchored" data-anchor-id="ragas">109. Ragas</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_109.png" class="img-fluid figure-img"></p>
<figcaption>Slide 109</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3589s">Timestamp: 59:49</a>)</p>
<p>This slide introduces <strong>Ragas</strong>, a framework specifically for evaluating RAG pipelines. It calculates a score based on <strong>Faithfulness</strong> and <strong>Relevancy</strong>.</p>
</section>
<section id="deepeval" class="level3">
<h3 class="anchored" data-anchor-id="deepeval">110. DeepEval</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_110.png" class="img-fluid figure-img"></p>
<figcaption>Slide 110</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3610s">Timestamp: 1:00:10</a>)</p>
<p>This slide mentions <strong>DeepEval</strong>, another tool that treats evaluation like unit tests for LLMs, checking for bias, toxicity, etc.</p>
</section>
<section id="hands-on-using-ragas" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-using-ragas">111. Hands on: Using Ragas</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_111.png" class="img-fluid figure-img"></p>
<figcaption>Slide 111</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3619s">Timestamp: 1:00:19</a>)</p>
<p>This slide shows code for using Ragas. It demonstrates how to pass a dataset to the <code>evaluate</code> function and get metrics like <code>context_precision</code> and <code>answer_relevancy</code>.</p>
</section>
<section id="hands-on-prompts-salmonn" class="level3">
<h3 class="anchored" data-anchor-id="hands-on-prompts-salmonn">112. Hands on: Prompts (SALMONN)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_112.png" class="img-fluid figure-img"></p>
<figcaption>Slide 112</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3660s">Timestamp: 1:01:00</a>)</p>
<p>This slide shows prompts from the <strong>SALMONN</strong> paper. Rajiv includes these to show real-world examples of how researchers craft prompts to evaluate specific qualities like coherence.</p>
</section>
<section id="quality-prompt" class="level3">
<h3 class="anchored" data-anchor-id="quality-prompt">113. Quality Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_113.png" class="img-fluid figure-img"></p>
<figcaption>Slide 113</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3684s">Timestamp: 1:01:24</a>)</p>
<p>This slide shows a prompt for evaluating <strong>Data Quality</strong>. It asks the model to rate the helpfulness and relevance of text on a scale.</p>
</section>
<section id="rag-relevancy-prompt" class="level3">
<h3 class="anchored" data-anchor-id="rag-relevancy-prompt">114. RAG Relevancy Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_114.png" class="img-fluid figure-img"></p>
<figcaption>Slide 114</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3695s">Timestamp: 1:01:35</a>)</p>
<p>This slide details a <strong>“RAG RELEVANCY PROMPT TEMPLATE.”</strong> It instructs the model to compare a question and a reference text to determine if the reference contains the answer.</p>
</section>
<section id="impartial-judge-prompt" class="level3">
<h3 class="anchored" data-anchor-id="impartial-judge-prompt">115. Impartial Judge Prompt</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_115.png" class="img-fluid figure-img"></p>
<figcaption>Slide 115</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3709s">Timestamp: 1:01:49</a>)</p>
<p>This slide shows a prompt for an <strong>“Impartial Judge.”</strong> It asks the model to be an assistant that evaluates the quality of a response, ensuring it is helpful, accurate, and detailed.</p>
</section>
<section id="resources-model-based-eval" class="level3">
<h3 class="anchored" data-anchor-id="resources-model-based-eval">116. Resources: Model Based Eval</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_116.png" class="img-fluid figure-img"></p>
<figcaption>Slide 116</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3728s">Timestamp: 1:02:08</a>)</p>
<p>This slide lists libraries: <strong>Ragas, Microsoft llm-eval, TrueLens, Guardrails</strong>.</p>
<p>Rajiv notes that while libraries are great, many people end up writing their own hand-crafted prompts to fit their specific needs.</p>
</section>
<section id="red-teaming" class="level3">
<h3 class="anchored" data-anchor-id="red-teaming">117. Red Teaming</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_117.png" class="img-fluid figure-img"></p>
<figcaption>Slide 117</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3743s">Timestamp: 1:02:23</a>)</p>
<p>This slide highlights <strong>Red Teaming</strong> on the chart.</p>
<p>This is the final, most flexible, and rigorous technical evaluation method.</p>
</section>
<section id="story-microsoft-tay" class="level3">
<h3 class="anchored" data-anchor-id="story-microsoft-tay">118. Story: Microsoft Tay</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_118.png" class="img-fluid figure-img"></p>
<figcaption>Slide 118</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3757s">Timestamp: 1:02:37</a>)</p>
<p>This slide tells the cautionary tale of <strong>Microsoft Tay (2016)</strong>. The chatbot learned from Twitter users and became racist/genocidal in less than 24 hours.</p>
<p>Rajiv cites this as the “Origin of Red Teaming in AI”—the realization that we must proactively attack our models to find vulnerabilities before the public does.</p>
</section>
<section id="why-red-teaming" class="level3">
<h3 class="anchored" data-anchor-id="why-red-teaming">119. Why Red Teaming?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_119.png" class="img-fluid figure-img"></p>
<figcaption>Slide 119</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3851s">Timestamp: 1:04:11</a>)</p>
<p>This slide defines Red Teaming: <strong>Eliciting model vulnerabilities to prevent undesirable behaviors.</strong></p>
<p>It is about adversarial testing—trying to trick the model into doing something bad.</p>
</section>
<section id="every-use-case-should-be-red-teamed" class="level3">
<h3 class="anchored" data-anchor-id="every-use-case-should-be-red-teamed">120. Every Use Case Should Be Red Teamed</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_120.png" class="img-fluid figure-img"></p>
<figcaption>Slide 120</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3863s">Timestamp: 1:04:23</a>)</p>
<p>This slide argues that <strong>“Every use case should be Red Teamed.”</strong></p>
<p>Rajiv explains that fine-tuning a model (even slightly) can destroy the safety alignment (RLHF) provided by the base model creator. You cannot assume a model is safe just because it was safe before you fine-tuned it.</p>
</section>
<section id="how-to-red-teaming-with-a-model" class="level3">
<h3 class="anchored" data-anchor-id="how-to-red-teaming-with-a-model">121. How to: Red Teaming with a Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_121.png" class="img-fluid figure-img"></p>
<figcaption>Slide 121</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3907s">Timestamp: 1:05:07</a>)</p>
<p>This slide suggests a technique: Use a separate “Risk Assessment” model (like Llama-2) to monitor the inputs and outputs of your main model, logging any risky queries.</p>
</section>
<section id="how-to-red-teaming-from-meta" class="level3">
<h3 class="anchored" data-anchor-id="how-to-red-teaming-from-meta">122. How to: Red Teaming from Meta</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_122.png" class="img-fluid figure-img"></p>
<figcaption>Slide 122</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3922s">Timestamp: 1:05:22</a>)</p>
<p>This slide describes <strong>Meta’s approach</strong> to Llama 2. They hired diverse teams to attack the model regarding specific risks (criminal planning, trafficking).</p>
<p>Rajiv notes that Meta actually held back a specific model (33b) because it failed these red team tests.</p>
</section>
<section id="red-teaming-process" class="level3">
<h3 class="anchored" data-anchor-id="red-teaming-process">123. Red Teaming Process</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_123.png" class="img-fluid figure-img"></p>
<figcaption>Slide 123</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3954s">Timestamp: 1:05:54</a>)</p>
<p>This slide outlines the workflow: Generate prompts (multilingual), Annotate risk (Likert scale), and use data for safety training.</p>
</section>
<section id="technical-methods-recap" class="level3">
<h3 class="anchored" data-anchor-id="technical-methods-recap">124. Technical Methods Recap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_124.png" class="img-fluid figure-img"></p>
<figcaption>Slide 124</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3964s">Timestamp: 1:06:04</a>)</p>
<p>This slide shows the full <strong>Generative AI Evaluation Methods</strong> chart again.</p>
<p>Rajiv concludes the technical section, having covered the spectrum from Exact Match to Red Teaming.</p>
</section>
<section id="operational-tco" class="level3">
<h3 class="anchored" data-anchor-id="operational-tco">125. Operational (TCO)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_125.png" class="img-fluid figure-img"></p>
<figcaption>Slide 125</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3976s">Timestamp: 1:06:16</a>)</p>
<p>This slide highlights the <strong>Operational (TCO)</strong> pillar.</p>
<p>Rajiv shifts gears to discuss the cost and maintenance of running these models.</p>
</section>
<section id="story-github-copilot-costs" class="level3">
<h3 class="anchored" data-anchor-id="story-github-copilot-costs">126. Story: GitHub Copilot Costs</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_126.png" class="img-fluid figure-img"></p>
<figcaption>Slide 126</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=3993s">Timestamp: 1:06:33</a>)</p>
<p>This slide references a story that <strong>GitHub Copilot</strong> was losing money per user (costing $20-$80/month while charging $10).</p>
<p>Rajiv uses this to warn about the “Epidemic of cloud laundering.” You must calculate the inference costs upfront, or your successful product might bankrupt you.</p>
</section>
<section id="monitoring" class="level3">
<h3 class="anchored" data-anchor-id="monitoring">127. Monitoring</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_127.png" class="img-fluid figure-img"></p>
<figcaption>Slide 127</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4048s">Timestamp: 1:07:28</a>)</p>
<p>This slide introduces <strong>Monitoring</strong> as the “Sibling of Evaluate.”</p>
<p>It lists things to watch: Functional metrics (latency, errors), Prompt Drift, and Response Monitoring.</p>
</section>
<section id="monitoring-metrics-gpuresponsible-ai" class="level3">
<h3 class="anchored" data-anchor-id="monitoring-metrics-gpuresponsible-ai">128. Monitoring Metrics (GPU/Responsible AI)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_128.png" class="img-fluid figure-img"></p>
<figcaption>Slide 128</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4061s">Timestamp: 1:07:41</a>)</p>
<p>This slide lists specific metrics. * <strong>GPU:</strong> Error rates (429), token counts. * <strong>Responsible AI:</strong> How often is the content filter triggering?</p>
</section>
<section id="performance-metrics" class="level3">
<h3 class="anchored" data-anchor-id="performance-metrics">129. Performance Metrics</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_129.png" class="img-fluid figure-img"></p>
<figcaption>Slide 129</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4073s">Timestamp: 1:07:53</a>)</p>
<p>This slide lists <strong>Performance Metrics</strong>: * <strong>Time to first token (TTFT):</strong> Critical for user experience. * <strong>Requests Per Second (RPS).</strong> * <strong>Token render rate.</strong></p>
</section>
<section id="user-engagement-funnel" class="level3">
<h3 class="anchored" data-anchor-id="user-engagement-funnel">130. User Engagement Funnel</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_130.png" class="img-fluid figure-img"></p>
<figcaption>Slide 130</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4081s">Timestamp: 1:08:01</a>)</p>
<p>This slide suggests monitoring <strong>User Engagement</strong>. * Funnel: Trigger -&gt; Response -&gt; User Keeps/Accepts Response.</p>
<p>Rajiv notes that OpenAI monitors the <strong>KV Cache</strong> utilization to understand real usage patterns better than simple GPU utilization.</p>
</section>
<section id="application-to-rag-1" class="level3">
<h3 class="anchored" data-anchor-id="application-to-rag-1">131. Application to RAG</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_131.png" class="img-fluid figure-img"></p>
<figcaption>Slide 131</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4154s">Timestamp: 1:09:14</a>)</p>
<p>This slide acts as a section header: <strong>APPLICATION TO RAG</strong>.</p>
<p>Rajiv will now apply all the previous concepts to a specific use case: Retrieval Augmented Generation.</p>
</section>
<section id="bring-your-own-facts" class="level3">
<h3 class="anchored" data-anchor-id="bring-your-own-facts">132. Bring Your Own Facts</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_132.png" class="img-fluid figure-img"></p>
<figcaption>Slide 132</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4166s">Timestamp: 1:09:26</a>)</p>
<p>This slide explains the core philosophy of RAG: <strong>“If you need facts - bring them yourself.”</strong> Don’t rely on the LLM’s training data; provide the context.</p>
</section>
<section id="what-is-rag" class="level3">
<h3 class="anchored" data-anchor-id="what-is-rag">133. What is RAG?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_133.png" class="img-fluid figure-img"></p>
<figcaption>Slide 133</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4173s">Timestamp: 1:09:33</a>)</p>
<p>This slide defines RAG: Improving responses by grounding the model on external knowledge sources.</p>
</section>
<section id="evaluating-rag-the-wrong-way" class="level3">
<h3 class="anchored" data-anchor-id="evaluating-rag-the-wrong-way">134. Evaluating RAG (The Wrong Way)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_134.png" class="img-fluid figure-img"></p>
<figcaption>Slide 134</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4191s">Timestamp: 1:09:51</a>)</p>
<p>This slide shows a “recipe” for RAG evaluation focusing solely on factuality precision (95%).</p>
<p>Rajiv presents this as a <strong>trap</strong>. He asks, “What’s wrong with this?”</p>
</section>
<section id="missing-the-point" class="level3">
<h3 class="anchored" data-anchor-id="missing-the-point">135. Missing the Point</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_135.png" class="img-fluid figure-img"></p>
<figcaption>Slide 135</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4207s">Timestamp: 1:10:07</a>)</p>
<p>This slide explicitly states that focusing only on technical details misses the larger point of view.</p>
<p>Rajiv is baiting the audience to remember the <strong>Three Pillars</strong>.</p>
</section>
<section id="three-pillars-rag-context" class="level3">
<h3 class="anchored" data-anchor-id="three-pillars-rag-context">136. Three Pillars (RAG Context)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_136.png" class="img-fluid figure-img"></p>
<figcaption>Slide 136</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4220s">Timestamp: 1:10:20</a>)</p>
<p>This slide brings back the <strong>Technical, Business, Operational</strong> pillars.</p>
<p>Rajiv insists we must start with the Business metrics before jumping into technical precision.</p>
</section>
<section id="business-metric-for-rag" class="level3">
<h3 class="anchored" data-anchor-id="business-metric-for-rag">137. Business Metric for RAG</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_137.png" class="img-fluid figure-img"></p>
<figcaption>Slide 137</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4228s">Timestamp: 1:10:28</a>)</p>
<p>This slide outlines the <strong>Business questions</strong>: * What is the value of a correct answer? * What is the <strong>cost/consequence</strong> of a wrong answer?</p>
<p>Rajiv warns against building a “science experiment” without knowing the ROI.</p>
</section>
<section id="operational-metrics-for-rag" class="level3">
<h3 class="anchored" data-anchor-id="operational-metrics-for-rag">138. Operational Metrics for RAG</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_138.png" class="img-fluid figure-img"></p>
<figcaption>Slide 138</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4258s">Timestamp: 1:10:58</a>)</p>
<p>This slide lists <strong>Operational questions</strong>: * Labeling effort? * Running costs? * Is IT ready to support this?</p>
</section>
<section id="three-pillars-transition" class="level3">
<h3 class="anchored" data-anchor-id="three-pillars-transition">139. Three Pillars (Transition)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_139.png" class="img-fluid figure-img"></p>
<figcaption>Slide 139</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4303s">Timestamp: 1:11:43</a>)</p>
<p>This slide shows the three pillars again, preparing to zoom in on the Technical side.</p>
</section>
<section id="technical-pillar" class="level3">
<h3 class="anchored" data-anchor-id="technical-pillar">140. Technical Pillar</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_140.png" class="img-fluid figure-img"></p>
<figcaption>Slide 140</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4305s">Timestamp: 1:11:45</a>)</p>
<p>This slide highlights <strong>Technical (F1)</strong>. Now that we’ve justified the business case, how do we technically evaluate RAG?</p>
</section>
<section id="current-approaches-eyeballing" class="level3">
<h3 class="anchored" data-anchor-id="current-approaches-eyeballing">141. Current Approaches (Eyeballing)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_141.png" class="img-fluid figure-img"></p>
<figcaption>Slide 141</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4311s">Timestamp: 1:11:51</a>)</p>
<p>This slide critiques the current state: <strong>“Eyeballing a few examples.”</strong></p>
<p>Rajiv notes that most developers just look at a few chats and say “looks good.” This is insufficient for production.</p>
</section>
<section id="evaluate-llm-system" class="level3">
<h3 class="anchored" data-anchor-id="evaluate-llm-system">142. Evaluate LLM System</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_142.png" class="img-fluid figure-img"></p>
<figcaption>Slide 142</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4334s">Timestamp: 1:12:14</a>)</p>
<p>This slide lists system-level questions: Accuracy, references, understandability, query time.</p>
</section>
<section id="decomposing-rag" class="level3">
<h3 class="anchored" data-anchor-id="decomposing-rag">143. Decomposing RAG</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_143.png" class="img-fluid figure-img"></p>
<figcaption>Slide 143</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4357s">Timestamp: 1:12:37</a>)</p>
<p>This slide brings back the RAG diagram, emphasizing <strong>decomposition</strong>. 1. Retrieval 2. Augmented Generation</p>
<p>Rajiv argues we must evaluate these independently to find the bottleneck. Often, the problem is the <strong>Retriever</strong>, not the LLM.</p>
</section>
<section id="component-metrics" class="level3">
<h3 class="anchored" data-anchor-id="component-metrics">144. Component Metrics</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_144.png" class="img-fluid figure-img"></p>
<figcaption>Slide 144</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4373s">Timestamp: 1:12:53</a>)</p>
<p>This slide details metrics for each component: * <strong>Retrieval:</strong> Precision, Recall, Order. * <strong>Augmentation:</strong> Correctness, Toxicity, Hallucination.</p>
</section>
<section id="analyze-retrieval" class="level3">
<h3 class="anchored" data-anchor-id="analyze-retrieval">145. Analyze Retrieval</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_145.png" class="img-fluid figure-img"></p>
<figcaption>Slide 145</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4433s">Timestamp: 1:13:53</a>)</p>
<p>This slide explains how to evaluate retrieval. You need a dataset of <strong>(Query, Relevant Documents)</strong>.</p>
<p>You run your retriever and check if it found the documents in your ground truth set.</p>
</section>
<section id="methods-for-retrieval" class="level3">
<h3 class="anchored" data-anchor-id="methods-for-retrieval">146. Methods for Retrieval</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_146.png" class="img-fluid figure-img"></p>
<figcaption>Slide 146</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4453s">Timestamp: 1:14:13</a>)</p>
<p>This slide highlights <strong>Exact Matching</strong> on the chart.</p>
<p>For retrieval, we can use exact matching (or set intersection) because we know exactly which document IDs should be returned.</p>
</section>
<section id="retrieval-metrics" class="level3">
<h3 class="anchored" data-anchor-id="retrieval-metrics">147. Retrieval Metrics</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_147.png" class="img-fluid figure-img"></p>
<figcaption>Slide 147</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4456s">Timestamp: 1:14:16</a>)</p>
<p>This slide lists retrieval metrics: <strong>Success rate (Hit-rate)</strong> and <strong>Mean Reciprocal Rank (MRR)</strong>.</p>
</section>
<section id="analyze-augmentation" class="level3">
<h3 class="anchored" data-anchor-id="analyze-augmentation">148. Analyze Augmentation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_148.png" class="img-fluid figure-img"></p>
<figcaption>Slide 148</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4488s">Timestamp: 1:14:48</a>)</p>
<p>This slide explains how to evaluate the generation step. You need <strong>(Context, Generated Response, Ground Truth)</strong>.</p>
</section>
<section id="methods-for-augmentation" class="level3">
<h3 class="anchored" data-anchor-id="methods-for-augmentation">149. Methods for Augmentation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_149.png" class="img-fluid figure-img"></p>
<figcaption>Slide 149</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4502s">Timestamp: 1:15:02</a>)</p>
<p>This slide highlights <strong>Human</strong> and <strong>Model-based</strong> approaches on the chart.</p>
<p>For generation, exact match doesn’t work. We need flexible evaluators (Humans or LLMs) to judge faithfulness and relevancy.</p>
</section>
<section id="augmentation-modules" class="level3">
<h3 class="anchored" data-anchor-id="augmentation-modules">150. Augmentation Modules</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_150.png" class="img-fluid figure-img"></p>
<figcaption>Slide 150</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4505s">Timestamp: 1:15:05</a>)</p>
<p>This slide lists modules to test: * <strong>Label-free:</strong> Faithfulness (did it stick to context?), Relevancy. * <strong>With-labels:</strong> Correctness (compared to ground truth).</p>
</section>
<section id="pro-tip-imbalance" class="level3">
<h3 class="anchored" data-anchor-id="pro-tip-imbalance">151. Pro Tip: Imbalance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_151.png" class="img-fluid figure-img"></p>
<figcaption>Slide 151</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4528s">Timestamp: 1:15:28</a>)</p>
<p>This slide warns about <strong>Imbalanced Data</strong>. If most retrieved documents are irrelevant, accuracy is a bad metric. Use <strong>Precision and Recall</strong>.</p>
</section>
<section id="pro-tip-synthetic-data" class="level3">
<h3 class="anchored" data-anchor-id="pro-tip-synthetic-data">152. Pro Tip: Synthetic Data</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_152.png" class="img-fluid figure-img"></p>
<figcaption>Slide 152</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4535s">Timestamp: 1:15:35</a>)</p>
<p>This slide suggests generating <strong>Synthetic Evaluation Datasets</strong>.</p>
<p>You can use an LLM to read your documents and generate Question/Answer pairs. This creates a “Gold Standard” dataset for retrieval evaluation without manual labeling.</p>
</section>
<section id="notebooks-used" class="level3">
<h3 class="anchored" data-anchor-id="notebooks-used">153. Notebooks Used</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_153.png" class="img-fluid figure-img"></p>
<figcaption>Slide 153</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4574s">Timestamp: 1:16:14</a>)</p>
<p>This slide lists the notebooks available in the GitHub repo: Prompting, Guidance, Eleuther Harness, Langtest, Ragas.</p>
</section>
<section id="final-slide" class="level3">
<h3 class="anchored" data-anchor-id="final-slide">154. Final Slide</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_154.png" class="img-fluid figure-img"></p>
<figcaption>Slide 154</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/iQl03pQlYWY&amp;t=4592s">Timestamp: 1:16:32</a>)</p>
<p>The presentation concludes with the title slide again, providing the speaker’s contact info and the GitHub link one last time. Rajiv thanks the audience and promises updates as the field evolves.</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>LLM</category>
  <category>Evaluation</category>
  <category>Generative AI</category>
  <category>Testing</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/evaluating-llms-deep-dive.html</guid>
  <pubDate>Wed, 15 Nov 2023 06:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/evaluating-llms-deep-dive/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Reasoning in Large Language Models</title>
  <link>https://rajivshah.com/blog/HF-Reasoning.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/hf/r-title.png" class="img-fluid figure-img"></p>
<figcaption>Reasoning</figcaption>
</figure>
</div>
<section id="introduction" class="level3">
<h3 class="anchored" data-anchor-id="introduction">Introduction</h3>
<p>I was wowed by ChatGPT. While I understood tasks like text generation and summarization, something was different with ChatGPT. When I looked at the literature, I saw this work exploring reasoning. Models reasoning, c’mon. As a very skeptical data scientist, that seemed far-fetched to me. But I had to explore.</p>
<p>I came upon the <a href="https://github.com/google/BIG-bench">Big Bench Benchmark</a>, composed of more than 200 reasoning tasks. The tasks include playing chess, describing code, guessing the perpetrator of a crime in a short story, identifying sarcasm, and even recognizing self-awareness. A common benchmark to test models is the Big Bench Hard (BBH), a subset of 23 tasks from Big Bench. Early models like OpenAI’s text-ada-00 struggle to reach a random score of 25. However, several newer models reach and surpass the average human rater score of 67.7. You can see results for these models in these publications: <a href="https://arxiv.org/pdf/2301.13688.pdf">1</a>, <a href="https://arxiv.org/pdf/2210.09261.pdf">2</a>, and <a href="https://arxiv.org/pdf/2210.11416.pdf">3</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/hf/BBH.png" class="img-fluid figure-img"></p>
<figcaption>Big Bench Hard (23 Tasks) (1).png</figcaption>
</figure>
</div>
<p>A <a href="https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec09.pdf">survey of the research</a> pointed out some common starting points for evaluating reasoning in models, including Arithmetic Reasoning, Symbolic Reasoning, and Commonsense Reasoning. This blog post provides examples of reasoning, but you should try out all these examples yourself. Hugging Face has a <a href="https://huggingface.co/spaces/osanseviero/i-like-flan">space where you can try</a> to test a Flan T5 model yourself.</p>
</section>
<section id="arithmetic-reasoning" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-reasoning">Arithmetic <strong>Reasoning</strong></h3>
<p>Let’s start with the following problem.</p>
<pre><code>Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: The answer is 5</code></pre>
<p>If you ask an older text generation model like GPT-2 to complete this, it doesn’t understand the question and instead continues to write a story like this.</p>
<p><img src="https://rajivshah.com/blog/images/hf/R-Cars-GPT2.png" alt="R-Cars-GPT2.png" style="zoom:50%;"></p>
<p>While I don’t have access to PalM - 540B parameter model in the Big Bench, I was able to work with the Flan-T5 XXL using this publicly available space. I entered the problem and got this answer!</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/hf/R-Cars-Flan.png" class="img-fluid figure-img"></p>
<figcaption>R-Cars-Flan.png</figcaption>
</figure>
</div>
<p>It solved it! I tried messing with it and changing the words, but it still answered correctly. To my untrained eye, it is trying to take the numbers and perform a calculation using the surrounding information. This is an elementary problem, but this is more sophisticated than the GPT-2 response. I next wanted to do a more challenging problem like this:</p>
<pre><code>Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?</code></pre>
<p>The model gave an answer of 8, which isn’t correct. Recent research has found using chain-of-thought prompting can improve the ability of models. This involves providing intermediate reasoning to help the model determine the answer.</p>
<pre><code>Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is </code></pre>
<p>The model correctly answers 11. To solve the juggling problem, I used this chain-of-thought prompt as an example. Giving the model some examples is known as few-shot learning. The new combined prompt using chain-of-thought and few-shot learning is:</p>
<pre><code>Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
A:</code></pre>
<p>Try it, it works! Giving it an example and making it think everything through step by step was beneficial. This was fascinating for me. We don’t train the model in the sense of updating it’s weights. Instead, we are guiding it purely by the inference process.</p>
</section>
<section id="symbolic-reasoning" class="level3">
<h3 class="anchored" data-anchor-id="symbolic-reasoning"><strong>Symbolic Reasoning</strong></h3>
<p>The first symbolic reasoning was doing a reversal and the Flan-T5 worked very well on this type of problem.</p>
<pre><code>Reverse the sequence "glasses, pen, alarm, license".</code></pre>
<p>A more complex problem on coin flipping was more interesting for me.</p>
<pre><code>Q: A coin is heads up. Tom does not flip the coin. Mike does not flip the coin. Is the coin still heads up?
A:</code></pre>
<p>For this one, I played around with different combinations of people flipping and showing the coin and the model, and it answered correctly. It was following the logic that was going through.</p>
</section>
<section id="common-sense-reasoning" class="level3">
<h3 class="anchored" data-anchor-id="common-sense-reasoning"><strong>Common sense reasoning</strong></h3>
<p>The last category was common sense reasoning and much less obvious to me how models know how to solve these problems correctly.</p>
<pre><code>Q: What home entertainment equipment requires cable?
Answer Choices: (a) radio shack (b) substation (c) television (d) cabinet
A: The answer is</code></pre>
<p>I was amazed at how well the model did, even when I changed the order.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/hf/Rcommon2.gif" class="img-fluid figure-img"></p>
<figcaption>Reasongif</figcaption>
</figure>
</div>
<p>Another common reasoning example goes like this:</p>
<pre><code>Q: Can Barack Obama have a conversation with George Washington? Give the rationale before answering.</code></pre>
<p>I changed around people to someone currently living, and it still works well.</p>
</section>
<section id="thoughts" class="level3">
<h3 class="anchored" data-anchor-id="thoughts"><strong>Thoughts</strong></h3>
<p>As the first step, please, go try out these models for yourself. <a href="https://huggingface.co/google/flan-t5-xxl">Google’s Flan-T5 is available</a> with an Apache 2.0 license. Hugging Face has a <a href="https://huggingface.co/spaces/osanseviero/i-like-flan">space where you can try</a> all these reasoning examples yourself. You can also replicate this using OpenAI’s GPT or other language models. I have a <a href="https://youtu.be/teRu-ZT9XJs">short video on the reasoning</a> that also shows several examples.</p>
<p>The current language models have many known limitations. The next generation of models will likely be able to retrieve relevant information before answering. Additionally, language models will likely be able to delegate tasks to other services. You can see a demo of this integrating <a href="https://huggingface.co/spaces/JavaFXpert/Chat-GPT-LangChain">ChatGPT with Wolfram’s scientific API</a>. By letting language models offload other tasks, the role of language models will emphasize communication and reasoning.</p>
<p>The current generation of models is starting to solve some reasoning tasks and match average human raters. It also appears that performance can still keep increasing. What happens when there are a set of reasoning tasks that computers are better than humans? While plenty of academic literature highlights the limitations, the overall trajectory is clear and has extraordinary implications.</p>


</section>

 ]]></description>
  <category>LLM</category>
  <category>NLP</category>
  <guid>https://rajivshah.com/blog/HF-Reasoning.html</guid>
  <pubDate>Wed, 08 Feb 2023 06:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/hf/r-title.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Text style transfer in a spreadsheet using Hugging Face Inference Endpoints</title>
  <link>https://rajivshah.com/blog/HF-Endpoint.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/hf/informal_endpoint_cover.png" class="img-fluid figure-img"></p>
<figcaption>SetFit</figcaption>
</figure>
</div>
<section id="introduction" class="level3">
<h3 class="anchored" data-anchor-id="introduction">Introduction</h3>
<p>We change our conversational style from informal to formal speech. We often do this without thinking when talking to our friends compared to addressing a judge. Computers now have this capability! I use <a href="https://blog.fastforwardlabs.com/2022/03/22/an-introduction-to-text-style-transfer.html">textual style transfer</a> in this post to convert informal text to formal text. To make this easy to use, we do it in a spreadsheet.</p>
</section>
<section id="step-1" class="level3">
<h3 class="anchored" data-anchor-id="step-1">Step 1</h3>
<p>The first step is identifying an <a href="https://huggingface.co/rajistics/informal_formal_style_transfer">informal to formal text style model</a>. Next, we deploy the model using <a href="https://ui.endpoints.huggingface.co/endpoints">Hugging Face Inference endpoints</a>. <a href="https://huggingface.co/docs/inference-endpoints/index">Inference endpoints</a> is a production-grade solution for model deployment.</p>
<p><img src="https://rajivshah.com/blog/images/hf/model-inference.png" class="img-fluid"></p>
</section>
<section id="step-2" class="level3">
<h3 class="anchored" data-anchor-id="step-2">Step 2</h3>
<p>Let’s incorporate the endpoint into Google Sheets custom function to make the model easy to use.</p>
<p><img src="https://rajivshah.com/blog/images/hf/HF_IE_examples-min.png" class="img-fluid"></p>
<p>I added the code to Google Sheets through the Apps Script extension. Grab it <a href="https://gist.github.com/rajshah4/6cde451b7f126aeaa67d89503cba5b93">here as a gist</a>. Once that is saved, you can use the new function as a formula. Now, I can use one simple command if I want to do textual style transfer!</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/hf/informal_endpoints.png" class="img-fluid figure-img"></p>
<figcaption>Alt Text</figcaption>
</figure>
</div>
</section>
<section id="resources" class="level3">
<h3 class="anchored" data-anchor-id="resources">Resources</h3>
<p>I created a Youtube 🎥 <a href="https://youtu.be/jA6VDKO7XfA">video</a> for a more detailed walkthrough.</p>
<p>Go try this out with your favorite model! For another example, check out the <a href="https://huggingface.co/RamAnanth1/positive-reframing">positive style textual model</a> in a <a href="https://www.tiktok.com/@rajistics/video/7161954065243508014">Tik Tok video</a>.</p>


</section>

 ]]></description>
  <category>NLP</category>
  <category>Huggingface</category>
  <category>Finetuning</category>
  <guid>https://rajivshah.com/blog/HF-Endpoint.html</guid>
  <pubDate>Mon, 07 Nov 2022 06:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/hf/informal_endpoint_cover.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Few shot text classification with SetFit</title>
  <link>https://rajivshah.com/blog/setfit.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/setfit.png" class="img-fluid figure-img"></p>
<figcaption>SetFit</figcaption>
</figure>
</div>
<section id="introduction" class="level4">
<h4 class="anchored" data-anchor-id="introduction">Introduction</h4>
<p>Data scientists often do not have large amounts of labeled data. This issue is even graver when dealing with problems with tens or hundreds of classes. The reality is very few text classification problems get to the point where adding more labeled data isn’t improving performance.</p>
<p>SetFit offers a few-shot learning approach for text classification. The <a href="https://arxiv.org/abs/2209.11055">paper’s results</a> show across many datasets, it’s possible to get better performance with less labeled data. This technique uses contrastive learning to build a larger dataset for fine-tuning a text classification model. This approach was new to me and was why I did a video explaining how contrastive learning helps with text classification.</p>
<p>I have created a Colab 📓 companion notebook at <a href="https://bit.ly/raj_setfit">https://bit.ly/raj_setfit</a>, and the Youtube 🎥 <a href="https://youtu.be/Pg-smN4fUy0">video</a> that provides a detailed explanation. I walk through a simple churn example to give the intuition behind SetFit. The notebook trains the CR (customer review dataset) highlighted in the SetFit paper.</p>
<p>The <a href="https://github.com/huggingface/setfit">SetFit github</a> contains the code, and a great deep dive for text classification is found on <a href="https://www.philschmid.de/getting-started-setfit">Philipp’s blog</a>. For those looking to productionize a SetFit model, Philipp has also documented how to create the <a href="https://huggingface.co/philschmid/setfit-ag-news-endpoint">Hugging Face endpoint</a> for a SetFit model.</p>
<p>So grab your favorite text classification dataset and give it a try!</p>


</section>

 ]]></description>
  <category>Setfit</category>
  <category>Classification</category>
  <category>NLP</category>
  <guid>https://rajivshah.com/blog/setfit.html</guid>
  <pubDate>Thu, 27 Oct 2022 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/setfit.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Getting predictions intervals with conformal inference</title>
  <link>https://rajivshah.com/blog/conformal_predictions.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/conformal_inference.png" class="img-fluid figure-img"></p>
<figcaption>Conformal</figcaption>
</figure>
</div>
<section id="introduction" class="level4">
<h4 class="anchored" data-anchor-id="introduction">Introduction</h4>
<p>Data scientists often overstate the certainty of their predictions. I have had engineers laugh at my point predictions and point out several types of errors in my model that create uncertainty. Prediction intervals are an excellent counterbalance for communicating the uncertainty of predictions.</p>
<p>Conformal inference offers a model agnostic technique for prediction intervals. It’s well known within statistics but not as well established in machine learning. This post focuses on a straightforward conformal inference technique, but there are more sophisticated techniques that provide more adaptable prediction intervals.</p>
<p>I have created a Colab 📓 companion notebook at <a href="https://bit.ly/raj_conf">https://bit.ly/raj_conf</a>, and the Youtube 🎥 <a href="https://youtu.be/ZUK4zR0IeLU">video</a> that provides a detailed explanation. This explanation is a toy example to learn how conformal inference works. Typical applications will use a more sophisticated methodology along with implementations found within the resources below.</p>
<p>For python folks, a great package to start using conformal inference is <a href="https://mapie.readthedocs.io/en/latest/index.html">MAPIE - Model Agnostic Prediction Interval Estimator</a>. It works for tabular and time series problems.</p>
</section>
<section id="further-resources" class="level4">
<h4 class="anchored" data-anchor-id="further-resources">Further Resources:</h4>
<p>Quick intro to conformal prediction using MAPIE in <a href="https://towardsdatascience.com/mapie-explained-exactly-how-you-wished-someone-explained-to-you-78fb8ce81ff3">medium</a></p>
<p>A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, <a href="https://people.eecs.berkeley.edu/~angelopoulos/publications/downloads/gentle_intro_conformal_dfuq.pdf">paper link</a></p>
<p><a href="https://github.com/valeman/awesome-conformal-prediction">Awesome Conformal Prediction</a> (lots of resources)</p>


</section>

 ]]></description>
  <category>Conformal</category>
  <category>MLOps</category>
  <category>MAPIE</category>
  <guid>https://rajivshah.com/blog/conformal_predictions.html</guid>
  <pubDate>Sat, 24 Sep 2022 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/conformal_inference.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Explaining predictions from 🤗 transformer models</title>
  <link>https://rajivshah.com/blog/explaining_transformers.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/transformers/banner.png" class="img-fluid figure-img"></p>
<figcaption>Banner</figcaption>
</figure>
</div>
<section id="introduction" class="level3">
<h3 class="anchored" data-anchor-id="introduction">Introduction</h3>
<p>This post covers 3 easy-to-use 📦 packages to get started. You can also check out the Colab 📓 companion notebook at https://bit.ly/raj_explain and the Youtube 🎥 <a href="https://youtu.be/j6WbCS0GLuY">video</a> for a deeper treatment.</p>
<p>Explanations are useful for explaining predictions. In the case of text, they highlight how the text influenced the prediction. They are helpful for 🩺 diagnosing model issues, 👀 showing stakeholders understand how a model is working, and 🧑‍⚖️ meeting regulatory requirements. Here is an explanation 👇 using shap. For more on explanations, check out the <a href="https://youtu.be/SVfrxFdJNB4">explanations in machine learning video</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/transformers/shap.png" class="img-fluid figure-img"></p>
<figcaption>Screen Shot 2022-08-12 at 9.25.07 AM</figcaption>
</figure>
</div>
<p>Let’s review 3 packages you can use to get explanations. All of these work with transformers, provide visualizations, and only require a few lines of code.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/transformers/code.png" class="img-fluid figure-img"></p>
<figcaption>Red and Purple Real Estate Soft Gradients Twitter Ad (1)</figcaption>
</figure>
</div>
</section>
<section id="shap" class="level3">
<h3 class="anchored" data-anchor-id="shap">Shap</h3>
<ol type="1">
<li><a href="https://github.com/slundberg/shap">SHAP</a> is a well-known, well-regarded, and robust package for explanations. In working with text, SHAP typically defers to using a Partition Shap explainer. This method makes the shap computation tractable by using hierarchical clustering and Owens values. The image here shows the clustering for a simple phrase. If you want to learn more about Shapley values, I have a <a href="https://youtu.be/DYA5SA0edb0">video on shapley values</a> and a deep dive on <a href="https://towardsdatascience.com/shaps-partition-explainer-for-language-models-ec2e7a6c1b77">Partition Shap explainer is here</a>.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/transformers/cluster.png" class="img-fluid figure-img"></p>
<figcaption>Screen Shot 2022-08-12 at 9.35.34 AM</figcaption>
</figure>
</div>
</section>
<section id="transformers-interpret" class="level3">
<h3 class="anchored" data-anchor-id="transformers-interpret">Transformers Interpret</h3>
<ol start="2" type="1">
<li><a href="https://github.com/cdpierse/transformers-interpret">Transformers Interpret</a> uses Integrated Gradients from <a href="https://captum.ai/">Captum</a> to calculate the explanations. This approach is 🐇 quicker than shap! Check out <a href="https://huggingface.co/spaces/rajistics/interpet_transformers">this space</a> to see a demo.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/transformers/ti.png" class="img-fluid figure-img"></p>
<figcaption>Screen Shot 2022-08-12 at 9.27.04 AM</figcaption>
</figure>
</div>
</section>
<section id="ferret" class="level3">
<h3 class="anchored" data-anchor-id="ferret">Ferret</h3>
<ol start="3" type="1">
<li><p><a href="https://github.com/g8a9/ferret">Ferret</a> is built for benchmarking interpretability techniques and includes multiple explanation methodologies (including Partition Shap and Integrated Gradients). A spaces <a href="https://huggingface.co/spaces/g8a9/ferret">demo for ferret is here</a> along with <a href="https://arxiv.org/abs/2208.01575">a paper</a> that explains the various metrics incorporated in ferret.</p>
<p>You can see below how explanations can differ when using different explanation methods. A great reminder that explanations for text are complicated and need to be appropriately caveated.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/transformers/ferret.png" class="img-fluid figure-img"></p>
<figcaption>Screen Shot 2022-08-11 at 1.19.05 PM</figcaption>
</figure>
</div>
<p>Ready to dive in? 🟢</p>
<p>For a longer walkthrough of all the 📦 packages with code snippets, web-based demos, and links to documentation/papers, check out:</p>
<p>👉 Colab notebook: https://bit.ly/raj_explain</p>
<p>🎥 https://youtu.be/j6WbCS0GLuY</p></li>
</ol>


</section>

 ]]></description>
  <category>MLOps</category>
  <category>NLP</category>
  <category>Huggingface</category>
  <category>Explainability</category>
  <guid>https://rajivshah.com/blog/explaining_transformers.html</guid>
  <pubDate>Sun, 14 Aug 2022 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/transformers/banner.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Dynamic Adversarial Data Collection</title>
  <link>https://rajivshah.com/blog/DADC.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/DADC.png" class="img-fluid figure-img"></p>
<figcaption>img</figcaption>
</figure>
</div>
<p>Are you looking for better training data for your models? Let me tell you about dynamic adversarial data collection!</p>
<p>I had a large enterprise customer asking me to incorporate this workflow into a <a href="https://www.linkedin.com/company/huggingface/">Hugging Face</a> private hub demo. Here are some resources I found useful: <a href="https://www.linkedin.com/in/ACoAADC2ZecBGpOHE1kqHIx4NINercY4WG0IkJs">Chris Emezue</a> put together a blog post: “<a href="https://huggingface.co/blog/mnist-adversarial">How to train your model dynamically using adversarial data</a>” and a real-life example using <a href="https://huggingface.co/spaces/chrisjay/mnist-adversarial">MNIST using Spaces</a>.</p>
<p>If you want an academic paper that details this process, check out: <a href="https://arxiv.org/abs/2110.08514">Analyzing Dynamic Adversarial Training Data in the Limit</a>. By using this approach, this paper found models made 26% fewer errors on the expert-curated test set.</p>
<p>And if you prefer a video — check out my Tik Tok:</p>
<p>https://www.tiktok.com/<span class="citation" data-cites="rajistics/video/7123667796453592366?is_from_webapp">@rajistics/video/7123667796453592366?is_from_webapp</span>=1&amp;sender_device=pc&amp;web_id=7106277315414181422</p>



 ]]></description>
  <category>Dataset</category>
  <category>Adversarial</category>
  <category>MNIST</category>
  <guid>https://rajivshah.com/blog/DADC.html</guid>
  <pubDate>Thu, 11 Aug 2022 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/DADC.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Model Interpretability and Explainability for Machine Learning Models</title>
  <link>https://rajivshah.com/blog/model-interpretability-explainability.html</link>
  <description><![CDATA[ 






<section id="video" class="level2">
<h2 class="anchored" data-anchor-id="video">Video</h2>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/ZRckw_fE56Q" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>Watch the <a href="https://youtu.be/ZRckw_fE56Q">full video</a></p>
<hr>
</section>
<section id="annotated-presentation" class="level2">
<h2 class="anchored" data-anchor-id="annotated-presentation">Annotated Presentation</h2>
<p>Below is an annotated version of the presentation, with timestamped links to the relevant parts of the video for each slide.</p>
<p>Here is the slide-by-slide annotated presentation based on the technical talk “A Quest for Interpretability.”</p>
<section id="a-quest-for-interpretability" class="level3">
<h3 class="anchored" data-anchor-id="a-quest-for-interpretability">1. A Quest for Interpretability</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_1.png" class="img-fluid figure-img"></p>
<figcaption>Slide 1</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=0s">Timestamp: 00:00</a>)</p>
<p>The presentation opens with the title slide, introducing the core mission of the talk: demystifying machine learning models. The speaker sets the stage for both data science novices and experts, promising to provide methods to “ask any particular machine learning model you see and be able to explain it.”</p>
<p>The goal is to move beyond simply generating predictions to understanding the “why” behind them. The speaker emphasizes that whether you are new to the field or comfortable with interpretability, the session will dive deeper into techniques that provide transparency to complex algorithms.</p>
</section>
<section id="predictive-model-around-aggression" class="level3">
<h3 class="anchored" data-anchor-id="predictive-model-around-aggression">2. Predictive Model Around Aggression</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_2.png" class="img-fluid figure-img"></p>
<figcaption>Slide 2</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=41s">Timestamp: 00:41</a>)</p>
<p>To make the concepts more engaging, the speaker introduces a “Dragon theme” as a visual metaphor. The hypothetical problem presented is building a <strong>Predictive Model Around Aggression</strong>. The objective is practical and dire: “we want to use machine learning to help us figure out which dragons are likely to eat us.”</p>
<p>This metaphor serves as a stand-in for real-world risk assessment models. Instead of dry financial or medical data initially, the audience is asked to consider the stakes of a model that must accurately predict danger (getting eaten) based on various dragon attributes.</p>
</section>
<section id="trust-the-big-picture" class="level3">
<h3 class="anchored" data-anchor-id="trust-the-big-picture">3. Trust: The Big Picture</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_3.png" class="img-fluid figure-img"></p>
<figcaption>Slide 3</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=88s">Timestamp: 01:28</a>)</p>
<p>The speaker broadens the scope to explain that <strong>interpretability</strong> is just one component of a much larger ecosystem called “Trust.” This slide illustrates that trusting a model involves asking questions about bias, correctness, ethical purposes (like facial recognition debates), and model health over time.</p>
<p>While acknowledging these critical factors—such as “is your data biased” or “is your model being used for an ethical purpose”—the speaker clarifies that this specific presentation will focus on the interpretability slice of the pie: “can we explain what’s going on… inside that model.”</p>
</section>
<section id="interpretable-predictive-model-around-aggression" class="level3">
<h3 class="anchored" data-anchor-id="interpretable-predictive-model-around-aggression">4. Interpretable Predictive Model Around Aggression</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_4.png" class="img-fluid figure-img"></p>
<figcaption>Slide 4</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=148s">Timestamp: 02:28</a>)</p>
<p>Returning to the dragon metaphor, this slide reiterates the specific technical goal: building an <strong>Interpretable Predictive Model Around Aggression</strong>. The speaker distinguishes this from simply dumping data into a “black box” like TensorFlow and deploying it based solely on performance metrics.</p>
<p>The focus here is on the deliberate choice to build a model that is not just predictive, but understandable. This sets up the central tension of the talk: the trade-off between model complexity (accuracy) and the ability to explain how the model works.</p>
</section>
<section id="why-interpretability" class="level3">
<h3 class="anchored" data-anchor-id="why-interpretability">5. Why Interpretability?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_5.png" class="img-fluid figure-img"></p>
<figcaption>Slide 5</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=158s">Timestamp: 02:38</a>)</p>
<p>This slide outlines the three key audiences for interpretability. First, for <strong>Yourself</strong>: debugging is essential because “it’s very easy for things to go wrong.” Second, for <strong>Stakeholders</strong>: managers and bosses will demand to know how a model works, regardless of how high the AUC (Area Under the Curve) is.</p>
<p>Third, the speaker highlights <strong>Regulators</strong> in high-risk industries like insurance, finance, and healthcare. In these sectors, there is a “higher standard set” where you must prove you understand the model’s behavior to mitigate risks to the financial system or public health.</p>
</section>
<section id="an-understandable-white-box-model-clear-2" class="level3">
<h3 class="anchored" data-anchor-id="an-understandable-white-box-model-clear-2">6. An Understandable White Box Model (CLEAR-2)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_6.png" class="img-fluid figure-img"></p>
<figcaption>Slide 6</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=261s">Timestamp: 04:21</a>)</p>
<p>The presentation begins with the “simplest, easiest, most interpretable model”: a linear regression for housing prices. This <strong>White Box Model</strong> uses only two features: the number of bathrooms and square footage.</p>
<p>The transparency is total; you can see the coefficients directly (e.g., multiplying bathrooms by a value). The audience is asked to confirm that this is intuitive, and the consensus is that yes, this is an easily explainable model where the inputs have a clear, logical relationship to the output.</p>
</section>
<section id="white-box-model-clear-8" class="level3">
<h3 class="anchored" data-anchor-id="white-box-model-clear-8">7. White Box Model (CLEAR-8)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_7.png" class="img-fluid figure-img"></p>
<figcaption>Slide 7</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=341s">Timestamp: 05:41</a>)</p>
<p>Complexity is introduced by adding more features to improve accuracy. However, this slide reveals a paradox of linear models: <strong>Multicollinearity</strong>. The speaker points out that while the model might be “transparent” (you can see the math), the logic breaks down.</p>
<p>Specifically, the model shows that “as the total rooms gets higher, the value of my house goes down.” This counter-intuitive finding occurs because features are not independent. While technically a “white box,” the interpretability suffers because the coefficients no longer align with human intuition due to correlations between variables.</p>
</section>
<section id="understandable-white-box-model-tree---auc-0.74" class="level3">
<h3 class="anchored" data-anchor-id="understandable-white-box-model-tree---auc-0.74">8. Understandable White Box Model? (Tree - AUC 0.74)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_8.png" class="img-fluid figure-img"></p>
<figcaption>Slide 8</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=556s">Timestamp: 09:16</a>)</p>
<p>Moving to <strong>Decision Trees</strong>, the speaker presents a simple tree based on the Titanic dataset (predicting survival). With only two features (gender and age), the logic is stark and easy to follow: “if you’re a male and your age is greater than 10 years old… chance of survival is very low.”</p>
<p>This model has an AUC of 0.74. It is highly interpretable, acting as a flowchart that anyone can trace. However, the speaker hints at the limitation: simplicity often comes at the cost of accuracy.</p>
</section>
<section id="understandable-white-box-model-tree---auc-0.78" class="level3">
<h3 class="anchored" data-anchor-id="understandable-white-box-model-tree---auc-0.78">9. Understandable White Box Model? (Tree - AUC 0.78)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_9.png" class="img-fluid figure-img"></p>
<figcaption>Slide 9</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=679s">Timestamp: 11:19</a>)</p>
<p>To improve the model, more features are added, raising the AUC to 0.78. The tree grows branches, becoming visually more cluttered. The speaker notes that “by adding more features or variables… the performance of our model increases.”</p>
<p>This slide represents the tipping point where the visual representation of the model starts to become less of a helpful flowchart and more of a complex web, though it is still technically possible to trace a single path.</p>
</section>
<section id="understandable-white-box-model-tree---auc-0.79" class="level3">
<h3 class="anchored" data-anchor-id="understandable-white-box-model-tree---auc-0.79">10. Understandable White Box Model? (Tree - AUC 0.79)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_10.png" class="img-fluid figure-img"></p>
<figcaption>Slide 10</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=698s">Timestamp: 11:38</a>)</p>
<p>The optimization continues, pushing the AUC to 0.79. The tree on the slide is now dense and difficult to read. The question mark in the title “Understandable White Box Model?” becomes more relevant.</p>
<p>The speaker emphasizes that data scientists “don’t have to kind of stop there.” The drive for higher accuracy encourages adding more depth and complexity to the tree, sacrificing the immediate “glance-value” interpretability that smaller trees possess.</p>
</section>
<section id="better-performance-but-too-much-to-comprehend-auc-0.81" class="level3">
<h3 class="anchored" data-anchor-id="better-performance-but-too-much-to-comprehend-auc-0.81">11. Better Performance but too much to Comprehend (AUC 0.81)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_11.png" class="img-fluid figure-img"></p>
<figcaption>Slide 11</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=704s">Timestamp: 11:44</a>)</p>
<p>This slide shows a massive, unreadable decision tree with an AUC of 0.81. The speaker notes, “it gets a little tricky… lot harder to understand what’s going on.” This illustrates the “Black Box” problem even within models considered interpretable.</p>
<p>Furthermore, the speaker points out that data scientists rarely stop at one tree; they use <strong>Random Forests</strong> (collections of trees). Interpreting a forest by looking at the trees is impossible, necessitating new tools for explanation.</p>
</section>
<section id="so-many-algorithms-to-try" class="level3">
<h3 class="anchored" data-anchor-id="so-many-algorithms-to-try">12. So Many Algorithms to Try</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_12.png" class="img-fluid figure-img"></p>
<figcaption>Slide 12</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=787s">Timestamp: 13:07</a>)</p>
<p>This heatmap, derived from a study by Randy Olssen, visualizes the performance of different algorithms across 165 datasets. It illustrates the <strong>No Free Lunch Theorem</strong>: there is not one single algorithm that always works best.</p>
<p>Because of this, data scientists must try various complex algorithms (Gradient Boosting, Neural Networks, Ensembles) to find the best solution. We cannot simply restrict ourselves to linear regression just for the sake of interpretability if it fails to solve the problem.</p>
</section>
<section id="algorithms-matter" class="level3">
<h3 class="anchored" data-anchor-id="algorithms-matter">13. Algorithms Matter</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_13.png" class="img-fluid figure-img"></p>
<figcaption>Slide 13</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=829s">Timestamp: 13:49</a>)</p>
<p>The speaker reinforces that model choice is critical. Using a simple model that yields inaccurate predictions is dangerous: “if we can’t figure out if this model is going to work or not we’re in trouble.”</p>
<p>The slide emphasizes that accuracy is paramount (“we are toast” if we are wrong). Therefore, we need methods that allow us to use complex, accurate algorithms without flying blind regarding how they work.</p>
</section>
<section id="simple-models-accurate" class="level3">
<h3 class="anchored" data-anchor-id="simple-models-accurate">14. Simple Models != Accurate</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_14.png" class="img-fluid figure-img"></p>
<figcaption>Slide 14</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=851s">Timestamp: 14:11</a>)</p>
<p>This slide counters the argument that we should only use simple models. The speaker asserts, “most simple models are just not very accurate.” Real-world problems are complex, and if they could be solved with a few simple rules, machine learning wouldn’t be necessary.</p>
<p>Resources are provided on the slide for further reading, including defenses of black box models. The takeaway is that complexity is often a requirement for accuracy, so we must find ways to explain complex models rather than avoiding them.</p>
</section>
<section id="tools-that-can-explain-any-black-box-model" class="level3">
<h3 class="anchored" data-anchor-id="tools-that-can-explain-any-black-box-model">15. Tools That Can Explain Any Black Box Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_15.png" class="img-fluid figure-img"></p>
<figcaption>Slide 15</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=899s">Timestamp: 14:59</a>)</p>
<p>This is the pivot point of the presentation. The speaker introduces the solution: “There are tools here that can explain any blackbox model.” This promises a methodology that is <strong>Model Agnostic</strong>—meaning it works regardless of whether you are using a Random Forest, a Neural Network, or an SVM.</p>
</section>
<section id="model-agnostic-explanation-tools" class="level3">
<h3 class="anchored" data-anchor-id="model-agnostic-explanation-tools">16. Model Agnostic Explanation Tools</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_16.png" class="img-fluid figure-img"></p>
<figcaption>Slide 16</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=907s">Timestamp: 15:07</a>)</p>
<p>The speaker outlines the three specific pillars of interpretability that the rest of the talk will cover: 1. <strong>Feature Importance:</strong> Understanding what variables are most impactful. 2. <strong>Partial Dependence:</strong> Understanding the directionality of features (e.g., does age increase or decrease risk?). 3. <strong>Prediction Explanations:</strong> Explaining why a specific prediction was made for a specific individual (using techniques like SHAP).</p>
</section>
<section id="feature-importance" class="level3">
<h3 class="anchored" data-anchor-id="feature-importance">17. Feature Importance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_17.png" class="img-fluid figure-img"></p>
<figcaption>Slide 17</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=971s">Timestamp: 16:11</a>)</p>
<p>The first pillar is <strong>Feature Importance</strong>. Returning to the dragon example, the speaker discusses the data collection process: asking domain experts (or watching Game of Thrones) to determine factors like age, weight, or number of children.</p>
<p>The goal is to determine which of these collected variables actually drives the model. This is crucial for debugging, feature selection, and explaining the model to stakeholders.</p>
</section>
<section id="dragon-reading-milk-vs.-age" class="level3">
<h3 class="anchored" data-anchor-id="dragon-reading-milk-vs.-age">18. Dragon Reading: Milk vs.&nbsp;Age</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_18.png" class="img-fluid figure-img"></p>
<figcaption>Slide 18</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1061s">Timestamp: 17:41</a>)</p>
<p>To illustrate the pitfalls of feature importance, the speaker introduces a new scenario: “how dragons learn to read.” We intuitively know that <strong>Age</strong> affects reading ability (older children read better).</p>
<p>The speaker then asks about <strong>Milk Consumption</strong>. While one might guess milk helps (calcium), the reality is that milk consumption is negatively correlated with age (babies drink milk, teenagers don’t). Therefore, milk consumption appears related to reading ability, but it is a <strong>spurious correlation</strong>. It has “nothing at all to do with the ability to read,” yet the data might suggest otherwise.</p>
</section>
<section id="split-based-variable-importance" class="level3">
<h3 class="anchored" data-anchor-id="split-based-variable-importance">19. Split Based Variable Importance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_19.png" class="img-fluid figure-img"></p>
<figcaption>Slide 19</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1239s">Timestamp: 20:39</a>)</p>
<p>This slide shows what happens when you use the default “Split Based” importance metric in algorithms like LightGBM. The chart shows <strong>milk_consumption</strong> as the <em>most</em> important feature, ranking higher than age.</p>
<p>This happens because the model uses milk consumption as a proxy for age during the tree-splitting process. The speaker warns that relying on default metrics can lead to incorrect conclusions where spurious correlations mask the true drivers of the model.</p>
</section>
<section id="permutation-based-variable-importance" class="level3">
<h3 class="anchored" data-anchor-id="permutation-based-variable-importance">20. Permutation Based Variable Importance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_20.png" class="img-fluid figure-img"></p>
<figcaption>Slide 20</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1271s">Timestamp: 21:11</a>)</p>
<p>By switching to a <strong>Permutation Based</strong> approach, the chart flips. Now, <strong>Age</strong> is correctly identified as the dominant feature, and milk consumption drops to near zero importance.</p>
<p>The speaker emphasizes that this technique “cuts right through” the noise. It correctly identifies that while milk varies with age, it does not actually influence the reading score when age is accounted for.</p>
</section>
<section id="spurious-correlations-nicolas-cage" class="level3">
<h3 class="anchored" data-anchor-id="spurious-correlations-nicolas-cage">21. Spurious Correlations (Nicolas Cage)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_21.png" class="img-fluid figure-img"></p>
<figcaption>Slide 21</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1298s">Timestamp: 21:38</a>)</p>
<p>This slide references the famous spurious correlation between Nicolas Cage films and swimming pool drownings. The speaker uses this to highlight the danger of “Enterprise Data Lakes.”</p>
<p>When data scientists grab massive tables of data without domain knowledge, they risk finding these coincidental patterns. Machine learning models are excellent at finding patterns, even ones that are nonsensical, making robust feature importance techniques vital.</p>
</section>
<section id="feature-impact-ranking" class="level3">
<h3 class="anchored" data-anchor-id="feature-impact-ranking">22. Feature Impact Ranking</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_22.png" class="img-fluid figure-img"></p>
<figcaption>Slide 22</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1029s">Timestamp: 17:09</a>)</p>
<p>The presentation shows a ranked list of features for the dragon model. The speaker reiterates that getting this ranking right has “real consequences.”</p>
<p>If you tell a business stakeholder that a specific variable is driving the risk, they will make decisions based on that. Understanding the true hierarchy of influence is essential for trust and actionable insight.</p>
</section>
<section id="if-your-feature-impact-is-wrong" class="level3">
<h3 class="anchored" data-anchor-id="if-your-feature-impact-is-wrong">23. If Your Feature Impact is Wrong…</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_23.png" class="img-fluid figure-img"></p>
<figcaption>Slide 23</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1037s">Timestamp: 17:17</a>)</p>
<p>A humorous but serious warning: “If your feature impact is wrong, you are toast.”</p>
<p>This underscores the professional risk. If a data scientist attributes a prediction to the wrong cause (like milk instead of age), they lose credibility and potentially cause the business to pull the wrong levers to try and optimize the outcome.</p>
</section>
<section id="feature-importance-ablation-methodology" class="level3">
<h3 class="anchored" data-anchor-id="feature-importance-ablation-methodology">24. Feature Importance: Ablation Methodology</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_24.png" class="img-fluid figure-img"></p>
<figcaption>Slide 24</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1335s">Timestamp: 22:15</a>)</p>
<p>The speaker explains the logic behind feature importance using an <strong>Ablation Methodology</strong>. He presents three models: 1. Model AB (Both features): R-squared 0.9 2. Model A (Feature A only): R-squared 0.7 3. Model B (Feature B only): R-squared 0.8</p>
<p>He asks the audience to intuit which feature is more important based on these scores.</p>
</section>
<section id="ablation-methodology-definition" class="level3">
<h3 class="anchored" data-anchor-id="ablation-methodology-definition">25. Ablation Methodology Definition</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_25.png" class="img-fluid figure-img"></p>
<figcaption>Slide 25</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1377s">Timestamp: 22:57</a>)</p>
<p>The audience correctly identifies that Feature B is more important because it carries more signal (higher R-squared) on its own.</p>
<p>The speaker defines <strong>Ablation</strong> as comparing the model performance with and without specific features. It is a scientific control method: “try something with it and without it,” similar to testing if coffee makes a person happy by withholding it for a day.</p>
</section>
<section id="leave-it-out-feature-importance" class="level3">
<h3 class="anchored" data-anchor-id="leave-it-out-feature-importance">26. ‘Leave it Out’ Feature Importance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_26.png" class="img-fluid figure-img"></p>
<figcaption>Slide 26</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1429s">Timestamp: 23:49</a>)</p>
<p>This slide formalizes the “Leave One Out” approach. By calculating the drop in performance when a feature is removed, we quantify its value. * Remove B: Performance drops by 0.2 (0.9 -&gt; 0.7). * Remove A: Performance drops by 0.1 (0.9 -&gt; 0.8).</p>
<p>Since removing B causes a larger drop in accuracy, B is the more important feature. However, the speaker notes a problem: with 100 features, you would have to build 100 different models, which is computationally expensive.</p>
</section>
<section id="permutation-based-feature-importance" class="level3">
<h3 class="anchored" data-anchor-id="permutation-based-feature-importance">27. Permutation Based Feature Importance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_27.png" class="img-fluid figure-img"></p>
<figcaption>Slide 27</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1613s">Timestamp: 26:53</a>)</p>
<p>To solve the computational cost of retraining models, the speaker introduces <strong>Permutation Importance</strong> (attributed to Breiman/Random Forests). Instead of removing a column and retraining, you simply <strong>shuffle</strong> the values of that column (permute them) within the existing test data.</p>
<p>By shuffling the data, you break the relationship between that feature and the target, effectively “removing” the signal while keeping the model structure intact. If the model’s error increases significantly after shuffling a feature, that feature was important.</p>
</section>
<section id="r-package-randomforest" class="level3">
<h3 class="anchored" data-anchor-id="r-package-randomforest">28. R Package: randomForest</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_28.png" class="img-fluid figure-img"></p>
<figcaption>Slide 28</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1651s">Timestamp: 27:31</a>)</p>
<p>The speaker highlights that this is a standard technique available in common tools. In the R language, the <code>randomForest</code> package has supported permutation-based importance for a long time.</p>
<p>This slide serves as a resource pointer for R users, confirming that these advanced interpretability checks are accessible within their standard toolkits.</p>
</section>
<section id="python-scikit-learn" class="level3">
<h3 class="anchored" data-anchor-id="python-scikit-learn">29. Python: scikit-learn</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_29.png" class="img-fluid figure-img"></p>
<figcaption>Slide 29</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1656s">Timestamp: 27:36</a>)</p>
<p>Similarly, for Python users, <code>scikit-learn</code> has added support for permutation importance. This accessibility reinforces the speaker’s point that there is no excuse for not using these techniques to validate model behavior.</p>
</section>
<section id="multicollinearity" class="level3">
<h3 class="anchored" data-anchor-id="multicollinearity">30. Multicollinearity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_30.png" class="img-fluid figure-img"></p>
<figcaption>Slide 30</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1690s">Timestamp: 28:10</a>)</p>
<p>The speaker addresses a complex issue: <strong>Multicollinearity</strong>. The Venn diagrams illustrate that features often share information (variance).</p>
<p>When features are highly correlated, they “share the signal.” This makes it difficult for the model (and the interpreter) to assign credit. Does the credit go to Feature A or Feature B if they both describe the same underlying phenomenon?</p>
</section>
<section id="different-models-10-different-importances" class="level3">
<h3 class="anchored" data-anchor-id="different-models-10-different-importances">31. 10 Different Models, 10 Different Importances</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_31.png" class="img-fluid figure-img"></p>
<figcaption>Slide 31</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1705s">Timestamp: 28:25</a>)</p>
<p>Due to multicollinearity, running the same algorithm on the same data multiple times (with different random seeds or data partitions) can result in different feature rankings.</p>
<p>This instability is frustrating. In one run, “Milk” might be important; in another, “Age” takes the lead. This happens because the model arbitrarily chooses one of the correlated features to split on, and this choice changes based on randomness in the training process.</p>
</section>
<section id="multicollinearity-affects-interpreting-models" class="level3">
<h3 class="anchored" data-anchor-id="multicollinearity-affects-interpreting-models">32. Multicollinearity Affects Interpreting Models</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_32.png" class="img-fluid figure-img"></p>
<figcaption>Slide 32</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1748s">Timestamp: 29:08</a>)</p>
<p>This chart visualizes the “trading off” effect. You can see features swapping positions in importance rankings across different model runs.</p>
<p>The speaker notes that you cannot simply remove correlated features without potentially hurting accuracy, as they might contain slight unique signals. This trade-off between accuracy and stable interpretability is a core challenge in data science.</p>
</section>
<section id="pro-tip-aggregate-feature-importance-same-model" class="level3">
<h3 class="anchored" data-anchor-id="pro-tip-aggregate-feature-importance-same-model">33. Pro Tip: Aggregate Feature Importance (Same Model)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_33.png" class="img-fluid figure-img"></p>
<figcaption>Slide 33</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1809s">Timestamp: 30:09</a>)</p>
<p>To handle instability, the speaker suggests a “Pro Tip”: <strong>Aggregate Feature Importance</strong>. Run the feature importance calculation multiple times on the same model and plot the variability (the box plots in the slide).</p>
<p>This gives a “richer understanding.” Instead of a single number, you see a range. If the range is huge, you know the feature’s importance is unstable due to correlation or noise.</p>
</section>
<section id="aggregate-feature-importance-different-models" class="level3">
<h3 class="anchored" data-anchor-id="aggregate-feature-importance-different-models">34. Aggregate Feature Importance (Different Models)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_34.png" class="img-fluid figure-img"></p>
<figcaption>Slide 34</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1815s">Timestamp: 30:15</a>)</p>
<p>Expanding on the previous tip, you can also aggregate importance across <em>different</em> models (e.g., comparing importance in a Random Forest vs.&nbsp;a Gradient Boosted Machine).</p>
<p>If a feature is consistently important across different algorithms and multiple runs, you can be much more confident that it is a true driver of the target variable.</p>
</section>
<section id="pro-tips-add-random-features" class="level3">
<h3 class="anchored" data-anchor-id="pro-tips-add-random-features">35. Pro Tips: Add Random Features</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_35.png" class="img-fluid figure-img"></p>
<figcaption>Slide 35</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=1820s">Timestamp: 30:20</a>)</p>
<p>Another technique mentioned is adding a <strong>Random Feature</strong> (noise) to the dataset. If a real feature ranks lower in importance than the random noise variable, it is likely not a significant predictor.</p>
<p>This serves as a baseline or “sanity check” to distinguish true signal from statistical noise in the feature ranking list.</p>
</section>
<section id="permutation-based-importance-conclusion" class="level3">
<h3 class="anchored" data-anchor-id="permutation-based-importance-conclusion">36. Permutation Based Importance Conclusion</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_36.png" class="img-fluid figure-img"></p>
<figcaption>Slide 36</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2014s">Timestamp: 33:34</a>)</p>
<p>The section concludes by asserting that <strong>Permutation based importance</strong> is the “best practice.” It offers a “good balance of computation and performance for any model.”</p>
<p>References to academic papers (like Strobl) are provided for those who want to dive into the edge cases, but for general application, this is the recommended approach for determining <em>what</em> matters in a model.</p>
</section>
<section id="partial-dependence" class="level3">
<h3 class="anchored" data-anchor-id="partial-dependence">37. Partial Dependence</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_37.png" class="img-fluid figure-img"></p>
<figcaption>Slide 37</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2062s">Timestamp: 34:22</a>)</p>
<p>The second tool introduced is <strong>Partial Dependence</strong>. While feature importance tells us <em>which</em> variables matter, Partial Dependence tells us <em>how</em> they matter.</p>
<p>The slide shows example plots for Age and Weight. The goal is to understand the functional relationship: as age increases, does the predicted aggression go up, down, or follow a complex curve?</p>
</section>
<section id="effect-of-age-on-our-target" class="level3">
<h3 class="anchored" data-anchor-id="effect-of-age-on-our-target">38. Effect of Age on Our Target</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_38.png" class="img-fluid figure-img"></p>
<figcaption>Slide 38</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2081s">Timestamp: 34:41</a>)</p>
<p>The speaker reiterates that in complex “black box” models, we don’t have coefficients (positive or negative signs) like in linear regression. We cannot simply say “age is positive.”</p>
<p>Therefore, we need a visualization that maps the input value to the prediction output to understand the behavior of the model across the range of the feature.</p>
</section>
<section id="calculating-partial-dependence-step-1" class="level3">
<h3 class="anchored" data-anchor-id="calculating-partial-dependence-step-1">39. Calculating Partial Dependence (Step 1)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_39.png" class="img-fluid figure-img"></p>
<figcaption>Slide 39</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2113s">Timestamp: 35:13</a>)</p>
<p>To explain how Partial Dependence is calculated, the speaker walks through the process. Step 1: Take a single observation (one Dragon).</p>
<p>Step 2: Keep all features constant <em>except</em> the one we are interested in (Age). Manually force the age to different values (e.g., 5, 10, 15 years old) and ask the model for a prediction at each point. This generates a hypothetical curve for that specific dragon.</p>
</section>
<section id="calculating-partial-dependence-step-2" class="level3">
<h3 class="anchored" data-anchor-id="calculating-partial-dependence-step-2">40. Calculating Partial Dependence (Step 2)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_40.png" class="img-fluid figure-img"></p>
<figcaption>Slide 40</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2153s">Timestamp: 35:53</a>)</p>
<p>The process is repeated for a second dragon. Because the other features (weight, color, etc.) are different for this dragon, the curve might look slightly different (higher or lower baseline), but it follows the model’s logic for age.</p>
</section>
<section id="calculating-partial-dependence-step-3" class="level3">
<h3 class="anchored" data-anchor-id="calculating-partial-dependence-step-3">41. Calculating Partial Dependence (Step 3)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_41.png" class="img-fluid figure-img"></p>
<figcaption>Slide 41</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2166s">Timestamp: 36:06</a>)</p>
<p>This is repeated for many observations in the dataset. The slide shows multiple data points being generated. This creates a “what-if” scenario for every dragon in the dataset across the spectrum of ages.</p>
</section>
<section id="individual-conditional-expectation-ice-curves" class="level3">
<h3 class="anchored" data-anchor-id="individual-conditional-expectation-ice-curves">42. Individual Conditional Expectation (ICE) Curves</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_42.png" class="img-fluid figure-img"></p>
<figcaption>Slide 42</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2175s">Timestamp: 36:15</a>)</p>
<p>When you draw lines connecting these predictions for each individual instance, you get <strong>ICE Curves</strong> (Individual Conditional Expectation).</p>
<p>This visualizes the relationship between the feature and the prediction for every single data point. It shows the variability: for some dragons, age might have a steep effect; for others, it might be flatter.</p>
</section>
<section id="partial-dependence-plots-pdps" class="level3">
<h3 class="anchored" data-anchor-id="partial-dependence-plots-pdps">43. Partial Dependence Plots (PDPs)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_43.png" class="img-fluid figure-img"></p>
<figcaption>Slide 43</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2187s">Timestamp: 36:27</a>)</p>
<p>To get the <strong>Partial Dependence Plot (PDP)</strong>, you simply <strong>average</strong> all the ICE curves.</p>
<p>This single line represents the <em>average</em> effect of the feature on the model’s prediction, holding everything else constant. It distills the complex interactions into a single, interpretable trend line.</p>
</section>
<section id="resulting-partial-dependence" class="level3">
<h3 class="anchored" data-anchor-id="resulting-partial-dependence">44. Resulting Partial Dependence</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_44.png" class="img-fluid figure-img"></p>
<figcaption>Slide 44</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2227s">Timestamp: 37:07</a>)</p>
<p>The final plot shows the isolated effect of Age. The speaker notes this gives “really good insight.” We can now see if the risk rises linearly with age, or if (as often happens in nonlinear models) it plateaus or dips at certain points.</p>
</section>
<section id="ice-plots" class="level3">
<h3 class="anchored" data-anchor-id="ice-plots">45. ICE Plots</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_45.png" class="img-fluid figure-img"></p>
<figcaption>Slide 45</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2448s">Timestamp: 40:48</a>)</p>
<p>This slide formally defines ICE Plots. While the PDP shows the average, ICE plots are useful for seeing heterogeneity. For example, if the model treats males and females differently, the ICE curves might show two distinct clusters of lines that the average PDP would obscure.</p>
</section>
<section id="partial-dependence-to-show-price-elasticity" class="level3">
<h3 class="anchored" data-anchor-id="partial-dependence-to-show-price-elasticity">46. Partial Dependence to Show Price Elasticity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_46.png" class="img-fluid figure-img"></p>
<figcaption>Slide 46</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2240s">Timestamp: 37:20</a>)</p>
<p>The speaker moves to a real-world example: <strong>Orange Juice Sales</strong>. The goal is to understand <strong>Price Elasticity</strong>—if we raise the price, do sales go down?</p>
<p>Economics 101 says yes, but the model includes complex factors like store location, coupons, and competitor prices (10 other brands), making it a high-dimensional problem.</p>
</section>
<section id="change-in-price-affects-sales" class="level3">
<h3 class="anchored" data-anchor-id="change-in-price-affects-sales">47. Change in Price Affects Sales?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_47.png" class="img-fluid figure-img"></p>
<figcaption>Slide 47</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2325s">Timestamp: 38:45</a>)</p>
<p>This chart shows the raw data (orange line) of Price vs.&nbsp;Sales. It is “all over the place.” There is no clear linear relationship visible because the data is noisy and confounded by other variables (e.g., maybe high prices occurred during a holiday when sales were high anyway).</p>
<p>Looking just at the raw data fails to isolate the specific impact of the price change on consumer behavior.</p>
</section>
<section id="ahh-price-does-affect-sales" class="level3">
<h3 class="anchored" data-anchor-id="ahh-price-does-affect-sales">48. Ahh, Price Does Affect Sales!</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_48.png" class="img-fluid figure-img"></p>
<figcaption>Slide 48</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2356s">Timestamp: 39:16</a>)</p>
<p>By applying <strong>Partial Dependence</strong>, the signal emerges from the noise. The blue line clearly shows that as price increases, sales generally decrease.</p>
<p>Crucially, the plot reveals a non-linear drop at exactly <strong>$3.50</strong>. The speaker interprets this as a psychological threshold where customers decide “maybe I’ll buy something else.” This insight—a specific price point where demand collapses—is only visible through this interpretability technique.</p>
</section>
<section id="distributions-and-partial-dependence" class="level3">
<h3 class="anchored" data-anchor-id="distributions-and-partial-dependence">49. Distributions and Partial Dependence</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_49.png" class="img-fluid figure-img"></p>
<figcaption>Slide 49</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2430s">Timestamp: 40:30</a>)</p>
<p>A warning is issued regarding <strong>Distributions</strong>. Partial Dependence assumes you can vary a feature independently of others. However, if features are correlated, you might create impossible combinations (like a 5-year-old dragon that weighs 5 tons).</p>
<p>Making predictions on these “impossible” data points means extrapolating outside the training distribution, which can lead to unreliable explanations.</p>
</section>
<section id="partial-dependence-conclusion" class="level3">
<h3 class="anchored" data-anchor-id="partial-dependence-conclusion">50. Partial Dependence Conclusion</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_50.png" class="img-fluid figure-img"></p>
<figcaption>Slide 50</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2425s">Timestamp: 40:25</a>)</p>
<p>The speaker concludes that Partial Dependence is a “best practice” for understanding feature behavior. References to Goldstein and Friedman (classic papers) are provided.</p>
<p>This tool answers the “directionality” question, proving that the model aligns with domain knowledge (e.g., higher prices = lower sales).</p>
</section>
<section id="predictions" class="level3">
<h3 class="anchored" data-anchor-id="predictions">51. Predictions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_51.png" class="img-fluid figure-img"></p>
<figcaption>Slide 51</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2512s">Timestamp: 41:52</a>)</p>
<p>The final section focuses on <strong>Predictions</strong>. The speaker shows three dragons with their associated risk scores (9.1, 2.4, etc.).</p>
<p>While the model successfully identifies the red dragon as high risk, the next logical question from a user is “Why?”</p>
</section>
<section id="predictions-explanations" class="level3">
<h3 class="anchored" data-anchor-id="predictions-explanations">52. Predictions &amp; Explanations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_52.png" class="img-fluid figure-img"></p>
<figcaption>Slide 52</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2545s">Timestamp: 42:25</a>)</p>
<p>This slide introduces <strong>Prediction Explanations</strong>. Alongside the score of 9.1, the model provides a list of contributing factors: “Number of past kills” increased the score, while “Gender” might have decreased it.</p>
<p>This moves from global interpretability (how the model works generally) to <strong>local interpretability</strong> (why this specific instance was scored this way).</p>
</section>
<section id="floor-map-with-readmission-probability" class="level3">
<h3 class="anchored" data-anchor-id="floor-map-with-readmission-probability">53. Floor Map with Readmission Probability</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_53.png" class="img-fluid figure-img"></p>
<figcaption>Slide 53</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2581s">Timestamp: 43:01</a>)</p>
<p>A real-world application is shown: a hospital dashboard predicting patient readmission. The interface doesn’t just show a risk score (63.7%); it lists the reasons (e.g., “Abdominal pain,” “Medical specialty unspecified”).</p>
<p>The speaker highlights that these explanations build <strong>Trust</strong> with end-users (nurses/doctors) and provide <strong>Context</strong> that helps them decide <em>how</em> to intervene, rather than just knowing <em>that</em> they should intervene.</p>
</section>
<section id="local-interpretable-model-agnostic-explanations-lime" class="level3">
<h3 class="anchored" data-anchor-id="local-interpretable-model-agnostic-explanations-lime">54. Local Interpretable Model-Agnostic Explanations (LIME)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_54.png" class="img-fluid figure-img"></p>
<figcaption>Slide 54</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2701s">Timestamp: 45:01</a>)</p>
<p>The speaker mentions <strong>LIME</strong>, one of the “traditional” or early techniques for this type of explanation. LIME works by fitting a simple local model around a single prediction to approximate the complex model’s behavior.</p>
</section>
<section id="lime-flaw-explanations-should-be-identical" class="level3">
<h3 class="anchored" data-anchor-id="lime-flaw-explanations-should-be-identical">55. LIME Flaw: Explanations Should Be Identical</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_55.png" class="img-fluid figure-img"></p>
<figcaption>Slide 55</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2712s">Timestamp: 45:12</a>)</p>
<p>The tone shifts to a critique of LIME. The slide asserts a fundamental requirement: <strong>“EXPLANATIONS SHOULD BE IDENTICAL”</strong> for the same data and same model.</p>
<p>If you ask the model twice why it predicted a score for the same dragon, the answer should be the same both times.</p>
</section>
<section id="lime-two-different-explanations" class="level3">
<h3 class="anchored" data-anchor-id="lime-two-different-explanations">56. LIME: Two Different Explanations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_56.png" class="img-fluid figure-img"></p>
<figcaption>Slide 56</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2715s">Timestamp: 45:15</a>)</p>
<p>This slide provides code evidence of LIME’s instability. Running LIME twice on the “SAME DATA, SAME MODEL” produces “TWO DIFFERENT EXPLANATIONS.”</p>
<p>This occurs because LIME relies on random sampling to build its local approximation. This randomness makes it unreliable for serious applications where consistency is required for trust.</p>
</section>
<section id="explanations-should-have-fidelity" class="level3">
<h3 class="anchored" data-anchor-id="explanations-should-have-fidelity">57. Explanations Should Have Fidelity</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_57.png" class="img-fluid figure-img"></p>
<figcaption>Slide 57</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2720s">Timestamp: 45:20</a>)</p>
<p>The speaker argues that explanations must have <strong>Fidelity</strong> to the data. If two data points are very similar, their explanations should be similar. LIME often fails this test, producing vastly different explanations for minor changes in input.</p>
</section>
<section id="lime-isnt-responsive-to-data" class="level3">
<h3 class="anchored" data-anchor-id="lime-isnt-responsive-to-data">58. LIME Isn’t Responsive to Data</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_58.png" class="img-fluid figure-img"></p>
<figcaption>Slide 58</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2722s">Timestamp: 45:22</a>)</p>
<p>Further criticism of LIME. The slide suggests that LIME explanations sometimes lack “local fidelity,” meaning the explanation doesn’t accurately reflect the model’s behavior in that specific region of the data.</p>
</section>
<section id="anyone-relying-on-lime-is-toast" class="level3">
<h3 class="anchored" data-anchor-id="anyone-relying-on-lime-is-toast">59. Anyone Relying on LIME is Toast</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_59.png" class="img-fluid figure-img"></p>
<figcaption>Slide 59</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2725s">Timestamp: 45:25</a>)</p>
<p>A blunt conclusion: “Anyone relying on LIME is toast.” The speaker strongly advises against using LIME due to these flaws, suggesting that while it was a pioneering method, it is no longer the standard for reliable interpretability.</p>
</section>
<section id="what-can-we-learn-from-this" class="level3">
<h3 class="anchored" data-anchor-id="what-can-we-learn-from-this">60. What Can We Learn From This?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_60.png" class="img-fluid figure-img"></p>
<figcaption>Slide 60</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2730s">Timestamp: 45:30</a>)</p>
<p>This slide summarizes the requirements for a good explanation method derived from LIME’s failures: consistency, accuracy, and fidelity. It sets the stage for introducing the superior method: Shapley values.</p>
</section>
<section id="your-model-or-a-surrogate-model" class="level3">
<h3 class="anchored" data-anchor-id="your-model-or-a-surrogate-model">61. Your Model or a Surrogate Model?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_61.png" class="img-fluid figure-img"></p>
<figcaption>Slide 61</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2730s">Timestamp: 45:30</a>)</p>
<p>The speaker questions whether we are explaining the <em>actual</em> model or a <em>surrogate</em> (approximation). LIME explains a surrogate. Ideally, we want to explain the actual model directly.</p>
</section>
<section id="what-is-local" class="level3">
<h3 class="anchored" data-anchor-id="what-is-local">62. What is Local?</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_62.png" class="img-fluid figure-img"></p>
<figcaption>Slide 62</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2730s">Timestamp: 45:30</a>)</p>
<p>Another critique of LIME involves the definition of “local.” The “kernel width” is a hyperparameter that changes the explanation. If the explanation depends on how you tune the explainer, rather than just the data, it is problematic.</p>
</section>
<section id="explanations-should-be-model-agnostic" class="level3">
<h3 class="anchored" data-anchor-id="explanations-should-be-model-agnostic">63. Explanations Should Be Model Agnostic</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_63.png" class="img-fluid figure-img"></p>
<figcaption>Slide 63</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2706s">Timestamp: 45:06</a>)</p>
<p>The speaker reiterates the requirement that the method must work for any model type (Trees, Neural Nets, SVMs). This is a strength of LIME, but also a requirement for its replacement.</p>
</section>
<section id="explanations-should-be-fast" class="level3">
<h3 class="anchored" data-anchor-id="explanations-should-be-fast">64. Explanations Should Be Fast</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_64.png" class="img-fluid figure-img"></p>
<figcaption>Slide 64</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2733s">Timestamp: 45:33</a>)</p>
<p>Speed is critical. The slide compares LIME’s speed across datasets. If an explanation takes too long to generate, it cannot be used in real-time applications (like the hospital dashboard).</p>
</section>
<section id="shapley-values-for-explanations" class="level3">
<h3 class="anchored" data-anchor-id="shapley-values-for-explanations">65. Shapley Values for Explanations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_65.png" class="img-fluid figure-img"></p>
<figcaption>Slide 65</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2736s">Timestamp: 45:36</a>)</p>
<p>The speaker introduces <strong>Shapley Values</strong> as the modern standard. Originating from Game Theory (and Nobel Prize-winning economics), this method provides a mathematically sound way to attribute the “marginal effect” of features to a prediction.</p>
</section>
<section id="shapley-values-metaphor-pushing-a-car" class="level3">
<h3 class="anchored" data-anchor-id="shapley-values-metaphor-pushing-a-car">66. Shapley Values Metaphor: Pushing a Car</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_66.png" class="img-fluid figure-img"></p>
<figcaption>Slide 66</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2760s">Timestamp: 46:00</a>)</p>
<p>To explain the concept, the speaker uses a metaphor: <strong>Pushing a car stuck in the snow</strong>. It’s a cooperative game. Several people (features) are pushing to achieve an outcome (moving the car/making a prediction).</p>
<p>The goal is to determine how much each person contributed. Did the teenager actually push, or just stand there?</p>
</section>
<section id="intuition-of-shapley-values" class="level3">
<h3 class="anchored" data-anchor-id="intuition-of-shapley-values">67. Intuition of Shapley Values</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_67.png" class="img-fluid figure-img"></p>
<figcaption>Slide 67</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2845s">Timestamp: 47:25</a>)</p>
<p>The speaker expands the metaphor. If “The Rock” joins the pushing, he might only need to add a small amount of force (10 units) to get the car moving because the others are already pushing.</p>
<p>However, if The Rock was pushing alone, he would contribute much more. Shapley values calculate the average contribution across all possible “coalitions” (combinations of people pushing).</p>
</section>
<section id="calculating-average-contribution" class="level3">
<h3 class="anchored" data-anchor-id="calculating-average-contribution">68. Calculating Average Contribution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_68.png" class="img-fluid figure-img"></p>
<figcaption>Slide 68</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2810s">Timestamp: 46:50</a>)</p>
<p>The slide visually represents different scenarios (orders of arrival). The contribution of a person depends on who is already there. Shapley values “unpack” this by averaging the marginal contribution of a feature across all possible permutations of features.</p>
</section>
<section id="calculating-shapley-values-subsets" class="level3">
<h3 class="anchored" data-anchor-id="calculating-shapley-values-subsets">69. Calculating Shapley Values: Subsets</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_69.png" class="img-fluid figure-img"></p>
<figcaption>Slide 69</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2923s">Timestamp: 48:43</a>)</p>
<p>Mathematically, this means looking at all possible subsets of features. The slide lists the combinations (A alone, B alone, A+B, etc.) and the model output (“Force”) for each.</p>
</section>
<section id="calculating-shapley-values-marginal-contributions" class="level3">
<h3 class="anchored" data-anchor-id="calculating-shapley-values-marginal-contributions">70. Calculating Shapley Values: Marginal Contributions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_70.png" class="img-fluid figure-img"></p>
<figcaption>Slide 70</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2931s">Timestamp: 48:51</a>)</p>
<p>By comparing the output of a subset <em>with</em> a feature to the subset <em>without</em> it, we find the <strong>marginal contribution</strong> for that specific scenario.</p>
</section>
<section id="calculating-shapley-values-the-average" class="level3">
<h3 class="anchored" data-anchor-id="calculating-shapley-values-the-average">71. Calculating Shapley Values: The Average</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_71.png" class="img-fluid figure-img"></p>
<figcaption>Slide 71</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2934s">Timestamp: 48:54</a>)</p>
<p>The final Shapley value is the <strong>average</strong> of these marginal contributions. This provides a fair distribution of credit among the features that sums up to the total prediction.</p>
</section>
<section id="shapley-values-formula" class="level3">
<h3 class="anchored" data-anchor-id="shapley-values-formula">72. Shapley Values Formula</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_72.png" class="img-fluid figure-img"></p>
<figcaption>Slide 72</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2920s">Timestamp: 48:40</a>)</p>
<p>The slide presents the formal mathematical formula. It is defined as the “average marginal contribution of a feature with respect to all subsets of other features.” While complex, it guarantees unique properties like consistency that LIME lacks.</p>
</section>
<section id="shapley-values-for-feature-attribution" class="level3">
<h3 class="anchored" data-anchor-id="shapley-values-for-feature-attribution">73. Shapley Values for Feature Attribution</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_73.png" class="img-fluid figure-img"></p>
<figcaption>Slide 73</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=2945s">Timestamp: 49:05</a>)</p>
<p>Applying this to Machine Learning: The “Game” is the prediction task. The “Players” are the features. The “Payout” is the prediction score.</p>
<p>The slide shows a Boston Housing prediction. The Shapley values tell us that for this specific house, the “LSTAT” feature pushed the price down, while “RM” (rooms) pushed it up, relative to the average house price.</p>
</section>
<section id="so-many-methods-for-shapley-values" class="level3">
<h3 class="anchored" data-anchor-id="so-many-methods-for-shapley-values">74. So Many Methods for Shapley Values</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_74.png" class="img-fluid figure-img"></p>
<figcaption>Slide 74</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3035s">Timestamp: 50:35</a>)</p>
<p>The speaker notes that calculating exact Shapley values is computationally expensive (2^N combinations). Therefore, many approximation methods exist. The slide lists implementations in R (<code>iml</code>, <code>fastshap</code>) and Python (<code>shap</code>).</p>
</section>
<section id="calculating-shapley-values---linear-model" class="level3">
<h3 class="anchored" data-anchor-id="calculating-shapley-values---linear-model">75. Calculating Shapley Values - Linear Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_75.png" class="img-fluid figure-img"></p>
<figcaption>Slide 75</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3010s">Timestamp: 50:10</a>)</p>
<p>For a simple <strong>Linear Model</strong>, Shapley values are easy to calculate. Because features in a linear model are additive and independent (conceptually), the coefficient * value roughly equals the contribution.</p>
</section>
<section id="linear-model-example" class="level3">
<h3 class="anchored" data-anchor-id="linear-model-example">76. Linear Model Example</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_76.png" class="img-fluid figure-img"></p>
<figcaption>Slide 76</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3012s">Timestamp: 50:12</a>)</p>
<p>The slide shows that if you change the Age, the prediction changes by a specific amount. In linear models, the difference between the prediction and the baseline is simply the sum of these changes.</p>
</section>
<section id="simple-to-get-shapley-values-for-linear-model" class="level3">
<h3 class="anchored" data-anchor-id="simple-to-get-shapley-values-for-linear-model">77. Simple to Get Shapley Values for Linear Model</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_77.png" class="img-fluid figure-img"></p>
<figcaption>Slide 77</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3015s">Timestamp: 50:15</a>)</p>
<p>This reinforces that for linear models, we don’t need complex approximations. The structure of the model allows for exact calculation easily.</p>
</section>
<section id="shapley-values-for-trees-tree-shap" class="level3">
<h3 class="anchored" data-anchor-id="shapley-values-for-trees-tree-shap">78. Shapley Values for Trees: Tree Shap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_78.png" class="img-fluid figure-img"></p>
<figcaption>Slide 78</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3045s">Timestamp: 50:45</a>)</p>
<p>For tree-based models (Random Forest, XGBoost, LightGBM), there is a specific, fast algorithm called <strong>Tree SHAP</strong> (developed by Scott Lundberg). It computes exact Shapley values in polynomial time by leveraging the tree structure, making it feasible for large models.</p>
</section>
<section id="tree-shap-calculation" class="level3">
<h3 class="anchored" data-anchor-id="tree-shap-calculation">79. Tree Shap Calculation</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_79.png" class="img-fluid figure-img"></p>
<figcaption>Slide 79</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3045s">Timestamp: 50:45</a>)</p>
<p>This slide visualizes how Tree SHAP works by tracing paths down the decision tree to calculate expectations. This efficiency is why SHAP has become the industry standard for boosting models.</p>
</section>
<section id="approximating-shapley-values" class="level3">
<h3 class="anchored" data-anchor-id="approximating-shapley-values">80. Approximating Shapley Values</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_80.png" class="img-fluid figure-img"></p>
<figcaption>Slide 80</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3055s">Timestamp: 50:55</a>)</p>
<p>For other “Black Box” models (like Neural Networks or SVMs) where exact calculation is intractable due to the combinatorial explosion (100 features = impossible to compute all subsets), we must use approximations.</p>
</section>
<section id="approximating-shapley-values-strumbelj" class="level3">
<h3 class="anchored" data-anchor-id="approximating-shapley-values-strumbelj">81. Approximating Shapley Values: Strumbelj</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_81.png" class="img-fluid figure-img"></p>
<figcaption>Slide 81</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3064s">Timestamp: 51:04</a>)</p>
<p>One method is <strong>Strumbelj’s algorithm</strong>, a sampling-based approach. It uses Monte Carlo sampling to estimate the difference between predictions with and without a feature, approximating the average marginal contribution.</p>
</section>
<section id="strumbelj-visualization" class="level3">
<h3 class="anchored" data-anchor-id="strumbelj-visualization">82. Strumbelj Visualization</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_82.png" class="img-fluid figure-img"></p>
<figcaption>Slide 82</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3064s">Timestamp: 51:04</a>)</p>
<p>The slide visualizes the sampling process: creating synthetic instances by mixing the feature of interest with random values from the dataset to estimate its effect.</p>
</section>
<section id="approximating-shapley-values-shap-kernel" class="level3">
<h3 class="anchored" data-anchor-id="approximating-shapley-values-shap-kernel">83. Approximating Shapley Values: Shap Kernel</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_83.png" class="img-fluid figure-img"></p>
<figcaption>Slide 83</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3064s">Timestamp: 51:04</a>)</p>
<p><strong>Kernel SHAP</strong> is introduced as a model-agnostic method. It connects LIME and Shapley values. It uses a weighted linear regression (like LIME) but uses specific “Shapley weights” to ensure the result is a valid Shapley value approximation.</p>
</section>
<section id="shap-kernel-generating-data-1" class="level3">
<h3 class="anchored" data-anchor-id="shap-kernel-generating-data-1">84. Shap Kernel: Generating Data (1)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_84.png" class="img-fluid figure-img"></p>
<figcaption>Slide 84</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p><em>Note: The speaker skips detailed explanations of these calculation slides due to time constraints, but the slides detail the technical steps.</em></p>
<p>This slide shows the setup for Kernel SHAP, defining a “background dataset” to serve as the reference value for “missing” features.</p>
</section>
<section id="shap-kernel-generating-data-2" class="level3">
<h3 class="anchored" data-anchor-id="shap-kernel-generating-data-2">85. Shap Kernel: Generating Data (2)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_85.png" class="img-fluid figure-img"></p>
<figcaption>Slide 85</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p>The method involves treating features as “missing” by replacing them with background values to simulate their absence from a coalition.</p>
</section>
<section id="shap-kernel-generating-data-3" class="level3">
<h3 class="anchored" data-anchor-id="shap-kernel-generating-data-3">86. Shap Kernel: Generating Data (3)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_86.png" class="img-fluid figure-img"></p>
<figcaption>Slide 86</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p>Permutations of feature coalitions are generated to create a synthetic dataset for the local regression.</p>
</section>
<section id="shap-kernel-generating-data-4" class="level3">
<h3 class="anchored" data-anchor-id="shap-kernel-generating-data-4">87. Shap Kernel: Generating Data (4)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_87.png" class="img-fluid figure-img"></p>
<figcaption>Slide 87</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p>A linear model is fit to this synthetic data. The coefficients of this linear model, when weighted correctly, correspond to the Shapley values.</p>
</section>
<section id="shap-kernel-generating-data-5" class="level3">
<h3 class="anchored" data-anchor-id="shap-kernel-generating-data-5">88. Shap Kernel: Generating Data (5)</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_88.png" class="img-fluid figure-img"></p>
<figcaption>Slide 88</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p>The result is the attribution value for the specific prediction.</p>
</section>
<section id="mimic-shap" class="level3">
<h3 class="anchored" data-anchor-id="mimic-shap">89. Mimic Shap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_89.png" class="img-fluid figure-img"></p>
<figcaption>Slide 89</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p><strong>Mimic SHAP</strong> is another approximation where a global surrogate model (like a Gradient Boosted Tree) is trained to mimic the black box, and then Tree SHAP is used on the surrogate.</p>
</section>
<section id="gradient-shap" class="level3">
<h3 class="anchored" data-anchor-id="gradient-shap">90. Gradient Shap</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_90.png" class="img-fluid figure-img"></p>
<figcaption>Slide 90</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p><strong>Gradient SHAP</strong> is designed for Deep Learning models (differentiable models). It combines Integrated Gradients with Shapley values for efficient computation in neural networks.</p>
</section>
<section id="gkmexplain" class="level3">
<h3 class="anchored" data-anchor-id="gkmexplain">91. GkmExplain</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_91.png" class="img-fluid figure-img"></p>
<figcaption>Slide 91</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p>A specialized method for non-linear Support Vector Machines (SVMs).</p>
</section>
<section id="dasp" class="level3">
<h3 class="anchored" data-anchor-id="dasp">92. DASP</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_92.png" class="img-fluid figure-img"></p>
<figcaption>Slide 92</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3075s">Timestamp: 51:15</a>)</p>
<p><strong>DASP</strong> is a polynomial-time algorithm for approximating Shapley values specifically in Deep Neural Networks.</p>
</section>
<section id="aggregating-shapley-values-feature-importance" class="level3">
<h3 class="anchored" data-anchor-id="aggregating-shapley-values-feature-importance">93. Aggregating Shapley Values: Feature Importance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_93.png" class="img-fluid figure-img"></p>
<figcaption>Slide 93</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3092s">Timestamp: 51:32</a>)</p>
<p>The speaker returns to practical applications. Once you have local SHAP values for every prediction, you can aggregate them.</p>
<p>By summing the <strong>absolute</strong> SHAP values across all data points, you get a global <strong>Feature Importance</strong> plot. This tells you which features are most important overall, derived directly from the local explanations.</p>
</section>
<section id="aggregating-shapley-values-feature-interactions" class="level3">
<h3 class="anchored" data-anchor-id="aggregating-shapley-values-feature-interactions">94. Aggregating Shapley Values: Feature Interactions</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_94.png" class="img-fluid figure-img"></p>
<figcaption>Slide 94</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3106s">Timestamp: 51:46</a>)</p>
<p>SHAP can also quantify <strong>Interactions</strong>. The slide shows the interaction between Age and Sex. It reveals that for males, a certain age range increases risk (prediction), whereas for females, it might be different.</p>
<p>This allows data scientists to see exactly how features modify each other’s effects, solving the problem of hidden interactions in complex models.</p>
</section>
<section id="aggregating-shapley-values-feature-selection" class="level3">
<h3 class="anchored" data-anchor-id="aggregating-shapley-values-feature-selection">95. Aggregating Shapley Values: Feature Selection</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_95.png" class="img-fluid figure-img"></p>
<figcaption>Slide 95</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3144s">Timestamp: 52:24</a>)</p>
<p>SHAP values can be used for <strong>Feature Selection</strong>. By ranking features by their mean absolute SHAP value, you can identify the top contributors and remove noise variables, potentially simplifying the model without losing accuracy.</p>
</section>
<section id="aggregating-shapley-values-supervised-clustering" class="level3">
<h3 class="anchored" data-anchor-id="aggregating-shapley-values-supervised-clustering">96. Aggregating Shapley Values: Supervised Clustering</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_96.png" class="img-fluid figure-img"></p>
<figcaption>Slide 96</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3165s">Timestamp: 52:45</a>)</p>
<p>A “cool advanced technique” is <strong>Explanation Clustering</strong> (Supervised Clustering). Instead of clustering the raw data, you cluster the <em>explanations</em> (the SHAP values).</p>
<p>This groups data points not by their raw values, but by <em>why</em> the model made a prediction for them. This can reveal distinct subpopulations or “reasons” for high risk (e.g., a group of high-risk dragons due to age vs.&nbsp;a group due to weight).</p>
</section>
<section id="model-agnostic-explanation-tools-summary" class="level3">
<h3 class="anchored" data-anchor-id="model-agnostic-explanation-tools-summary">97. Model Agnostic Explanation Tools Summary</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_97.png" class="img-fluid figure-img"></p>
<figcaption>Slide 97</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3206s">Timestamp: 53:26</a>)</p>
<p>The presentation wraps up by reviewing the three key tools covered: 1. <strong>Feature Importance</strong> (Permutation based) 2. <strong>Partial Dependence</strong> (for directionality) 3. <strong>Prediction Explanations</strong> (Shapley Values)</p>
<p>The speaker encourages the audience to use these tools to build trust and understanding in their machine learning workflows.</p>
</section>
<section id="question-time" class="level3">
<h3 class="anchored" data-anchor-id="question-time">98. Question Time</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_98.png" class="img-fluid figure-img"></p>
<figcaption>Slide 98</figcaption>
</figure>
</div>
<p>(<a href="https://youtu.be/ZRckw_fE56Q&amp;t=3214s">Timestamp: 53:34</a>)</p>
<p>The final slide opens the floor for questions and provides contact information. The speaker mentions that the slides and notebooks (including the age/milk and LIME examples) are available on his GitHub for those who want to explore the code.</p>
<hr>
<p><em>This annotated presentation was generated from the talk using AI-assisted tools. Each slide includes timestamps and detailed explanations.</em></p>


</section>
</section>

 ]]></description>
  <category>Interpretability</category>
  <category>Explainability</category>
  <category>Machine Learning</category>
  <category>XAI</category>
  <category>Annotated Talk</category>
  <guid>https://rajivshah.com/blog/model-interpretability-explainability.html</guid>
  <pubDate>Wed, 15 Apr 2020 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/model-interpretability-explainability/slide_1.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Stand Up for Best Practices</title>
  <link>https://rajivshah.com/blog/standup.html</link>
  <description><![CDATA[ 






<p><img src="https://cdn-images-1.medium.com/max/1600/1*jL9fT-oAR6Ki3HOvXpwMLQ.png" class="img-fluid" alt="img"> Source: Yuriy Guts selection from Shutterstock</p>
<section id="stand-up-for-best-practices" class="level3">
<h3 class="anchored" data-anchor-id="stand-up-for-best-practices"><strong>Stand Up for Best Practices:</strong></h3>
</section>
<section id="misuse-of-deep-learning-in-natures-earthquake-aftershock-paper" class="level3">
<h3 class="anchored" data-anchor-id="misuse-of-deep-learning-in-natures-earthquake-aftershock-paper"><strong>Misuse of Deep Learning in Nature’s Earthquake Aftershock Paper</strong></h3>
</section>
<section id="the-dangers-of-machine-learning-hype" class="level3">
<h3 class="anchored" data-anchor-id="the-dangers-of-machine-learning-hype">The Dangers of Machine Learning Hype</h3>
<p>Practitioners of AI, machine learning, predictive modeling, and data science have grown enormously over the last few years. What was once a niche field defined by its blend of knowledge is becoming a rapidly growing profession. As the excitement around AI continues to grow, the new wave of ML augmentation, automation, and GUI tools will lead to even more growth in the number of people trying to build predictive models.</p>
<p>But here’s the rub: While it becomes easier to use the tools of predictive modeling, predictive modeling knowledge is not yet a widespread commodity. Errors can be counterintuitive and subtle, and they can easily lead you to the wrong conclusions if you’re not careful.</p>
<p>I’m a data scientist who works with dozens of expert data science teams for a living. In my day job, I see these teams striving to build high-quality models. The best teams work together to review their models to detect problems. There are many hard-to-detect-ways that lead to problematic models (say, by allowing <a href="https://www.datarobot.com/wiki/target-leakage/">target leakage</a> into their training data).</p>
<p>Identifying issues is not fun. This requires admitting that exciting results are “too good to be true” or that their methods were not the right approach. In other words, <strong>it’s less about the sexy data science hype that gets headlines and more about a rigorous scientific discipline</strong>.</p>
</section>
<section id="bad-methods-create-bad-results" class="level3">
<h3 class="anchored" data-anchor-id="bad-methods-create-bad-results">Bad Methods Create Bad Results</h3>
<p>Almost a year ago, I read an article in Nature that claimed unprecedented accuracy in <a href="https://www.nature.com/articles/s41586-018-0438-y">predicting earthquake aftershocks by using deep learning</a>. Reading the article, my internal radar became deeply suspicious of their results. <strong>Their methods simply didn’t carry many of the hallmarks of careful predicting modeling.</strong></p>
<p>I started to dig deeper. In the meantime, this article blew up and became <a href="https://blog.google/technology/ai/forecasting-earthquake-aftershock-locations-ai-assisted-science/">widely recognized</a>! It was even included in the <a href="https://medium.com/tensorflow/whats-coming-in-tensorflow-2-0-d3663832e9b8">release notes for Tensorflow</a> as an example of what deep learning could do. However, in my digging, I found major flaws in the paper. Namely, data leakage which leads to unrealistic accuracy scores and a lack of attention to model selection (you don’t build a 6 layer neural network when a simpler model provides the same level of accuracy).</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*CPPVFzHd4GXlBSI4EILWZw.png" class="img-fluid" alt="img">The testing dataset had a much higher AUC than the training set . . . this is not normal</p>
<p>To my earlier point: these are subtle, <strong>but incredibly basic</strong> predictive modeling errors that can invalidate the entire results of an experiment. Data scientists are trained to recognize and avoid these issues in their work. I assumed that this was simply overlooked by the author, so I contacted her and let her know so that she could improve her analysis. Although we had previously communicated, she did not respond to my email over concerns with the paper.</p>
</section>
<section id="falling-on-deaf-ears" class="level3">
<h3 class="anchored" data-anchor-id="falling-on-deaf-ears">Falling On Deaf Ears</h3>
<p>So, what was I to do? My coworkers told me to just tweet it and let it go, but I wanted to stand up for good modeling practices. I thought reason and best practices would prevail, so I started a 6-month process of writing up my results and shared them with Nature.</p>
<p>Upon sharing my results, I received a note from Nature in January 2019 that despite serious concerns about data leakage and model selection that invalidate their experiment, they saw no need to correct the errors, because “<strong>Devries et al.&nbsp;are concerned primarily with using machine learning as [a] tool to extract insight into the natural world, and not with details of the algorithm design</strong>”. The authors provided a much harsher response.</p>
<p>You can read the entire exchange <a href="https://github.com/rajshah4/aftershocks_issues">on my github</a>.</p>
<p>It’s not enough to say that I was disappointed. This was a major paper (<em>it’s Nature!</em>) that bought into AI hype and published a paper despite it using flawed methods.</p>
<p>Then, just this week, I ran across <a href="https://link.springer.com/chapter/10.1007/978-3-030-20521-8_1">articles by Arnaud Mignan and Marco Broccardo</a> on <a href="https://arxiv.org/abs/1904.01983">shortcomings</a> that they found in the aftershocks article. Here are two more data scientists with expertise in earthquake analysis who also noticed flaws in the paper. I also have placed my analysis and reproducible code <a href="https://github.com/rajshah4/aftershocks_issues">on github</a>.</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*Op19T2cR7gG60fbQLWS5cA.png" class="img-fluid" alt="img">Go run the analysis yourself and see the issue</p>
</section>
<section id="standing-up-for-predictive-modeling-methods" class="level3">
<h3 class="anchored" data-anchor-id="standing-up-for-predictive-modeling-methods">Standing Up For Predictive Modeling Methods</h3>
<p>I want to make it clear: my goal is not to villainize the authors of the aftershocks paper. I don’t believe that they were malicious, and I think that they would argue their goal was to just show how machine learning could be applied to aftershocks. Devries is an accomplished earthquake scientist who wanted to use the latest methods for her field of study and found exciting results from it.</p>
<p>But here’s the problem: their insights and results were based on fundamentally flawed methods. It’s not enough to say, “This isn’t a machine learning paper, it’s an earthquake paper.” <strong>If you use predictive modeling, then the quality of your results are determined by the quality of your modeling.</strong> Your work becomes data science work, and you are on the hook for your scientific rigor.</p>
<p>There is a huge appetite for papers that use the latest technologies and approaches. It becomes very difficult to push back on these papers.</p>
<p><strong>But if we allow papers or projects with fundamental issues to advance, it hurts all of us. It undermines the field of predictive modeling.</strong></p>
<p>Please push back on bad data science. Report bad findings to papers. And if they don’t take action, go to twitter, post about it, share your results and make noise. This type of collective action worked to raise awareness of p-values and combat the epidemic of p-hacking. We need good machine learning practices if we want our field to continue to grow and maintain credibility.</p>
<p><strong>Acknowledgments:</strong> I want to thank all the great data scientists at <a href="http://www.datarobot.com">DataRobot</a> that collaborated and supported me this past year, a few of these include: Lukas Innig, Amanda Schierz, Jett Oristaglio, Thomas Stearns, and Taylor Larkin.</p>
<p><strong>This article was orignally posted on <a href="https://towardsdatascience.com/stand-up-for-best-practices-8a8433d3e0e8">Medium</a> and featured on <a href="https://www.reddit.com/r/MachineLearning/comments/c4ylga/d_misuse_of_deep_learning_in_nature_journals/">Reddit</a></strong></p>


</section>

 ]]></description>
  <category>Leakage</category>
  <category>Earthquake</category>
  <guid>https://rajivshah.com/blog/standup.html</guid>
  <pubDate>Thu, 15 Aug 2019 05:00:00 GMT</pubDate>
  <media:content url="https://cdn-images-1.medium.com/max/1600/1*jL9fT-oAR6Ki3HOvXpwMLQ.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Optimization Strategies</title>
  <link>https://rajivshah.com/blog/optimization.html</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/fanduel2.jpg" class="img-fluid figure-img"></p>
<figcaption>fanduel</figcaption>
</figure>
</div>
<section id="introduction" class="level3">
<h3 class="anchored" data-anchor-id="introduction">Introduction</h3>
<p>As a data scientist, you spend a lot of your time helping to make better decisions. You build predictive models to provide improved insights. You might be predicting whether an image is a cat or dog, store sales for the next month, or the likelihood if a part will fail. In this post, I won’t help you with making better predictions, but instead how to make the <strong>best</strong> decision.</p>
<p>The post strives to give you some background on optimization. It starts with a simply toy example show you the math behind an optimization calculation. After that, this post tackles a more sophisticated optimization problem, trying to pick the best team for fantasy football. The FanDuel image below is a very common sort of game that is widely played (ask your inlaws). The optimization strategies in this post were shown to consistently win! Along the way, I will show a few code snippets and provide links to working code in R, Python, and Julia. And if you do win money, feel free to share it :)</p>
</section>
<section id="simple-optimization-example" class="level3">
<h3 class="anchored" data-anchor-id="simple-optimization-example">Simple Optimization Example</h3>
<p>A simple example, which I found <a href="http://melaniewingard.weebly.com/uploads/3/7/5/5/37554047/09-30-16_section_3.5_linear_programming_and_optimization_continued.pdf">online</a>, starts with a carpenter making bookcases in two sizes, large and small. It takes 6 hours to make a large bookcase and 2 hours to make a small one. The profit on a large bookcase is $50, and the profit on a small bookcase is $20. The carpenter can spend only 24 hours per week making bookcases and must make at least 2 of each size per week. Your job as a data scientist is to help your carpenter maximize her revenue.</p>
<p>Your initial inclination could be that since the large bookcase is the most profitable, why not focus on them. In that case, you would profit (2*$20) + (3*$50) which is $190. That is a pretty good baseline, but not the best possible answer. It is time to get the algebra out and create equations that define the problem. First, we start with the constraints:</p>
<pre><code>x&gt;=2    ## large bookcases

y&gt;=2    ## small bookcases

6x + 2y &lt;= 24  (labor constraint)</code></pre>
<p>Our objective function which we are trying to maximize is:</p>
<pre><code>P = 50x + 20y</code></pre>
<p>If we do the algebra by hand, we can convert out constraints to <code>y &lt;= 12 - 3x</code>. Then we graph all the constraints and find the feasible area for the portion of making small and large bookcases:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/graph.png" class="img-fluid figure-img"></p>
<figcaption>graph</figcaption>
</figure>
</div>
<p>The next step is figuring out the optimal point. Using the corner-point principle of linear programming, the maximum and minimum values of the objective function each occur at one of the vertices of the feasible region. Looking here, the maximum values (2,6) is when we make 2 large bookcases and 6 small bookcases, which results in an income of $220.</p>
<p>This is a very simple toy problem, typically there are many more constraints and the objective functions can get complicated. There are lots of classic problems in optimization such as routing algorithms to find the best path, scheduling algorithms to optimize staffing, or trying to find the best way to allocate a group of people to set of tasks. As a data scientist, you need to dissect what you are trying to maximize and identify the constraints in the form of equations. Once you can do this, we can hand this over to a computer to solve. So lets next walk through a bit more complicated example.</p>
</section>
<section id="fantasy-football" class="level3">
<h3 class="anchored" data-anchor-id="fantasy-football">Fantasy Football</h3>
<p>Over the last few years, fantasy sports have increasingly grown in popularity. One game is to pick a set of football players to make the best possible team. Each football player has a price and there is a salary cap limit. The challenge is to optimize your team to produce the highest total points while staying within a salary cap limit. This type of optimization problem is known as the knapsack problem or an assignment problem.</p>
</section>
<section id="simple-linear-optimization" class="level3">
<h3 class="anchored" data-anchor-id="simple-linear-optimization">Simple Linear Optimization</h3>
<p>So for this problem, let’s start by loading a dataset and taking a look at the raw data. You need to know both the salary as well as the expected points. Most football fans spend a lot of time trying to predict how many points a player will score. If you want to build a model for predicting the expected performance of a player, take a look at Ben’s blog post.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/qbs.png" class="img-fluid figure-img"></p>
<figcaption>QB points</figcaption>
</figure>
</div>
<p>The goal here is to build the best possible team for a salary cap, let’s say $50,000. A team consists of a quarterback, running backs, wide receivers, tight ends, and a defense. We can use the <code>lpSolve</code> package in R to set up the problem. Here is a code snippet for setting up the constraints.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/constraints.png" class="img-fluid figure-img"></p>
<figcaption>constraints</figcaption>
</figure>
</div>
<p>If you parse through this, you can see we have set a minimum and maximum for QB of 1 player. However, for the RB, we have allowed a maximum of 3 and a minimum of 2. This is not unusual in fantasy football, be because there is a role called a flex player, which anyone can choose and they can either be a RB, WR, or TE. Now let’s look at the code for the objective:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/objective.png" class="img-fluid figure-img"></p>
<figcaption>objective</figcaption>
</figure>
</div>
<p>The code shows that we have set up the problem to maximize the objective of the most points and include our constraints. Once the code is run, it outputs an optimal team! I forked an existing repo and have made the R code and dataset are <a href="https://github.com/rajshah4/linear-optimization-fantasy-football">available here.</a> A more sophisticated <a href="https://github.com/mattbrondum/Fantasy-Football-Optimization">python</a> optimization repo is also available.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://rajivshah.com/blog/images/finalteam.png" class="img-fluid figure-img"></p>
<figcaption>finalteam</figcaption>
</figure>
</div>
</section>
<section id="advanced-steps" class="level3">
<h3 class="anchored" data-anchor-id="advanced-steps">Advanced steps</h3>
<p>So far, we have built a very simple optimization to solve the problem. There are several other strategies to further improve the optimizer. First, the variance of our teams can be increased by using a strategy called <strong>stacking</strong>, where you make sure your QB and WR are on the same team. A simple optimization is a constraint for selecting a QB and WR from the same team. Another strategy is using an <strong>overlap</strong> constraint for selecting multiple lineups. An overlap constraint ensures a diversity of players and not the same set of players for each optimized team. This strategy is particularly effective when submitting multiple lineups. You can read more about these <a href="https://arxiv.org/pdf/1604.01455v2.pdf">strategies here</a> and run the code in Julia <a href="https://github.com/dscotthunter/Fantasy-Hockey-IP-Code">here</a>. An code snippet of the stacking constraint (this is for a hockey optimization):</p>
<p><img src="https://rajivshah.com/blog/images/goalie.png" class="img-fluid" alt="goalie">.</p>
<p>Last year, at Sloan sports conference, <a href="http://www.sloansportsconference.com/wp-content/uploads/2018/02/1001.pdf">Haugh and Sighal</a> , presented a paper with additional optimization constraints. They include what an <strong>opponents team</strong> is likely to look like. After all, there are some players that are much more popular. Using this knowledge, you can predict the likely teams that will oppose your team. The approach here used Dirichlet regressions for modeling players. The result was a much-improved optimizer that was capable of consistently winning!</p>
<p>I hope this post has shown you how optimization strategies can help you find the best possible solution.</p>
<p>​</p>


</section>

 ]]></description>
  <category>Optimization</category>
  <category>sport</category>
  <guid>https://rajivshah.com/blog/optimization.html</guid>
  <pubDate>Mon, 30 Jul 2018 05:00:00 GMT</pubDate>
  <media:content url="https://rajivshah.com/blog/images/fanduel2.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
