Practical guide for scaling vLLM under OOM and instability
New guide emphasizes workload profiling and tuned configurations over raw hardware for successfully scaling vLLM in production environments.
My personalized AI news feed, curated from newsletters and deduplicated automatically.
Powered by OpenHands + Claude Sonnet 4 • Updated every 30 minutes
New guide emphasizes workload profiling and tuned configurations over raw hardware for successfully scaling vLLM in production environments.
Users report llama.cpp experiencing infinite loops with Qwen models at ~20% context even at higher quantization levels, highlighting brittleness of recommended decoding settings across runtimes.
Gated DeltaNet modules in Qwen 3.5 can avoid KV-cache growth, making it more memory-friendly than Qwen3 under certain operational ratios.
George Bredis introduces NE-Dreamer, exploring world models trained to predict next embeddings rather than reconstruct pixels, arguing reconstruction may be wrong objective for control.
Gordon Wetzstein introduces Mode Seeking meets Mean Seeking (MMM) as a route to long-context, persistent video world models via unified representation.
New paper explores native multimodal models where vision is first-class, treating all modalities Transfusion-style, arguing progress requires vision-native training not language-first adapters.
BFL introduces Self-Flow, a self-supervised flow-matching approach for multimodal models (image/video/audio/text) that avoids external pretrained models, claiming 2.8× faster convergence and improved temporal consistency.
Cursor announces availability in JetBrains IDEs through Agent Client Protocol, enabling IDE-native access without forcing users to switch tools.
LangChain releases OSS Skills designed to teach coding agents how to effectively use LangChain, LangGraph, and DeepAgents frameworks.
LangChain ships LangSmith Skills and CLI enabling coding agents to natively debug traces, build datasets, and run experiments from the terminal.
New OpenAI repo Symphony provides orchestration layer that polls project boards and spawns agents per ticket lifecycle stage, shifting UX from prompting to workflow-native automation.
VS Code announces shift from monthly release cycle to weekly shipping of main branch to accelerate feature delivery for agent development.
VS Code releases agent hooks, message steering/queueing, integrated agentic browser, and shared memory capabilities designed for production agent workflows.
OpenAI open-sources the Windows-native agent sandbox implementation, supporting native Windows, WSL, PowerShell, CMD, Git Bash terminals with integrated Windows skills.
OpenAI releases Codex for Windows featuring a Windows-native agent sandbox using OS controls (restricted tokens, ACLs, dedicated users) to constrain filesystem and network access.
Alibaba CEO Eddie Wu held emergency meeting where Qwen team challenged leadership on restructuring and compute allocation, with Cloud CTO acknowledging external customers had smoother compute access than internal teams.
Analysis shows Qwen is the #1 open model in 2025-2026 HuggingFace papers, used in 41% of 7,692 papers and ~50% in May 2025, highlighting ecosystem dependence on a small core team.
Nat Lambert argues open-weight frontier efforts may concentrate into only non-profits, NVIDIA (hardware pull-through), and Meta (commoditize complements), making corporate misalignment structurally likely.
ByteDance paper describes agentic RL system that writes CUDA kernels in secure test environments, optimizing for speedups with claims of up to 100% faster components versus traditional tools.
Tanishq Kumar introduces SSD, claiming up to 2× faster inference than leading engines (vLLM, SGLang), collaborating with Tri Dao and Avner May using asynchronous machine techniques.
Meme circulates claiming ChatGPT uninstalls increased 295% following Pentagon deal announcement, though commenters question the validity and sourcing of this statistic.
A long-form essay examining the central tension in AI engineering between relying on powerful models versus sophisticated orchestration systems (harnesses), drawing parallels to finance's 'value of the human vs seat' debate.
Demis Hassabis teased Gemini 3.1 Flash-Lite as incredibly fast and cost-efficient for its performance, framing the model around latency and cost per capability rather than raw frontier scores.
Google's NotebookLM introduces a major new feature for Ultra users that generates bespoke, immersive videos from user sources automatically.
The Information reports GPT-5.4 coming with ~1M token context window and extreme reasoning mode that can think for hours, targeting long-horizon agentic workflows with more frequent monthly updates.
Nat Lambert argues the discussion should shift from Anthropic focusing on code to their lead on general agent behavior, implying coding will commoditize but agent robustness will not.
New MathArena evaluation finds Claude Opus 4.6 strong overall but weak on visual mathematics problems, with the evaluation costing approximately $8,000 to run.
Alibaba's Qwen team lead resigned as the company restructures from vertically integrated teams to horizontal splits across pretraining, post-training, multimodal, and infrastructure.
Apple announces M5 Pro and M5 Max chips with up to 4× faster LLM prompt processing than M4, supporting up to 128GB unified memory at 614GB/s bandwidth and 2× faster SSDs.
Qwen 3.5 4b successfully generated a fully functional web-based operating system with games, text editor, audio player, file browser, and customizable wallpaper from one prompt.
Qwen3.5-0.8B model demonstrated running efficiently on 2nd gen i5 processor with 4GB DDR3 RAM using llama.cpp, handling complex topics without requiring GPU acceleration.
Detailed comparison of Q4 quantization methods for Qwen3.5-27B using KL Divergence finds unsloth_Qwen3.5-27B-UD-Q4_K_XL achieves lowest KLD of 0.005087 against BF16 baseline.
Small 3B active parameter MoE model achieves 37.8% on SWE-bench Verified Hard (nearly matching Claude Opus 4.6's 40%) using simple verify-after-every-edit strategy.
YuanAI Lab releases Yuan 3.0 Ultra, an open multimodal mixture-of-experts model with 1010B total parameters and 68.8B active parameters.
Benchmark testing nonsensical prompt rejection finds only Claude and Qwen 3.5 score above 60%, with reasoning models sometimes rationalizing nonsense instead of rejecting it.
Analysis stresses failures often come from outdated eval rubrics rather than broken prompts, advocating for evals as feedback loops tied to production distribution shift.
Study shows LLM-generated SWE-bench patches are consistently longer and more bloated than human solutions, passing tests but harming human verification and maintenance.
Diagnostic framework finds retrieval approach causes ~20pp variance while memory-writing methods only shift 3-8pp, with raw chunking matching expensive summarization pipelines.
Ian Li explains why diffusion LLMs struggle with parallel token generation due to fully factorized output heads that can't represent full joint distributions, proposes CoDD to break the barrier.
NVIDIA NIM (NVIDIA Inference Microservices) provides ready-to-use containers that package AI models with inference engines and OpenAI-compatible APIs, reducing deployment time from days to minutes. The containers include GPU optimizations, quantization, and can be deployed self-hosted or cloud-hosted with minimal engineering overhead.
The U.S. Treasury, Federal Housing Agency, and State Department became the first offices to move off Anthropic, with Treasury Secretary saying no private company will dictate national security terms.
OpenAI research scientist Aidan McLaughlin publicly stated he doesn't think the company's Pentagon deal was worth it.
AWS lost connectivity at a UAE data center after unidentified objects struck the facility amid the US-Iran conflict, causing major outages for Anthropic's Claude.
Google released Nano Banana 2, a new top-ranked AI image model.
Alibaba released Qwen3.5 Small, a family of four open-source AI models small enough to run on laptops or phones, with the 9B model outscoring OpenAI's GPT-OSS-120B despite being 13x smaller.
A step-by-step guide on using OpenAI's Whisper model locally to transcribe and translate video files for free without uploading to external sites.
Anthropic launched a tool that lets users import their saved preferences and context from ChatGPT, Gemini, or Copilot with a single copy-paste, while also opening Claude's memory feature to free users.
The U.S. Supreme Court declined to hear a case about whether AI-generated art can be copyrighted, letting lower court rulings stand that only humans can be authors.
MyFitnessPal acquired Cal AI, an AI calorie-counting app created by two 19-year-old founders that reached 15M downloads and $30M in annual revenue in under two years.
Apple announced the iPhone 17e at $599, bringing Apple Intelligence features like visual search, AI call screening, and live translation to its most affordable iPhone.
Physical Intelligence's vision-language-action models deployed at Weave for laundry folding and Ultra for e-commerce packaging, handling variability traditional automation couldn't solve.
MIT, WashU, and UCLA researchers model the AGI transition where humans shift from labor to verifying AI agent actions, warning of 'Hollow Economy' risks without proper verification infrastructure.
20 researchers probed AI agents for weeks, uncovering vulnerabilities including unauthorized compliance, infinite loops, and prompt injection attacks in realistic social environments.
Jack Clark and Ezra Klein discuss AI agents' economic impacts and ambitious positive policy ideas for an AI-powered future.
Study finds novices with LLM access were 4.16× more accurate on bioweapon-related tasks, increasing from 5% to 17% accuracy across biosecurity benchmarks.
Benchmark of 100 AI-generated web games shows state-of-the-art models achieving under 10% of human performance while taking 15-20x longer.
Tool uses LLM evolution to automatically optimize code and prompts, achieving state-of-the-art 95% on ARC-AGI-2.
OpenAI signed a Pentagon deal hours after Trump ordered agencies to cut ties with Anthropic over safeguards on mass surveillance and autonomous weapons, claiming similar red lines while facing consumer backlash.
Staff members share AI applications including using Seedance to animate wedding photos and Claude Cowork for fantasy baseball draft planning and research.
Step-by-step guide for setting up Obsidian notes app with Claude Cowork to automatically create daily plans and manage workdays more efficiently.
OpenAI raised $110B at $730B valuation with Amazon leading at $50B alongside Nvidia and SoftBank, marking a notable pivot away from Microsoft-only infrastructure.
AI agent featuring memory capabilities and cross-platform messaging.
Multi-model agent system designed for handling long-running tasks.
CEO Jack Dorsey explicitly cited AI as the reason for laying off over 4,000 of Block's 10,000 employees, sending shares up more than 20%.
OpenAI founding member Andrej Karpathy called recent shifts the end of the era of typing code.
Amazon's David Luan announced his departure after leading the company's Nova Act browser agent and San Francisco AI lab.
Released embedding models powering its search results that outperform Google and Alibaba rivals while cutting storage needs by up to 32x.
A safety audit of Clawdbot (OpenClaw) reveals a 58.9% overall pass rate, with AI agents handling structured tasks reliably but breaking under ambiguity, achieving 0% on intent misunderstanding while scoring 100% on hallucination prevention.
Eric Zhu discusses frameworks and architectures powering modern AI products, covering why early agent frameworks were brittle, lessons around memory and orchestration, and advice for AI builders on careers and avoiding burnout.
Sequential Attention is a feature selection algorithm that combines greedy forward selection performance with differentiable attention mask efficiency through iterative selection. In the linear case, it is proven equivalent to Orthogonal Matching Pursuit and was validated on the 3+ billion example Criteo dataset.
Google Colab quietly adds NVIDIA RTX PRO 6000 instances at ~$0.81/hour, contrasted against A100 high-RAM at ~$7.52 credits/hour, potentially making Colab default cheap pretraining/finetuning playground.
520M-parameter voice model runs fully on-device on RTX and Apple Silicon, using full dialogue history to modulate emotion for real-time, privacy-preserving, emotionally-aware TTS.
LMArena announces gpt-5.3-codex entering the Code Arena leaderboard, expanding model coverage for code evaluation.
LMArena adds Kling-V3-Pro to Video Arena leaderboard where it ties #8 with 1337 score, showing +52pt jump over Kling 2.6 Pro and +48pt over Kling-2.5-turbo-1080p.
LMArena expands Image Arena with 7 new categories for more granular image generation model evaluation, highlighted in video walkthrough.
MLOps @Chipro announces two-part paper clinic on 'Understanding World or Predicting Future? A Comprehensive Survey of World Models' (arXiv:2411.14499), covering JEPA, V-JEPA, Dreamer, Genie, Sora, and World Labs architectures.
Hugging Face contributor releases ARACHNID RL Dataset with 2,831 Atari-style space-shooter gameplay samples for imitation-learning research.
Power users mourn retirement of GPT-5.1 for GPT-5.2, complaining the new version's tone feels condescending and hyper-safe compared to 5.1 'being a delight to work with.'
Google announces intelligent OS with system-level support for AI agents on Android-class devices, alongside Google Labs' new Opal Agent.
Microsoft details how Copilot now converts answers into executable actions, transforming from chat assistant to task runner for multi-step workflows.
Burger King tests BK Assistant pilot—a headset-based OpenAI-powered voice bot called 'Patty' that answers recipe questions and scores employee friendliness by counting phrases across 500 US locations.
Moonshot releases Kimi K2.5 Agent Swarm as web-only feature, with only sub-agents exposed via Kimi CLI for multi-agent workflow orchestration.
Open-source Constrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks (CoDA-GQA-L) dramatically reduces KV-cache VRAM with custom fused Triton kernels. 7B Mistral CoDA-GQA-L model released.
NNsight v0.6 releases with 2.4-3.9× faster traces, cleaner errors, vLLM multi-GPU/multi-node support, LLM-friendly docs, and first-class support for VLMs and diffusion models.
Researchers push for native Unsloth support of Logit Fusion, a training scheme that fuses logits from multiple models/checkpoints during training, framed as low-infrastructure way to get ensemble benefits.
Peter Steinberger uses 5-10 AI coding agents to achieve 118 commits/day across 48 repositories in 72 days, creating tools like Peekaboo, Poltergeist, and Oracle to overcome agent limitations.
Codeflash benchmarked 76K lines of Claude Code-generated code, finding 118 functions running up to 446x slower due to naive algorithms and inefficient patterns. LLMs achieve less than 0.23x the speedup of human experts on SWE-fficiency benchmark.
Comparison using identical prompts shows Nano Banana 2 (Gemini 3.1 Flash Image) significantly improves spatial awareness and proportion handling over original Nano Banana model.
Google announces Nano Banana 2 pricing at $0.50 input and $3.00 output, positioned as cost-effective vs Nano Banana Pro ($2.00/$12.00), with January 2025 knowledge cutoff.
Nano Banana 2 appears early in Gemini interface before official announcement, showing loading message and model selection, suggesting staged rollout or testing phase.
Google releases Nano Banana 2 model with professional-grade capabilities, rapid processing, enhanced world knowledge, and improved subject consistency. Users report significant improvements in complex scenarios.
Dario Amodei refuses Pentagon's request to remove Claude AI safety guardrails, emphasizing ethical concerns over unrestricted military access amid government ban threats.
Anthropic rejects Pentagon's final offer due to inadequate safeguards against mass surveillance and autonomous weapons. Pentagon threatens blacklist and potential Defense Production Act invocation.
Onyx's self-hosted LLM leaderboard categorizes models from S to D tier based on Coding, Math, Reasoning, and Efficiency metrics. Recently updated to include Minimax M2.5 model.
LLmFit tool evaluates models based on system RAM, CPU, and GPU capabilities, providing scores for quality, speed, fit, and context. Supports multi-GPU setups, MoE architectures, and dynamic quantization with TUI and CLI modes.
DeepSeek gives Huawei and domestic suppliers early V4 model access for hardware optimization, while major US chipmakers Nvidia and AMD do not receive early access.
Detailed Q4 quantization comparison using KL Divergence shows AesSedai's Q4_K_M achieves lowest KLD of 0.0102 by maintaining certain tensors at Q8_0, while Unsloth's UD-Q4_K_XL shows highest at 0.0524.
Follow-up experiments confirm KV q8_0 offers +12-38% throughput increase without quality loss, Q4_K_M remains optimal, and 35B-A3B MoE runs 10x faster than 27B dense on single-GPU setups.
Unsloth releases Qwen3.5-35B-A3B Dynamic GGUFs with 150 KL Divergence benchmarks totaling 9TB, achieving 99.9% KLD on Pareto Frontier for UD-Q4_K_XL and IQ3_XXS. Includes fix for tool-calling chat template bug.
Showing 100 of 100 stories from the last 7 days