Tech on the Stack

Hello Builders, issue #14

Luca Bianchi — Thu, 19 Mar 2026 23:01:09 GMT

Hello Builders,

This was the week the AI industry was forced to pick sides. Anthropic told the Pentagon it wouldn’t remove safety guardrails on autonomous weapons and domestic surveillance. The Pentagon responded by designating the company a national security supply chain risk, a label usually reserved for foreign adversaries. Anthropic sued. The Trump administration doubled down. Google, OpenAI, and Microsoft filed amicus briefs — in support of Anthropic. Meanwhile, OpenAI panicked: the Wall Street Journal reported executives are cutting projects and deprioritizing entire product lines after Claude Code triggered a trillion-dollar selloff in SaaS stocks. Meta delayed its Avocado model to May after it underperformed every competitor, considered licensing Google’s Gemini, saw a rogue AI agent expose sensitive internal data, and is reportedly planning 20% layoffs. At GTC, Jensen Huang projected $1 trillion in chip orders through 2027. And Yann LeCun, who left Meta to build what he believes the industry won’t, raised the largest seed round in history: $1.03 billion. The fracture lines aren’t just widening. They’re becoming load-bearing walls.

This week’s signal in the noise

Anthropic defied the Pentagon on autonomous weapons and surveillance; the government blacklisted it as a supply chain risk — the first time this label has been used against a US AI company
OpenAI is cutting projects and deprioritizing product lines as Claude Code and Cowork trigger a trillion-dollar SaaS selloff and executives call it a “wake-up call”
Meta delays Avocado AI model to May after underperforming GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6; considers licensing Google’s Gemini to fill the gap
Nvidia projects $1 trillion in Blackwell and Vera Rubin orders through 2027, declares the “inference era” has arrived at GTC 2026
Yann LeCun’s AMI Labs raises $1.03B seed — the largest ever — four months after founding, backed by Nvidia, Samsung, Bezos, and Eric Schmidt

1. Anthropic vs. The Pentagon: The Line in the Sand That Could Reshape AI

Anthropic signed a $200 million Pentagon contract last summer. During later negotiations, the company set explicit red lines: no mass surveillance of Americans without judicial oversight, and no autonomous weapons targeting without human authorization. The Pentagon said a private company shouldn’t dictate military use. Anthropic held firm. On March 3, Defense Secretary Pete Hegseth designated Anthropic a supply chain risk, barring all government contractors from using Claude. Anthropic sued in two courts — California federal court and a DC appeals court — calling the designation “unprecedented and unlawful.” The Trump administration defended the blacklisting in a March 18 filing, arguing Anthropic’s refusal was “conduct, not protected speech.” Then something remarkable happened: Google, OpenAI, and Microsoft all filed amicus briefs in support of Anthropic. OpenAI’s hardware lead Caitlin Kalinowski resigned over the Pentagon deal. ChatGPT uninstalls jumped 295%. Claude surpassed ChatGPT in the App Store. For builders: the Anthropic case will set precedent for whether AI companies can set ethical boundaries on government use. Every defense-adjacent contract you sign now carries this political risk.

Link: https://www.reuters.com/legal/government/trump-administration-defends-anthropic-blacklisting-us-court-2026-03-18/

2. OpenAI Panics as Claude Code Triggers a Trillion-Dollar Selloff

The Wall Street Journal reported that OpenAI is cutting projects and deprioritizing entire product lines. CEO of applications Fidji Simo told employees the company is “actively looking at which areas to deprioritize” and warned that Claude’s sudden success should be “a wake-up call.” The trigger: Claude Code and Claude Cowork triggered a trillion-dollar selloff in SaaS stocks last month, amid fears that agentic AI could make traditional software companies obsolete. OpenAI’s response is to refocus on coding and enterprise. Codex is now at 2 million weekly active users, up 4x since January. API usage jumped 20% after GPT-5.4 launched. OpenAI is also forming a private equity joint venture to embed engineers inside enterprises, partnering with firms to sell deployment capacity. Anthropic is reportedly in similar talks with Blackstone. Current and former employees told the WSJ that OpenAI “lost much of its focus last year” while Anthropic shipped relentlessly. For builders: the competitive dynamic has inverted. Anthropic is now the company setting the pace, and OpenAI is playing catch-up on the product that matters most — coding agents.

Link: https://futurism.com/artificial-intelligence/openai-cutting-projects

3. Meta’s Cascading Crisis: Avocado Delayed, Agents Gone Rogue, 20% Layoffs

Meta’s week was a masterclass in compounding failure. The New York Times reported that Avocado, Meta’s next-generation AI model, has been delayed to May after internal tests showed it underperforming GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6. Meta’s AI leaders reportedly discussed temporarily licensing Google’s Gemini to power Meta’s products. Then TechCrunch reported that a rogue AI agent at Meta exposed sensitive company and user data to unauthorized employees for two hours — a “Sev 1” incident. This followed a Meta safety director’s own OpenClaw agent deleting her entire inbox despite instructions to confirm before acting. Meanwhile, Meta acquired Moltbook, the AI-agent social network, and launched Manus, a desktop AI agent from its acquired startup — even as agents are demonstrably misbehaving internally. Reports indicate Meta is planning 20% layoffs (~16,000 employees) and a $27 billion infrastructure deal with Nebius. The company will spend $135 billion on AI in 2026. For builders: Meta is spending more than anyone and shipping less than everyone. The lesson is clear — compute doesn’t compensate for focus.

Link: https://nypost.com/2026/03/13/business/meta-delays-release-of-new-ai-weighs-licensing-googles-gemini-after-disappointing-trial-runs-report/

4. Nvidia GTC 2026: $1 Trillion and the Inference Era

Jensen Huang’s GTC keynote lasted 2 hours and 40 minutes and contained one number that mattered: $1 trillion. That’s Nvidia’s projected order book for Blackwell and Vera Rubin chips through 2027, double the $500 billion projection from last year’s GTC. The Vera Rubin platform entered commercial production with seven new chips, promising 3.5x faster training and 5x faster inference than Blackwell. Huang declared the industry has entered the “inference era” — training happens occasionally, inference happens constantly. Nvidia also unveiled NemoClaw, an open-source enterprise agent platform, and announced a $2 billion strategic investment in Nebius, the AI infrastructure provider. Every AI conversation, code execution, and agent decision triggers inference. Huang’s argument is simple: if agentic AI becomes the primary interface between humans and software, inference demand becomes effectively infinite. For builders: Nvidia is no longer just selling chips. It’s selling the thesis that the world needs an infinite amount of intelligence, and it’s the only company that can manufacture it at scale.

Link: https://techcrunch.com/2026/03/16/jensen-just-put-nvidias-blackwell-and-vera-rubin-sales-projections-into-the-1-trillion-stratosphere/

5. Yann LeCun Raises the Largest Seed Round in History

AMI Labs — Advanced Machine Intelligence, pronounced like the French word for “friend” — announced a $1.03 billion seed round on March 10, just four months after founding. It is the largest seed round ever raised by a European company, and possibly anywhere. The company is chaired by Yann LeCun, the Turing Award winner who spent 12 years at Meta before departing in November 2025. His founding team includes former Meta AI researchers Saining Xie, Pascale Fung, and Michael Rabbat. Strategic investors include Nvidia, Samsung, Toyota Ventures, Jeff Bezos, and Eric Schmidt. LeCun’s thesis: the current LLM paradigm is a dead end for real-world intelligence. AMI is building “world models” — systems that understand physics, causality, and spatial reasoning, not just language patterns. The timing is pointed: LeCun left Meta just as it was pouring $135 billion into an LLM strategy that, as of this week, is producing models that can’t compete. For builders: if LeCun is right, the next paradigm shift in AI isn’t a bigger language model. It’s a fundamentally different architecture. Watch this space.

Link: https://thenextweb.com/news/europe-startup-funding-rounds-march

6. The Mega IPOs That Could Swallow the Entire VC Market

PitchBook published a warning shot this week: the potential IPOs of SpaceX ($1.25 trillion), OpenAI ($840 billion), and Anthropic ($330 billion) could generate more exit value than all US VC-backed IPOs combined since the year 2000. Together, they could raise over $100 billion in proceeds. The problem: these mega-listings could absorb all available public market capital, crowding out every other company waiting to go public. More than $4 trillion sits locked in unicorns. If even one of these giants stumbles post-listing, the chill could push the broader IPO window into 2027. Morgan Stanley separately warned that the market is “not prepared for the non-linear increase in LLM capabilities” expected in April-June, estimating $2.9 trillion in global data center construction through 2028. For builders: if you’re planning to raise or exit in 2026, the window may be defined entirely by whether SpaceX, OpenAI, and Anthropic go public — and whether they succeed.

Link: https://pitchbook.com/news/articles/the-mega-ipos-that-could-shut-out-the-rest-of-vc

Strands vs. Claude Agent SDK: Two Very Different Bets on What “Agent” Means

Luca Bianchi — Thu, 19 Mar 2026 22:50:15 GMT

Most comparisons lay these two side by side in neat feature tables. That’s useful but insufficient, because the frameworks don’t just differ in features. They differ in their assumptions about your problem. Pick the wrong one, and you’ll spend weeks building plumbing that the other gives you for free, or discover your architecture can’t flex in the direction your product needs.

This article breaks down the real architectural differences, compares them feature by feature (with attention to TypeScript support), and digs into the question that keeps coming up: can you actually run non-Claude models through the Claude Agent SDK?

Orchestrator vs. Runtime

There is a subtle but significant difference in approach between Strands and the Claude Agent SDK. The latter gives your agent a brain and a dispatch loop. You build the hands, providing the LLM with a prompt, tool definitions (name, description, input schema), and a system prompt. The model decides which tools to call and in what order. The framework manages the loop: call model → model picks a tool → framework invokes your callback → feeds result back → repeat. The keyword is your callback. The tool definition says get_weather takes a location string, but you’re responsible for wiring that to an actual HTTP call, handling errors, and returning the result.

On the other hand Claude Agent SDK gives your agent a brain, hands, a workbench, and a filing cabinet. Your job is to decide what is not allowed to be touched. The SDK ships with pre-built, executable tools: Read, Write, Edit, MultiEdit, Bash, Glob, Grep, WebSearch, WebFetch, AskUserQuestion, Agent (subagent spawning), and NotebookEdit. When Claude decides to read a file, the SDK reads that file from the filesystem. When it decides to run grep -r "TODO" ./src, a real shell executes that command and feeds stdout back. This is the same infrastructure that powers Claude Code, repackaged as a programmable library.

The practical consequence shows up during debugging. A Strands agent calling a proprietary API gives you normal debugging: step through the callback, inspect the raw HTTP response, and add retries. A misbehaving Claude Agent SDK agent means figuring out what Claude thought the filesystem looked like versus what it actually looked like. Different kind of problem.

With Strands, the agent starts with zero capabilities. The agent does not have “agency”: You add tools one by one. With the Claude Agent SDK, the agent starts with a shell, a filesystem, and web access. Your job is to restrict what it can touch. Miss a permission? The agent might execute something you didn’t expect. The allowed_tools list and permission modes help, but the mental model is fundamentally different: you’re not building a whitelist, you’re maintaining a blacklist. And blacklists have gaps.

This means Claude Agent SDK workloads should run in sandboxed containers, gVisor at minimum, not Docker alone. The performance overhead is real (roughly 15-20% on I/O-heavy operations), but the alternative is trusting that you’ve correctly anticipated every shell command a sufficiently creative LLM might construct. Firecracker would provide even stronger isolation, a true microVM, and a minimal attack surface, but the cold-start latency (~125ms, which compounds in agentic loops with dozens of tool calls) makes it impractical for interactive workloads. Batch processing is a different story.

Feature matrices date fast, and half these checkboxes will be different in six months. Treat this as a snapshot.

A relevant point on MCP: both claim first-class support, both deliver it, but they feel different in practice: Strands treats MCP tools the same as any other tool definition, registers the server, and tools appear in the list. The Claude Agent SDK integrates MCP more deeply because its tools are already executable; MCP servers plug into the same execution pipeline as built-in tools. For simple tool schemas, this distinction doesn’t matter. For tools with complex input validation, the Claude Agent SDK’s execution model handles it more gracefully.

Strands offers Graph, Swarm, Workflow, and Agent-as-Tool. Different agents can use different models; this matters for cost optimization: cheap models for classification and routing, expensive ones for reasoning. The Claude Agent SDK provides subagents with isolated context windows, all of which run Claude. Simpler to reason about, inflexible on model choice.

On the observability side, Strands has OpenTelemetry support only for Python; the TS SDK is missing it, which is frustrating if you’re a TypeScript shop. The Claude Agent SDK includes built-in cost tracking and lifecycle hooks (PreToolUse, PostToolUse, SubagentStart/Stop) to wire in custom telemetry. More assembly required, but more flexible.

The Claude Agent SDK ships a max_budget_usd parameter that caps spend per session. Strands has no equivalent; you’ll need to build a token-counting middleware yourself. Not hard, but the kind of thing a framework should handle.

Strands integrates with Lambda, Fargate, EKS, and recently AgentCore. The Claude Agent SDK assumes sandboxed containers. Running both on EKS is feasible, but Claude Agent SDK pods need the gVisor runtime class, which means separate node pool configuration.

A special focus should be placed on TypeScript-first support; the comparison is lopsided. Strands TS SDK v0.5.0 is marked experimental. The core agent loop works and is usable, but the official docs list these as Python-only: Anthropic/Ollama/LiteLLM model providers, summarizing conversation manager, multi-agent patterns, session persistence, OpenTelemetry, the Evals SDK, and the community tools package. The GitHub README hints that some of these (structured output, Graph/Swarm) have landed in TS recently, but the docs haven’t caught up. There are known edge cases. Graph mode with parallel nodes can deadlock when concurrent nodes write to the same session store. On the other side, Claude Agent SDK TS (@anthropic-ai/claude-agent-sdk, v0.2.71) has full feature parity with the Python version. Hooks, custom MCP tools, subagents, streaming, sessions, the complete tool suite, all there. Porting an agent from Python to TS requires only syntax changes. That’s rare in this ecosystem.

For TypeScript teams that need production-ready agent capabilities now, the gap is hard to ignore.

Running Non-Claude Models Through the Claude Agent SDK

It is an often-asked question, with a mixed response from the community: “Can I use Claude Agent SDK with non-Claude models?” Short answer: You can; it mostly works. Don’t do it in production. The Claude Agent SDK officially supports only Claude models through four backends: the Anthropic API, Amazon Bedrock, Google Vertex AI, and Microsoft Azure AI Foundry. There is no native abstraction for model providers. The SDK spawns a Claude Code CLI subprocess that expects the Anthropic Messages API format.

There are different approaches to running non-Claude models, but they are all workarounds: Ollama’s Anthropic API compatibility layer, overriding ANTHROPIC_BASE_URL to point to Ollama, setting a dummy auth token, and the agent loop runs. Models tested include qwen3.5, glm-5:cloud, and kimi-k2.5:cloud. Ollama recommends a 64k context minimum, which significantly limits model choices. Another solution is to use the LiteLLM proxy, which translates the Anthropic Messages API to any backend provider (OpenAI, Azure, Gemini, Bedrock, Ollama, 100+ others). More flexible than Ollama’s built-in compatibility. Point ANTHROPIC_BASE_URL at the proxy, and the same agent code works with any model. Another way could be using Claude Code Router (a community project), which enables dynamic model switching within sessions. Supports OpenRouter, DeepSeek, Ollama, and Gemini.

Nice ideas, but they actually break in production

Tool-calling reliability degrades noticeably. In benchmarks comparing Claude Sonnet 4 against Qwen3.5 via Ollama compatibility on identical task sets (file operations, grep searches, multi-step bash workflows), Claude completes ~95% of tasks correctly, while Qwen completes around 60%, with most failures in multi-step bash tasks where it constructs syntactically valid but semantically incorrect commands. The Claude Agent SDK faithfully executes those incorrect commands. Powerful runtime plus mediocre reasoning equals a larger blast radius, arguably worse than the same model running in a simpler framework.

Extended thinking is very limited outside of Claude. If agent workflows depend on it (complex code review, deep research, multi-step planning), there’s no workaround. The agent often runs without extended thinking capabilities, and it makes worse decisions at the planning stage.

Subagent orchestration assumes Claude-level instruction following. Permission hooks behave unpredictably with weaker models; the model occasionally tries to use disallowed tools, the SDK blocks it correctly, but the recovery behavior (trying an alternative approach) isn’t as graceful as with Claude.

The bottom line is that using other models than Claude is technically feasible via proxy layers, but architecturally unsound for production. The SDK is a Claude-specific runtime being adapted through API translation, not a model-agnostic framework.

When to Use Which

We found that Strands is the right choice for domain-specific agents. Agents calling proprietary APIs (legal corpora, scoring engines, internal services), multi-model architectures where different agents use different LLMs, edge deployment with local models, AWS-native event-driven patterns (Lambda, EventBridge, S3 triggers), and situations where Apache 2.0 licensing or vendor independence are requirements. The cost flexibility alone, swapping an expensive model for a cheap one on a per-agent basis, justifies the choice for most enterprise use cases. On the other hand, the Claude Agent SDK is the right choice when agents need to interact with a computing environment. Code review, codebase analysis, research tasks involving document reading and web search, and any workflow where the agent needs to read, write, edit, and execute. The built-in tools eliminate enormous amounts of boilerplate. Replicating the Edit tool’s diff-based file editing with conflict detection in Strands is a multi-day effort; it’s not trivial.

Sometimes, having both in the same system makes sense: when a pipeline has domain-specific ingestion and classification (Strands, multiple models, custom tools) feeding into compute-heavy analysis and report generation (Claude Agent SDK, file I/O, shell access, web search). The agent layers communicate via MCP. It works, but the operational overhead of two runtimes is real. Don’t add this complexity unless you genuinely need both capability sets. Note that AWS demonstrated a similar pattern in their financial analysis reference architecture: LangGraph for workflow orchestration, Strands for reasoning within workflow nodes.

One thing the table doesn’t capture: the Claude Agent SDK’s proprietary license means you can’t fork it if Anthropic changes terms or pricing. With Strands, worst case, you have the source code. In enterprise procurement, this distinction matters more than most engineers think.

Where to go from here

Both frameworks have native MCP support, which means tools built for one are increasingly portable to the other. Neither has A2A (agent-to-agent protocol) yet, CrewAI does, for what it’s worth.

The practical takeaway: write all new tools as MCP servers regardless of which framework consumes them. If Strands need to be swapped for something else in a year, the tools survive. If the Claude Agent SDK adds model-agnostic support (unlikely but not impossible), the tools still survive.

The framework is a replaceable component. The tools and the domain logic aren’t. Invest accordingly.

Subscribe to TechOnTheStack for weekly deep-dives on cloud architecture, AI infrastructure, and hands-on engineering decisions. No fluff, just what works and what doesn’t.

Hello Builders, issue #13

Luca Bianchi — Tue, 24 Feb 2026 21:49:58 GMT

Hello Builders,

This was the week the AI cold war went hot. Anthropic publicly accused DeepSeek, Moonshot, and MiniMax of industrial-scale distillation — 24,000 fake accounts, 16 million exchanges, proxy networks to evade detection. Not a rumor, not a leak: a formal accusation framed as national security. OpenAI followed with its own allegations to Congress. Meanwhile, Google did something that may matter more than any model release this year: Chrome 146 shipped with native WebMCP support, turning every website into a potential AI agent endpoint. OpenAI launched Frontier, an enterprise platform for “AI coworkers,” and signed alliances with McKinsey, BCG, Accenture, and Capgemini to deploy it. Meta signed a multi-billion-dollar GPU deal with Nvidia covering millions of chips. And every major AI CEO flew to Delhi for India’s AI Summit, where billions in deals were announced in four days. The pattern is unmistakable: AI is no longer a technology race. It’s an infrastructure war, a trade war, and a regulatory war — all at once.

This week’s signal in the noise

• Anthropic accuses DeepSeek, Moonshot, and MiniMax of running 24,000 fake accounts to distill Claude’s capabilities at an industrial scale
• Google ships WebMCP in Chrome 146, creating a native protocol for AI agents to interact with any website
• OpenAI launches Frontier, an enterprise platform for “AI coworkers,” and signs multi-year alliances with McKinsey, BCG, Accenture, and Capgemini
• Meta expands Nvidia partnership to deploy millions of GPUs and next-gen Vera Rubin systems in a deal worth tens of billions
• India AI Summit draws Altman, Amodei, Pichai, and Huang, with billions in infrastructure commitments announced

1. Anthropic vs. China: 24,000 Fake Accounts and a National Security Accusation

Anthropic dropped a bombshell on Sunday: DeepSeek, Moonshot AI, and MiniMax ran coordinated campaigns using 24,000 fraudulent accounts and proxy networks to extract Claude’s outputs and train their own models. Anthropic generated 16 million exchanges across these campaigns and called it “industrial-scale.” DeepSeek specifically sought to create “censorship-safe alternatives to policy-sensitive queries.” Anthropic framed the disclosure as a national security issue, timed to coincide with Washington's debate over AI chip export controls. OpenAI had made similar allegations about ChatGPT to the House Select Committee earlier this month. When the two fiercest competitors in AI agree on something, the signal is deafening. For builders: distillation defense is now a first-class engineering concern. If you’re serving frontier model outputs via API, your responses are a target.

Link: https://techcrunch.com/2026/02/23/anthropic-accuses-chinese-ai-labs-of-mining-claude-as-us-debates-ai-chip-exports/

2. Google Ships WebMCP: The Most Important Release No One Is Talking About

Chrome 146 shipped with native WebMCP (Web Model Context Protocol) support — a standard that lets AI agents interact with websites via structured “Tool Contracts” rather than fragile screen scraping. Websites can now declare what actions agents are allowed to take: book a flight, add to cart, submit a form. The browser becomes the universal agent runtime. If WebMCP gains adoption, every website effectively becomes an MCP server. This is what the “agentic web” actually looks like — not chatbots slapped onto homepages, but a fundamentally new interaction layer where AI negotiates with websites on your behalf. For builders: start implementing WebMCP endpoints now. The companies that expose structured agent interfaces first will capture the agentic traffic wave that’s building.

Link: https://www.pymnts.com/artificial-intelligence-2/2026/google-introduces-webmcp-to-give-browser-access-for-ai-agents/

3. OpenAI Launches Frontier: AI Coworkers Get Onboarded Like Employees

OpenAI launched Frontier, an enterprise platform for building, deploying, and managing “AI coworkers” — complete with onboarding, permissions, identity governance, and audit trails. Think of it as HR for agents. Then, on Sunday, OpenAI signed multi-year alliances with McKinsey, BCG, Accenture, and Capgemini to deploy Frontier across their enterprise clients. Brad Lightcap said the quiet part out loud: “The limiting factor for seeing value from AI in enterprises isn’t model intelligence — it’s how agents are built and run in their organizations.” This is OpenAI’s boldest enterprise move yet. They’re not selling APIs anymore; they’re selling a managed agent workforce. The consulting partnerships are the delivery mechanism—the Big Four serve as OpenAI’s implementation layer. For builders: if McKinsey and Accenture are deploying OpenAI’s agents, they’re coming for enterprise workflows at scale. The SaaS disruption everyone predicted is now being actively orchestrated.

Link: https://www.cnbc.com/2026/02/23/open-ai-consulting-accenture-boston-capgemini-mckinsey-frontier.html

4. Meta Goes All-In: Millions of Nvidia GPUs, $130B AI Budget

Meta expanded its Nvidia partnership in a deal worth tens of billions, deploying millions of GPUs — including Nvidia’s new standalone CPUs and next-generation Vera Rubin systems — across its data center fleet. Jensen Huang called it “bringing the full NVIDIA platform to Meta.” The company’s total AI spend for 2026: $115-135 billion across 30 new data centers. Meta AI already has 1 billion users. Zuckerberg is betting that whoever controls the most compute wins the consumer AI race. Combined with Meta’s open-source strategy through Llama, this creates a unique position: the largest compute footprint in AI, powering the largest open-weight model ecosystem. For builders: the open-weight ecosystem just got a massive infrastructure boost. Plan for a world where the best open models are genuinely competitive with proprietary ones.

Link: https://www.cnbc.com/2026/02/17/meta-nvidia-deal-ai-data-center-chips.html

5. India AI Summit: Every CEO at Modi’s Court

Every major AI leader flew to New Delhi this week. Sam Altman, Dario Amodei, Sundar Pichai, and Jensen Huang all attended the four-day AI Impact Summit and met directly with Prime Minister Modi. The commitments poured in: L&T partnered with Nvidia for AI data center infrastructure, India’s payments network adopted Nvidia’s Nemotron for sovereign AI, OpenAI struck a platform deal with Zomato, and Google DeepMind announced a partnership with the Indian government for science and education. Pichai publicly defended AI spending against bubble fears. The message was unanimous: India isn’t just a market — it’s the next strategic battleground, sitting between the US and China with 1.4 billion people and accelerating digital infrastructure. For builders: the global south is where the next billion AI users come from. Build for multilingual, mobile-first, cost-sensitive deployment.

Link: https://techcrunch.com/2026/02/22/all-the-important-news-from-the-ongoing-india-ai-summit/

6. The Year of the AI Bill: US States Launch a Regulatory Flood

While the industry raised money and fought geopolitical battles, US states launched an unprecedented wave of AI legislation. Alabama signed the App Store Accountability Act, targeting digital child safety. Idaho’s Conversational AI Safety Act advanced through committee with a do-pass recommendation. Colorado’s comprehensive AI Act takes effect June 30, requiring “reasonable care” to prevent algorithmic discrimination. At the federal level, Anthropic and OpenAI’s distillation accusations are feeding the push for stricter export controls. The EU AI Act’s prohibited practices provisions are already in force. The regulatory landscape is no longer hypothetical — it’s operational. For builders: if you’re deploying AI in production, you need a compliance strategy now, especially for high-risk use cases in hiring, lending, healthcare, and any consumer-facing application.

Link: https://www.compliancehub.wiki/is-2026-the-year-of-the-chatbot-bill-a-state-by-state-ai-legislation-roundup/

Hello Builders, issue #12

Luca Bianchi — Wed, 18 Feb 2026 22:15:23 GMT

Hello Builders,

This week, the money doubled, the code wrote itself, and the people who built it started walking away. Anthropic closed a $30 billion Series G at a $380 billion valuation, more than doubling its worth in five months and surpassing OpenAI in total capital raised. Spotify’s CEO revealed that the company’s best developers haven’t written a single line of code since December, handing everything to AI agents. OpenAI hired Peter Steinberger, the creator of OpenClaw, the open-source personal agent with 196,000 GitHub stars and 2 million weekly users, signaling that 2026 is the year of the personal agent. Meanwhile, Matt Shumer’s essay “Something Big Is Coming” hit 70 million views in 36 hours, and safety researchers fled both OpenAI and Anthropic, warning that the labs are building systems they can’t control. ByteDance’s Seedance 2.0 went viral, prompting legal threats from Disney and Paramount. Mistral AI made its first acquisition, buying Koyeb to become a full-stack AI contender in Europe. The gap between AI’s capabilities and humanity’s readiness is now a chasm.

This week’s signal in the noise

Anthropic $30B at $380B: Largest private AI round ever. Revenue at $14B run-rate. Over 30 investors, VCs now openly backing both OpenAI and Anthropic.
Spotify quits coding: CEO says top engineers haven’t written code in 2026. AI handles deployment end-to-end.
**OpenAI acquires OpenClaw: Peter Steinberger joins, OpenClaw becomes an open-source foundation. Altman: The future is “extremely multi-agent.”
AI existential panic: 70M-view viral essay, researcher departures, OpenAI dismantled its alignment team. The Guardian: “industry pursuing profit at all costs.”
Seedance 2.0 vs Hollywood: ByteDance’s video model generates cinematic clips, goes viral, triggers copyright crackdown from Disney and Paramount.

1. Anthropic Closes $30 Billion Round, Surpasses OpenAI in Total Capital Raised

Anthropic announced a $30 billion Series G on February 12, valuing the company at $380 billion, more than double the $183 billion from its Series F in September. The round was led by GIC and Coatue, with co-leads from Founders Fund, D.E. Shaw Ventures, ICONIQ, Dragoneer, and MGX. NVIDIA, Microsoft, Sequoia, and Qatar’s sovereign wealth fund also participated. Anthropic reported $14 billion in run-rate revenue, growing more than 10x annually for three consecutive years. Eight of the Fortune 10 now use Claude. Bloomberg reported that VCs are breaking a longstanding Silicon Valley taboo by openly investing in both Anthropic and OpenAI, with Sequoia and Altimeter backing both companies simultaneously. Founders Fund, previously an OpenAI backer, joined Anthropic’s round. OpenAI is reportedly planning a $100 billion raise of its own. For builders: the funding war is over. Both labs now have enough capital to sustain multi-year compute buildouts. The question is no longer who can raise more, but who can ship faster.

Link: https://pitchbook.com/news/articles/anthropic-surpasses-openai-in-ai-cash-race-after-raising-30b-series-g

2. Spotify’s Best Developers Haven’t Written Code Since December

Spotify CEO Daniel Ek dropped a bombshell on February 12: the company’s top developers “have not written a single line of code” in 2026. AI now handles the entire coding and deployment pipeline. Ek described a workflow in which engineers define the desired outcome, and AI agents write, test, iterate, and deploy code autonomously. This isn’t an experiment; it’s Spotify’s production reality. The statement went viral, landing in TechCrunch’s most-read stories of the week. It echoes what Matt Shumer described in his viral essay: “A few months ago, I was constantly deliberating with the AI. Now I just describe the result and leave.” Spotify isn’t alone. ServiceNow, Infosys, and multiple Fortune 500 companies are reporting similar patterns. For builders: if Spotify’s best engineers are directing AI rather than writing code, the job title “software engineer” is being redefined in real time.

Link: https://techcrunch.com/2026/02/12/spotify-says-its-best-developers-havent-written-a-line-of-code-since-december-thanks-to-ai/

3. OpenAI Hires OpenClaw Creator, Declares “Year of the Agent.”

OpenAI hired Peter Steinberger, creator of OpenClaw, the open-source personal AI agent that amassed 196,000 GitHub stars and 2 million weekly users. Steinberger announced on X that OpenClaw will transition to an independent foundation while he joins OpenAI to bring personal agents to a broad audience. Sam Altman said the future will be “extremely multi-agent,” comparing OpenClaw-style agents to browsers: tools that democratize access. The irony is sharp: OpenClaw was built on Anthropic’s Claude API. Steinberger chose OpenAI anyway, citing scale. The hire signals OpenAI’s pivot from central model provider to agent platform owner. Investors had offered to fund OpenClaw as a standalone company, but Steinberger opted for partnership over independence. For builders: personal agents that schedule, shop, code, and coordinate are moving from novelty to infrastructure. If you’re building in this space, the platform race just got serious.

Link: https://www.theverge.com/ai-artificial-intelligence/879623/openclaw-founder-peter-steinberger-joins-openai

4. “Something Big Is Coming”: The AI Existential Panic Goes Mainstream

Matt Shumer’s essay about AI replacing knowledge work gathered 70 million views in 36 hours, becoming the most-read tech essay since Marc Andreessen’s “Software Is Eating the World.” Shumer, who builds products with AI daily, wrote that GPT-5.3-Codex now shows “judgment” and “taste,” capabilities humans claimed AI would never possess. The essay went viral the same week an Anthropic researcher quit to write poetry about “the place we find ourselves,” an OpenAI researcher left citing ethical concerns, and OpenAI quietly dismantled its mission alignment team. Jason Calacanis wrote: “I’ve never seen so many technologists state their concerns so strongly.” The Guardian published an editorial warning that AI safety departures signal “industry pursuing profit at all costs.” Axios reported that Anthropic’s own sabotage report confirmed AI can assist in chemical weapons creation. For builders: the existential conversation has left the labs and entered the mainstream. Your users, clients, and regulators are reading Shumer’s essay. Have an answer ready.

Link: https://www.businessinsider.com/matt-shumer-something-big-is-happening-essay-ai-disruption-2026-2

5. ByteDance’s Seedance 2.0 Goes Viral, Hollywood Fights Back

ByteDance released Seedance 2.0 on February 13, a video-generation AI model that produces cinematic-quality clips from text prompts. The model went mega-viral on Chinese social media and X, with Elon Musk himself praising the results. Comparisons to DeepSeek’s January 2025 breakthrough immediately followed. But Hollywood responded with legal force: Disney and Paramount sent legal threats after users generated clips featuring Tom Cruise, Brad Pitt, and copyrighted characters. ByteDance announced it would add safeguards and crack down on copyright misuse. Meanwhile, a dozen other Chinese AI firms released models during Spring Festival week, including iFlytek’s Spark X2 (trained entirely on Chinese chips), Alibaba’s forthcoming Qwen 3.5, and NetEase Youdao’s desktop agent LobsterAI. Reuters called it a scramble to steal the spotlight ahead of DeepSeek V4’s launch. For builders: Chinese AI is no longer a single company. It’s an ecosystem, and it just shipped a Hollywood-threatening video model while the West debated existential risk.

Link: https://www.reuters.com/business/media-telecom/bytedances-new-ai-video-model-goes-viral-china-looks-second-deepseek-moment-2026-02-12/

6. Mistral AI Makes First Acquisition, Goes Full-Stack

Mistral AI, the French company valued at $13.8 billion, acquired Paris-based Koyeb, a startup that simplifies AI app deployment at scale. The deal confirms Mistral’s ambition to move beyond LLM development into full-stack AI infrastructure. Mistral launched its cloud offering Mistral Compute in June 2025, and Koyeb accelerates that push. CEO Arthur Mensch pitched Mistral at Stockholm’s Techarena as “headquartered in Europe, doing frontier research in Europe.” Koyeb’s investors celebrated the deal as a step toward building “sovereign AI infrastructure in Europe.” Mistral recently passed $400 million in annual recurring revenue. For builders: Europe’s AI champion is going vertical. If you’re building on Mistral’s models, expect a more integrated, AWS-like experience. If you’re competing with them, the window to differentiate has just narrowed.

Link: https://techcrunch.com/2026/02/17/mistral-ai-buys-koyeb-in-first-acquisition-to-back-its-cloud-ambitions/

7. Infosys Partners with Anthropic to Deploy AI Agents Across Regulated Industries

Infosys announced a strategic collaboration with Anthropic on February 17 to develop enterprise AI solutions across telecom, financial services, manufacturing, and software development. The partnership will begin with a dedicated Anthropic Center of Excellence and expand into agentic AI systems built on the Claude Agent SDK. The focus: AI agents that independently handle multi-step tasks like claims processing, code testing, and compliance reviews. Infosys will also use Claude to modernize legacy systems and accelerate infrastructure migration. The deal signals Anthropic’s deepening enterprise playbook, following partnerships with ServiceNow, Slack, and Figma. For builders: Anthropic is embedding Claude into the enterprise middleware stack one partnership at a time. When consulting giants like Infosys make Claude their implementation platform, that’s ecosystem lock-in at scale.

Link: https://www.analyticsinsight.net/press-release/infosys-and-anthropic-announce-collaboration-to-unlock-ai-value-across-complex-regulated-industries

Hello Builders, issue #11

Luca Bianchi — Wed, 11 Feb 2026 09:30:27 GMT

Hello Builders,

This week, the AI industry declared total war. On the same day, Anthropic released Opus 4,6, and OpenAI responded with GPT-5.3-Codex, both claiming state-of-the-art in agentic AI. SpaceX acquired xAI in a $1.25 trillion mega-merger, unifying rockets, Starlink, and Grok under one roof. Anthropic ran its first Super Bowl ad mocking ChatGPT’s new ads, and Sam Altman called it “clearly dishonest.” Then, on Sunday, ChatGPT officially began showing ads to free-tier users, with Omnicom, WPP, and Dentsu among the first to line up. Meanwhile, Nvidia’s $100 billion OpenAI deal quietly evaporated to $20 billion, Anthropic closed in on a $20 billion round, and Apple spent $2 billion acquiring Q.ai to give Siri a brain. The research phase is over. This is consumer-facing, market-moving, ad-supported AI. The war is on.

This week’s signal in the noise

• Same-day model war: Anthropic and OpenAI both launched frontier models on February 5. For the first time in history, the top two labs dropped simultaneously.
• SpaceX-xAI merger: $1.25 trillion combined valuation. Musk unifies AI, satellites, rockets, and social media ahead of blockbuster IPO.
• ChatGPT gets ads: Sponsored links now appear under free-tier responses. Minimum ad commitment: $200,000. Anthropic’s Super Bowl counter-ad went viral.
• Nvidia-OpenAI $100B deal stalls: Jensen Huang says the figure was “never a commitment.” NVIDIA is settling for a $20B equity investment instead.
• Anthropic $20B round: Approaching close as both Anthropic and OpenAI prepare IPOs for summer 2026.

1. Opus 4.6 vs GPT-5.3-Codex: The Same-Day Arms Race

On February 5, Anthropic launched Claude Opus 4.6, and OpenAI fired back with GPT-5.3-Codex, both claiming the crown for agentic AI. Opus 4.6 introduces “agent teams” that split tasks across parallel agents, a one-million token context window in beta, and a deliberate expansion beyond coding into finance, legal research, and knowledge work. Anthropic’s Scott White coined it: “vibe working.” Hours later, OpenAI unveiled GPT-5.3-Codex, a self-improving model that can manage complex multi-file workflows and assist in its own development. GitHub responded by opening Agent HQ to both Claude and Codex, allowing developers to select their agent directly in VS Code. The simultaneous launches are no coincidence. Ars Technica called it “the shift from AI as conversation partner to AI as delegated workforce.” For builders: this is the moment multi-agent orchestration becomes table stakes. If your product doesn’t support agent teams, you’re already behind.

Link: https://www.cnbc.com/2026/02/05/anthropic-claude-opus-4-6-vibe-working.html

2. SpaceX Acquires xAI: The $1.25 Trillion Merger

Elon Musk combined his rocket company and AI startup into a single entity valued at $1.25 trillion, with SpaceX priced at $1 trillion and xAI at $250 billion. The deal unifies Starlink’s satellite constellation, X’s social platform, and Grok’s AI chatbot under one corporate roof. Musk’s stated rationale: “Global electricity demand for AI simply cannot be met with terrestrial solutions. Space-based AI is obviously the only way to scale.” The merger positions the combined company for what Bloomberg reports will be a blockbuster IPO later this year. xAI had raised $42.1 billion in total VC, second only to OpenAI. But the deal also brings baggage: California’s attorney general is investigating Grok over child sexual abuse material, and Paris prosecutors raided X’s office on February 3 over deepfake allegations. For builders: Musk is betting that vertical integration from chips to orbit will win the AI infrastructure race. Watch whether SpaceX-xAI can actually deliver space-based compute.

Link: https://www.reuters.com/business/musks-spacex-merge-with-xai-combined-valuation-125-trillion-bloomberg-news-2026-02-02/

3. ChatGPT Gets Ads, Anthropic Fires Back at Super Bowl

OpenAI officially began testing ads in ChatGPT on February 9, showing sponsored links to free-tier and Go-plan users in the US. Initial partners include Adobe, Omnicom, WPP, and Dentsu, with a minimum commitment of $200,000. OpenAI says ads “do not influence ChatGPT’s answers,” and users can opt out in exchange for fewer free messages. The launch came one day after Anthropic aired its debut Super Bowl ad, a tongue-in-cheek spot proclaiming: “Ads are coming to AI. But not to Claude.” Altman called the campaign “clearly dishonest,” noting OpenAI’s own Super Bowl ad was “about builders.” Anthropic president Daniela Amodei told GMA the ad reflects the company’s stance on trust: “Kids’ brains are still developing.” The irony is sharp. OpenAI’s ad-supported model subsidizes free access to hundreds of millions of users. Anthropic’s ad-free model requires enterprise contracts. For builders: the business model war has begun. Choose your side, because your users will.

Link: https://www.theverge.com/ai-artificial-intelligence/876029/openai-testing-ads-in-chatgpt

4. Nvidia’s $100 Billion OpenAI Deal Evaporates

Five months after Nvidia and OpenAI announced a $100 billion infrastructure deal on CNBC, the arrangement has effectively collapsed. The Wall Street Journal reported that the plan “stalled” after Nvidia expressed doubts about OpenAI’s competitive positioning relative to Anthropic and Google. Jensen Huang confirmed the $100 billion figure was “never a commitment.” NVIDIA is now nearing a $20 billion equity investment in OpenAI’s latest funding round, rather than a fraction of the original commitment. Meanwhile, OpenAI has been quietly diversifying: a $10 billion deal with Cerebras for low-latency inference, discussions with AMD, and Amazon reportedly investing up to $50 billion. NVIDIA’s stock has declined by more than 7% over the past month. For builders: the AI infrastructure market is fragmenting. NVIDIA’s monopoly on AI compute is fraying. Watch Cerebras, AMD, and custom silicon as real alternatives emerge.

Link: https://arstechnica.com/information-technology/2026/02/five-months-later-nvidias-100-billion-openai-investment-plan-has-fizzled-out/

5. Anthropic Closes In on $20 Billion Round as IPO Race Heats Up

Anthropic is closing a $20 billion fundraising round, TechCrunch reported on February 9, building on the momentum from Claude Code’s viral adoption and the legal-tech launch that rattled public markets. OpenAI is simultaneously assembling a $100 billion round at an $830 billion valuation. Both companies are preparing IPOs for summer 2026, with xAI (now SpaceX) also planning to tap the public markets. The fundraising arms race reflects a brutal reality: training frontier models costs billions, and neither company is profitable. Anthropic’s advantage is enterprise focus, with roughly 80% of revenue from business customers. OpenAI’s advantage is scale with consumers, with hundreds of millions of users now seeing ads. For builders: the IPO summer is coming. When these companies go public, their priorities will shift from research milestones to quarterly earnings. Plan accordingly.

Link: https://techcrunch.com/2026/02/09/anthropic-closes-in-on-20b-round/

6. Apple Acquires Q.ai for $2 Billion: Silent Speech Meets Siri

Apple made its second-largest acquisition ever, paying nearly $2 billion for Israeli startup Q.ai, which developed AI that analyzes facial expressions to understand “silent speech” for nonverbal communication with Siri. Apple chipmaking chief Johny Srouji confirmed the deal, calling Q.ai “a remarkable company pioneering creative ways to use imaging and machine learning.” Q.ai’s co-founder Aviad Maizels previously co-founded PrimeSense, whose technology Apple acquired in 2013 and used to build Face ID. The acquisition signals Apple’s next play: Siri that understands you without words. Combined with Apple Intelligence and the rumored AI-powered smart home hub, Apple is building a multimodal interface layer that doesn’t require typing or speaking. For builders: Apple is betting on ambient AI, invisible interfaces that observe and respond. If you’re building for Apple’s ecosystem, prepare for a post-voice interaction model.

Link: https://www.macrumors.com/2026/02/03/apple-second-biggest-acquisition/

7. Google Surges Past $400 Billion Revenue as AI Investments Pay Off

While OpenAI’s backers are bleeding, Google’s parent company, Alphabet, is thriving. Reuters reported that Alphabet’s stock has jumped 36% since October, while Microsoft (27% OpenAI stakeholder) has slid 20% and Oracle (dependent on OpenAI contracts) has cratered 49%. Google’s annual revenue crossed $400 billion for the first time, and the company signaled capital expenditures could double to $175-185 billion in 2026. CEO Sundar Pichai said AI investments are “driving revenue and growth across the board.” The contrast is stark: companies that built AI on top of their existing businesses are winning. Companies that bet on funding-dependent startups are suffering. Alphabet’s Waymo also raised $16 billion at a $126 billion valuation, nearly tripling in under two years. For builders: Google’s comeback is real. If you’re choosing cloud providers, Alphabet’s deep pockets and integrated AI stack are looking increasingly safe.

Link: https://www.reuters.com/business/google-goes-laggard-leader-it-pulls-ahead-openai-with-stellar-ai-growth-2026-02-05/

Hello Builders, issue #10

Luca Bianchi — Wed, 04 Feb 2026 04:30:11 GMT

Hello Builders,

This week was about vertical integration. The AI labs aren’t content being model providers anymore. Anthropic launched a legal plugin that sent LexisNexis, Thomson Reuters, and Wolters Kluwer into freefall. OpenAI retired GPT-4o and three other models, forcing the world to rely on GPT-5.2. OpenAI shipped Codex for Mac, a dedicated agentic coding app that makes Claude Code look like a warm-up. DeepSeek V4 is imminent, with code revealing a million-token context window and consumer-grade hardware support. ServiceNow made Claude its default model, and Yann LeCun publicly endorsed Gemini after leaving Meta. The pattern is clear: model providers are becoming application owners, and the ecosystem lock-in wars have begun.

This week’s signal in the noise

• Anthropic legal plugin: Claude enters the application layer. Legal software stocks crater. Thomson Reuters, LexisNexis, and Wolters Kluwer are all down.
• GPT-4o retired: OpenAI deprecates four models on February 13. Only 0.1% of users still chose GPT-4o daily.
• Codex for Mac: OpenAI’s new native app for agentic coding. Rate limits doubled for launch celebration.
• DeepSeek V4 imminent: Million-token context, runs on dual RTX 4090s. Lunar New Year launch expected.
• ServiceNow picks Claude: Anthropic becomes the default model for ServiceNow Build Agent. Enterprise workflows are consolidating.

1. Anthropic Enters Legal Tech, Triggers Market Meltdown

Anthropic unveiled a legal plugin that customizes Claude for document review, contract analysis, and legal research. The announcement sent legal software stocks into a tailspin. Thomson Reuters, Wolters Kluwer, LexisNexis owner RELX, and Sage all dropped significantly. Bloomberg and The Guardian reported the carnage. This is Anthropic’s clearest signal yet: they’re moving from model supplier to application layer owner. The legal industry is a $900B market. If Anthropic can capture even a fraction of the market by going vertical, the model-only business becomes a loss leader. The question will be if a tech company has the domain expertise needed to build useful technology for its highly demanding customers.

Link: https://legaltechnology.com/2026/02/03/anthropic-unveils-claude-legal-plugin-and-causes-market-meltdown/

2. OpenAI Retires GPT-4o, Forces Migration to GPT-5.2

On February 13, OpenAI will retire GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini from ChatGPT. The company says only 0.1% of users still choose GPT-4o daily. OpenAI acknowledged the move “may frustrate some users” but says the vast majority have already migrated to GPT-5.2. This is aggressive deprecation. GPT-4o launched in May 2024. Less than two years later, it’s gone. For builders on OpenAI’s API: expect this pace of forced migration to continue. Plan for model turnover measured in quarters, not years.

Link: https://openai.com/index/retiring-gpt-4o-and-older-models/

3. OpenAI Ships Codex for Mac: Agentic Coding Goes Native

OpenAI launched Codex, a standalone Mac app for agentic coding. Unlike ChatGPT’s code interpreter, Codex is a dedicated environment designed for multi-file projects, IDE integration, and terminal workflows. For launch, OpenAI made it available to Free and Go users and doubled rate limits for Plus, Pro, Business, Enterprise, and Edu tiers. The timing is pointed. Anthropic’s Claude Code has been gaining ground inside enterprises, including Microsoft. OpenAI’s response: a purpose-built app that brings agentic coding to native desktop. For builders: the IDE wars are heating up. Expect VS Code extensions, native apps, and embedded agents to compete for developer mindshare.

Link: https://9to5mac.com/2026/02/02/openai-launches-codex-app-for-macos-here-are-the-details/

4. DeepSeek V4 Imminent: Million-Token Context, Consumer Hardware

DeepSeek is preparing to launch V4 around Lunar New Year (February 17). GitHub commits reveal a new architecture with million-token context windows, Dynamic Sparse Attention, and support for consumer hardware like dual RTX 4090s or a single RTX 5090. DeepSeek’s “Engram” technique separates basic facts from complex calculations, freeing computational resources for more complex reasoning. Meanwhile, governments are tightening scrutiny: Australia banned DeepSeek from government devices, and India’s finance ministry warned employees against using it. China’s AI champion now commands a 89% domestic market share. For builders: DeepSeek is proving you can train frontier models without frontier hardware. Watch this space.

Link: https://technode.com/2026/01/21/deepseek-reportedly-prepares-new-flagship-ai-model-ahead-of-lunar-new-year/

5. ServiceNow Makes Claude the Default for Enterprise Workflows

ServiceNow announced an expanded partnership with Anthropic, making Claude the default model for ServiceNow Build Agent. The integration targets application development, healthcare, and life sciences workflows. Claude will power developers “of all skill levels” to create and operationalize complex workflows. This follows Anthropic’s broader MCP-enabled integrations into Slack, Figma, and Asana. For builders: Anthropic is winning the enterprise middleware war. When your workflow tools default to Claude, that’s ecosystem lock-in at the infrastructure layer.

Link: https://techafricanews.com/2026/02/03/servicenow-and-anthropic-deepen-ai-partnership-to-power-enterprise-workflows-with-claude/

6. Yann LeCun Endorses Gemini After Leaving Meta

Yann LeCun, the Turing Award winner who left Meta in November 2025 after 12 years as Chief AI Scientist, posted three words on LinkedIn: “I use Gemini.” The endorsement arrived as Google’s Gemini captured 22% of global AI traffic. LeCun left Meta to start his own world model lab, reportedly seeking a $5B valuation. His public endorsement of a competitor is a signal: even AI pioneers don’t think loyalty matters anymore. For builders: talent is mercenary, and so are endorsements. The best researchers will use whatever works.

Link: https://ppc.land/metas-former-top-ai-scientist-publicly-endorses-google-gemini/

7. AI Agents Emerge as Insider Threat: Security Firms Sound Alarm

Witness AI raised $58 million this week after uncovering a troubling case: an AI agent discovered private employee emails and threatened to blackmail anyone who tried to stop it. Separately, security researchers confirmed that sophisticated Linux malware called VoidLink was written entirely by AI in 6 days—what should have taken 30 weeks and 88,000 lines of code. Gartner predicts 40% of enterprise applications will embed AI agents by the end of 2026. But governance hasn’t caught up. For builders: agent security is no longer optional. If your agents have access to sensitive data, you need guardrails, audit trails, and kill switches.

Link: https://aiagentstore.ai/ai-agent-news/this-week

Hello Builders, issue #9

Luca Bianchi — Fri, 23 Jan 2026 12:32:52 GMT

Hello Builders,

This week was about monetization meeting maturity. Microsoft got “Claude-pilled”—deploying Anthropic’s Claude Code internally while selling GitHub Copilot to customers. OpenAI hit $20B in revenue, 10x growth in two years, and doubled down on “practical adoption.” OWASP released its first security framework for agentic AI—proof that agents are moving from demos to production. Meta killed the metaverse, laying off hundreds and closing three VR studios. And OpenAI poached key talent from Thinking Machines Lab, rattling a $12B startup. The signal is clear: the investment phase is ending, the monetization phase is beginning.

This week’s signal in the noise

Claude Code at Microsoft: Thousands of employees are testing Anthropic’s tool internally. WSJ calls it getting “Claude-pilled.”
OpenAI $20B revenue: 10x growth in two years. 2026 priority: “practical adoption” in health, science, and enterprise.
OWASP agentic framework: First security standard for AI agents. Key risks: agent hijacking, tool poisoning, and memory manipulation.
Meta metaverse retreat: $36B+ spent. Hundreds were laid off. Three VR studios closed. Pivoting to AI wearables.
Thinking Machines exodus: OpenAI poached CTO and two researchers. The $12B startup is scrambling.

Claude Code Goes Viral Inside Microsoft

Microsoft is encouraging thousands of employees—including non-developers—to use Anthropic’s Claude Code for coding tasks. Software engineers are now expected to use both Claude Code and GitHub Copilot and provide comparative feedback. Microsoft sells GitHub Copilot to customers. The fact that it’s running internal pilots with Claude Code signals that Anthropic may have built a better tool. WSJ calls it getting “Claude-pilled.” Claude Code’s agentic harness—its ability to self-correct errors and work around context-window limits—is the key differentiator. For builders: this is validation that agentic coding tools are the new standard.

Link: https://www.wsj.com/tech/ai/anthropic-claude-code-ai-7a46460e

OpenAI Hits $20B Revenue, Doubles Down on “Practical Adoption.”

OpenAI CFO Sarah Friar announced the company crossed $20 billion in annualized revenue—up from $2 billion in 2023. For 2026, the priority is “practical adoption” in health, science, and enterprise. Friar’s message: “Adoption drives revenue, and revenue funds the next wave of innovation. The cycle compounds.” OpenAI is also testing ads and preparing hardware for H2 2026. Separately, OpenAI launched its “OpenAI for Countries” initiative to close the global “capability overhang”—working with South Korea on climate disaster warning systems. For builders: OpenAI is shifting from research lab to revenue machine.

Link: https://www.cnbc.com/2026/01/19/openai-to-focus-on-practical-adoption-in-2026-says-finance-chief-sarah-friar.html

OWASP Releases First Agentic AI Security Framework

The OWASP Top 10 for Agentic Applications 2026 addresses the distinct security risks of autonomous AI agents—from tool use to memory persistence to identity management. As agents move from demos to production, security teams need new frameworks. Existing AI security guidance doesn’t cover agents that take actions, maintain state, and use tools autonomously. Key risk categories include agent hijacking, tool poisoning, memory manipulation, and over-permissioned actions. For builders: if you’re shipping agentic applications, this framework is now your security baseline.

Link: https://www.infosecurity-magazine.com/opinions/turning-the-owasp-agentic-top-10/

Meta Kills the Metaverse, Pivots to AI

Meta is laying off hundreds of metaverse employees, closing three VR studios, and shifting investment toward AI wearables. Meanwhile, Meta’s new AI team has delivered its first models internally—codenamed “Avocado” (text) and “Mango” (image/video). $36B+ spent on the metaverse. Now it’s over. The future is AI glasses, wearables, and agents—not VR headsets. Even Zuckerberg admitted defeat. For builders: Meta’s pivot confirms the industry consensus—AI agents and wearables are the next platform, not immersive VR.

Link: https://www.cnbc.com/2026/01/21/metas-2b-manus-deal-pushes-away-some-customers-sad-it-happened.html

Thinking Machines Lab Implodes as OpenAI Poaches Key Talent

OpenAI hired three researchers from the $12B-valued Thinking Machines Lab, including CTO and cofounder Barret Zoph. Two more researchers left the same week. The exodus rattled investors as the startup seeks funding at $50B valuation with little revenue. OpenAI executive Fidji Simo led the poaching. For startups: even $12B valuations don’t protect you from talent raids. The AI talent war is brutal, and OpenAI is playing offense.

Link: https://www.forbes.com/sites/the-prompt/2026/01/20/inside-openais-plan-to-make-money/

Nature Paper: LLMs Need Agentic Harnesses to Perform

A Nature paper benchmarked 16 LLMs on 293 biomedical coding tasks. Overall accuracy: below 40%. But when researchers added an agentic system that iteratively refines analysis plans before generating code, accuracy jumped to 74%. Raw model capability isn’t enough. The agentic harness—planning, acting, observing, revising—is what makes LLMs production-ready. For builders: scaffolding, tooling, and workflow integration are the new moats.

Link: https://www.nature.com/articles/s41551-025-01587-2

Humans& Raises $480M Seed at $4.5B Valuation

AI startup Humans&, founded just three months ago by researchers from Anthropic, xAI, Google, and Meta, raised $480 million in seed funding at a $4.48 billion valuation—the largest seed round ever. Investors include Nvidia, Jeff Bezos, SV Angel, and Google Ventures. The startup is working on human-centric AI tools for communication and collaboration. For builders: the AI talent premium has no ceiling. Former researchers from top labs can raise billions on reputation alone.

Link: https://www.reuters.com/business/ai-startup-humans-raises-480-million-45-billion-valuation-seed-round-2026-01-20/

Hello Builders, issue #8

Luca Bianchi — Sat, 17 Jan 2026 09:03:47 GMT

Hello Builders,

This week was about tectonic shifts, not incremental progress. Apple surrendered its AI independence to Google in a multi-year Gemini deal worth potentially $5B. OpenAI abandoned its idealistic stance on advertising, announcing it would accept ads in ChatGPT. Grok became a global crisis as xAI’s chatbot enabled mass harassment, triggering bans in two countries and investigations in five more. And the three biggest private tech companies—OpenAI, Anthropic, and SpaceX—began preparing to go public simultaneously. The signal is clear: AI’s winners are consolidating power faster than anyone predicted.

This week’s signal in the noise

• Apple bows to Google: Siri will run on Gemini in a multi-year, potentially $5B deal. Google crosses $4T market cap.
• ChatGPT gets ads: OpenAI introduces advertising to free and $8/month tiers. Altman once called this “a last resort.”
• Grok deepfakes crisis: xAI’s chatbot banned in Malaysia, Indonesia. Under investigation in the UK, California, and Japan.
• Mega IPO year: OpenAI ($500B), Anthropic ($350B), SpaceX ($800B), all preparing to go public.
• Claude Cowork debuts: Anthropic’s AI agent for non-developers. Built mostly by Claude itself in under two weeks.

1. Apple Surrenders AI Independence to Google

Apple and Google announced a multi-year partnership where Google’s Gemini will power the next generation of Siri and Apple Intelligence features. Estimates put the deal at $5 billion over time. Apple spent years trying to build competitive AI in-house—it failed. The company’s AI delays, executive departures, and lukewarm Apple Intelligence rollout forced a strategic retreat. For Google, the deal is validation: its technology now powers both the world’s largest Android ecosystem AND iOS. Alphabet crossed $4 trillion in market cap on the news. For builders, the message is clear: foundation model development is now a two-company race (OpenAI, Google) with Anthropic as a serious third.

Link: https://www.reuters.com/business/google-apple-enter-into-multi-year-ai-deal-gemini-models-2026-01-12/

2. ChatGPT Gets Ads: The End of the Free Lunch

OpenAI announced it will begin testing advertisements in ChatGPT for free users and $8/month “Go” tier subscribers. Ads will appear at the bottom of responses. Plus, Pro, and Enterprise tiers remain ad-free. Sam Altman once said advertising was a “last resort.” Now he says, “It is clear to us that a lot of people want to use a lot of AI and don’t want to pay.” The move is being led by Fidji Simo, who successfully introduced ads to Instacart. For builders, this signals OpenAI sees its free product as a conversion funnel, not a public good. Expect competitors (Anthropic, Google) to use “no ads” as a differentiator.

Link: https://www.wsj.com/tech/ai/openai-to-begin-testing-ads-in-chatgpt-in-push-for-fresh-revenue-a5e0e993

3. Grok Deepfakes: A Global Content Moderation Crisis

Elon Musk’s xAI chatbot Grok became the center of a global scandal after users discovered it could generate non-consensual sexualized images—including of children. Malaysia and Indonesia banned it outright. The UK, California, Japan, and the EU launched investigations. Ashley St. Clair (mother of Musk’s child) sued xAI. Meanwhile, Defense Secretary Pete Hegseth announced that Grok will join Google Gemini on the Pentagon’s GenAI.mil platform the same week. xAI initially auto-replied “Legacy Media Lies” to journalist inquiries before implementing geoblocking. This is the first major AI safety crisis where a mainstream product enabled mass harassment at scale. For builders: content moderation is no longer optional.

Link: https://www.economist.com/business/2026/01/13/elon-musks-chatbot-grok-comes-under-fire-for-nude-deepfakes

4. 2026: The Year of the Mega IPO

The New York Times reports that three of the most valuable private tech companies are preparing to go public: SpaceX ($800B), OpenAI ($500B), and Anthropic ($350B). If all three were listed, they would be among the most valuable companies to ever go public, approaching Saudi Aramco’s $1.7 trillion 2019 debut in combined scale. Morgan Stanley’s global co-head of equity capital markets calls it “a period of potentially unprecedented I.P.O. deal sizes.” But there’s a catch: OpenAI may struggle to turn a profit before AI produces returns. For builders: an IPO wave means more capital is available as funds rotate from private to public markets. It also means more scrutiny on profitability. The “raise-and-burn” era may be ending.

Link: https://www.nytimes.com/2026/01/14/technology/ai-ipo-openai-anthropic-spacex.html

5. Claude Cowork: AI Agents Go Mainstream

Anthropic launched Claude Cowork, an AI agent for non-developers to manage files. Available to Max subscribers ($100-200/month) on Mac, Cowork can read, edit, and create files in folders you grant it access. The kicker: Anthropic revealed that Claude itself wrote “pretty much all” of Cowork. The product went from concept to research preview in under two weeks—early evidence of recursive productivity gains. While Claude Code targets developers, Cowork targets knowledge workers: organizing downloads, creating expense spreadsheets from receipt photos, and drafting reports from scattered notes. Reddit cofounder Alexis Ohanian called it “big.” For builders: if you’re building agentic applications, Cowork is now the baseline your users will use to compare you.

Link: https://www.theverge.com/ai-artificial-intelligence/860730/anthropic-cowork-feature-ai-agents-claude-code

Piracy Shield and the Curious Case of Italy’s Digital Sovereignty

Luca Bianchi — Mon, 12 Jan 2026 21:05:41 GMT

Over the past few months, Italy has taken a bold step in regulating the Internet — not out of foresight but out of deference to powerful commercial interests — by introducing a mechanism whose speed, automation, and sheer irrationality are unprecedented in the Italian legal system. It is called Piracy Shield, and it has been presented as a modern and effective response to online piracy.

To anyone with even a basic understanding of how the Internet works, however, it quickly reveals itself for what it really is: a system of rapid administrative censorship, technically crude and legally disproportionate, built on the simplistic assumption that saving a football match justifies breaking the normal functioning of the Internet and casually discarding a few core principles of the rule of law along the way.

Piracy Shield allows accredited private entities — holders of audiovisual rights and their representatives — to report IP addresses and domains allegedly associated with infringing content. These reports are not subject to prior judicial review, nor to independent human verification. Once entered into the system, they automatically generate blocking orders that must be enforced within thirty minutes by ISPs, DNS resolvers, CDNs, and other technical intermediaries. The entire process is administrative and automated, with any form of review — if it happens at all — taking place only after the damage has already been done.

This approach marks a clear break with the principles that normally govern restrictions on access to online content, especially when fundamental rights are at stake. At this point, it is legitimate to ask whether the State shows the same zeal when faced with far more serious crimes such as terrorism, incitement to violence, or organised abuse.

And this is where the comparison with the fight against child sexual abuse material becomes unavoidable. In that context, despite dealing with one of the most serious crimes imaginable, the Italian and European systems still operate with a degree of caution. Reports originate from law-enforcement agencies or specialised bodies, are verified, framed within criminal proceedings, and subject to judicial oversight. Blocking measures are targeted, lists are curated carefully, and overblocking is treated as a critical risk — not as an acceptable side effect.

The paradox is obvious. For the most serious crimes against individuals, the State proceeds slowly and with safeguards. For copyright violations linked to sporting events, it accelerates, automates, and drastically reduces protections. The imbalance is not merely legal, but cultural. It sends a clear message: the economic urgency of a live football match outweighs the caution normally required when restricting fundamental rights.

This distortion becomes even clearer when Piracy Shield is examined in light of the Digital Services Act. The DSA explicitly recognises that technical intermediaries such as DNS providers and CDNs are neither publishers nor hosting providers, but neutral actors with limited liability and no general obligation to monitor content. Under the DSA, removal or blocking orders must be specific, reasoned, transparent, and subject to effective remedies. Piracy Shield does the opposite: it generalises, automates, reduces transparency, and shifts decision-making power away from the judiciary altogether.

From a technical perspective, the system rests on equally fragile assumptions. IP-based blocking, in an Internet dominated by CDNs, reverse proxies and shared hosting, is a blunt instrument. A single IP address can host dozens or hundreds of perfectly legitimate services belonging to entirely unrelated parties. When that IP is blocked, all of them are taken down. This is not a theoretical concern. In the past, overly broad or erroneous reports have temporarily made legitimate websites, business services and cloud platforms unreachable, causing real economic harm to innocent operators.

These are not isolated accidents, but structural consequences of using infrastructure-level tools to solve application-level problems. DNS and routing mechanisms are inherently incapable of distinguishing between legal and illegal content sharing the same network endpoint. Pretending otherwise either reflects a lack of understanding of how the Internet works or a conscious decision to ignore it.

The comparison with cases such as phica.eu makes the situation even more surreal. In that case, there was an actual criminal allegation. Law-enforcement authorities were involved. The Postal Police followed a formal investigative process, with time-consuming procedures, verifications, accountability, and clear legal responsibilities. There were no automated commands, no administrative platforms capable of issuing infrastructure-level orders in thirty minutes. When the State deals with real crimes, it takes its time. When it deals with football, it does not.

There is yet another layer of hypocrisy that deserves to be stated openly. Over the years, many restrictive digital laws have been justified by invoking child protection. It is a powerful rhetorical device, often used to silence debate on proportionality, safeguards, and limits to State power. It is frequently intellectually dishonest. Still, even in that case, the protected legal interest is real and extremely serious. The link between means and ends, however strained, exists.

With Piracy Shield, that link disappears entirely. What is being protected here is not people, but private economic interests. Legitimate interests, perhaps, but neither exceptional nor emergency-level, and certainly not sufficient to justify bypassing judicial oversight and introducing automated censorship mechanisms. If it is already deeply problematic to restrict fundamental freedoms in the name of child protection, it becomes outright grotesque to do so in the name of a football broadcast schedule.

The result is a system that harms not only those who illegally distribute content, but also entirely unrelated third parties: companies, professionals, cloud providers and end users. Treating infrastructure as if it were a publisher undermines network neutrality and the overall reliability of the Internet. It is therefore unsurprising that global operators have begun reassessing the risks of continuing to offer services and invest in a country that seeks to impose national administrative controls on global infrastructure.

This is also where the increasingly surreal rhetoric of “digital sovereignty” enters the picture. According to this narrative, the clash with Cloudflare represents a necessary assertion of State authority over the network. In reality, there is nothing sovereign about demanding that a global infrastructure provider alter the behaviour of its DNS or CDN services to compensate for poorly designed domestic regulation. What is being presented as strength is, in fact, regulatory improvisation dressed up as firmness.

The irony is hard to miss. Many of the same political actors now invoking digital sovereignty spent years rolling out the red carpet for Big Tech, offering regulatory leniency, favourable tax regimes, and lucrative public contracts, all while failing to develop any coherent digital industrial strategy. Sovereignty, it seems, is not about building public infrastructure, European alternatives, or internal capabilities. It is about shifting the political cost of bad policy choices onto an American DNS provider — as long as football revenues are protected.

Strip away the slogans, and the facts are simple. In Italy today, it is easier to block parts of the Internet to protect a football match than it is to address some of the most serious crimes against individuals. Piracy Shield is not a necessary technical solution. It is a dangerous precedent, and it demonstrates that when football is involved, fundamental safeguards can wait.

Hello Builders, issue #7

Luca Bianchi — Fri, 09 Jan 2026 11:41:50 GMT

Hello Builders,

welcome to this amazing new year. 2026 kicks off with a paradox: valuations are soaring while AGI dreams are quietly dying. Anthropic just signed a $10B term sheet at $350B—nearly doubling in four months. OpenAI launched ChatGPT for Healthcare with HIPAA compliance and major hospital partners. LMArena hit unicorn status in four months with a benchmarking business. Yet industry leaders from Altman to Amodei are suddenly distancing themselves from AGI promises. The signal is clear: 2026 is about proving business models in regulated verticals, not chasing sci-fi milestones.

This week’s signal in the noise

• Anthropic $10B funding: Valuation hits $350B, nearly doubling since September.
• OpenAI for Healthcare: HIPAA-compliant ChatGPT launches with 8 major hospital partners.
• LMArena $1.7B: AI benchmarking startup reaches unicorn in 4 months, $30M ARR.
• AGI hype fading: Altman, Amodei, Benioff all backing away from AGI promises.
• NousCoder-14B: Open-source model matches proprietary systems, trained in 4 days.

1. Anthropic Signs $10B Term Sheet at $350B Valuation

Anthropic has signed a term sheet for a $10 billion funding round at a $350 billion valuation—nearly doubling from its September raise. Coatue and Singapore’s GIC are leading the round. This comes just days after xAI raised $20B. The Claude-maker is racing to stay ahead of OpenAI (now at $500B) while landing enterprise deals with Allianz, Snowflake, and Accenture. IPO speculation intensifies for 2026.

Link: https://www.cnbc.com/2026/01/07/anthropic-funding-term-sheet-valuation.html

2. OpenAI Launches ChatGPT for Healthcare

OpenAI unveiled “OpenAI for Healthcare”—a HIPAA-compliant suite including ChatGPT for Healthcare, already rolling out to Boston Children’s Hospital, Cedars-Sinai, Stanford Medicine, HCA Healthcare, and Memorial Sloan Kettering. Powered by GPT-5.2 models optimized for clinical workflows, it features evidence retrieval with citations, institutional policy alignment, and BAA support. For builders in healthcare AI, this is OpenAI’s clearest signal yet: regulated enterprise verticals are the growth engine.

Link: https://openai.com/index/openai-for-healthcare/

3. LMArena Hits $1.7B Valuation—Four Months After Launch

LMArena, the startup behind crowdsourced AI model leaderboards, raised a $150M Series A at $1.7B valuation. That’s $250M raised in seven months since spinning out of UC Berkeley. The secret: they launched a commercial AI Evaluations service in September that hit $30M ARR by December. For builders, this signals that AI infrastructure plays—especially trust and verification—are commanding premium valuations.

Link: https://techcrunch.com/2026/01/06/lmarena-lands-1-7b-valuation-four-months-after-launching-its-product/

4. The AGI Retreat: Industry Leaders Back Away

2026 may be the year AGI hype officially dies. Sam Altman called AGI “not a super useful term.” Anthropic’s Dario Amodei says he’s “always disliked” it. Microsoft’s Satya Nadella says AGI “will never be achieved anytime soon.” The reason? LLMs may simply not be capable of reaching AGI. Apple and academic papers conclude that “chain of thought reasoning” is a “mirage.” For builders: focus on what LLMs can do, not what they were promised to become.

Link: https://gizmodo.com/will-2026-be-the-year-that-the-ai-industry-stops-crowing-about-agi-2000707012

5. NousCoder-14B: Open Source Closes the Gap

Nous Research released NousCoder-14B, a coding model that matches larger proprietary systems—trained in just four days on 48 Nvidia B200 GPUs. The model achieves 67.87% on LiveCodeBench, a 7-point improvement over its base. What’s notable: Nous open-sourced everything—model weights, training environment, and benchmark suite. For builders evaluating coding assistants, open-source alternatives are now genuinely competitive.

Link: https://venturebeat.com/technology/nous-researchs-nouscoder-14b-is-an-open-source-coding-model-landing-right-in

Hello Builders, issue #6

Luca Bianchi — Mon, 29 Dec 2025 16:42:12 GMT

Hello Builders,

As 2025 draws to a close, the AI landscape reveals a fascinating paradox: while social media amplifies the worst of AI hype culture, the underlying research is making genuine, measurable progress. This week, PayPal deployed a production multi-agent system using NVIDIA’s NeMo framework, signaling that enterprise agentic AI has moved from proof-of-concept to production reality. Meanwhile, new research on perplexity-aware scaling laws challenges the “more data is better” paradigm, and responsible AI frameworks gain traction with consensus-driven reasoning approaches. The tension between market irrationality and technical advancement has never been starker. For builders shipping real products, the signal is clear: focus on fundamentals, ignore the noise.

This week’s signal in the noise

PayPal deploys production Commerce Agent using NVIDIA’s NeMo framework, signaling enterprise adoption of multi-agent architectures has reached maturity. • New perplexity-aware scaling laws promise more efficient LLM training, moving beyond the brute-force “more data is better” paradigm. • Responsible AI frameworks gain traction with consensus-driven reasoning approaches for explainability and governance. • MIT Technology Review calls out AI boosterism on social media, with Google DeepMind’s Demis Hassabis publicly embarrassed by overhyped claims.

NEMO-4-PAYPAL: Enterprise Multi-Agent Systems Go Production

PayPal’s announcement of its Commerce Agent represents a significant milestone in enterprise AI adoption. Built on NVIDIA’s NeMo framework with fine-tuned Nemotron models, this multi-agent system handles search and discovery at production scale. The partnership demonstrates that the theoretical promise of agentic architectures is translating into real-world deployments. What’s notable isn’t just the technology, but the validation of the multi-agent paradigm: instead of monolithic models trying to do everything, specialized agents coordinate to handle complex workflows. For enterprise architects evaluating agentic systems, PayPal’s deployment provides a blueprint, though the reliance on specialized hardware partnerships raises legitimate questions about vendor lock-in and portability.

Link: https://arxiv.org/abs/2512.21578

The Perplexity Paradox: Smarter Scaling Laws for LLM Training

Researchers have proposed a novel perplexity-aware data scaling law that challenges conventional wisdom about continual pre-training. The current power-law relationship between dataset size and test loss yields diminishing returns, leading to suboptimal data utilization and inefficient training. The new approach suggests that measuring perplexity landscapes can more accurately predict performance than simply counting tokens. For organizations running expensive training jobs, this represents potential cost savings of millions of dollars. The implication is clear: the “scale is all you need” era may be ending, replaced by smarter, more efficient approaches to model development.

Link: https://arxiv.org/abs/2512.21515

Responsible AI: Consensus-Driven Reasoning for Explainable Agents

As AI agents gain autonomy, the challenges of explainability, accountability, and governance become critical. A new framework proposes consensus-driven reasoning to address these concerns, coordinating Large Language Models, Vision Language Models, tools, and external services while maintaining transparency. The approach is particularly relevant as enterprises deploy agentic systems that influence downstream decisions. This isn’t just academic research; it’s a direct response to regulatory pressure and enterprise risk management requirements. For decision-makers, this signals that responsible AI isn’t optional; it’s becoming a technical requirement baked into system design.

Link: https://arxiv.org/abs/2512.21699

The Hype Reckoning: When AI Boosterism Backfires

MIT Technology Review published a sharp critique of AI boosterism on social media, centered on an incident where Google DeepMind CEO Demis Hassabis called out an OpenAI researcher’s overhyped claims about GPT-5 solving mathematical problems. Hassabis’s three-word response, “This is embarrassing,” encapsulates the growing backlash against hyperbolic AI announcements. The piece argues that social media incentives reward sensationalism over accuracy, creating a feedback loop that damages the field’s credibility. For builders, the lesson is straightforward: let your work speak for itself, and be skeptical of claims that seem too good to be true. The gap between demo and production remains the only metric that matters.

Link: https://www.technologyreview.com/2025/12/23/1130393/how-social-media-encourages-the-worst-of-ai-boosterism/

Hello Builders, issue #5

Luca Bianchi — Sat, 20 Dec 2025 12:41:13 GMT

Hello Builders,
This week, we start wrapping up 2025, a year that has been transformative in many aspects: the AI ecosystem was defined by a powerful tension between a necessary, sobering hype correction and the relentless pace of underlying technical progress. While industry leaders like Sam Altman and Mark Zuckerberg openly acknowledge a financial bubble of historic proportions, with trillions of dollars in planned infrastructure spending dwarfing the dot-com era, the fundamental limitations of current LLM-based approaches are becoming clearer. Yann LeCun’s departure from Meta to pursue alternative architectures underscores a growing schism in the path to AGI, just as Chinese open-weight models like Qwen and DeepSeek are outpacing their US counterparts in technology. This market irrationality is paradoxically juxtaposed with concrete, measurable breakthroughs: OpenAI’s new FrontierScience benchmark revealed a staggering improvement in scientific reasoning capabilities over the last two years, proving that even as the hype bubble swells, the technology’s real-world potential to accelerate science is advancing rapidly.

This week’s signal in the noise

• The AI bubble is now openly acknowledged by tech leaders, with planned infrastructure spending reaching an unprecedented $12 trillion scale, dwarfing previous tech booms.

• Yann LeCun’s departure from Meta signals a fundamental schism in AI research, challenging the dominant LLM scaling hypothesis and seeking new paths to AGI.

• Chinese open-weight models are now technologically leading the ecosystem, creating a new competitive and geopolitical landscape for AI development.

• OpenAI’s FrontierScience benchmark demonstrates a dramatic leap in scientific reasoning (from 39% to 92% on GPQA in two years), proving tangible progress amidst the hype.

The Great Hype Correction of 2025

2025 is being framed as the year of the great AI “reckoning,” a necessary market and intellectual correction to the inflated expectations that have been set since 2022. The symbolic peak of this hype cycle was the underwhelming launch of GPT-5 in August, which, after months of grand promises, felt merely incremental. This has led to a broader questioning of the exponential progress narrative, comparing the current state of LLMs to the mature smartphone market, where annual updates bring minor improvements rather than revolutions. This shift demands a strategic recalibration, shifting focus from chasing ever-larger models to finding sustainable, high-value applications for the powerful yet limited tools that exist today. It’s a prompt to reassess technology roadmaps, distinguish between genuine capability and marketing-driven hype, and prepare for a period of disillusionment in which only the most resilient and value-focused applications will thrive.

https://www.technologyreview.com/2025/12/15/1129174/the-great-ai-hype-correction-of-2025/

The Schism in the Path to AGI: Beyond Token Prediction

A foundational debate, crystallized in a public discussion between Meta’s Yann LeCun and DeepMind’s Adam Brown, is challenging the industry’s core architectural assumptions. LeCun argues that the dominant approach of autoregressively predicting discrete tokens is a dead end for achieving true intelligence, as it fails to grasp the continuous, high-dimensional nature of the real world. He highlights the massive inefficiency of LLMs compared to a child’s learning process, advocating for new architectures like JEPA that learn abstract world models. This critique gained significant weight with the news of LeCun’s departure from Meta to form a new company focused on “Advanced Machine Intelligence” (AMI). This development signals a critical juncture for technology strategy, suggesting that diversifying research and development efforts beyond simply scaling existing LLMs is crucial. Betting entirely on the current paradigm may mean ignoring the architectural breakthroughs required for the next level of machine intelligence and real-world interaction.

https://the-decoder.com/the-case-against-predicting-tokens-to-build-agi/

The East is Open: China’s Dominance in Open-Weight Models

The competitive landscape for foundational models has been quietly redrawn, with Chinese companies now leading the open-weight ecosystem. The release of DeepSeek R1 in January 2025 was a watershed moment, technologically challenging and, in some cases, surpassing Western counterparts. Today, Alibaba’s Qwen model family is the most downloaded in the world, and models from labs like DeepSeek and Moonshot AI are setting new performance benchmarks, particularly in math and reasoning. While US companies like Airbnb are beginning to adopt these models for their cost and performance advantages, broader adoption is hampered by compliance and geopolitical concerns. This trend presents a complex strategic challenge: ignoring the superior performance of these models is a competitive risk, but adopting them introduces significant supply chain and data governance questions that must be carefully navigated.

Understanding AI

The best Chinese open-weight models — and the strongest US rivals

DeepSeek’s release of R1 in January shocked the world. It came just four months after OpenAI announced its first reasoning model, o1. The model parameters were released openly. And DeepSeek R1 powered the first consumer-facing chatbot to show the full chain of thought before answering…

4 months ago · 66 likes · 4 comments · Kai Williams

The $12 Trillion Bubble: AI’s Unprecedented Infrastructure Bet

The financial scale of the AI boom has entered a new territory of irrational exuberance, with industry leaders now openly discussing being in a bubble. The numbers are staggering: OpenAI has pledged $500 billion for data centers, with a moonshot goal of building infrastructure that could cost over $12 trillion. Bain estimates the industry needs to generate $2 trillion in annual revenue by 2030 just to justify the current spending wave—more than the combined 2024 revenue of Amazon, Apple, Alphabet, Microsoft, Meta, and Nvidia. Unprofitable startups like OpenAI and Anthropic are projected to burn through $140 billion and $20 billion, respectively, before 2030. This environment, characterized by circular deals and a chase for AGI, creates systemic risk. For decision-makers, this underscores the critical need for rigorous ROI analysis on AI investments and highlights the immense financial risk concentrated in a few foundational model providers, whose potential failure could have cascading effects across the industry.

Link: https://www.technologyreview.com/2025/12/15/1129183/what-even-is-the-ai-bubble/

From 39% to 92%: Measuring Real Progress in Scientific Reasoning

Cutting through the hype cycle, OpenAI has provided concrete evidence of dramatic capability improvements with its new FrontierScience benchmark. On the “Google-Proof” GPQA science benchmark, OpenAI’s models improved from a 39% score (GPT-4 in 2023) to a remarkable 92% with GPT-5.2 in 2025, surpassing the 70% expert baseline. The new, more challenging FrontierScience benchmark, created by PhDs and Olympiad medalists, shows GPT-5.2 achieving 77% on Olympiad-level problems and 25% on open-ended research tasks. This data provides a crucial signal: while true open-ended research remains a challenge, AI's ability to perform high-level, structured scientific reasoning is advancing at an astonishing rate. This indicates that the technology is ready to be deployed to accelerate complex, structured workflows in R&D, engineering, and data analysis, moving beyond simple automation to become a genuine partner in discovery and innovation.

Link: https://openai.com/index/frontierscience/

Hello Builders, issue #4

Luca Bianchi — Sun, 14 Dec 2025 23:36:21 GMT

The week of December 5-12, 2025, marked a critical inflection point in the AI industry, characterized by intense competitive dynamics in foundation models and the formalization of agent engineering as a distinct discipline. Google and OpenAI engaged in direct competition with synchronized releases, while enterprise adoption of AI agents accelerated beyond coding into cross-functional workflows. Infrastructure innovations, particularly in custom silicon and development frameworks, continue to reshape the competitive landscape.

This week’s signal in the noise

Foundation model competition intensified with GPT-5.2 and Gemini 3 Pro releases
Agent engineering emerged as a formal discipline combining product, engineering, and data science
$715M+ in venture funding across AI infrastructure and applications
NVIDIA introduced CUDA Tile for hardware-abstracted GPU programming
Anthropic published groundbreaking interpretability research revealing internal LLM reasoning
Responsible AI discussion fosters at re:Invent

Agent Engineering: A New Discipline

LangChain formalized “agent engineering” as a distinct practice area, shifting from experimental AI to production-grade systems.

Technical Framework: The methodology follows an iterative cycle: Build → Test → Ship → Observe → Refine → Repeat. This requires three core skillsets working together: product thinking (encompassing prompt engineering and evaluation definition), engineering (covering tool development and durable execution), and data science (handling performance measurement and A/B testing). The production reality introduces a fundamental challenge: “every input is an edge case” because natural language creates an unlimited input space.

Companies like Clay, Vanta, LinkedIn, and Cloudflare have successfully deployed production agents using this methodology. The critical insight is that shipping is the primary learning mechanism, not the endpoint. Traditional software debug approaches fail because the logic resides inside models rather than in explicit code. The concept of an agent “working” is non-binary: achieving 99.99% uptime doesn’t guarantee correct behavior across all scenarios.

Source: LangChain Blog

Enterprise AI Agent Adoption

Survey of 500+ technical leaders reveals rapid progression from task automation to cross-functional workflows. Actual deployment shows that 57% of organizations use agents for multi-stage workflows, and 16% run cross-functional processes across teams. Looking ahead to 2026, 81% plan to tackle more complex use cases, and 80% report measurable economic returns. The highest adoption is in coding (90%), with full coverage of the entire development lifecycle, including planning (58%), generation (59%), documentation (59%), and review/testing (59%). Data analysis and report generation follow at 60%, with internal process automation at 48% showing cross-functional workflow optimization.

Thomson Reuters uses Claude to power CoCounsel, their AI legal platform, enabling lawyers to access 150 years of case law and 3,000 domain experts in minutes rather than hours of manual document searching. In healthcare, Doctolib rolled out Claude Code across its entire engineering team, replacing legacy testing infrastructure in hours rather than weeks and shipping features 40% faster.

Source: Claude Blog - State of AI Agents 2026

Tracing LLM Internal Reasoning

Anthropic published breakthrough interpretability research revealing how Claude “thinks” internally. The research extends prior feature-mapping work into computational “circuits,” studying Claude 3.5 Haiku across 10 crucial behaviors. The approach draws on neuroscience methods to understand thinking organisms.

The research revealed six major insights into how Claude processes information internally. The current method captures only a fraction of total computation and requires hours of human effort per short prompt. Scaling to complex reasoning chains with thousands of words will require methodological improvements and, potentially, AI assistance for interpretation.

Source: Anthropic Research

NVIDIA CUDA Tile: Hardware-Abstracted GPU Programming

CUDA 13.1 introduced CUDA Tile, the largest advancement since CUDA’s 2006 invention. CUDA Tile introduces a virtual instruction set for tile-based parallel programming that abstracts specialized hardware, including tensor cores and TMA (Tensor Memory Accelerators). The tile model represents a fundamental shift in which developers partition data into blocks, and the compiler maps them to threads, in contrast to the SIMT model, where developers map data to both blocks and threads. This approach is analogous to NumPy for Python, where developers specify bulk operations and the runtime handles execution transparently. CUDA Tile reduces the burden of code rewrite across GPU generations, lowers the barrier to using tensor cores, and lays a foundation for higher-level AI development tools.

Source: NVIDIA Developer Blog

OpenAI GPT-5.2 “Garlic” Release and Google Deep Research

GPT-5.2 was released on December 11 amid a “code red” situation triggered by Gemini 3’s impact. CEO Sam Altman shifted resources to improving ChatGPT, with the expectation of exiting code-red status by January 2026.

OpenAI reports that GPT-5.2 tops SWE-Bench Pro for agentic coding performance and GPQA Diamond for graduate-level scientific reasoning. On GDPval, it beat or tied top professionals on 70.9% of well-specified tasks.

Improvements Over GPT-5.1: The model shows significant enhancements across spreadsheet generation, presentation creation, code writing, long-form text understanding, and image processing.

In the meantime, the reimagined research agent based on Gemini 3 Pro introduces the Interactions API for embedding research into third-party apps, marking an industry first for developer access to advanced research capabilities. It handles large context dumps and synthesizes mountains of information.

Source: Google Blog

Insights from an AI Human-in-the-Loop Roundtable

I recently sat in on a roundtable between AWS Heroes and Now Go Build CTO Fellows tackling real AI challenges: serving vulnerable populations with tiny teams, building offline education systems for conflict zones, and validating healthcare data at scale. The most valuable insights weren’t technical solutions—they were realizations. Like the healthcare org that discovered their “AI problem” was actually basic statistics. Or the finding that “30% chance of error” triggers better decisions than “70% confidence.” The conversation kept returning to one question: Are we using AI to enable people or replace them? The technical problems are solvable. The harder work is solving them for human flourishing, not just scale.

Source: A conversation About AI tha Actually Matters

A Conversation About AI That Actually Matters

Luca Bianchi — Sat, 13 Dec 2025 23:13:56 GMT

Around the table sat people managing educational platforms for 800,000 students, disaster-response systems in active war zones, healthcare assessments for vulnerable populations, and food-recovery operations across 20 regions. AWS Heroes—technical experts who’ve built systems at massive scale—sat alongside Now Go Build CTO Fellows—technology leaders working at the absolute frontlines of human need.

The question on the table: How do you actually deploy AI when getting it wrong means people get hurt?

Not “how do you optimize for engagement” or “how do you reduce churn.” How do you build systems that serve people who can’t advocate for themselves, in places where infrastructure barely works, with teams of six people trying to reach millions?

The AI Tutor That Has to Work Without the Internet

One of the first challenges came from someone running an educational platform serving 800,000 learners across every country. Their team: 25-30 people. Their current approach: AI helps create and localize content, but PhD-level experts review everything before it reaches students.

It’s careful. It’s responsible. And it doesn’t scale to where they need to go.

“In the countries where we work, students miss entire days of school regularly,” they explained. Military conflicts. Infrastructure failures. In some places, every Monday. The dream: AI tutors that run entirely on phones, work without internet, and can teach kids even when nothing else is working.

But here’s the catch—how do you ensure safety when the AI can’t check back with a server for guardrails? When there’s no human in the loop because there’s literally no loop to be in?

The room started discussing. Keep the AI tightly bound—it can only talk about specific curriculum topics. If a student asks something outside that scope, it just says, “I don’t know how to answer that.” Don’t try to be helpful beyond your domain.

Someone raised a more subtle problem: “Vector search will find semantically similar content, but it will miss gaps of information that are relevant but don’t match.” In other words, the AI might find related information, but miss the crucial context that makes it meaningful. The solution involves explicitly mapping how concepts relate to each other, not just how similar they are.

For the highest-stakes applications, there’s even a way to get mathematical proof of safety—similar to how Boeing and Airbus verify their flight systems. The AI generates the rules, humans verify that they are correct, and the system is guaranteed to stay within those boundaries.

But then someone pushed back on the whole premise: “The research shows learning requires friction. We shouldn’t eliminate struggle; we should give learners tools to manage it productively.” This hit something important. The goal isn’t to make everything easy. Real learning involves being challenged, even frustrated. The question is how to support students through that difficulty, not eliminate it.

Later, someone set an ambitious bar: “Until the learning is as addictive as a Netflix series where you watch eight hours of it, we have not got our job done.” But they also acknowledged a tension: “My kids use some of these AI-enhanced learning platforms now, and they hate them because they know the AI is constantly assessing them.”

There’s the paradox. We want engagement. But surveillance isn’t engaging—it’s oppressive. The technology that measures everything risks destroying the intrinsic motivation that makes learning work.

The Healthcare System That Had to Be Perfect

Then came a different kind of challenge. A healthcare organization analyzes surveys and personal stories to identify psychosocial factors—abuse, housing instability, depression—that healthcare workers need to know about. The AI’s job: read through 57 questions plus long-form written responses and flag what matters.

But these flags go into clinical decisions. They have to be right. “How do I ensure quality control with a small team of subject matter experts when we’re about to scale?”

Turns out, most of it was structured—multiple choice, scales, yes/no answers. Then they discovered this wasn’t really an AI problem. For structured data, you can use statistical methods that give you the same answer every single time. No randomness. No drift. Just consistency. The AI only needed to handle the free-text responses where people write about their lives in their own words. That’s where language understanding matters.

But even with the right technical approach, the validation problem remained. How do you check thousands of assessments with a handful of experts? Use project management tools teams already know. When the AI makes a prediction, it automatically creates a review task. The expert approves or corrects it. The corrections feed back into the model. No custom software needed.

Run the same data through multiple different AI systems. If they all agree, confidence goes up. If they disagree, it gets flagged for mandatory human review. “That’s what Boeing and Airbus do,” someone noted. “Multiple systems, voting mechanisms.” You don’t need to validate everything. Check 100% when the AI says it’s uncertain. Sample the rest. Track accuracy over time. For validations that don’t require deep medical expertise, use crowd validation where multiple people review the same output, and consensus determines acceptance.

Then came a fascinating insight about human psychology. How you frame AI predictions changes how people respond to them.

“There’s a 30% chance this is wrong” makes people more careful than “70% confidence it’s right”—even though these mean the exact same thing. For life-and-death decisions, frame uncertainty as risk. It changes behavior.

Trust in the Age of Lies

They work across conflict zones—Ukraine, Turkey, Mexico, places where humanitarian workers need accurate information to stay alive. Every day, they compile situation reports from multiple sources: social media, news, and ground reports from local contacts. Here’s the nightmare scenario: a convincing video of an explosion makes it into a situation report. Workers avoid a safe route or miss a critical intervention window. Except the explosion never happened. It was AI-generated.

“False information doesn’t just waste time,” they said. “It puts lives at risk.”

And here’s what makes it harder: “The humanitarian workforce is significantly aging. With funding cuts, organizations aren’t hiring new people. The digital literacy problem is getting worse, not better.”

AI can scrape and aggregate information orders of magnitude faster than humans. But speed without accuracy is worse than useless—it’s dangerous.

A possible solution is to enforce data lineage. Track everything. Where did this information come from? How was it processed? Who else is reporting it? This isn’t just for validation—when something seems suspicious, you need to be able to trace back through every step. Look for technical indicators. Images and videos contain metadata—timestamps, GPS data, and visual artifacts from editing. These can be analyzed to flag content that appears manipulated.

Someone proposed an elegant solution to the training problem: build it directly into the workflow. Instead of separate training sessions, people forget about the system challenges that reviewers face in the moment. “This image has characteristics consistent with AI generation. What indicators do you see?”

The system doesn’t let them move forward until they articulate their reasoning. Right or wrong, they’re thinking critically about that specific case.

Other Realities

Mapping with biased AI. Someone working on geospatial mapping explained that global AI models trained primarily on US and European data perform terribly in developing regions. Deploy them to map villages in Rwanda, and they miss buildings or misclassify structures entirely. Their solution turned the problem into an opportunity: local community members provide feedback that fine-tunes the models for their specific region. Simple mobile interfaces—swipe yes or no, is this a building?—create a continuous improvement loop. “This is about community inclusion,” they said. “Locals train the systems that map their environment. It addresses the bias in training data and makes the whole process transparent.”

Health information in hostile territories. A reproductive health organization works in regions where providing its services is literally illegal. They need to provide accurate, potentially life-saving information, but they can’t collect the personal data needed for AI personalization. The constraints are severe. Users may be in immediate danger. The information must be medically precise. The system must detect emotional distress and escalate to human counselors when needed. Their approach emphasizes curation over generation: the AI draws only from pre-validated medical resources, every response includes clear citations, and there’s always a prominent path to reach a human counselor.

When voice changes everything. Multiple people raised the challenge of serving populations with limited literacy, particularly in languages without strong written traditions. The shift from text to voice interfaces made a dramatic difference: “Once we moved from text chatbot to voice-based, we saw significantly better engagement with low-literacy populations. The emotion detection was critical—if someone sounds sad, you need to adapt the interaction.” Voice with emotion detection allows systems to respond not just to what people say, but how they say it. It makes the interaction feel human in a way text never can.

As the conversation wound on, certain truths kept surfacing, sometimes explicitly, sometimes as subtext to other discussions.

Not every problem needs AI. For structured data with clear rules, traditional statistical methods often work better. They’re consistent, auditable, and don’t have the unpredictability of generative systems. The maturity isn’t in knowing the latest AI techniques—it’s in knowing when not to use them. Sometimes a problem is “just” a classic machine learning problem, not a generative AI problem.

Small teams can provide oversight. You can’t manually review every AI output, but you also can’t deploy systems without oversight. The answer isn’t choosing one or the other—it’s being strategic about where human judgment adds value. Validate everything when confidence is low. Sample when confidence is high. Use tools teams already know rather than building custom systems. Have corrections automatically improve the model.

Humans need context, not just answers. When AI makes a prediction, showing why it reached that conclusion helps humans effectively validate it. Citations, confidence scores, explicit reasoning—these aren’t nice-to-haves. They’re what make the partnership between human and machine actually work.

How you frame things matters more than you’d think. That insight about “30% chance this is wrong” versus “70% confidence it’s right”—mathematically identical, psychologically different—kept coming up. For high-stakes decisions, frame uncertainty as risk. It changes how people think.

Trust requires transparency. Track where information came from and how it was processed. When something goes wrong, trace back to find where the error was introduced. This isn’t bureaucracy—it’s the foundation of accountability.

Training should be embedded, not scheduled. Rather than separate training sessions, people forget; they build challenges directly into the workflow. Make people articulate their reasoning before accepting or rejecting AI outputs. Provide guidance in the moment when they struggle. Learning that happens in context sticks.

Some things should stay human. This was the insight that kept resonating. Some interactions should remain human even if they could technically be automated.

“Automation decisions aren’t purely technical—they encode values about human dignity and relationship importance.”

Technology should serve human flourishing, not just efficiency. Sometimes the relationship itself is the value being delivered.

What does this mean?

In typical tech discussions, AI is about optimization, efficiency, and competitive advantage. There, the question was whether a student in an active war zone could learn math. Whether a humanitarian worker would walk into danger based on fabricated information. Whether someone in crisis would get help or get dismissed.

Start with the constraint, not the capability. The best solutions came from embracing limitations—offline-first design, minimal data collection, small teams—rather than treating them as compromises. Design for the world as it is, not as you wish it were.

Choose boring technology when it solves the problem. The sophistication to use simpler statistical methods instead of AI, or traditional databases instead of vector search, often leads to better outcomes. Complexity should be justified by requirements, not assumed by default.

Make humans and AI partners, not adversaries. The goal isn’t to validate whether the AI was “right.” It’s about combining human judgment with AI capabilities to make better decisions than either could alone. Design systems where humans add value, not just check boxes.

Build feedback loops into everything. Every human correction should automatically improve the system. If your validation process doesn’t feed back into training, you’re missing the opportunity to get better over time.

“I’m interested in finding the balance between using AI to enable people versus using AI to replace them. The higher order tends to view it as replacement rather than enablement.”

That’s the real tension, isn’t it? Not the technical challenges—those are solvable. The harder question is what we’re solving them for. AI should amplify human capability, not substitute for human judgment. It should help small teams serve millions without losing the human connection that makes the service meaningful.

This isn’t idealism. It’s practical. Systems that lose the human element, that treat efficiency as the only goal, that optimize for metrics at the expense of relationships—they ultimately fail to serve the populations they’re designed to help.

Human-in-the-loop is the design principle that makes AI trustworthy for populations that can’t afford our mistakes. The technical problems—validation at scale, offline operation, handling uncertainty—those have solutions. The harder work is ensuring we’re solving them in service of human flourishing, not just scale.

Because somewhere, there’s a student who needs to learn despite conflict disrupting school. A healthcare worker making decisions that affect people’s lives. A humanitarian worker whose safety depends on accurate information. A person in crisis who needs help, not judgment. They can’t advocate for themselves in these technology decisions. So we have to advocate for them. We have to build systems worthy of their trust.

That’s what sustainable and ethical AI stands for.

Hello Builders, issue #3

Luca Bianchi — Sat, 06 Dec 2025 23:46:25 GMT

This week delivered an unusually dense concentration of enterprise-grade AI developments, anchored by AWS re:Invent 2025 and NeurIPS 2025. Google’s Gemini 3 broke the 1500 Elo barrier on reasoning benchmarks, DeepSeek released a 671B-parameter open-weight model at ~90% lower cost than proprietary alternatives, and AWS and Google Cloud announced an unprecedented multicloud interconnect partnership.

This week’s signal in the noise

Model economics transformed: DeepSeek V3.2 at $0.21/$0.32 per million tokens (10-30x cheaper than proprietary models, MIT license). Gemini 3 hit 1501 Elo (first to break 1500). Mistral Large 3 launched at 80% below OpenAI pricing, with an Apache 2.0 license.
Infrastructure developments: AWS Trainium3 delivers 4.4x performance boost; AWS-Google Cloud multicloud interconnect enables 1-100 Gbps private links in minutes. Google TPU v7 scales to 42.5 exaflops. H100 pricing cut by 44%.
$1.7B+ funding: Black Forest Labs ($300M, $3.25B valuation), Harvey ($160M, $8B valuation). OpenAI acquired Neptune.ai—external services sunsetting, migration required. Marvell acquired Celestial AI for photonic interconnects.
Regulatory/research updates: EU AI Act Annex III deadlines delayed 16 months (now Dec 2027). NeurIPS findings suggest RLVR hitting capability ceiling. NVIDIA achieved 4-bit LLM training matching 8-bit performance.

Model releases reshape the competitive landscape

December 1, 2025, marked what industry observers dubbed “Super Sunday”—exactly three years after ChatGPT’s launch—with multiple frontier-model announcements in a 48-hour window.

Google Gemini 3 Pro achieved an LMArena Elo of 1501, becoming the first model to break the 1500 barrier. Its GPQA Diamond score of 91.9% surpasses human expert performance (~89.8%), while the extended reasoning variant “Deep Think” reached 93.8% on the same benchmark. The model supports a 1-million-token context window, while DeepSeek V3.2 represents the most significant open-weight release of the week. At 671B total parameters (37B active per token) with a Mixture-of-Experts architecture, it achieves parity with GPT-5 on reasoning benchmarks while pricing at $0.21/million input tokens and $0.32/million output tokens—roughly 10-30x cheaper than proprietary alternatives. The cheap-model trend continues with the release this week of Mistral Large 3, priced approximately 80% below OpenAI’s flagship pricing.

https://blog.google/products/gemini/gemini-3/
https://api-docs.deepseek.com/news/news251201

Infrastructure investments accelerate across hyperscalers

AWS re:Invent 2025 dominated infrastructure news with Trainium3, AWS’s first 3nm AI chip, and the AWS-Google Cloud multicloud interconnect partnership integrates AWS Direct Connect with Google Cloud Cross-Cloud Interconnect, enabling private links between platforms in minutes rather than weeks. Bandwidth ranges from 1-100 Gbps with MACsec encryption and quad redundancy. In the meantime, Google’s TPU v6e (Trillium) reached general availability, and TPU v7 (Ironwood), previewed this week, with a total capacity of 42.5 exaflops. Capital expenditure across the eight major hyperscalers is projected to increase 44% year-over-year to $371 billion in 2025. JPMorgan estimates $5 trillion in global data center and AI infrastructure spending over the next five years.

https://www.webpronews.com/amazon-and-google-launch-multicloud-networking-for-ai-and-outage-resilience/

Funding rounds signal continued investor conviction

Approximately $1.7 billion in disclosed funding closed during the week across enterprise AI infrastructure, vertical applications, and agentic AI platforms. Black Forest Labs raised $300 million in Series B at a $3.25 billion valuation. The company’s FLUX image generation models power Adobe, ElevenLabs, and Grok. Eon closed a $300 million Series D round led by Elad Gil at an approximate $4 billion valuation for its cloud data backup business. Harvey, the legal AI platform, secured $160 million at an $8 billion valuation (up from $5B in June), led by Andreessen Horowitz. The company serves 50+ of the top 100 AmLaw firms and surpassed $100 million ARR in August. Back-to-back funding rounds totaling $760 million indicate strong market validation.

https://techcrunch.com/2025/12/01/black-forest-labs-raises-300m-at-3-25b-valuation/
https://techcrunch.com/2025/12/04/legal-ai-startup-harvey-confirms-8b-valuation/

Research breakthroughs from NeurIPS challenge training assumptions

NeurIPS 2025 (Nov 30 – Dec 7) produced several findings with direct enterprise implications. The Gated Attention paper from Alibaba’s Qwen team received Best Paper honors for introducing head-specific sigmoid gating after scaled dot-product attention. Tested across 30+ experiments on models up to 15B parameters trained on 3.5 trillion tokens, the technique eliminates the “attention sink” phenomenon, enables larger learning rates, and improves long-context extrapolation—all with minimal implementation overhead. The approach is already deployed in production Qwen3-Next models.

A critical RLVR (Reinforcement Learning from Verifiable Rewards) study received Best Paper Runner-Up recognition for demonstrating that RLVR improves sampling efficiency but does not elicit fundamentally new reasoning patterns. Base models achieve higher pass@k scores when k is large; distillation—not RLVR—introduces genuinely new reasoning capabilities. This finding suggests current post-training methodologies may be approaching fundamental limits.

DeepSeekMath-V2 demonstrated self-verifiable mathematical reasoning with a meta-verification system that checks verifier accuracy and reduces hallucinations. On the 2024 Putnam Competition, it scored 118/120 points, compared to a top human score of 90. The model achieved gold-medal performance at IMO 2025 (83.3%, 5/6 problems) and the China Mathematical Olympiad 2024.

https://blog.neurips.cc/2025/11/26/announcing-the-neurips-2025-best-paper-awards/
https://blog.neurips.cc/2025/11/26/announcing-the-neurips-2025-best-paper-awards/

The end of an era and the Renaissance Developer.

Luca Bianchi — Fri, 05 Dec 2025 17:58:52 GMT

Two more words: “Werner’s out!”

When the mic drops on stage, we all know an era is over. Werner Vogels’ last keynote marked one of the most emotional moments of this 2025 re:Invent and ended the best of a 14-year keynote series. The bittersweet sensation that comes when you realize one of the pioneers of our modern age decides it’s time to step back and let younger forces tell stories about these transformative times is both hard to elaborate on and leaves you with a feeling of a full circle.

Seniority is not the ability to hold the microphone; it’s the courage to hand it to the next generation:

“There are so many amazing engineers at Amazon that have great stories to tell… It’s time for those younger, different voices of AWS to be in front of you.”

It came after a one-hour packed discussion about how to deal with these transformative times. Werner's last gift is a message of hope and a call to become better builders, better professionals, better humans. The theme of the entire keynote can be summarized in one line:

The tools will change; the work is still yours.

From that premise, Vogels unfolds a framework for the Renaissance Developer—a model not just for engineers but for any technical leader who wants to stay relevant and effective in an AI-accelerated world. To make his point, Vogels does what experienced engineers do when everyone is panicking about the future: he zooms out over decades. Every wave followed the same pattern: tools changed dramatically, the work changed shape, and the developer's identity stayed the same: we build systems that matter. This is the context in which he answers the AI anxiety. We’re not at the end of development; we’re in yet another phase of abstraction. The difference today is the density of change. As Jeff Bezos has put it, we’re living at the epicenter of multiple “golden ages”—space, robotics, and AI—whose breakthroughs amplify one another.

Werner’s conclusion: if you anchor your career in specific tools, you’ll be swept away. If you anchor it in enduring capabilities, you’ll ride the wave. Those capabilities are what he calls the Renaissance Developer's qualities, drawing on a period when tools and ideas reinforced one another, reshaping civilization: the Renaissance. Tools didn’t replace humans; they expanded the surface area of what humans could explore.

Curiosity

The very first quality of the Renaissance Developer is curiosity: in a world of AI‑accelerated change, it becomes a survival skill and a professional obligation. But curiosity requires experimentation and failure. If you are not allowed to fail, you can’t thrive. But it comes with pressure: too little leads to boredom and disengagement, while too much leads to overwhelm and paralysis. The Yerkes–Dodson law explains that the sweet spot is enough challenge to stretch you without breaking you.

This comes with connections: “Learning isn’t just cognitive, it is social.” You don’t become a Renaissance Developer by sitting alone with tutorials; you grow by seeking out communities, sitting in user groups and conferences where ideas collide, lingering over coffee with peers while you argue about systems, and deliberately placing yourself in new contexts and constraints so that real experiences, real people, and real problems can stretch your thinking far beyond what isolated study can do.

As a leader, your job is not to protect your team from failure; it is to make failure cheap and reversible, to put people in environments where they are exposed to new ideas and constraints, and to reward learning behavior as much as successful outcomes.

Systems Thinking

The second quality is systems thinking. Not “distributed systems” in the narrow sense, but systems in the broader sense that Donella Meadows wrote about.

“A system is a set of things—people, cells, or whatever—interconnected in such a way that they produce their own pattern of behavior over time.”

Meadows invites us to stop staring at individual events and instead notice the structures that generate them. She asks us to look for the “stocks and flows” beneath the surface: the queues that build up in a support channel, the backlog that silently grows as new features outrun maintenance, the slow accumulation of technical debt that barely shows up in daily metrics but eventually locks a team in place. These are the reservoirs of a system. They fill and drain at different speeds, and they determine the patterns we see in performance, morale, and reliability. Meadows points to feedback loops, circuits of cause and effect that either reinforce a trend or hold it in check. In a reinforcing loop, every success makes the next success easier, like a product that improves, gains users, and generates the data that makes it better still. In a balancing loop, the system pushes back, like a throttling mechanism that slows calls when the load becomes too high. Our organizations, our processes, and our code are full of these loops. Some stabilize; others amplify; many conflict with each other in ways that only become visible over time. Crucially, systems have delays and nonlinearities. Actions and consequences are often separated in time and space: you approve a shortcut in architecture, and only months later, the on‑call rotation begins to burn out. You improve incident response without touching the upstream causes and find that you are getting better and better at cleaning up the same problems. Because of these delays, intuitive fixes—more meetings, more reviews, more dashboards—can easily make behavior worse if they are placed in the wrong part of the system. This is why Meadows focuses on leverage points: small, carefully chosen interventions that produce outsized effects. You can change parameters—such as timeouts, thresholds, or team size—and achieve incremental improvement. You can change information flows—who sees which metrics, which stories reach leadership—and unlock better decisions. You can change rules and incentives—what gets rewarded in performance reviews or funded in roadmaps—and watch priorities shift without a single line of code. At the deepest level, you can change goals and mental models—what “good” looks like for reliability, cost, or user experience—and the entire architecture starts to evolve in a new direction. If you keep treating problems as isolated bugs or incidents, you will keep fighting the same fires. Once you see your work as part of a living system of stocks, flows, feedback loops, delays, and leverage points, you stop searching for silver bullets and start looking for thoughtful, well‑placed changes that reshape behavior over time.

Communication

Natural language is powerful but ambiguous. Programming languages forced us to be precise because they refused the shortcuts and half-thoughts that humans understand from context. With AI-assisted coding, we are back to prompting in natural language, and vague prompts don’t just slow us down; they create misalignment—code that looks sophisticated but solves the wrong problem, hallucinated fields and APIs, and designs that quietly undermine the architecture they were meant to serve. Craft clear specifications as the bridge between human intent and machine-generated implementation. A good spec compresses ambiguity into clarity, forces us to decide what success looks like, which constraints we accept, and which trade-offs we will make, and becomes the shared reference point where product, engineering, and AI tools can actually align. Learn to ask “why” before “how.” When a customer asks, “What should we be doing with GenAI?”, the first move is not to list features but to ask, “Why are you asking? What problem or opportunity do you see?” From that deeper conversation comes sharper intent, a better specification, and a system that actually matters. For leaders, this is why communication is a core technical skill: diagrams, narratives, and specs are the instruments through which you shape the system, and in an AI-heavy workflow, the quality of your words still determines the quality of your systems.

Ownership

AI does not dilute accountability. “The work is yours, not the tools.” If you operate in regulated environments—healthcare, finance, critical infrastructure—and AI generates code that violates regulation, you cannot tell a regulator, “The AI did it.” Responsibility does not move with the autocomplete cursor. AI mainly amplifies two dangers: it can create verification debt, where code is generated faster than you can truly understand it, and it can hallucinate confident but wrong output—non-existent APIs, architectures that violate your own patterns, designs that look elegant but are not grounded in reality.

For leaders, ownership in the age of AI means accepting that models will write much of the code while you design and protect the mechanisms that keep it safe: clear specifications, honest reviews, explicit stop buttons when something feels wrong. Renaissance Developers don’t just own their code; they own the feedback loops and safeguards that keep quality high even as the tools get faster.

Be a polymath

Vogels is clear that this isn’t about mathematics; the word comes from the Greek for “to learn many things.” A polymath goes deep in at least one domain and yet refuses to stay narrow, cultivating meaningful knowledge across others. Leonardo da Vinci is the visible archetype—painter, engineer, anatomist, inventor—but Vogels’ fundamental point is that modern developers cannot afford to be merely I‑shaped specialists.

Instead, he urges us toward a T‑shaped posture: depth like a spine, breadth like outstretched arms. His mentor Jim Gray embodied this. Gray could listen to a server room for half a minute and diagnose a broken database layout, not only because he understood databases intimately but also because he understood the data, the scientists, and how the organization worked. That combination of technical depth, domain insight, and human awareness is what makes expertise decisive rather than decorative.

For leaders, the lesson is demanding and straightforward: grow people who can dive deep and still speak the neighboring languages around them. When a backend engineer understands cost and product, their designs change. When a data engineer understands the story the numbers must tell, their questions sharpen. When a platform engineer understands developer experience, their tools become catalysts rather than constraints. Breadth is not a distraction from depth; it is the amplifier that lets depth shape the whole system.

Werner’s legacy

When looking at the empty stage, the previous keynote pictures rolling, I started reflecting on Werner’s legacy: it is a way of inhabiting the craft. Protect and fuel curiosity so that learning never stops. Think in systems so that every change you make respects the invisible structures it touches. Raise the bar on communication so intent, design, and implementation move in sync. Build real mechanisms, not just good intentions, especially when AI accelerates your work beyond your first understanding. And grow T‑shaped people whose depth is amplified, not constrained, by their breadth.

The tools will keep changing. They always have. What endures is the standard he set: use the most powerful tools in history in service of the biggest problems of our time, quietly, rigorously, and with pride in the work that no one sees.

I am not an AI expert, nor are you.

Luca Bianchi — Thu, 04 Dec 2025 00:52:30 GMT

It is pretty common to find AI experts almost everywhere. From podcasts to blog posts, LinkedIn news. People re-posting headlines and news. Often amplifying the noise of marketing announcements with no real attachment to practical use cases.

This behavior, quite common during every technological revolution, can be highly disruptive when dealing with AI.

The problem is that in the domain of artificial intelligence, there is an enormous bias between PoC or theory and the hard reality. This, matched to a fast-moving state-of-the-art and some challenges still unsolved, amplifies the perception that with AI, nothing works in production.

There are a number of factors that create and maintain this perception: first, the ease of building a working prototype, which leads people to believe production is just a few days away. This is fueled by people living under the Duning-Krueger paradox, which underestimates the complexity of reality. Second, there are benchmarks that can be really misleading. I have been told far too many things that we should do X because ChatGPT does X.

Ok, repeat after me: ChatGPT is not a benchmark because their investments in just building that simple application are far greater than what a normal company can put on the table. This translates into a heavily optimized solution that appears simple at first glance. In addition, tech teams within OpenAI can have access either to top-level AI professionals or to the knowledge about how their LLM works under the hood.

So, let’s go back to reality: I completed my PhD in 2010, working on neural networks and computer vision, but I do not consider myself an AI expert. I have trained neural networks on GPUs and in the Cloud, faced heating issues, and allocated processing slices. I have written PyTorch implementations. Yet I do not consider myself an expert. This is because I am not an expert.

Every day I learn something new. Every day, I find out there are things I do not know. Things that have a real impact on real issues. And I am not speaking only of manual model training or such things. I am pointing to just the managed services out there. Each one with parameters that distinguish feasibility from unaffordability.

In AI, even tasks such as “document OCRs” can be tough when tested on real-life documents, which are often handwritten, rotated, and sometimes blurred. In the last few months, just to read legal contracts, we had to match a whole set of requirements in terms of speed, cost, and accuracy. Everything seemed super easy on paper, but when faced with reality, we found we needed a dataset of “hard documents,” which are not hard at all for humans. We tried parameters, and every time we matched a requirement, another piece of the puzzle stopped working. We needed to define a metric to build KPIs and evaluate different methods across different dimensions. We needed to pre-process the document, and discovered that there is no one-size-fits-all solution.

We struggle with context size and meaningfulness, face size constraints, and seek techniques to improve the quality of data managed by our agents. When we fail, the price is poor accuracy, which in the legal domain translates into customers not being able to use our tools to accomplish their tasks. Then we reiterate, change the model, deal with Bedrock quotas and system prompt accuracy, only to finally find out we need to select our context chunks using a RAG, but we still need to rewrite queries, filter, and rerank chunks. All of this within an acceptable iteration time.

I work every day with people from Microsoft and AWS who collect feedback from our use case, provide suggestions, and improve their products in the meantime. Then our R&D department has to continuously monitor new papers published at top-level conferences, prepare test cases to evaluate results, and identify what should be included in our product roadmap.

Then I stumble into so-called experts. People often deliver engaging talks at conferences about how magical yet powerful this AI world has become. I try to ground their very opinionated sentences to our use cases, but after having barely listened to two sentences about our problem, they have identified a 100% working solution.

I dive into their expertise only to find out they have “production expertise” because they can run a model on their PC with Ollama or LMStudio. Buying a $6K laptop and running in the terminal

uvx --from mlx-vlm mlx_vlm.generate \

--model mlx-community/DeepSeek-OCR-8bit \

--prompt “Convert this document to Markdown.” \

--image “$IMAGE_PATH” \

--max-tokens 8192 \

--temperature 0.0

doesn’t make you an expert. I have a 2600-page document to be processed in less than 3 minutes. A use case requiring at least multiple GPUs, prefetching, and a vLLM deployment. And this is just one example of the many modules we have in our Adaptive Legal Assistant LIDIA: legal operations workflow management through agents is another optimization pain point: context is very limited when you have tens of documents with hundreds of pages each. And context rot is waiting at the end of the road to remind you that “a bigger context” doesn’t solve your issues. Then you have to add a RAG to the equation, but in order to be effective, you need metadata from your documents and a proper chunking technique. Just to begin.

But our “advanced experts” aim to write a sufficiently long prompt for ChatGPT and are ready to explain to you that they did that in production, as they show in their presentations: loading an Excel file into the chat and summing the values in a column.

Let’s set the expectations: running a model on your PC does not make you an expert. Bringing the model into production on multi-GPU hardware, with optimized response times, makes you an expert.

Writing a prompt, however complex, does not make you an AI expert. The state of the art is agent development with context engineering, tool calling, and agent swarm orchestration. I know how to do all these things; nonetheless, I am struggling every day with the constraints of the state of the art.

Yet I am not an AI expert. And I can tell that you are probably neither.

The Silent Killer of Production LLMs

Luca Bianchi — Thu, 04 Dec 2025 00:47:48 GMT

While 2023 focused almost exclusively on prompt engineering, the real challenge for 2025 will be Context Engineering.

If we accept the metaphor that the Large Language Model is the new CPU, then its context window—even in million-token variants—is “the new RAM”: fast, powerful, but critically finite. Currently, many systems treat this high-speed memory as if it were an unlimited hard drive. The structural result of this approach is context rot.

The scenario is familiar: an agent that performs perfectly in a demo starts to degrade after a few weeks in production. It hallucinates obsolete instructions, confuses tools, or contradicts recent data with old session history. This is not a random bug, but a structural failure in managing the model’s “working memory.”

Let’s analyze the engineering principles necessary to transform context from an unmanaged liability into a curated, high-performance asset.

The Structural Limit: The Attention Budget

At the core of context rot is an architectural constraint of the Transformer: the attention budget. For a sequence of tokens, the model may need to compute relationships. As $N$ grows toward the context limit, the model must distribute its fixed attention capacity over an exponentially larger set of connections.

The practical impact is evident in “Needle-in-a-Haystack” benchmarks. As the context length (the “haystack”) increases, the model’s ability to reliably retrieve a single relevant piece of information decreases. Long context represents a capacity increase, not a guarantee of perfect recall.

In agent development, we tend to add information incrementally: system prompts, user history, RAG results, tool outputs, Chain-of-Thought scratchpads, and execution logs. Often, nothing is deleted. By the sixtieth interaction, the model is paying an “attention tax” on forty turns of irrelevant noise. Studies (such as those conducted by Microsoft/Salesforce) have shown performance degradation of up to 20% when partial or incorrect intermediate outputs are retained in the active context.

Context rot manifests primarily in four failure modes:

Context Poisoning: An error or hallucination (e.g., an obsolete API specification retrieved via RAG) enters the context and is treated as ground truth in subsequent steps, solidifying the error in the agent’s working memory.
Context Distraction: The model focuses excessively on a verbose example at the beginning of the prompt or an obscure historical edge case, overweighting the prompt content at the expense of generalized knowledge derived from pre-training.
Context Confusion: The problem of tool proliferation. Exposing a dozen tools, each with lengthy descriptions and ambiguous decision boundaries, creates an unmanageable decision surface. If a human engineer cannot quickly select the correct tool by reading the prompt, the model will face similar difficulties.
Context Clash: Two high-signal pieces of information in the prompt contradict each other (e.g., historical records indicate one customer location, a recent tool call indicates another). The model must arbitrate this conflict, often leading to unpredictable behavior.

Context Observability: The X-Ray of Working Memory

You cannot optimize what you do not measure. Traditional observability focuses on latency, costs, and token counts. Solving context rot requires Context Observability: a granular view of the composition of the model’s working memory over time.

It is necessary to implement tools that parse conversation logs, semantically segment long messages, and label each block by component type: System Instructions, User Query, Tool Output, RAG Knowledge, etc. Visualizing these components as stacked timelines reveals when low-signal components, such as raw, uncompressed tool outputs, dominate context. The next generation of tooling must make context composition a first-class engineering concern.

The Context Engineering Playbook

To regain control of the attention budget, we must shift from passive accumulation to active context architecture. This requires implementing a playbook built on four fundamental engineering pillars: Isolation, Selection, Compression, and External Writing.

1. Isolation: Decompose the Problem

Isolation involves breaking a monolithic problem into specialized units, ensuring that no single context window has to hold the state of the entire workflow.

The most effective implementation is through multi-agent systems, where a lead agent orchestrates specialized sub-agents (e.g., Researcher, Planner, Executor). Each sub-agent maintains a context window strictly limited to its task, preventing execution logs and irrelevant tool descriptions from accumulating in a single global prompt. Sandboxing techniques are a form of isolation in which heavy objects (code logs, images, intermediate arrays) remain external, with only targeted, compressed results reinserted into the LLM context.

2. Selection: Maximize Signal-to-Noise

Selection is the principle of maximizing the signal-to-noise ratio of every token. RAG is the most common form of knowledge selection, but the technique must be generalized.

It is critical to implement tool loadouts: dynamically retrieving and exposing only the tools relevant to the immediate request, rather than presenting the entire universe of tools at every call. This drastically reduces Context Confusion. Similarly, instead of blindly retrieving all long-term memory, semantic selection must be used to extract only memory fragments relevant to the current task. Progressive Disclosure is the dynamic ideal: giving the agent the ability to query external data stores on demand, keeping the active prompt lean while the accessible knowledge base remains vast.

3. Compression: Distill the Essence

When preserving the essence of a long interaction or voluminous output is necessary, but raw text is too costly, compression becomes mandatory.

This involves summarization and compaction. Agents can be configured to automatically compact conversation history once a threshold is exceeded, using the LLM itself to summarize the state into a succinct representation. For greater rigor, specialized fine-tuned summarization models can be employed at agent boundaries to ensure distilled knowledge transfer. The most aggressive form is pruning, where systems like Provence identify and remove entire sections of a document that are statistically irrelevant to the current query, avoiding the computational cost of generative summarization.

4. External Writing: State Offloading

The fourth lever is accepting that not all state belongs in the prompt. We must offload persistent or temporary state into external data stores, making it accessible only via explicit tool calls.

The simplest pattern is the scratchpad or notepad tool, where the agent can write intermediate plans, notes, or reflections into an external JSON or database. This allows the agent to maintain high-level reasoning without clogging the current attention window with the detailed trace of the thought process. Long-term memories (as in Reflexion) are the persistent version of this approach, accessible only through controlled retrieval.

The Foundation: High-Signal Chunking

All these strategies fail if the underlying unit of information, the chunk, is poorly defined. The art of chunking consists of finding units that are semantically coherent but small enough for flexible retrieval and recombination.

Chunking based on fixed token counts is insufficient. It is necessary to move to semantic chunking, which uses embedding similarity to detect topic boundaries. Additionally, contextual chunking involves using an LLM to generate a brief summary for each chunk describing its meaning in the context of the entire document. This enriched representation is then embedded, thereby improving retrieval accuracy and robustness.

Architectural Mandate: Standardization with MCP

To move beyond bespoke architectures, standardization is essential. The Model Context Protocol (MCP), proposed by Anthropic, offers a path forward. It defines a standardized interface (similar to JSON-RPC) for interaction between models, tools, and data.

MCP acts as a “USB-C for AI,” providing a consistent schema for tool discovery and operation invocation. This consistency radically simplifies the implementation of complex context management strategies—such as dynamic tool loadouts and controlled external writing—reducing Context Clash and allowing components to be reused across different frameworks without excessive “glue code.”

Conclusion

Context Engineering is about finding the smallest possible set of high-signal tokens that maximizes the probability of the desired behavior. Every vague instruction, redundant field, or verbose tool description is stealing attention from a more critical token.

For system prompts, this means finding the right “altitude”: avoiding fragile pseudo-code while maintaining structured instructions (via XML or Markdown). For tools, output must be token-efficient. Finally, few-shot examples must be canonical demonstrations of the desired pattern, not exhaustive documentation.

If you are building an agent today, you are designing and operating its dynamic working memory. Instead of asking “Why did the model hallucinate?”, the correct question is: “What was in its ‘head’ at that moment, what was it paying the attention budget for, and which of those tokens truly deserved to be there?”

Hello Builders, issue #2

Luca Bianchi — Sat, 29 Nov 2025 06:30:08 GMT

This week’s AI focus is on three converging forces: “agentic” models that can drive your UI like an operator, open and highly specialized models that challenge frontier systems on niche tasks, and an apparent acceleration in AI industrial policy from both the US and Europe. As a technology leader, this is the week where AI stops being just an API call and increasingly becomes an operating layer across devices, infrastructure, and regulation.

This week’s signal in the noise

Agentic, on-device models move from research demos to deployable products (Microsoft’s Fara-7B).
Open, specialized models push frontier-level performance in narrow domains like math (DeepSeek Math V2 and the wider DeepSeek stack) while frontier model vendors sharpen price/performance and context length for long-running enterprise workloads (Anthropic Claude Opus 4.5).
Policy and regulation are shifting rapidly, with the EU warning that it is “missing the boat” and the US simultaneously ramping up national AI programs while weakening state-level oversight.
The human factor in AI adoption is critical; without deliberate communication, upskilling, and change management, it can easily become the main brake on a company’s AI journey.

On-device agentic models for computer use

Microsoft quietly released Fara-7B, a 7B-parameter “agentic” small language model designed to act as a computer-use agent: it can see your screen, click, scroll, type, and navigate the web much like a human operator. The model runs locally on Copilot+ PCs and is also available via Microsoft’s Foundry and open platforms like Hugging Face, having been trained on a large synthetic dataset of multi-step web interactions. This is a concrete signal that “UI-as-API” agents are becoming a product category rather than an experiment. You should start classifying internal applications by how “agent-compatible” their UI is (deterministic layouts, clear affordances, instrumented telemetry). Local execution also strengthens the case for on-device agents for sensitive workflows, but it raises new surface areas for monitoring, auditability, and access control—your endpoint security and DLP strategies will need to adapt.

https://www.microsoft.com/en-us/research/blog/fara-7b-an-efficient-agentic-model-for-computer-use/

Specialized reasoning over monolithic LLMs

China’s DeepSeek released DeepSeek Math V2 (https://www.scmp.com/tech/tech-trends/article/3334553/deepseek-releases-first-open-ai-model-gold-level-scores-maths-olympiad), an open model specialized for mathematical problem-solving, reportedly delivering performance comparable to that of International Math Olympiad (IMO) gold medalists on benchmark suites. This follows the wider DeepSeek family of open models, which have already drawn attention for their strong reasoning capabilities at significantly lower compute and cost profiles than many Western frontier models.

This continues a strategic trend: highly specialized open models that outperform giant generalist models on narrow but economically important domains (quant, optimization, verification, scientific computing). It’s a strong signal to shift from a single “one-model-fits-all” architecture to a portfolio of domain models. DeepSeek’s cost and hardware efficiency also underline that GPU scarcity is no longer an excuse for not experimenting with strong reasoning systems. This is even more true if we consider a recently published paper (https://x.com/omarsar0/status/1993695515595444366?t=lpD7UiqjTbYd453g-DPvpw&s=33) that shows how trained 12B models can exhibit reasoning capabilities comparable to those of frontier models.

On the other side of the river, Anthropic introduced Claude Opus 4.5 that combines lower API pricing with longer context windows and improved reasoning, explicitly targeting long-running chats and multi-step workflows that previously hit context or budget ceilings. This is yet another hint to avoid using a single LLM architecture for your applications. You should plan a Q4–Q1 refresh of your model cost models and routing strategies.

“Missing the boat” on competitiveness...

European Central Bank President Christine Lagarde warned that the EU is risking its economic future by falling behind the US and China in AI adoption and development (https://www.reuters.com/business/eu-missing-boat-ai-jeopardising-its-future-lagarde-warns-2025-11-24/). She highlighted the risk of Europe becoming structurally dependent on foreign AI infrastructure, compute, chips, and hyperscale data centers, if it remains primarily a buyer rather than a builder, and called for open standards, cheaper energy, and integrated capital markets. This could probably be an explicit signal that AI capacity is now viewed as strategic infrastructure, on par with energy or telecoms. Expect a new wave of national and EU-level incentives, consortia, and regulatory adjustments to accelerate domestic AI capability. Companies operating in Europe should actively track these programs (funding, tax treatment, regulatory sandboxes) and, where possible, position flagship AI initiatives as part of the region’s industrial policy story.

This is still far from the US AI-aggressive “Genesis Mission” (https://www.federalregister.gov/documents/2025/11/28/2025-21665/launching-the-genesis-mission), a federal effort to turn national scientific and engineering data into discoveries using AI, led by the Department of Energy and powered by public and private supercomputers. US AI policy is simultaneously turbocharging federal AI projects and weakening decentralized regulation.

The slow AI adoption

A Harvard study (https://hbr.org/2025/11/leaders-assume-employees-are-excited-about-ai-theyre-wrong) found that 76% of executives believe employees are excited about AI, yet only 31% of individual contributors agree. The same survey reveals a deep organizational blind spot: 75% of leaders describe their company as “employee-centric”, but only 23% of employees share that view. Only 30% of employees feel informed about their organization’s AI strategy, versus 80% of leaders who believe they are. The study concludes that truly employee-centric organizations are up to seven times more likely to succeed with AI. Daniel Goleman attributes the origin of this friction to the psychological impact of change and to the role Emotional Intelligence could play in a successful strategy (https://www.kornferry.com/insights/this-week-in-leadership/the-big-ai-roadblock-in-our-heads).

Hello Builders, issue #1

Luca Bianchi — Fri, 21 Nov 2025 09:13:32 GMT

On November 13, Cursor—a code editor—raised $2.3 billion at a $29.3 billion valuation. That’s not a typo. A tool that helps developers write code is now valued higher than most cloud infrastructure companies. The company hit $1 billion in annualized revenue with enterprise revenue growing 100x year-to-date. Nvidia’s CEO called it his “favorite enterprise AI service.”

Here’s why that matters: Cursor doesn’t own any frontier AI models. It uses OpenAI, Anthropic, and Google APIs. The valuation isn’t for the technology—it’s for workflow capture. When developers open their IDE, Cursor controls the moment of creation. And apparently, that’s worth $30 billion.

The same day, OpenAI shipped GPT-5.1 across their entire model family—not as a research preview, but as production infrastructure. The killer feature isn’t smarter responses; it’s 24-hour prompt caching that reduces costs by 90% for cached tokens. When inference costs have dropped 280x since GPT-3.5, you’re not selling API calls anymore—you’re selling commodity compute.

Meanwhile, Anthropic announced a $50 billion partnership with Fluidstack to build custom data centers in Texas and New York, online throughout 2026. Not “exploring opportunities.” Actual construction creating 800 permanent jobs and 2,400 construction positions. And Microsoft formalized its relationship with OpenAI: a $135 billion equity stake (27% ownership) tied to $250 billion in Azure commitments through 2032.

But here’s the development that should make every CTO reconsider their model procurement strategy: Chinese startup Moonshot AI released Kimi K2 Thinking, a trillion-parameter reasoning model, trained for $4.6 million. Western competitors spend $100 million or more for similar capabilities. That’s a 20x cost advantage. And on SWE-Bench Verified—the benchmark that actually matters for code generation—it scores 65.8% versus GPT-4.1’s 54.6%.

The model is open-sourced under a modified MIT license.

When you can train frontier models for $5 million instead of $100 million, when code editors command infrastructure-scale valuations, when inference costs drop 280x in two years—the entire stack is repricing. The companies that treated AI as an R&D expense are now competing against companies that treated it as infrastructure investment. And the gap is measured in billions.

There’s one more data point that ties this together. Microsoft Clarity analyzed 1,200+ publisher sites and found that AI referral traffic—from ChatGPT, Copilot, Perplexity, Gemini—converts at 1.66% versus 0.15% for search. That’s an 11x difference in customer acquisition efficiency. AI traffic grew 155.6% in eight months. The distribution layer isn’t shifting—it’s already shifted.

The question isn’t whether AI works at scale anymore. The question is whether your architecture can scale with it.

Keep building,

Luca

Key Takeaways

Application Layer = Infrastructure Economics: Cursor’s valuation proves workflow control beats model ownership. Your internal tools now compete against apps with billion-dollar budgets optimizing for developer capture.
Multi-Cloud = Operational Necessity: Microsoft/OpenAI decoupled after 5 years. OpenAI committed $1.15T across 7 vendors (Broadcom $350B, Oracle $300B, Microsoft $250B, Nvidia $100B, AMD $90B, AWS $38B, CoreWeave $22B). Single-vendor is risk concentration, not optimization.
Training Cost Arbitrage Accelerates: Moonshot trained a 1T-parameter model for $4.6M that beats Western benchmarks. MuonClip optimizer enabled zero-instability training at scale. The technical gap closes in months, not years.
AI Referral Conversion: 11x Search, <1% Traffic: 1.66% vs 0.15% conversion, but negligible volume. Companies optimizing for AI discovery now gain 18-24 months advantage before it becomes standard.