1. Mythos Preview, and capabilities Anthropic didn't train for
On April 7, Anthropic announced Claude Mythos Preview, a research preview of a new general-purpose model. The notable claim wasn't a benchmark score but an emergent capability: Mythos autonomously found and exploited zero-day vulnerabilities in major operating systems and browsers, including a 27-year-old bug in OpenBSD. Anthropic launched Project Glasswing to coordinate disclosure and defense, and stated the capability emerged as a downstream consequence of general improvements to code and reasoning, not from targeted training.
The detail worth pausing on isn't the security story. It's that capabilities appear without being trained for. The same emergence pattern is why vision models keep getting better at extracting tables nobody specifically annotated, and why instruction following crept up between releases without anyone advertising it. It's also why your last model upgrade quietly changed how the system handles ambiguous fields.
You don't find out from model cards. You find out from regression tests across document types: the boring evaluation harness that nobody wants to maintain. When the same upgrade that fixed a 27-year-old OpenBSD bug also subtly changes how a model handles handwritten invoice corrections, you want to know before an AP clerk does. The gap between announced capabilities and actual behavior in your pipeline keeps growing.
2. Muse Spark, and what "multimodal" actually means in production
Meta Superintelligence Labs released Muse Spark in April, the first model in a new series. The pitch is multimodal perception, pointing a phone at a snack shelf and asking which item has the most protein, scanning a product to compare alternatives, parsing health charts. Meta also showcased parallel subagent execution, where Meta AI launches multiple agents in parallel for a single query. The model now powers the Meta AI app and meta.ai, with API access in private preview for select partners.
The demos are clean. A snack shelf in good lighting. A product label held at the right angle. A chart with axes that make sense. This is where vision models look excellent, and where document processing teams have learned to be cautious.
What benchmarks rarely capture: scans skewed three degrees because the corner of the page lifted. Stamps half-overlapping a key field. A handwritten correction over the final amount. An invoice photographed at a kitchen table with glare across the totals. These are not edge cases. They're Tuesday.
Better multimodal models do help. Vision pipelines that needed bespoke pre-processing two years ago now handle a wider range of inputs out of the box. But the production gap stays roughly constant. Each generation closes ground on benchmarks. The failure modes that actually break production pipelines improve more slowly.
3. Claude Design, and the rise of AI-generated documents
On April 17, Anthropic Labs released Claude Design, powered by Claude Opus 4.7. It produces visual work from prompts: prototypes, slide decks, one-pagers, marketing collateral, and imports from DOCX, PPTX, XLSX, codebases, or live websites. Output exports to Canva, PDF, PPTX, or standalone HTML, with a handoff bundle for Claude Code when a design is ready to build. Teams can supply a design system that the model applies automatically. Available in research preview for Pro, Max, Team, and Enterprise subscribers.
Most discussion of generative AI in document processing focuses on the extraction side. Tools like Claude Design change the supply side. A larger share of the PDFs and decks flowing through procurement, finance, and legal pipelines will have been drafted by a model and sometimes lightly edited, sometimes shipped straight from a prompt.
Mapping teams who turn one format into another already deal with vendor templates that change silently when a procurement department updates them. Add another layer: AI-generated formats that look polished but follow a different visual grammar than human-designed templates. Field positions drift. Logos render as inline SVGs instead of embedded images. Tables get rebuilt in HTML rather than native PPTX. Not a crisis. A small ongoing tax on every extraction pipeline that quietly assumed documents were authored the way they used to be.
4. DeepSeek V4, and the economics of bulk document processing
DeepSeek released V4 in early May. It's an open-weight model with a 1M-token context window and an MoE architecture (roughly 37B of 685B parameters active per pass), reportedly competitive with GPT-4o and Claude 3.7 Sonnet on coding, reasoning, and language benchmarks. Inference pricing on the official API runs roughly 80–95% below comparable closed models, and weights are downloadable under a permissive license. The lab reports V3 was trained for about $5.5M, though exact totals for V4 are not yet disclosed.
The cost number changes what's worth doing. At three dollars per million tokens, you triage, and you only run the model on the documents that already failed simpler heuristics. At a quarter of that, the math flips. You can run two models on every document and compare. You can re-process the back catalog. You can afford to throw a second pass at every low-confidence field before routing it to a human.
The 1M context window matters separately. A full contract pack (master agreement, schedules, side letters, amendments) can sit in one prompt instead of being chunked across calls. That doesn't fix accuracy on its own, but it removes a class of errors caused by stitching context back together across chunks.
Cheap models don't fix the last mile. They make it economically reasonable to try more things while you work on it.
A small confession to close on. I spent more time this quarter re-running last year's test sets against this year's models than I did reading model cards. Half the deltas were quiet improvements. Half were quiet surprises. That ratio hasn't moved much in the last year, which is its own kind of signal.