What’s new today…

VentureBeat Transformative tech coverage that matters

  • Why reinforcement learning plateaus without representation depth (and other key takeaways from NeurIPS 2025)
    on January 17, 2026 at 7:00 pm

    Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren’t about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize.This year’s top papers collectively point to a deeper shift: AI progress is now constrained less by raw model capacity and more by architecture, training dynamics and evaluation strategy.Below is a technical deep dive into five of the most influential NeurIPS 2025 papers — and what they mean for anyone building real-world AI systems.1. LLMs are converging—and we finally have a way to measure itPaper: Artificial Hivemind: The Open-Ended Homogeneity of Language ModelsFor years, LLM evaluation has focused on correctness. But in open-ended or ambiguous tasks like brainstorming, ideation or creative synthesis, there often is no single correct answer. The risk instead is homogeneity: Models producing the same “safe,” high-probability responses.This paper introduces Infinity-Chat, a benchmark designed explicitly to measure diversity and pluralism in open-ended generation. Rather than scoring answers as right or wrong, it measures:Intra-model collapse: How often the same model repeats itselfInter-model homogeneity: How similar different models’ outputs areThe result is uncomfortable but important: Across architectures and providers, models increasingly converge on similar outputs — even when multiple valid answers exist.Why this matters in practiceFor corporations, this reframes “alignment” as a trade-off. Preference tuning and safety constraints can quietly reduce diversity, leading to assistants that feel too safe, predictable or biased toward dominant viewpoints.Takeaway: If your product relies on creative or exploratory outputs, diversity metrics need to be first-class citizens. 2. Attention isn’t finished — a simple gate changes everythingPaper: Gated Attention for Large Language ModelsTransformer attention has been treated as settled engineering. This paper proves it isn’t.The authors introduce a small architectural change: Apply a query-dependent sigmoid gate after scaled dot-product attention, per attention head. That’s it. No exotic kernels, no massive overhead.Across dozens of large-scale training runs — including dense and mixture-of-experts (MoE) models trained on trillions of tokens — this gated variant:Improved stabilityReduced “attention sinks”Enhanced long-context performanceConsistently outperformed vanilla attentionWhy it worksThe gate introduces:Non-linearity in attention outputsImplicit sparsity, suppressing pathological activationsThis challenges the assumption that attention failures are purely data or optimization problems.Takeaway: Some of the biggest LLM reliability issues may be architectural — not algorithmic — and solvable with surprisingly small changes.3. RL can scale — if you scale in depth, not just dataPaper: 1,000-Layer Networks for Self-Supervised Reinforcement LearningConventional wisdom says RL doesn’t scale well without dense rewards or demonstrations. This paper reveals that that assumption is incomplete.By scaling network depth aggressively from typical 2 to 5 layers to nearly 1,000 layers, the authors demonstrate dramatic gains in self-supervised, goal-conditioned RL, with performance improvements ranging from 2X to 50X.The key isn’t brute force. It’s pairing depth with contrastive objectives, stable optimization regimes and goal-conditioned representationsWhy this matters beyond roboticsFor agentic systems and autonomous workflows, this suggests that representation depth — not just data or reward shaping — may be a critical lever for generalization and exploration.Takeaway: RL’s scaling limits may be architectural, not fundamental.4. Why diffusion models generalize instead of memorizingPaper: Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingDiffusion models are massively overparameterized, yet they often generalize remarkably well. This paper explains why.The authors identify two distinct training timescales:One where generative quality rapidly improvesAnother — much slower — where memorization emergesCrucially, the memorization timescale grows linearly with dataset size, creating a widening window where models improve without overfitting.Practical implicationsThis reframes early stopping and dataset scaling strategies. Memorization isn’t inevitable — it’s predictable and delayed.Takeaway: For diffusion training, dataset size doesn’t just improve quality — it actively delays overfitting.5. RL improves reasoning performance, not reasoning capacity Paper: Does Reinforcement Learning Really Incentivize Reasoning in LLMs?Perhaps the most strategically important result of NeurIPS 2025 is also the most sobering.This paper rigorously tests whether reinforcement learning with verifiable rewards (RLVR) actually creates new reasoning abilities in LLMs — or simply reshapes existing ones.Their conclusion: RLVR primarily improves sampling efficiency, not reasoning capacity. At large sample sizes, the base model often already contains the correct reasoning trajectories.What this means for LLM training pipelinesRL is better understood as:A distribution-shaping mechanismNot a generator of fundamentally new capabilitiesTakeaway: To truly expand reasoning capacity, RL likely needs to be paired with mechanisms like teacher distillation or architectural changes — not used in isolation.The bigger picture: AI progress is becoming systems-limited Taken together, these papers point to a common theme:The bottleneck in modern AI is no longer raw model size — it’s system design.Diversity collapse requires new evaluation metricsAttention failures require architectural fixesRL scaling depends on depth and representationMemorization depends on training dynamics, not parameter countReasoning gains depend on how distributions are shaped, not just optimizedFor builders, the message is clear: Competitive advantage is shifting from “who has the biggest model” to “who understands the system.”Maitreyi Chatterjee is a software engineer.Devansh Agarwal currently works as an ML engineer at FAANG.

  • Black Forest Labs launches open source Flux.2 [klein] to generate AI images in less than a second
    by carl.franzen@venturebeat.com (Carl Franzen) on January 16, 2026 at 11:28 pm

    The German AI startup Black Forest Labs (BFL), founded by former Stability AI engineers, is continuing to build out its suite of open source AI image generators with the release of FLUX.2 [klein], a new pair of small models — one open and one non-commercial — that emphasizes speed and lower compute requirements, with the models generating images in less than a second on a Nvidia GB200. The [klein] series, released yesterday, includes two primary parameter counts: 4 billion (4B) and 9 billion (9B).The model weights are available on Hugging Face and code on Github.While the larger models in the FLUX.2 family ([max] and [pro]), released in November of 2025, chase the limits of photorealism and “grounding search” capabilities, [klein] is designed specifically for consumer hardware and latency-critical workflows.In great news for enterprises, the 4B version is available under an Apache 2.0 license, meaning they — or any organization or developer — can use the [klein] models for their commercial purposes without paying BFL or any intermediaries a dime. However, a number of AI image and media creation platforms including Fal.ai have begun offering it for extremely low cost as well through their application programming interfaces (APIs) and as a direct-to-user tool. Already, it’s won strong praise from early users for its speed. What it lacks for in overall image quality, it seems to make up for in its fast generation capability, open license, affordability and small footprint — benefitting enterprises who want to run image models on their own hardware or at extremely low cost. So how did BFL do it and how can it benefit you? Read on to learn more.The “Pareto Frontier” of LatencyThe technical philosophy behind [klein] is what BFL documentation describes as defining the “Pareto frontier” for quality versus latency. In simple terms, they have attempted to squeeze the maximum possible visual fidelity into a model small enough to run on a home gaming PC without a noticeable lag.The performance metrics released by the company paint a picture of a model built for interactivity rather than just batch generation. According to Black Forest Labs’ official figures, the [klein] models are capable of generating or editing images in under 0.5 seconds on modern hardware. Even on standard consumer GPUs like an RTX 3090 or 4070, the 4B model is designed to fit comfortably within approximately 13GB of VRAM.This speed is achieved through “distillation,” a process where a larger, more complex model “teaches” a smaller, more efficient one to approximate its outputs in fewer steps. The distilled [klein] variants require only four steps to generate an image. This effectively turns the generation process from a coffee-break task into a near-instantaneous one, enabling what BFL describes on X (formerly Twitter) as “developing ideas from 0 → 1” in real-time.Under the Hood: Unified ArchitectureHistorically, image generation and image editing have often required different pipelines or complex adapters (like ControlNets). FLUX.2 [klein] attempts to unify these. The architecture natively supports text-to-image, single-reference editing, and multi-reference composition without needing to swap models.According to the documentation released on GitHub, the models support:Multi-Reference Editing: Users can upload up to four reference images (or ten in the playground) to guide the style or structure of the output.Hex-Code Color Control: A frequent pain point for designers is getting “that exact shade of red.” The new models accept specific hex codes in prompts (e.g., #800020) to force precise color rendering.Structured Prompting: The model parses JSON-like structured inputs for rigorously defined compositions, a feature clearly aimed at programmatic generation and enterprise pipelines.The Licensing Split: Open Weights vs. Open SourceFor startups and developers building on top of BFL’s tech, understanding the licensing landscape of this release is critical. BFL has adopted a split strategy that separates “hobbyist/research” use from “commercial infrastructure.”FLUX.2 [klein] 4B: Released under Apache 2.0. This is a permissive free software license that allows for commercial use, modification, and redistribution. If you are building a paid app, a SaaS platform, or a game that integrates AI generation, you can use the 4B model royalty-free.FLUX.2 [klein] 9B & [dev]: Released under the FLUX Non-Commercial License. These weights are open for researchers and hobbyists to download and experiment with, but they cannot be used for commercial applications without a separate agreement.This distinction positions the 4B model as a direct competitor to other open-weights models like Stable Diffusion 3 Medium or SDXL, but with a more modern architecture and a permissive license that removes legal ambiguity for startups.Ecosystem Integration: ComfyUI and BeyondBFL is clearly aware that a model is only as good as the tools that run it. Coinciding with the model drop, the team released official workflow templates for ComfyUI, the node-based interface that has become the standard integrated development environment (IDE) for AI artists.The workflows—specifically image_flux2_klein_text_to_image.json and the editing variants—allow users to drag and drop the new capabilities into existing pipelines immediately.Community reaction on social media has centered on this workflow integration and the speed. In a post on X, the official Black Forest Labs account highlighted the model’s ability to “rapidly explore a specific aesthetic,” showcasing a video where the style of an image shifted instantly as the user scrubbed through options.Why It Matters For Enterprise AI Decision-MakersThe release of FLUX.2 [klein] signals a maturation in the generative AI market, moving past the initial phase of novelty into a period defined by utility, integration, and speed.For Lead AI Engineers who are constantly juggling the need to balance speed with quality, this shift is pivotal. These professionals, who manage the full lifecycle of models from data preparation to deployment, often face the daily challenge of integrating rapidly evolving tools into existing workflows. The availability of a distilled 4B model under an Apache 2.0 license offers a practical solution for those focused on rapid deployment and fine-tuning to achieve specific business goals, allowing them to bypass the latency bottlenecks that typically plague high-fidelity image generation.For Senior AI Engineers focused on orchestration and automation, the implications are equally significant. These experts are responsible for building scalable AI pipelines and maintaining model integrity across different environments, often while working under strict budget constraints.The lightweight nature of the [klein] family directly addresses the challenge of implementing efficient systems with limited resources. By utilizing a model that fits within consumer-grade VRAM, orchestration specialists can architect cost-effective, local inference pipelines that avoid the heavy operational costs associated with massive proprietary models.Even for the Director of IT Security, the move toward capable, locally runnable open-weight models offers a distinct advantage. Tasked with protecting the organization from cyber threats and managing security operations with limited resources, reliance on external APIs for sensitive creative workflows can be a vulnerability. A high-quality model that runs locally allows security leaders to sanction AI tools that keep proprietary data within the corporate firewall, balancing the operational demands of the business with the robust security measures they are required to uphold.

  • How Google’s ‘internal RL’ could unlock long-horizon AI agents
    by bendee983@gmail.com (Ben Dickson) on January 16, 2026 at 10:41 pm

    Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem. Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.The limits of next-token predictionReinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, “on the order of one in a million,” according to the researchers.The issue isn’t just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal.”We argue that when facing a problem with some abstract structure… [goal-oriented exploration] is what you want,” Schimpf said. By solving the problem at the abstract level first, the agent commits to a path, ensuring it doesn’t “get lost in one of the reasoning steps” and fail to complete the broader workflow.To address this, the field has long looked toward hierarchical reinforcement learning. HRL attempts to solve complex problems by decomposing them into a hierarchy of temporally abstract actions (high-level subroutines that represent different stages of the solution) rather than managing a task as a string of tokens. However, discovering these appropriate subroutines remains a longstanding challenge. Current HRL methods often fail to discover proper policies, frequently “converging to degenerate options” that do not represent meaningful behaviors. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning. Steering the LLM’s internal thoughtsTo overcome these limitations, the Google team proposed internal RL. Advanced autoregressive models already “know” how to perform complex, multi-step tasks internally, even if they aren’t explicitly trained to do so. Because these complex behaviors are hidden inside the model’s residual stream (i.e., the numerical values that carry information through the network’s layers), the researchers introduced an “internal neural network controller,” or metacontroller. Instead of monitoring and changing the output token, the metacontroller controls the model’s behavior by applying changes to the model’s internal activations in the middle layers.This nudge steers the model into a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining. The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, the researchers use a self-supervised framework where the model analyzes a full sequence of behavior and works backward to infer the hidden, high-level intent that best explains the actions.During the internal RL phase, the updates are applied to the metacontroller, which shifts training from next-token prediction to learning high-level actions that can lead to the solution.To understand the practical value of this, consider an enterprise agent tasked with code generation. Today, there is a difficult trade-off: You need “low temperature” (predictability) to get the syntax right, but “high temperature” (creativity) to solve the logic puzzle.”Internal RL might facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of those actions to the robust, lower-temperature distribution of the base model,” Schimpf said. The agent explores the solution without breaking the syntax.The researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pretrained on a behavioral dataset and then frozen, while the metacontroller is trained to steer the frozen model’s residual stream. In the second, the metacontroller and the base model are jointly optimized, with parameters of both networks updated simultaneously. Internal RL in actionTo evaluate the effectiveness of internal RL, the researchers ran experiments across hierarchical environments designed to stump traditional learners. These included a discrete grid world and a continuous control task where a quadrupedal “ant” robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.While baselines like GRPO and CompILE failed to learn the tasks within a million episodes due to the difficulty of credit assignment over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than tiny steps, the metacontroller drastically reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.Notably, the researchers found that the “frozen” approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, perfectly aligning its internal switching mechanism with the ground-truth moments when an agent finished one subgoal and started the next.As the industry currently fixates on reasoning models that output verbose “chains of thought” to solve problems, Google’s research points toward a different, perhaps more efficient future.”Our study joins a growing body of work suggesting that ‘internal reasoning’ is not only feasible but potentially more efficient than token-based approaches,” Schimpf said. “Moreover, these silent ‘thoughts’ can be decoupled from specific input modalities — a property that could be particularly relevant for the future of multi-modal AI.”If internal reasoning can be guided without being externalized, the future of AI agents may hinge less on prompting strategies and more on how well we can access and steer what models already represent internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift could matter more than any new reasoning benchmark.

AWS News Blog Announcements, Updates, and Launches

    Feed has no items.