DeepSeek released its V4 model family this week under a permissive open weights license, publishing full model weights, training data details, and a technical report that walks through the architecture and training pipeline. The flagship model, DeepSeek V4 Max, posted scores on MMLU, GPQA Diamond, SWE Bench Verified, and the newer LiveCodeBench benchmarks that land within a few points of GPT-5.1 and Claude Opus 4.5 on most tests. That is a closer gap than any open model has achieved to date, and it has forced the AI policy conversation to get real about what open weights frontier models mean.

The technical story is worth understanding because it is not just another incremental release. V4 is a mixture of experts architecture with 680 billion total parameters and roughly 38 billion active per forward pass. The training run reportedly used a mix of H100 and domestically produced accelerators, and the efficiency claims in the paper suggest the full training cost landed under 15 million dollars. That number is smaller than what US labs have reported for comparable training runs, and the independent replication work people are doing this week will matter because self reported costs have been wrong before.

The context window is one million tokens on the flagship variant with meaningful performance retention across that length, which has been the weak spot for long context claims in most models released in the last year. The tool use and function calling capabilities are stronger than V3 and benchmark within the range of the closed labs for common agent workflows. Coding is where the model is particularly strong, with SWE Bench scores that will be hard to ignore if they replicate in the field.

The license is the headline for developers. The DeepSeek V4 license allows commercial use, fine tuning, distillation, and redistribution with modest attribution requirements and no commercial revenue thresholds. That is more permissive than Llama 3 and roughly comparable to the Apache style licenses that made earlier open models usable in business contexts. Within 24 hours of release, the weights were available on Hugging Face, running on Together AI and Fireworks, and being integrated into local inference frameworks. The ecosystem picked it up faster than any release I can remember.

The implications for the closed labs are significant even if the closed models remain ahead on a few dimensions. The value proposition of paying a per token API fee to a frontier lab becomes harder to defend when a comparably capable model is running locally or on commodity cloud infrastructure at a fraction of the cost. Companies that were building products on GPT-5 or Claude Opus APIs are already running cost analysis with DeepSeek V4 in the loop and the numbers are landing in DeepSeek's favor for high volume use cases. That does not mean the closed labs disappear. It means their pricing power compresses.

The policy angle is messier. A Chinese lab releasing frontier capability under an open license cuts against the framing some in Washington have used to argue for export controls and compute restrictions. The theory of the case was that controlling access to cutting edge AI meant controlling the supply of advanced chips. V4's reported training run used chips that are not supposed to be the frontier and still produced a model that benchmarks near the top. Whether the reported training details are accurate is an open question. Whether the capability is real is less open. You can download the weights and test them yourself.

The safety conversation is going to get louder. Closed labs have built elaborate release processes around capability evaluations, red teaming, refusal training, and rate limiting. An open weights model does not have any of those controls once it is in the wild. Fine tuning can undo most of the built in refusals in hours. That is not new. It has been true of every open model for two years. What is new is that the capability level of the underlying model is now high enough that the misuse scenarios people have been warning about are more plausible than they were a year ago.

For developers, the practical advice is to treat this as a real option and run your own evals. Benchmark scores are useful but they do not capture the messy reality of specific workflows. Pull the weights, run them against your existing test cases, and see how the latency and cost look compared to whatever you are paying for today. If you are building agents, the tool use quality matters more than the headline MMLU number.

The frontier is not one company anymore. That was already true, but this week it is harder to deny.