Harness design for long-running application development
…It is worth the cost when the task sits beyond what the current model does reliably solo. Alongside the structural simplification, I also added prompting to improve how the harness built AI…
…It is worth the cost when the task sits beyond what the current model does reliably solo. Alongside the structural simplification, I also added prompting to improve how the harness built AI…
Policy Frontier Red Team Partnering with Mozilla to improve Firefox’s security Mar 6, 2026 AI models can now independently identify high-severity vulnerabilities in complex software. As we recently documented, Claude…
…first, we have reason for concern about well-resourced malicious actors attempting to gain uplift from our models for highly risky biological research. Second, models now have a greater ability to accomplish…
Engineering at Anthropic Building effective agents Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't…
…This model could work for other bulk sourcing! 🧅📋 That was until another staffer stepped in to tell the models that this would fall afoul of a 1958 quirk of US law…
…the concerns of recoverable context storage in the session and arbitrary context management in the harness because we can’t predict what specific context engineering will be required in future models. The…
…One particularly useful application would be to monitor models as they are updated. The sycophancy that emerged in OpenAI’s GPT-4o in April 2025 was a concerning behavioral change from a…
Engineering at Anthropic Introducing advanced tool use on the Claude Developer Platform The future of AI agents is one where models work seamlessly across hundreds or thousands of tools. An IDE assistant…
…However, these estimates reflect current model capabilities, and all signs suggest that reliability over increasingly long-running tasks will improve. Tradeoffs in task acceleration Our estimates suggest that, in general, the more…
…4 Why might actual usage fall short of theoretical capability? Some tasks that are theoretically possible may not show up in usage because of model limitations. Others may be slow to diffuse…
To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.