Validating agentic behavior when “correct” isn’t deterministic
… I am a PhD student at UW focused on improving the reliability and maintainability of LLM agents, using best practices from traditional software engineering. …
… I am a PhD student at UW focused on improving the reliability and maintainability of LLM agents, using best practices from traditional software engineering. …
… And they use many tools for this, the so-called engineering stack. …
… We try to normalize this by tracking LLM API call counts alongside token counts; constant LLM turns-per-run and falling tokens-per-call indicate genuine efficiency improvement. …
… That said, LLMs are much more helpful if you already have a clear idea in mind. Kevin: Do you have any advice for folks who are starting out in software engineering or research? …