📝 Notes
### How They Work & Why Performance Varies
* While the basic concept is an LLM in a loop with tools (read/write files, execute commands), the details are crucial.
* The best-performing agents use tools that the **foundation model was specifically trained on**. A mismatch between the agent's tools and the model's training (e.g., using a generic agent with a model like GPT-5) leads to worse performance.
* Agent quality differs significantly even with the same underlying model. Key differentiators include:
* **Safety checks**: Some agents use a second, faster LLM as a "judge" to prevent harmful commands.
* **Error recovery**: Many agents get stuck in loops or fail to recover from errors, while more battle-tested ones are more robust.
* **Tool implementation**: The way an agent executes code and handles processes can vary greatly in quality.
### The Challenge of Evaluation
* **Cost is deceptive**. A model with a cheaper per-token price (like GPT-5) might be more expensive overall if the agent requires more tokens and interactions to solve a problem compared to a more efficient one (like Claude Code).
* Similarly, faster inference speed doesn't matter if the model's output quality is lower.
### Models & Pricing
* Open-weight models are improving but are not yet as reliable. **Self-hosting them is currently more expensive** and technically challenging than using commercial APIs.
* **costs will go up** as VC subsidies wane and users tackle more complex problems that require more tokens.