Expensively Quadratic: The LLM Agent Cost Curve

Expensively Quadratic: the LLM Agent Cost Curve – exe.dev blog Pop quiz: at what point in the context length of a coding agent are cached reads costing you half of the next API call? By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads. Let’s take a step back. We’ve previously written about how coding agents work: they post the conversation thus far to the LLM, and continue doing that in a loop as long as the LLM is requesting tool calls. When there are no more tools to run, the loop waits for user input, and the whole cycle starts over. Visually: The agentic loop Or, in code form: def loop(llm): msg = user_input() while True: output, tool_calls = llm(msg) print(“Agent: “, output) if tool_calls: msg = [handle_tool_call(tc) for tc in tool_calls] else: msg = user_input() The LLM providers charge for input tokens, cache writes, output tokens, and cache reads. It’s a little tricky: you indicate in your prompt to cache up to a certain point (usually the end), and you get charged as “cache write” and not input. The previous turn’s output becomes the next turn’s cache write. Visually: Token costs across LLM calls Here, the colors and numbers indicate the costs making up the n th call to the LLM. Every subsequent call reads the story so far from the cache, writes the previous call’s output to the cache (as well as any new input), and gets an output. The area represents the cost, though in this diagram, it’s not quite drawn to scale. Add up all the rectangles, and that’s the total cost. That triangle emerging for cache reads? That’s the scary quadratic! How scary is the quadratic? Pretty squarey! I took a rather ho-hum feature implementation conversation, and visualized it like the diagram above. The area corresponds to cost: the width of every rectangle is the number of tokens and the height is the cost per token. As the conversation evolves, more and more of the cost is the long thin lines across the bottom that correspond to cache reads. The whole conversation cost $12.93 total or so. You can see that as the conversation continues, the cache reads dominate. At the end of the conversation, cache reads are 87% of the total cost. They were half the cost at 27,500 tokens! This conversation is just one example. Does this happen generally? exe.dev’s LLM gateway keeps track of the costs we’re incurring. We do not store the messages themselves going past, but we do keep track of the number of tokens. The following graph shows the “cumulative cost” visualization for many Shelley conversations, not just my own. I sampled 250 conversations from the data randomly. The x-axis is the context length, and the y-axis is the cumulative cost up to that point. The left graph is all the costs and the right graph is just the cache reads. You can mouse over to find a given conversation on both graphs. The box plots below show the distribution of input tokens and output tokens. The cost curves are all different because every conversation is

Source: Hacker News | Original Link

才疏学浅

一花一草一世界 | 心若无物就可以一花一世界，一草一天堂

Expensively Quadratic: The LLM Agent Cost Curve