ebaths | Aug. 10th, 2025

About Vending-Bench

Recently, I read an AI/NLP paper by Andon Labs called “Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents.” Here’s the press release version, and here’s the actual paper itself. The idea is that we are hearing a lot about autonomous AI agents lately—so let’s create a benchmark test which can give us some comparisons on how different agents do on a simple autonomous task. In this case, it’s running a vending machine, which includes finding good products, ordering those products, stocking and restocking the actual machine, and trying to make as much money as possible. On their press release page, you can try playing the game yourself for a couple turns—it’s basically a more realistic version of those old browser games where you have to run an ice cream truck or pizza parlor.

The paper is definitely worth the time it takes to read the whole thing, but the basic result is that the agents are generally coherent and able to keep the business running, but when failure occurs, it is catastrophic. (This is probably not surprising.)

Fun Failures

( interesting examples of catastrophic failures )

Results, Memory, Context

( the results of the paper, paraphrased )

The Medium Article and Consciousness

( an off-topic tirade about whether chatbots are alive )

Conclusions

I think I went a bit off-topic here, haha. If you’re interesting in the Vending-Bench concept, Andon Labs also did an experiment with Anthropic AI where they had their AI, Claude, run a real vending machine inside their offices. It’s a good read, and covers many of the same topics as the original paper but in more of a blog post format.

I don’t think we’re close to having these tools actually do jobs just as well as humans do. The fear I have, instead, is that they will be able to replace a person in an office job up to a certain point, and the people in charge will see that, say “good enough”, and let the LLMs have control that they shouldn’t.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

長々と剥いた

ｌｏｎｇｌｙ　ｐｅｅｌｅｄ

Aug. 10th, 2025

Aug. 10th, 2025

AI Can Barely Play Ice Cream Truck Simulator

About Vending-Bench

Fun Failures

Results, Memory, Context

The Medium Article and Consciousness

Conclusions

Profile

October 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags

長々と剥いた

ｌｏｎｇｌｙ ｐｅｅｌｅｄ

Aug. 10th, 2025

Aug. 10th, 2025

AI Can Barely Play Ice Cream Truck Simulator

About Vending-Bench

Fun Failures

Results, Memory, Context

The Medium Article and Consciousness

Conclusions

Profile

October 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags

ｌｏｎｇｌｙ　ｐｅｅｌｅｄ