LogoThread Easy
  • 探索
  • 線程創作
LogoThread Easy

Twitter 線程的一站式夥伴

© 2025 Thread Easy All Rights Reserved.

探索

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

in particular, detecting dependencies automatically (by parsing "imports") is important since it lets you form a full dependency tree without calling an AI to find the files you need, which adds a lot of latency, and avoids literally gathering the whole repo, which isn't viable

also I really think block-based patching is the best possible format. search/replace is error-prone and complex (it eats a lot of the AI's mental space to just format the edits, so it loses some IQ points). and line numbers overwhelm the context with too many lines, but it could work too

the only problem of line/block-based edits is that it isn't robust against files being changed, but since this workflow is transactional, that doesn't happen. you hand the control to the AI, it does its thing, you get the control back, in a turn-based fashion

in particular, detecting dependencies automatically (by parsing "imports") is important since it lets you form a full dependency tree without calling an AI to find the files you need, which adds a lot of latency, and avoids literally gathering the whole repo, which isn't viable also I really think block-based patching is the best possible format. search/replace is error-prone and complex (it eats a lot of the AI's mental space to just format the edits, so it loses some IQ points). and line numbers overwhelm the context with too many lines, but it could work too the only problem of line/block-based edits is that it isn't robust against files being changed, but since this workflow is transactional, that doesn't happen. you hand the control to the AI, it does its thing, you get the control back, in a turn-based fashion

Kind / Bend / HVM / INets / λCalculus

avatar for Taelin
Taelin
Wed Nov 19 14:12:17
a quick demo of this workflow

I really recommend it, way faster than CLI agents

a quick demo of this workflow I really recommend it, way faster than CLI agents

in particular, detecting dependencies automatically (by parsing "imports") is important since it lets you form a full dependency tree without calling an AI to find the files you need, which adds a lot of latency, and avoids literally gathering the whole repo, which isn't viable also I really think block-based patching is the best possible format. search/replace is error-prone and complex (it eats a lot of the AI's mental space to just format the edits, so it loses some IQ points). and line numbers overwhelm the context with too many lines, but it could work too the only problem of line/block-based edits is that it isn't robust against files being changed, but since this workflow is transactional, that doesn't happen. you hand the control to the AI, it does its thing, you get the control back, in a turn-based fashion

avatar for Taelin
Taelin
Wed Nov 19 14:05:33
→ split all files in sequence of non-empty lines
→ mark each block with an id
→ let the AI patch a block with:

<patch id=123>
new contents here
</patch>

→ to delete: do an empty patch
→ to split: include empty lines
→ to merge: just move contents

why we're using search/replace
that has to be a joke

→ split all files in sequence of non-empty lines → mark each block with an id → let the AI patch a block with: <patch id=123> new contents here </patch> → to delete: do an empty patch → to split: include empty lines → to merge: just move contents why we're using search/replace that has to be a joke

Kind / Bend / HVM / INets / λCalculus

avatar for Taelin
Taelin
Wed Nov 19 12:39:21
I've now read Aider's diff patching overview, and OpenAI's apply_patch manifesto, I decided you're all INSANE

I just changed refactor.ts to use block-based patching, and I thereby and officially declare it the God's chosen patching format

re-train your models

I've now read Aider's diff patching overview, and OpenAI's apply_patch manifesto, I decided you're all INSANE I just changed refactor.ts to use block-based patching, and I thereby and officially declare it the God's chosen patching format re-train your models

Kind / Bend / HVM / INets / λCalculus

avatar for Taelin
Taelin
Wed Nov 19 12:36:04
I've now read Aider's diff patching overview, and OpenAI's apply_patch manifesto, I decided you're all INSANE

I just changed refactor.ts to use block-based patching, and I thereby and officially declare it the God's chosen patching format

re-train your models

I've now read Aider's diff patching overview, and OpenAI's apply_patch manifesto, I decided you're all INSANE I just changed refactor.ts to use block-based patching, and I thereby and officially declare it the God's chosen patching format re-train your models

→ split all files in sequence of non-empty lines → mark each block with an id → let the AI patch a block with: <patch id=123> new contents here </patch> → to delete: do an empty patch → to split: include empty lines → to merge: just move contents why we're using search/replace that has to be a joke

avatar for Taelin
Taelin
Wed Nov 19 12:36:04
most benchmarks suck, but also ppl misinterpret them

HLE, for example, can easily be cheated / trained for, even unintentionally, because the questions are all over the internet, and the answers being private doesn't really matter because people WILL solve it and the information WILL spread. so, a model scoring well on it almost always just means "the AI seen the answer". I don't like this kind of fixed questions benchmark, and I think it becomes a non-signal as soon as it gets popular. or rather, all they measure is the extent on which the team failed to hide the answers from the model, so, more often than not, higher scores are a bad sign

on VPCT, all questions are roughly in the same difficulty level, so, a model going from 10% to 90% doesn't imply it is super-human; just that it broke that specific threshold. even ARC-AGI suffers from this. that's also why often a benchmark stales at a percentage; usually that means most questions are easy, and a select few are super hard (or even wrong), so, AIs just stop making progress at that point.

(not bad mouthing Chase's work in any way, it is a nice idea and a good benchmark, but it is very hard to construct a flawless eval. perhaps a V2 with a proper scaling would fix this specific flaw)

I avoid that on my vibe tests by having just a few personal questions on each "difficulty bracket". when an AIs get smarter, I just make a harder question. that way, when a new model launches, all I have to do is give it my easiest questions, then a harder question, then a harder question, and so on. it becomes very easy to decide the actual intelligence of the model. and since I have only a few questions, it is easy to create small variations on the spot, if I suspect an AI has just seen the answer

I wish I had time to make an eval

most benchmarks suck, but also ppl misinterpret them HLE, for example, can easily be cheated / trained for, even unintentionally, because the questions are all over the internet, and the answers being private doesn't really matter because people WILL solve it and the information WILL spread. so, a model scoring well on it almost always just means "the AI seen the answer". I don't like this kind of fixed questions benchmark, and I think it becomes a non-signal as soon as it gets popular. or rather, all they measure is the extent on which the team failed to hide the answers from the model, so, more often than not, higher scores are a bad sign on VPCT, all questions are roughly in the same difficulty level, so, a model going from 10% to 90% doesn't imply it is super-human; just that it broke that specific threshold. even ARC-AGI suffers from this. that's also why often a benchmark stales at a percentage; usually that means most questions are easy, and a select few are super hard (or even wrong), so, AIs just stop making progress at that point. (not bad mouthing Chase's work in any way, it is a nice idea and a good benchmark, but it is very hard to construct a flawless eval. perhaps a V2 with a proper scaling would fix this specific flaw) I avoid that on my vibe tests by having just a few personal questions on each "difficulty bracket". when an AIs get smarter, I just make a harder question. that way, when a new model launches, all I have to do is give it my easiest questions, then a harder question, then a harder question, and so on. it becomes very easy to decide the actual intelligence of the model. and since I have only a few questions, it is easy to create small variations on the spot, if I suspect an AI has just seen the answer I wish I had time to make an eval

Kind / Bend / HVM / INets / λCalculus

avatar for Taelin
Taelin
Wed Nov 19 11:49:31
  • Previous
  • 1
  • 2
  • 3
  • More pages
  • 7
  • 8
  • Next