LogoThread Easy
  • Explorar
  • Componer hilo
LogoThread Easy

Tu compañero integral para hilos de Twitter

© 2025 Thread Easy All Rights Reserved.

Explorar

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

everyone is much more nervous about capitalizing on my recent stroke of luck than i am and i think it’s funny… i have gotten lucky before and in the end all things that happen fast come down equally fast. i am now enjoying it, and have a plan, and not worrying all day.

everyone is much more nervous about capitalizing on my recent stroke of luck than i am and i think it’s funny… i have gotten lucky before and in the end all things that happen fast come down equally fast. i am now enjoying it, and have a plan, and not worrying all day.

curious guy creating things @ https://t.co/HXWladhJaA - up and coming wife guy

avatar for jack friks
jack friks
Mon Nov 03 03:10:51
RT @aidanshandle: I built an art instillation for a halloween party this year!

Meet Aura Leo: a 1920s lion sculpture reborn for Halloween.…

RT @aidanshandle: I built an art instillation for a halloween party this year! Meet Aura Leo: a 1920s lion sculpture reborn for Halloween.…

investing @a16z // curating https://t.co/ssslqn6eo7

avatar for Ryan McEntush
Ryan McEntush
Mon Nov 03 03:09:42
[On using Continuous Latent Space Vectors in the context windows of  Transformers and LLMs] #SundayHarangue 

There is a lot of chatter about how vectors from continuous latent space can make transformers solve problems efficiently. Some of these arguments run counter to conservation of computational complexity, IMHO. 

The arguments/analogies revolve around viewing these tokens as "superposition" (think union) of discrete tokens.

As background, transformers operate in a latent space L s.t. every (linguistic) token corresponds to a vector in L.  This mapping is however one sided: not every vector in L corresponds to a unique  token. 

You could however see these vectors (that don't have unique token mapping) as a linear combination of token-corresponding vectors. In this way, they can be seen as a union/superposition of those tokens. 

It should be rather obvious that the operations of the transformer see entities in the the context window as just vectors from the  embedding space. In particular, the forward pass operation doesn't really care whether the vectors being processed have unique tokens corresponding to them or not. 

This means as far as the transformer operation is concerned, the the context window can have both "token vectors" (i.e., embedding vectors that correspond to unique tokens) and "latent vectors" (i.e., embedding vectors that don't correspond to unique tokens). As mentioned above, these latent vectors can be seen as linear combinations of the token vectors. 

One obvious use of this flexibility is that the intermediate tokens emitted by the transformer can well be these latent vectors; only the solution tokens (that are being passed onto the end users) need to be token vectors. Indeed, as we argue in https://t.co/f6E3c2j4dm (https://t.co/t4uYw5WTmD), as long as intermediate tokens don't seem to have any end-user semantics anyway, allowing for them to be any vector from latent space provides significantly more flexibility for learning appropriate prompt augmentations (c.f. https://t.co/jl0LyWJUys). 

Another argument that has been made about the use of latent vectors in the intermediate tokens is as a way to "improve efficiency of solving the underlying problems." 

Now, I am pretty skeptical about viewing LLMs as solving problems. Our work shows, for example, that there is little connection between the length of the intermediate tokens and the underlying complexity of the problem (c.f. https://t.co/UKgCwgHKeQ), suggesting that it is more indicative of attempts to bridge the training distribution and the test instance. 

Nevertheless, if we are into looking at transformers as ways of "computing solutions" (even if that is not what is actually happening in pre-trained LLMs), then letting transformers operate on latent vectors vs. token vectors seems to correspond to doing computation on disjunctive representations of entities rather than on single entities. 

Now, operating on disjunctive representations can improve average case efficiency over specific distributions, but not the worst case complexity. As a sanity test, abstraction and hierarchy can be viewed as operating on disjunctive representations, and neither change the worst case computational complexity of the problem; see https://t.co/aXreC5YKPN or https://t.co/UDzu2Qp7WK for arguments on planning. 

This is why, I am skeptical of claims that transformers with latent tokens can provably increase efficiency in all cases. For example, a recent paper https://t.co/4oQzEUIFPk argues that transformers with latent tokens can solve graph reachability in time proportional to the diameter of the graph (and throws in some citations to quantum superposition to boot!). This doesn't make sense--certainly not in the worst case--without violating conservation of complexity (or changing what it means to "solve" reachability; the paper's empirical results seem to be happy with less than 100% accuracy, for example). 

When we were discussing this paper in our group meeting on Friday, I told my students about the analogy with Graphplan planning algorithm--which speeds up STRIPS planning (which is closely connected to reachability). Many years back, we showed that Graphplan's speedups can be understood in terms of doing projection over sets of states rather than individual states. However, if you operate directly over union representations, you can get to a point where the representation might look like it is reaching the goal state, but it may not be possible to actually extract a valid path! (In the case of Graphplan, this extraction involves a decoding step that is exponential in cost, and if it fails, the projection over disjunctive states continues). This is illustrated in the figure below 👇and the original paper at https://t.co/s20cFEOfQk (or Figure 3 and the accompanying discussion in https://t.co/YqN0fh7vp6). 

tldr; I do believe that latent tokens can considerably increase the flexibility of prompt augmentations that LLMs can learn in post-training, but don't quite agree with the "they reduce the complexity of the problems under consideration".

[On using Continuous Latent Space Vectors in the context windows of Transformers and LLMs] #SundayHarangue There is a lot of chatter about how vectors from continuous latent space can make transformers solve problems efficiently. Some of these arguments run counter to conservation of computational complexity, IMHO. The arguments/analogies revolve around viewing these tokens as "superposition" (think union) of discrete tokens. As background, transformers operate in a latent space L s.t. every (linguistic) token corresponds to a vector in L. This mapping is however one sided: not every vector in L corresponds to a unique token. You could however see these vectors (that don't have unique token mapping) as a linear combination of token-corresponding vectors. In this way, they can be seen as a union/superposition of those tokens. It should be rather obvious that the operations of the transformer see entities in the the context window as just vectors from the embedding space. In particular, the forward pass operation doesn't really care whether the vectors being processed have unique tokens corresponding to them or not. This means as far as the transformer operation is concerned, the the context window can have both "token vectors" (i.e., embedding vectors that correspond to unique tokens) and "latent vectors" (i.e., embedding vectors that don't correspond to unique tokens). As mentioned above, these latent vectors can be seen as linear combinations of the token vectors. One obvious use of this flexibility is that the intermediate tokens emitted by the transformer can well be these latent vectors; only the solution tokens (that are being passed onto the end users) need to be token vectors. Indeed, as we argue in https://t.co/f6E3c2j4dm (https://t.co/t4uYw5WTmD), as long as intermediate tokens don't seem to have any end-user semantics anyway, allowing for them to be any vector from latent space provides significantly more flexibility for learning appropriate prompt augmentations (c.f. https://t.co/jl0LyWJUys). Another argument that has been made about the use of latent vectors in the intermediate tokens is as a way to "improve efficiency of solving the underlying problems." Now, I am pretty skeptical about viewing LLMs as solving problems. Our work shows, for example, that there is little connection between the length of the intermediate tokens and the underlying complexity of the problem (c.f. https://t.co/UKgCwgHKeQ), suggesting that it is more indicative of attempts to bridge the training distribution and the test instance. Nevertheless, if we are into looking at transformers as ways of "computing solutions" (even if that is not what is actually happening in pre-trained LLMs), then letting transformers operate on latent vectors vs. token vectors seems to correspond to doing computation on disjunctive representations of entities rather than on single entities. Now, operating on disjunctive representations can improve average case efficiency over specific distributions, but not the worst case complexity. As a sanity test, abstraction and hierarchy can be viewed as operating on disjunctive representations, and neither change the worst case computational complexity of the problem; see https://t.co/aXreC5YKPN or https://t.co/UDzu2Qp7WK for arguments on planning. This is why, I am skeptical of claims that transformers with latent tokens can provably increase efficiency in all cases. For example, a recent paper https://t.co/4oQzEUIFPk argues that transformers with latent tokens can solve graph reachability in time proportional to the diameter of the graph (and throws in some citations to quantum superposition to boot!). This doesn't make sense--certainly not in the worst case--without violating conservation of complexity (or changing what it means to "solve" reachability; the paper's empirical results seem to be happy with less than 100% accuracy, for example). When we were discussing this paper in our group meeting on Friday, I told my students about the analogy with Graphplan planning algorithm--which speeds up STRIPS planning (which is closely connected to reachability). Many years back, we showed that Graphplan's speedups can be understood in terms of doing projection over sets of states rather than individual states. However, if you operate directly over union representations, you can get to a point where the representation might look like it is reaching the goal state, but it may not be possible to actually extract a valid path! (In the case of Graphplan, this extraction involves a decoding step that is exponential in cost, and if it fails, the projection over disjunctive states continues). This is illustrated in the figure below 👇and the original paper at https://t.co/s20cFEOfQk (or Figure 3 and the accompanying discussion in https://t.co/YqN0fh7vp6). tldr; I do believe that latent tokens can considerably increase the flexibility of prompt augmentations that LLMs can learn in post-training, but don't quite agree with the "they reduce the complexity of the problems under consideration".

AI researcher & teacher @SCAI_ASU. Former President of @RealAAAI; Chair of @AAAS Sec T. Here to tweach #AI. YouTube Ch: https://t.co/4beUPOmf6y Bsky: rao2z

avatar for Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)
Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)
Mon Nov 03 03:09:19
周五周六两天在 RTE 2025 现场和很多人聊

对话式 AI,Voice AI,语音 AI 硬件,
语音 AI 交互的现状和未来

有资深技术党,有产品经理,有投资人
有观望党,还有很多零基础

收获颇多,本周陆续发一些见闻和想法

周五周六两天在 RTE 2025 现场和很多人聊 对话式 AI,Voice AI,语音 AI 硬件, 语音 AI 交互的现状和未来 有资深技术党,有产品经理,有投资人 有观望党,还有很多零基础 收获颇多,本周陆续发一些见闻和想法

代码/设计/运营 @TenFramework 一个开源语音 AI 框架 |日常分享 AI |English @elliotchen200

avatar for 艾略特
艾略特
Mon Nov 03 03:08:57
RT @AIEMiami: The world's leading AI Engineering conference is coming to Miami!

AI Engineer: Miami, April 20–21.

Two days. One track. A c…

RT @AIEMiami: The world's leading AI Engineering conference is coming to Miami! AI Engineer: Miami, April 20–21. Two days. One track. A c…

achieve ambition with intentionality, intensity, & integrity - @dxtipshq - @sveltesociety - @aidotengineer - @latentspacepod - @cognition + @smol_ai

avatar for swyx
swyx
Mon Nov 03 03:08:03
RT @doodlestein: @growing_daniel One Flew Over the Cuckoo’s Nest and the response to it had absolutely disastrous consequences for American…

RT @doodlestein: @growing_daniel One Flew Over the Cuckoo’s Nest and the response to it had absolutely disastrous consequences for American…

Former Quant Investor, now building @lumera (formerly called Pastel Network) | My Open Source Projects: https://t.co/9qbOCDlaqM

avatar for Jeffrey Emanuel
Jeffrey Emanuel
Mon Nov 03 03:07:15
  • Previous
  • 1
  • More pages
  • 3350
  • 3351
  • 3352
  • More pages
  • 4210
  • Next