LogoThread Easy
  • Explorar
  • Componer hilo
LogoThread Easy

Tu compañero integral para hilos de Twitter

© 2025 Thread Easy All Rights Reserved.

Explorar

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

You can be happy until you are no more.

But what about the you that could have remained?

Immortality is about the happiness of selves increasingly removed from our present ones. The "me" 5 seconds from now, 5 years from now, 5 centuries from now deserves it as well.

You can be happy until you are no more. But what about the you that could have remained? Immortality is about the happiness of selves increasingly removed from our present ones. The "me" 5 seconds from now, 5 years from now, 5 centuries from now deserves it as well.

We're in a race. It's not USA vs China but humans and AGIs vs ape power centralization. @deepseek_ai stan #1, 2023–Deep Time «C’est la guerre.» ®1

avatar for Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Mon Nov 10 11:49:25
Yishan shares his brilliant thesis. Almost all startups out-side major labs are going to be a flash in the pan.

It is a solid perspective. Lesson: the only way for AI startups to have a long term advantage is by having data moat, rather than enriching BIG FOUR (OAI, ANT, GDM, xAI).

Agent labs like @cursor_ai , @windsurf have taken a lesson and building models on user implicit feedback (as in case of Tab models by Cursor) as well as with help of RL setup base on real life problem data and golden answers. This advantage can not be stolen by the big four.

Find out who are @thinkymachines , @appliedcompute and read up about what @PrimeIntellect does.

@MohapatraHemant @balajis

Yishan shares his brilliant thesis. Almost all startups out-side major labs are going to be a flash in the pan. It is a solid perspective. Lesson: the only way for AI startups to have a long term advantage is by having data moat, rather than enriching BIG FOUR (OAI, ANT, GDM, xAI). Agent labs like @cursor_ai , @windsurf have taken a lesson and building models on user implicit feedback (as in case of Tab models by Cursor) as well as with help of RL setup base on real life problem data and golden answers. This advantage can not be stolen by the big four. Find out who are @thinkymachines , @appliedcompute and read up about what @PrimeIntellect does. @MohapatraHemant @balajis

AI @amazon. All views personal!

avatar for GDP
GDP
Mon Nov 10 11:47:58
Are LEDs constantly flashing so that my iPhone camera missed the 7 and other lights in that split second? 🤯

Are LEDs constantly flashing so that my iPhone camera missed the 7 and other lights in that split second? 🤯

Photographer & software engineer into publishing. Loves building w/ Nodejs, React, Ruby/Rails, Python - making shipping fun! DM for collabs. ❤️ @JiwonKwak6

avatar for Ronald
Ronald
Mon Nov 10 11:47:21
2025 update: who invented Transformer neural networks (the T in ChatGPT)? Timeline of Transformer evolution in Technical Note IDSIA-11-25 (easy to find on the web):

★ 1991. Original tech report on what's now called the unnormalized linear Transformer (ULTRA)[FWP0][ULTRA]. KEY/VALUE was called FROM/TO. ULTRA uses outer product rules to associate its self-invented KEYs/VALUEs through fast weights [FAST][FWP], and applies the resulting context-dependent attention mappings to incoming queries. ULTRA's computational costs scale linearly in input size, that is, for 1,000 times more text we need 1,000 times more compute, which is acceptable. Like modern quadratic Transformers (see below), the 1991 ULTRA is highly parallelizable. It was a by-product of more general research on neural networks (NNs) that learn to program fast weight changes of other NNs [FWP,FWP0-9,FWPMETA1-10], back then called fast weight controllers [FWP0] or fast weight programmers (FWPs) [FWP]. ULTRA was presented as an alternative to recurrent NNs [FWP0]. The 1991 experiments were similar to today's: predict some effect, given a sequence of inputs [FWP0].

★ 1992. Journal publication on ULTRA [FWP1], based on the 1991 tech report. Note that the terminology was different back then.

★ 1993. Recurrent ULTRA extension [FWP2] introducing the terminology of learning "internal spotlights of attention."

★ 2014. End-to-end sequence-to-sequence models [S2Sa,b,c,d] became popular for Natural Language Processing. They were not based on the 1991 unnormalized linear Transformer [ULTRA] above, but on the Long Short-Term Memory (LSTM) recurrent NN from the same lab. In 2014, this approach was combined with an attention mechanism [ATT14] that isn't linearized like the 1991-93 attention [FWP0-2] but includes a nonlinear softmax operation. The first Large Language Models (LLMs) were based on such LSTM-attention systems. See additional work on attention from 2016-17 [ATT16a-17b].

★ 2017. Modern quadratic Transformer ("attention is all you need"), scaling quadratically in input size [TR1], that is, for 1,000 times more text we need 1,000,000 times more compute. Note that in 1991 [ULTRA], no journal would have accepted an NN that scales quadratically, but by 2017, compute was cheap enough to apply the quadratic Transformer (a kind of fast weight programmer [FWP]) to large amounts of data on massively parallel computers. The quadratic Transformer combines the 1991 additive outer product fast weight principle [FWP0-2] and softmax (see 2014 above): attention (query, KEY, VALUE) ~ softmax (query KEY) VALUE.

★ 2020. New paper [TR5] using the terminology "linear Transformer" for a more efficient Transformer variant that scales linearly, leveraging linearized attention [TR5a].

★ 2021. Paper [FWP6] pointing out that the unnormalised linear Transformer [TR5-6] is actually MATHEMATICALLY EQUIVALENT to the 1991 fast weight controller [FWP0][ULTRA] published when compute was a million times more expensive than in 2021. Overview of ULTRA and FWPs (2021) [FWP]. 

★ 2021-25. Work on extensions of ULTRAs and other FWPs (such as the DeltaNet [FWP6]) has become mainstream research, aiming to develop sequence models that are both efficient and powerful [TR6,TR6a][LT23-25][FWP23-25b].

Of course, plain outer products in NNs go back at least to Konorski's informal 1948 rule [HEB48] (later sometimes called the "Hebb rule" [HEB49]) and concrete formal implementations through Steinbuch's Learning Matrix around 1960 [ST61-63][AMH1-2][KOH72][LIT74][PAL80]. See also bidirectional associative memories (1988) [KOS88]. However, these authors described pre-wired rules to associate user-given patterns with each other. Unlike ULTRA and other Transformers since 1991 [ULTRA][TR1], their NNs did not learn to use such rules for associating self-invented KEY/VALUE patterns, by backpropagating errors [BP4] THROUGH the rules, to generate appropriate KEYs/VALUEs at the right times and create useful changes of fast weights. (Neither did early NNs with fast weights by Malsburg (1981) and others [FAST][FASTa,b][DLP].)

*********************

SELECTED REFERENCES (remaining references in: Who Invented Transformer Neural Networks? Technical Note IDSIA-11-25, Nov 2025 - easy to find on the web)

[ATT] Juergen's AI Blog (2020, updated 2025): 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. There was both hard attention for foveas (1990) and soft attention in form of Transformers with linearized self-attention (1991-93) [ULTRA]. Today, both types are very popular.

[ATT14] D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014-16. Preprint arXiv/1409.0473, 2014-16.

[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981. First paper on fast weights or dynamic links.

[FWP] 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! AI Blog, 26 March 2021, updated 2025.

[FWP0] J.  Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, TU Munich, 26 March 1991. First paper on neural fast weight programmers (FWPs) that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version (Eq. 5) is now known as the unnormalized linear Transformer or the "Transformer with linearized self-attention" [ULTRA][FWP].

[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. Based on [FWP0]. 

[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. A recurrent extension of the 1991 unnormalized linear Transformer [ULTRA], introducing the terminology of learning "internal spotlights of attention." First recurrent NN-based fast weight programmer using outer products to program weight matrix changes. 

[FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Shows that the unnormalised linear Transformer is actually MATHEMATICALLY EQUIVALENT to the 1991 system [FWP0][ULTRA] published when compute was a million times more expensive than in 2021.

[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. NeurIPS 2021. Preprint: arXiv:2106.06295

[HEB48] J. Konorski (1948). Conditioned reflexes and neuron organization. Translation from the Polish manuscript under the author's supervision. Cambridge University Press, 1948. Konorski published the so-called "Hebb rule" before Hebb [HEB49].

[HEB49] D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949. Konorski [HEB48] published the so-called "Hebb rule" before Hebb. 

[KOS88] B. Kosko. Bidirectional associative memories. IEEE Transactions on Systems, Man, and Cybernetics, 18(1):49-60, 1988. 

[LT20] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020.

[LT21] I. Bello. LambdaNetworks: Modeling Long-Range Interactions Without Attention. Preprint arXiv:2102.08602. A linear transformer variant. 

[LT23] K. Irie, R. Csordas, J. Schmidhuber. Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions. EMNLP 2023.

[LT24] S. Yang, B. Wang, Y. Zhang, Y. Shen, Y. Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length. NeurIPS 2024.

[LT25] S. Yang, J. Kautz, A. Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. ICLR 2025. "Mamba2" is essentially the 1991 ULTRA with a scalar time-decay factor on the fast weight matrix.

[LT25b] R. Grazzi, J. Siems, A. Zela, J. K.H. Franke, F. Hutter, M. Pontil. Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues. ICLR 2025. Shows that the delta-rule extension [FWP6][LT23] is more expressive than the quadratic Transformer and other naive linear Transformers (e.g., it can do parity and modular arithmetics).

[LT25c] J. Siems, T. Carstensen, A. Zela, F. Hutter, M. Pontil, R. Grazzi. DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products ICLR 2025 Workshop FM-Wild. Extending the DeltaNet [FWP6][LT23] through additional "micro-steps." 

[S2Sa] M.L. Forcada and R.P. Ñeco. Recursive hetero-associative memories for translation. International Work-Conference on Artificial Neural Networks, 1997.

[S2Sb] T. Mikolov and G. Zweig, G. December. Context dependent recurrent neural network language model. IEEE Spoken Language Technology Workshop (SLT), 2012.

[S2Sc] A. Graves. Sequence transduction with recurrent neural networks. Representation Learning Workshop, Int. Conf. on Machine Learning (ICML), 2012

[S2Sd] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112. 

[ST61] K. Steinbuch. Die Lernmatrix. Kybernetik, 1(1):36-45, 1961. 

[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008. 

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional Transformers for language understanding. Preprint arXiv:1810.04805.

[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.

[TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020. 

[TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020.

[TR5a] Z. Shen, M. Zhang, H. Zhao, S. Yi, H. Li. Efficient Attention: Attention with Linear Complexities. WACV 2021.

[TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021.

[TR6a] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, L. Kong. Random Feature Attention. ICLR 2021.

[TR7] S. Bhattamishra, K. Ahuja, N. Goyal. On the Ability and Limitations of Transformers to Recognize Formal Languages. EMNLP 2020. 

[ULTRA] References on the 1991 unnormalized linear Transformer (ULTRA): original tech report (March 1991) [FWP0]. Journal publication (1992) [FWP1]. Recurrent ULTRA extension (1993) introducing the terminology of learning "internal spotlights of attention” [FWP2]. Modern "quadratic" Transformer (2017: "attention is all you need") scaling quadratically in input size [TR1]. 2020 paper [TR5] using the terminology "linear Transformer" for a more efficient Transformer variant that scales linearly, leveraging linearized attention [TR5a]. 2021 paper [FWP6] pointing out that ULTRA dates back to 1991 [FWP0] when compute was a million times more expensive. Overview of ULTRA and other Fast Weight Programmers (2021) [FWP].

2025 update: who invented Transformer neural networks (the T in ChatGPT)? Timeline of Transformer evolution in Technical Note IDSIA-11-25 (easy to find on the web): ★ 1991. Original tech report on what's now called the unnormalized linear Transformer (ULTRA)[FWP0][ULTRA]. KEY/VALUE was called FROM/TO. ULTRA uses outer product rules to associate its self-invented KEYs/VALUEs through fast weights [FAST][FWP], and applies the resulting context-dependent attention mappings to incoming queries. ULTRA's computational costs scale linearly in input size, that is, for 1,000 times more text we need 1,000 times more compute, which is acceptable. Like modern quadratic Transformers (see below), the 1991 ULTRA is highly parallelizable. It was a by-product of more general research on neural networks (NNs) that learn to program fast weight changes of other NNs [FWP,FWP0-9,FWPMETA1-10], back then called fast weight controllers [FWP0] or fast weight programmers (FWPs) [FWP]. ULTRA was presented as an alternative to recurrent NNs [FWP0]. The 1991 experiments were similar to today's: predict some effect, given a sequence of inputs [FWP0]. ★ 1992. Journal publication on ULTRA [FWP1], based on the 1991 tech report. Note that the terminology was different back then. ★ 1993. Recurrent ULTRA extension [FWP2] introducing the terminology of learning "internal spotlights of attention." ★ 2014. End-to-end sequence-to-sequence models [S2Sa,b,c,d] became popular for Natural Language Processing. They were not based on the 1991 unnormalized linear Transformer [ULTRA] above, but on the Long Short-Term Memory (LSTM) recurrent NN from the same lab. In 2014, this approach was combined with an attention mechanism [ATT14] that isn't linearized like the 1991-93 attention [FWP0-2] but includes a nonlinear softmax operation. The first Large Language Models (LLMs) were based on such LSTM-attention systems. See additional work on attention from 2016-17 [ATT16a-17b]. ★ 2017. Modern quadratic Transformer ("attention is all you need"), scaling quadratically in input size [TR1], that is, for 1,000 times more text we need 1,000,000 times more compute. Note that in 1991 [ULTRA], no journal would have accepted an NN that scales quadratically, but by 2017, compute was cheap enough to apply the quadratic Transformer (a kind of fast weight programmer [FWP]) to large amounts of data on massively parallel computers. The quadratic Transformer combines the 1991 additive outer product fast weight principle [FWP0-2] and softmax (see 2014 above): attention (query, KEY, VALUE) ~ softmax (query KEY) VALUE. ★ 2020. New paper [TR5] using the terminology "linear Transformer" for a more efficient Transformer variant that scales linearly, leveraging linearized attention [TR5a]. ★ 2021. Paper [FWP6] pointing out that the unnormalised linear Transformer [TR5-6] is actually MATHEMATICALLY EQUIVALENT to the 1991 fast weight controller [FWP0][ULTRA] published when compute was a million times more expensive than in 2021. Overview of ULTRA and FWPs (2021) [FWP]. ★ 2021-25. Work on extensions of ULTRAs and other FWPs (such as the DeltaNet [FWP6]) has become mainstream research, aiming to develop sequence models that are both efficient and powerful [TR6,TR6a][LT23-25][FWP23-25b]. Of course, plain outer products in NNs go back at least to Konorski's informal 1948 rule [HEB48] (later sometimes called the "Hebb rule" [HEB49]) and concrete formal implementations through Steinbuch's Learning Matrix around 1960 [ST61-63][AMH1-2][KOH72][LIT74][PAL80]. See also bidirectional associative memories (1988) [KOS88]. However, these authors described pre-wired rules to associate user-given patterns with each other. Unlike ULTRA and other Transformers since 1991 [ULTRA][TR1], their NNs did not learn to use such rules for associating self-invented KEY/VALUE patterns, by backpropagating errors [BP4] THROUGH the rules, to generate appropriate KEYs/VALUEs at the right times and create useful changes of fast weights. (Neither did early NNs with fast weights by Malsburg (1981) and others [FAST][FASTa,b][DLP].) ********************* SELECTED REFERENCES (remaining references in: Who Invented Transformer Neural Networks? Technical Note IDSIA-11-25, Nov 2025 - easy to find on the web) [ATT] Juergen's AI Blog (2020, updated 2025): 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. There was both hard attention for foveas (1990) and soft attention in form of Transformers with linearized self-attention (1991-93) [ULTRA]. Today, both types are very popular. [ATT14] D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014-16. Preprint arXiv/1409.0473, 2014-16. [FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981. First paper on fast weights or dynamic links. [FWP] 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! AI Blog, 26 March 2021, updated 2025. [FWP0] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, TU Munich, 26 March 1991. First paper on neural fast weight programmers (FWPs) that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version (Eq. 5) is now known as the unnormalized linear Transformer or the "Transformer with linearized self-attention" [ULTRA][FWP]. [FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. Based on [FWP0]. [FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. A recurrent extension of the 1991 unnormalized linear Transformer [ULTRA], introducing the terminology of learning "internal spotlights of attention." First recurrent NN-based fast weight programmer using outer products to program weight matrix changes. [FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Shows that the unnormalised linear Transformer is actually MATHEMATICALLY EQUIVALENT to the 1991 system [FWP0][ULTRA] published when compute was a million times more expensive than in 2021. [FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. NeurIPS 2021. Preprint: arXiv:2106.06295 [HEB48] J. Konorski (1948). Conditioned reflexes and neuron organization. Translation from the Polish manuscript under the author's supervision. Cambridge University Press, 1948. Konorski published the so-called "Hebb rule" before Hebb [HEB49]. [HEB49] D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949. Konorski [HEB48] published the so-called "Hebb rule" before Hebb. [KOS88] B. Kosko. Bidirectional associative memories. IEEE Transactions on Systems, Man, and Cybernetics, 18(1):49-60, 1988. [LT20] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020. [LT21] I. Bello. LambdaNetworks: Modeling Long-Range Interactions Without Attention. Preprint arXiv:2102.08602. A linear transformer variant. [LT23] K. Irie, R. Csordas, J. Schmidhuber. Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions. EMNLP 2023. [LT24] S. Yang, B. Wang, Y. Zhang, Y. Shen, Y. Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length. NeurIPS 2024. [LT25] S. Yang, J. Kautz, A. Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. ICLR 2025. "Mamba2" is essentially the 1991 ULTRA with a scalar time-decay factor on the fast weight matrix. [LT25b] R. Grazzi, J. Siems, A. Zela, J. K.H. Franke, F. Hutter, M. Pontil. Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues. ICLR 2025. Shows that the delta-rule extension [FWP6][LT23] is more expressive than the quadratic Transformer and other naive linear Transformers (e.g., it can do parity and modular arithmetics). [LT25c] J. Siems, T. Carstensen, A. Zela, F. Hutter, M. Pontil, R. Grazzi. DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products ICLR 2025 Workshop FM-Wild. Extending the DeltaNet [FWP6][LT23] through additional "micro-steps." [S2Sa] M.L. Forcada and R.P. Ñeco. Recursive hetero-associative memories for translation. International Work-Conference on Artificial Neural Networks, 1997. [S2Sb] T. Mikolov and G. Zweig, G. December. Context dependent recurrent neural network language model. IEEE Spoken Language Technology Workshop (SLT), 2012. [S2Sc] A. Graves. Sequence transduction with recurrent neural networks. Representation Learning Workshop, Int. Conf. on Machine Learning (ICML), 2012 [S2Sd] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112. [ST61] K. Steinbuch. Die Lernmatrix. Kybernetik, 1(1):36-45, 1961. [TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008. [TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional Transformers for language understanding. Preprint arXiv:1810.04805. [TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585. [TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020. [TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020. [TR5a] Z. Shen, M. Zhang, H. Zhao, S. Yi, H. Li. Efficient Attention: Attention with Linear Complexities. WACV 2021. [TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021. [TR6a] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, L. Kong. Random Feature Attention. ICLR 2021. [TR7] S. Bhattamishra, K. Ahuja, N. Goyal. On the Ability and Limitations of Transformers to Recognize Formal Languages. EMNLP 2020. [ULTRA] References on the 1991 unnormalized linear Transformer (ULTRA): original tech report (March 1991) [FWP0]. Journal publication (1992) [FWP1]. Recurrent ULTRA extension (1993) introducing the terminology of learning "internal spotlights of attention” [FWP2]. Modern "quadratic" Transformer (2017: "attention is all you need") scaling quadratically in input size [TR1]. 2020 paper [TR5] using the terminology "linear Transformer" for a more efficient Transformer variant that scales linearly, leveraging linearized attention [TR5a]. 2021 paper [FWP6] pointing out that ULTRA dates back to 1991 [FWP0] when compute was a million times more expensive. Overview of ULTRA and other Fast Weight Programmers (2021) [FWP].

Invented principles of meta-learning (1987), GANs (1990), Transformers (1991), very deep learning (1991), etc. Our AI is used many billions of times every day.

avatar for Jürgen Schmidhuber
Jürgen Schmidhuber
Mon Nov 10 11:45:04
发现一个:Statping 复刻 Statping-ng  一站式网站与应用状态监控解决方案,轻量、跨平台,支持 MySQL/PostgreSQL/SQLite,容器化部署友好,并内置 Slack/Email 等通知与可扩展插件。
https://t.co/74pTbOjH20

发现一个:Statping 复刻 Statping-ng 一站式网站与应用状态监控解决方案,轻量、跨平台,支持 MySQL/PostgreSQL/SQLite,容器化部署友好,并内置 Slack/Email 等通知与可扩展插件。 https://t.co/74pTbOjH20

🧠在家居士 | 🥦素食者 | 🏃🏻马拉松爱好者 | 💰省钱小能手 | 搭🪜技术资深学者 | 👨‍💻科技宅 | 🆕更新狂 | 🆅 六边型战五渣

avatar for Geek
Geek
Mon Nov 10 11:44:28
你说这是不是创作?我觉得是的。

只不过我们在内容本身之外尝试用 AI 回应自己那些没机会被展开的念头。

给一个设定,它除了画出来还会告诉我们这个设定可能会走向哪一种世界,并且还能接着续写,接着改编。

这其实就是那个更底层的变化:
一代人的创造方式已经在悄悄改变了。

创作从一个团队变成了一个人+AI 的对话式,给一句话,AI 给一部漫画,还能自己往后延展。
也许未来的创作者会更加的年轻化和多元化,只要脑子里有想法,我们就能让叙事发生。

有什么想法可以打开文心 App的魔法漫画,动手尝试下让AI告诉你,你的故事还能怎么继续讲。
我对 AI 说我没后悔,AI 给我画了一整部后悔的样子,大概就是这种感觉。

你说这是不是创作?我觉得是的。 只不过我们在内容本身之外尝试用 AI 回应自己那些没机会被展开的念头。 给一个设定,它除了画出来还会告诉我们这个设定可能会走向哪一种世界,并且还能接着续写,接着改编。 这其实就是那个更底层的变化: 一代人的创造方式已经在悄悄改变了。 创作从一个团队变成了一个人+AI 的对话式,给一句话,AI 给一部漫画,还能自己往后延展。 也许未来的创作者会更加的年轻化和多元化,只要脑子里有想法,我们就能让叙事发生。 有什么想法可以打开文心 App的魔法漫画,动手尝试下让AI告诉你,你的故事还能怎么继续讲。 我对 AI 说我没后悔,AI 给我画了一整部后悔的样子,大概就是这种感觉。

行道途中。非求速成,惟求通达。 2023 年扎进AI ,打通Know-How,不少赚钱项目,踩过坑,也见过光。 围城里待得够久了,出来聊聊世界,聊聊技术、聊聊赚钱。

avatar for 凡人小北
凡人小北
Mon Nov 10 11:39:01
  • Previous
  • 1
  • More pages
  • 269
  • 270
  • 271
  • More pages
  • 2118
  • Next