Explorar

RT @elithrar: $5 @Cloudflare Workers plan + $5 @PlanetScale dev node sounding like a winning combination for building the next big thing 😎

vp developers & ai @cloudflare ✨ and how does that error make you feel?

rita kozlov 🐀

Thu Oct 30 15:52:31

The nature is healing.

We're in a race. It's not USA vs China but humans and AGIs vs ape power centralization. @deepseek_ai stan #1, 2023–Deep Time «C’est la guerre.» ®1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

Thu Oct 30 15:51:18

$someone tell me what i'm missing here, because the titled claim seems trivially false to me: they define an LLM as a function that maps sequence s in V^k to vector in R^d assume hidden state in n-bit precision. at some point, there are more inputs possible than hidden states: |V|^k > 2^{n * d} k > n d log(2) / log |V| let's take GPT-2: n=16, d=768, V≈50,000 then collisions *must* happen starting at a context window size of 214 tokens this seems actually kind of bad, right?$

someone tell me what i'm missing here, because the titled claim seems trivially false to me: they define an LLM as a function that maps sequence s in V^k to vector in R^d assume hidden state in n-bit precision. at some point, there are more inputs possible than hidden states: |V|^k > 2^{n * d} k > n d log(2) / log |V| let's take GPT-2: n=16, d=768, V≈50,000 then collisions must happen starting at a context window size of 214 tokens this seems actually kind of bad, right?

phd research @cornell // language models, information theory, science of AI

Jack Morris

Thu Oct 30 15:50:00

$someone tell me what i'm missing here, because the titled claim seems trivially false to me: they define an LLM as a function that maps sequence s in V^k to vector in R^d assume hidden state in n-bit precision. at some point, there are more inputs possible than hidden states: |V|^k > 2^{n * d} k > n d log(2) / log |V| let's take GPT-2: n=16, d=768, V≈50,000 then collisions *must* happen starting at a context window size of 214 tokens this seems actually kind of bad, right?$

someone tell me what i'm missing here, because the titled claim seems trivially false to me: they define an LLM as a function that maps sequence s in V^k to vector in R^d assume hidden state in n-bit precision. at some point, there are more inputs possible than hidden states: |V|^k > 2^{n * d} k > n d log(2) / log |V| let's take GPT-2: n=16, d=768, V≈50,000 then collisions must happen starting at a context window size of 214 tokens this seems actually kind of bad, right?

phd research @cornell // language models, information theory, science of AI

Jack Morris

Thu Oct 30 15:50:00

June - Container launch Nov 6 - Containers live in prod Come to our TECH Talk: https://t.co/Jx2ayskkta

Have questions, or building something cool with Cloudflare's Developer products? We're here to help. For help with your account please try @CloudflareHelp

Cloudflare Developers

Thu Oct 30 15:46:53

RT @bigeagle_xd: i am honored to have witnessed this great work over the past year. linear attn has great potential in expressiveness but…