«GHC introduces a flexible mechanism that, with lightweight computation, compresses the Over -Width Hidden States to the backbone width before feeding them into the attention or feed-forward modules, and then expands the module outputs back to the Over -Width» crazy chutzpah
it's relatively cheap

