In above picture, the green stripe is storage needed to hold sqrt of grad^2 for Adam. Fused Adam computes updates to apply layer by layer causing stripes. At the top it applies those updates in one shot. The spike is temp buffer needed for division before applying update.
There is lot more to say on these beautiful visualizations. Please let me know of any errors or other cool things we should know. More at my blog: How to Get and Interpret GPU Memory Profiling shital.com/blog/gpu-memor…