HomeAI NewsScienceSuper-Sized Transformers: Scaling BERT to 1M Tokens and Beyond

Super-Sized Transformers: Scaling BERT to 1M Tokens and Beyond

April 24, 2023

Recurrent Memory Transformer Enables Unprecedented Context Length in NLP Models

Researchers have applied recurrent memory to BERT, extending the model’s context length to an impressive two million tokens.
The method maintains high memory retrieval accuracy while allowing for the storage and processing of local and global information.
The Recurrent Memory Transformer (RMT) can enhance long-term dependency handling in NLP tasks and facilitate large-scale context processing in memory-intensive applications.

A recent technical report presents a significant breakthrough in the field of natural language processing: the successful extension of BERT’s context length to an unprecedented two million tokens. By leveraging the Recurrent Memory Transformer (RMT) architecture, researchers have maintained high memory retrieval accuracy while increasing the model’s context length, enabling enhanced long-term dependency handling in natural language understanding and generation tasks.

The RMT architecture addresses the quadratic complexity issue that plagues the attention operation in Transformer models, making them increasingly difficult to apply to longer inputs. By incorporating token-based memory storage and segment-level recurrence with recurrent memory, the RMT-augmented BERT model can tackle tasks on sequences with lengths up to seven times its original input length of 512 tokens.

The study revealed that the trained RMT can successfully extrapolate to tasks of varying lengths, including those exceeding one million tokens, with a linear scaling of computations required. Additionally, attention pattern analysis showed the operations RMT employs with memory, allowing it to handle exceptionally long sequences effectively.

Scaling-Transformer-to-1M-tokens-and-beyond-with-RMT Download

The application of RMT to BERT demonstrates that handling long texts with Transformers does not necessarily require large amounts of memory. Using a recurrent approach and memory, the quadratic complexity can be reduced to linear, and models trained on large inputs can extrapolate their abilities to significantly longer texts.

As a first milestone for enabling RMT to generalize to tasks with unseen properties, the synthetic tasks explored in this study pave the way for further improvements in effective context size. Future work aims to tailor the recurrent memory approach to the most commonly used Transformers, further enhancing their capabilities in natural language processing tasks.

Paper

Github

Karel https://neuronad.com

HalluSegBench: Unmasking the Mirage in Visual Segmentation

Xbox Producer’s AI Advice to Laid-Off Workers Sparks Outrage

Pay Per Crawl: Revolutionizing Content Monetization for the AI Era

China’s RoBoLeague: The Future of Soccer Kicks Off with a Robotic Twist

OpenAI CEO Fires Back at Zuckerberg’s$100 Million Offers in Heated AI Talent War

Mistral’s New OCR API: A Game Changer for AI-Ready Documents

China’s Autonomous Agent, Manus, Changes Everything: The Dawn of Self-Directed AI

LLM Inference Hardware Calculator

Claude 3.7 Sonnet: The World’s First Hybrid AI Brain Coding and Reasoning

SambaNova Launches the Fastest DeepSeek-R1 671B with Unmatched Efficiency

Celebrities explaining science? Yes, please!

Breaking News: The world is ending, and influencers are live-reacting to the chaos!

THIS WILL BE A DAY LONG REMEMBERED: DARTH VADER’S AI VOICE LANDS IN FORTNITE

Where AI Baby Wisdom Meets Canine Comedy

The Impact of OpenAI’s 4o Image Generation: A Visual Revolution

From Garage Invite to X-Rated Text: When AI Mishears, Chaos Follows