The biggest announcements in generative AI this year have centered around the release of powerful new language models from Google and Anthropic. Outside of commercial enterprises, the Allen Institute of Artificial Intelligence (known as AI2) came to the show with an entirely open source model with a permissive Apache 2.0 license. For businesses, this means that there are now many more LLM options to choose from, with lower cost, better performance, improved safety and new features that enable entirely new use cases in production environments. At the very end of the quarter, several new 'open source' models were made public, in addition to OLMo. Below, I’ll summarize what’s new, how the models compare and discuss some of the implications for AI deployments in 2024.
In February 2024, Google’s AI research lab - Google DeepMind - announced the arrival of a suite of new language models based on ‘the same research and technology’ as used in the creation of Gemini and their GPT-4 competitor, Gemini Ultra 1.0. The Gemini models Ultra and Pro were introduced in December 2023. The smaller Gemma models were built in 2 sizes, 2B and 7B (parameter count in billions) with variants of these being either pre-trained or instruction-tuned models. The model evaluation benchmarks that the Google AI team published against similar-sized models from competitors demonstrated superior performance by the Gemmas. For example, there was a substantial difference on the well-known MMLU benchmark, which tests undergraduate-level knowledge across 57 subject areas: Gemma 7B scored 64.3% versus Meta’s Llama-2 7B at 45.3% accuracy. These smaller Gemma models can be utilized commercially under a permissive license and come with developer tools for supervised fine-tuning, inference and a responsible Gen AI toolkit. Note that these were released as open models (weights available), not open source (i.e. source code and training data not available). This is a trend that is appearing across the AI landscape, due in part to competitive IP pressures and to copyright infringement uncertainties and looming legal battles over data usage.
Not to be outdone, Anthropic came out with 3 powerful versions of Claude that together make Anthropic a top competitor against OpenAI and Google. It’s most capable model, Opus, was ranked at the top of the elite models based on MMLU performance at 86.8% accuracy, versus GPT-4 (86.4%) and Gemini 1 Ultra (83.7%). Anthropic has not disclosed the parameter counts for these new Claude versions but offers good documentation on usage and evaluations here on a models overview webpage. It is worth noting that the Opus model is built to take in 1 million tokens of content, similar to Gemini 1.5 Pro. For developers, accessing these models is done via API from their web Console or from an Amazon Bedrock API; the new Opus model runs underneath the Claude Pro chatbot, while the Sonnet model powers the free Anthropic chatbot.
The announced release of Allen Institute’s Open Language Model (OLMo) was significant for the fact that so much was included for the research community. AI2 packaged up the training data (a pretraining corpus known as Dolma), source code for training and evaluation, along with other artifacts for advancement of foundation models. OLMo is available in 1B and 7B parameter versions and impressively does well alongside comparably sized models Llama-2 7B, Falcon 7B, and PaLM 8B.
Four notable foundation models were announced in the open source (weights available) space in March 2024: DBRX from Databricks, Samba-CoE v0.2 from SambaNova Systems, Grok 1.5 from X.ai, and last but not least- Jamba from AI21 Labs. The DBRX release is making waves with its mixture of experts (MoE) architecture and performance vs GPT 3.5 and other MoE models. The architectural design has 16 'experts' and utilizes 4 per input (see the Mosaic Research team website for announcement and details). Samba-CoE v0.2 is a Composition of Experts model, using an approach that has multiple smaller models combined under the hood to form 1 large model. Samba is touting excellent performance and 1 trillion plus parameter optimization. A very interesting new entrant that is pushing a full stack platform for Gen AI. Head over to SambaSystems and check it out. Elon Musk's X.ai has published Grok 1.5, a model that is notable for a relatively huge context window (128K). Again, Grok 1.5 is another model touting performance near the top of the open source category. Finally, the Jamba model from AI21 Labs was released and like its predecessor Mamba, shares a novel architecture- a hybrid of a transformer and a SSM, structure state model. The hybrid design appears to get around some memory and computational limitations of transformer models that scale poorly with context or sequence length. The Jamba model boasts a 256K context window and details around the hybrid SSM transformer model can be found here.
As Gen AI technology is adopted and rolls into organizations large and small across the globe, the tech sector continues to enable deployment for a greater number of business use cases by offering a wider range of models, and thus applications. The availability of multiple options for high quality language models - commercial as well as open model/ open source - is driving down compute costs, decreasing response times, and improving the chances of finding the most performant model for a given task. For the latter, many of these models were trained to perform best in selected abilities or scenarios, such as for “reasoning”, domain understanding (math, science, foreign policy), coding or more generally possessing greater overall knowledge.