Models, Rock, Paper, Scissors

2024-06-12

Rock. Paper. Scissors. Shoot!

A game as old as decision making itself - Rock Paper Scissors is a bastion of school yards, road trips, and board rooms the world over. Paper beats rock. Rock beats scissors. Scissors beat paper. Elegant. Simple. And an example of intransitive superiority.

Intransitive superiority is a fascinating concept that changes the way we think about something as being the \"best\" in absolute terms - paper beats rock. Rock beats scissors. Scissors beat paper - and the limitations of side-by-side comparisons of more complex systems or choices.

These counterintuitive systems are not uncommon outside of the playground: you can find the same system in evolution (species A outcompetes species B for resources, species B < species C, species C < species A), sports leagues (team A beats team B, team B beats team C, team C beats team A), video games (earth magic < water, water magic < fire, fire magic < earth), and - and you knew I was getting there - generative AI foundation models (model A outperforms model B, model B outperforms C, model C outperforms A, depending on the task at hand).

These cycles arise when the relationship between elements is non-transitive, leading to cyclical patterns of superiority, where each element is superior to one other element while being inferior to another.

It becomes pretty important to examine the specific pairwise interactions between elements rather than relying solely on overall rankings or presuming transitivity. Failure to account for intransitive superiority can lead to suboptimal decisions and strategies, as the superiority of an element may vary depending on the specific context and the other elements involved.

The same is true for generative AI foundation models. Traditional transitive superiority would have us believe that a single absolute \"best\" model exists, and that there is a hierarchy of silver and bronze winners after that. Indeed, we kind of fool ourselves into believing this by (over?) indexing on benchmarking scores. Traditional transitive superiority assumptions make for great headlines (and a win is a win in any category!), but it's only part of the story - and it misses the broader context which reveals a counterintuitive truth.

There is no one model to rule them all.

General \"world\" models like GPT4o and Claude 3 Opus are great! They have a ton of utility in reasoning across general information and knowledge, analyzing data, interpreting code, and so on. These broad, general language tasks are common - and having access to awesome models like these is a bit part of why customers are excited about generative AI in the first place. They even \"win\" in many benchmarks, and for general tasks will outperform specialized models. But!

A specialized model will beat a general model at a specific task - with better answers, less risk of hallucination, and often at lower cost. But!

An ensemble of specialized models working in concert will out perform a specialized model at a specialized task (since they usually have more context to pull form multiple forms of specialization). This is - in part - why mixture of experts models work so well (used - famously - by Mistral AI in Mixtral). But!

General models will beat ensemble models in general tasks. And so the cycle continues.

Assuming model transitivity - that an absolute \"best\" model exists irrespective of context - often leads to suboptimal performance of an AI system, and higher costs in aggregate. But! If we assume model intransivity - the value shifts from finding the \"best\" to using the \"many\" for the right task at the right time.

Intransitive models in the real world

It is in this insight which instructed our design of Amazon Bedrock, where different models from different families are available behind a consistent API with evaluation tools to help pick the right one at the right time. This collection of models creates a wide circle of transivity from which you can pick the \"best\" model for increasingly specialized tasks.

This includes world models like Claude 3 Opus for general tasks, ensemble models like Mixtral, or specialized natural language models such as Titan's text models. Also, specialization capabilities let you fine-tune existing models to hone and focus on specialized use cases, further increasing the circumference of the cycle of intransivity. For each use case or task, you can create or find the best model, and vary that choice as your needs inevitably change.

It's also what informed the architecture of Apple Intelligence, where specialized on-device models are used when possible, tasks are routed to specialized ensemble models are combined in the cloud when appropriate (which have been fine-tuned on personal data types like messages, calendars, emails, and so on), and general tasks are routed to generalized world systems like ChatGPT (and others like it in the future).

Rock. Paper. Scissors. Specialized. General. Ensemble.

I think this intransitive characteristic is likely to remain an immutable, stable feature of the menagerie of generative AI models for the foreseeable future.

I would bet we will have many more model families (and that the models in each category will continue to diversify) over time, making mastery of this intransitive choice and selection of the right model for the right use case to be one of the biggest levers most of us can pull to drive successful AI workloads. Organizations that build the muscles to make the right choice based on the right evaluation criteria, are going to be well poised to move quickly as the models themselves also continue to improve.

Exciting times.

Further reading

You can read about model choice in Amazon Bedrock and Apple's model adaption approach, here:

https://aws.amazon.com/bedrock/developer-experience/

https://machinelearning.apple.com/research/introducing-apple-foundation-models