Understand the Gen AI Innovation Ingredients

Imagine a kitchen with five basic ingredients: flour, rice, pasta, beans, and salmon. For years, you’ve cooked meals with these ingredients. You know their ins and outs, their upsides and limitations. 

Now, something has changed. 

Suddenly, you have a box of fresh ingredients in your kitchen - eggplants, broccoli, cilantro, kale, and chili. How long will it take you to know and master these ingredients? It might take a while.

That’s where we are with generative AI and designing new digital services. We suddenly have a box of new ingredients, and it will take years to figure out how to whip up delicious meals from them. 

In this issue, I wanted to introduce the three levels of ingredients of innovation with generative AI:

1. The Foundation - Knowledge and conversations
2. Multi-modality - image, video, and audio understanding and generation
3. Extensions - web search and actions like code generation 

We’ll take a quick look at these levels, explore their examples in the wild, and where they are in current maturity. We’ll also look at how these ingredients work together for novel innovations.

Let’s tie our aprons and sharpen our knives.

  1. The Foundation - Knowledge and conversation  

Knowledge and conversations are the foundational ingredients of generative AI and the base layer for all other levels. 

Knowledge in generative AI can be roughly divided into two: 

  1. General knowledge that foundational models (GPT-4, Gemini etc) have of the world through their vast training data compressed into a probabilistic model. This knowledge is based on learning patterns from materials by token prediction. As such, it doesn’t store actual facts but predictions of the next token (parts of text, video etc).    

  2. Augmented knowledge that gen AI models can reference, including documents, databases, and any context shared in prompts. This external knowledge can add accuracy to the general knowledge, deepen it in niche areas, and add information not available in public training materials.  

There are two interesting trends in the knowledge base of generative AI: larger context windows and Retrieval Augmented Generation (RAG). 

Larger context windows mean that models can remember more of our prompts. Last week, Google announced Gemini 1.5 with a context window of over 1,000,000 tokens (pieces of information). The model could accurately analyze plot points from a 44-minute silent movie in their example.

Retrieval Augmented Generation means the AI model can augment its original learning data by referencing specific material. The material can be, for instance, large PDF documents or databases. Organizations that can use their intellectual property in useful ways as the background knowledge of AI will have an edge over their competitors.

On the base layer of knowledge, conversations are the other foundational ingredient of generative AI. 

State-of-the-art models like GPT-4 and Gemini are amazingly fluent in conversations. Expect these conversations to become more nuanced and intelligent as the models advance.

With an understanding of this base layer, a first step for innovation with gen AI is to consider the following:

1. What proprietary deep knowledge do we have in databases and documents?
2. How could our customers or internal stakeholders benefit from this knowledge through a conversation?  

For example, the credit ratings and research firm Moodys is using gen AI to build a conversational layer on top of its vast databases of proprietary financial data - from disclosures to financial reports. Analysts can draw deeper insights from the data through natural conversations. 

Moodys is building gen AI applications that combine their proprietary financial data and a conversational layer. Analysts can just ask questions instead of pouring over mountains of data.


2. Multimodality - understand and generate images, video, and audio 


The next layer on top of the foundation is multimodality - the ability to generate and understand different media modalities. 

Generative AI models are increasingly adept at both generating and analyzing images. 

So far, most of the world's attention has been focused on image generation and its negative effects. 

Image recognition (or vision as Open AI calls it) has the potential to be equally or even more powerful.

Here are some current and potential uses for image recognition:

- Be My Eyes uses GPT-vision to let vision-impaired people “see” the world with live explanations and instructions based on their smartphone camera (“take this subway line home”)

- Object detection and instructions - understanding what’s in the picture and answering any questions about it (how should I water this plant?)

- Combining specific knowledge (manuals of complex machines) with conversations and image recognition (how should I fix this broken part of a machine?)   

Be My Eyes combines the ability for an AI to see images with conversations to help the vision impaired navigate the world

AI video generation is now catching up to images. 

Last week, OpenAI shocked the AI-following world with its text-to-video model Sora. It can generate up to a minute of near-perfect video based on any prompt. Sora is still in limited internal test use but shows how quickly AI video is progressing. 

Video recognition is already further.

FOX Sports is using gen AI to analyze millions of videos and find the relevant clips

In a recent example, FOX Sports has started to use generative AI to analyze and tag millions of videos to find relevant clips for its viewers. Finding the right clips is one of the most manual processes in sports media.  

More sophisticated gen AI models can be built in the future by combining live video recognition, knowledge bases, and conversations. For example, an analysis of a live video feed of a warehouse could be combined with a database to generate contextual guidance (“Here’s how to optimize the flow of goods”).


3. Extensions - web search and other actions 


Extensions like web search and other actions like generating code add another layer to the innovation ingredient mix.

In addition to referencing its learning data and external materials, the best gen AI models can also search the web to add current information to their knowledge. So far, Google has done a better job incorporating search into its results. OpenAI has been going back and forth with turning on and off their Bing search capabilities. 

In addition to search, AI models like GPT-4 can generate code and talk to other applications via API requests. Code generation can be handy for complex mathematical problems and for generating documents like CSV files. 

Some of these powers are already accessible to developers with the Open AI Assistants API, albeit not cheaply. So far, the big companies behind the leading models have been understandably cautious about rolling out these extensions - sometimes called agentive features. Currently, taking actions is limited to simple actions like finding a hotel through the Custom GPT function of ChatGPT.

Extensions are the least mature of the three levels but are the most interesting to follow for future innovations.  

Putting it all together

The first wave of experimentation with generative AI has been about the first foundational level - using tools like ChatGPT in personal workflows and simple chatbots.

In this next wave, organizations will combine different levels of ingredients of generative AI, such as multimodality and powerful extensions. 

The maturity of each of these levels is at different stages - conversations are already powerful while search is still faltering. For us working in design and innovation, it’s crucial to know these ingredients, follow their development, and experiment responsibly with new useful services tapping into their potential.

Matias Vaara

I help teams tap into the power of generative AI for design and innovation.

My weekly newsletter, Amplified, shares practical insights on generative AI for design and innovation.

Previous
Previous

The emerging role of the AI designer

Next
Next

The Exploration Machine - AI image generation in design and innovation