What can Generative AI *actually* do well?(31 min read)

Over the past two years, Generative AI has become a major trend. This trend has been fueled by the development of increasingly large machine learning models that focus on generating content. Here is a brief recap of what has happened in the saga of this technology so far.

2022 | 💥 Big Bang: ChatGPT is released in November 2022 and quickly gains public attention. Numerous experiments with the tool reveal the surprising capabilities of the underlying GPT-3 model. This initial excitement leads to ambitious visions for the future of the technology.

2023 | Euphoria: The success of ChatGPT accelerates the development of new models, frameworks, and techniques in the area of Generative AI. The technology spreads across all industries. Startups, consultancies, organizations, and researchers build first proof of concepts to bring the ambitious visions to life.

2024 | 🤔 Reality Check: The technological development slows down a little and the limitations of the technology become clearer. Many proof of concepts struggle with market-readiness, leading to a re-evaluation of their technical feasibility and business returns. The initial euphoria is fading as businesses have to tackle the hard question of how and where Generative AI brings concrete value. 

Boom or Bust?

The described development is reflected in Gartner’s Hype Cycle for Emerging Technologies from August 2024. Generative AI has moved beyond the Peak of Inflated Expectations and is entering the Trough of Disillusionment. This transition highlights how inflated expectations around Generative AI are clashing with the practical limitations of the technology. Early visions are being reassessed as organizations question whether the technology will live up to the hype or get stuck at half-baked demos. So, should we worry about the future of Generative AI?

The good: The new wave of machine learning models, which are often referred to as “Generative AI” models, provide real technological potential. These models stand out from previous machine learning models by focusing on content generation, utilizing large-scale datasets, and training for broad, general-purpose understanding. For the first time in history, complex digital content like text, code, images, video, and 3D objects can be generated automatically. In some areas, like text and images, the outputs of generative models are already indistinguishable from human work. Consequently, many workflows in our highly digitized world can be significantly improved by leveraging generative models. Generative models are a great addition to the machine learning toolkit and can power new features, tools, and solutions that greatly improve the efficiency of the digital workforce.

⚠️ The bad: Generative AI models are not magic. You cannot simply apply “AI“ to a task and it will somehow solve it. Generative AI models are a technical building block, just like Kafka Streams, Blockchain Storage, 5G Networks, Microservices, or any other technological trends of the past decade. Like any other technology, applying this technical building block to a business case requires a lot of knowledge, effort, talent, and most importantly: a fit between the problem and the technology. Finding fitting use cases for generative models is quite hard, as Generative AI emerged without a specific business problem in mind, so now the market is retrospectively trying to find suitable use cases.

💀 The ugly: Although identifying use cases in retrospect is possible, it has become extraordinarily hard due to the hype around Generative AI. To match the technology with the right use cases, there needs to be a realistic view of the current and future capabilities of the technology. In my opinion, this is where it gets ugly, as the technology is often unknowingly but sometimes deliberately misrepresented. The backbox nature of machine learning models leaves a lot of room for speculation about the inner workings of the models. Some players benefitting from the hype exploit this lack of transparency by filling the room for speculation with anthropomorphism, falsely claiming that human-like intelligence can emerge from statistical models that possess no such intelligence.

There is no need to worry about the future of Generative AI as there is real potential at the core of this technology. However, significant effort is needed to leverage it in real use cases. This crucial effort is often misguided by the hype around Generative AI, which makes it hard to see the technology for what it truly is. Due to the lack of transparency, one can easily invest in the wrong projects, wasting valuable time, resources, and manpower on false promises that might never come true. To successfully leverage Generative AI over the long term, one must cut through the hype and focus on use cases that fit the technology. That is why, in this article, I want to offer a realistic answer to the question: “What can Generative AI actually do well?”.

The usefulness and limitations of Generative AI

Before discussing the tasks that generative models can do well, one first needs a clear understanding of what generative models are and what they are not. Defining this boundary highlights the utility but also the limitations of these models, both currently and in the future.

Usefulness

The usefulness of generative models comes from the fact that they can learn patterns from data, just like any other machine learning model. In the field of Generative AI, models are trained on huge amounts of text, images, videos, or other types of content. By picking up on patterns in the content, a model builds a generic understanding of the content it is being trained on.

For example, GPT-4 was trained on a vast amount of internet text and picked up on patterns in this training data. When you ask it a question, it replicates relevant text patterns from the training data to produce a fitting answer, giving the impression of “understanding” your input. GPT-4 Vision operates similarly but with images. If you upload a picture of a labrador the model, it recognizes patterns typical for labrador images and reproduces descriptions that are commonly found along similar images on the internet. Since GPT-4V was trained on a wide array of images covering many topics, it recognizes patterns of many common objects, allowing it to describe generic images well.

The important point here is that generative models do not think or reason. Claiming they possess “mental capabilities” or “digital intelligence” is misleading. The only capability they truly possess is content understanding. GPT-4 understands text content (if the language and topic can be found on the internet), and GPT-4V understands image content (if the contents of the image are described on the internet).

Limitations

Generative models are bound to the world of digital content. They only learn to “understand” and reproduce content about a task but never learn how to perform a task itself like a human does. This distinction is extremely important since it reduces the number of tasks where applying an existing generative model or training a new model makes sense.

For example, there is plentiful text and image content about floor planning available online, so we could expect that a text-generation model like GPT-4 or an image-generation model like DALL·E can produce a useful floor plan. Turns out they cannot, as shown by the two images below. Why? Because the task of floor planning goes much deeper than just sketching a layout. Architects carry out site analysis, engage in technical planning, conduct regulatory checks, and integrate findings from all these steps into the final floor plan. Generative models on the other hand only replicate the final content of a floor plan, while being completely oblivious to the actual thinking that went into creating it.

In order to truly automate floor planning with “AI”, you must develop a narrow mathematical or machine learning model that accurately captures the complexity of the whole task. It is important to not only capture the complexity of the final output (=floor plans), but the complexity of the whole process (= floor planning) to get meaningful results.

To summarize, generative models are and will forever be bound to content. While there lies potential in exploring novel combinations of model architectures and content types, the space to explore is limited, as it is bound by two fundamental requirements that must be met to create useful outputs. First, the task must be representable by digital content, meaning there is little to no difference between executing the task itself and replicating content about the task. Second, there must be a substantial amount of content about this task, sufficient to train a large machine learning model to the point where it reaches a “generalized understanding“ of the content.

What Generative AI actually does well

As established, generative models only operate on a content level, which draws a nice boundary between tasks that realistically could be handled by current or future versions of the available models and tasks that are unrealistic or impossible. In the following, I will lay out tasks which generative models can do well while both considering the current state of generative models as well as future developments including new combinations of far more powerful generative models with novel content that could be used for training.

The image below shows an overview of the tasks that generative models can handle. You can directly jump into each task using these quick links: 🎨 Generation | 🔄 Transformation | 🤏🏻 Summarization | 🎯 Decision | ⚖️ Comparison | 🔎 Retrieval | 👾 Digital Navigation | 🦾 Physical Navigation | 🎭 Behaviour Simulation

🎨 Generation Tasks

The most obvious use cases are content generation tasks. In this case, we only provide a sparse prompt and the model generates a new piece of content. For example, an image model trained on sunset photos generates new sunset photos by reproducing patterns that are typical for these kinds of images. Replicating digital content works well because it is what generative models were designed to do. Furthermore, it works for all kinds of modalities and formats like texts, poems, emails, code, pictures, videos, 3D models, and other digital formats. I will lay out this task in detail, as it forms the foundation for all others and is essential for understanding the state of various modalities. Rest assured, the remaining tasks will be more concise.

Text Modality

Generative models can be used to generate any type of text. Currently, generating short and medium-length text content like poems, song texts, emails, ad copies, and short essays already works well. Topic-wise, generating texts about well-known topics found on the internet works well.

There are two big limitations of the current models. First, generating complex content like books does not work as it requires keeping track of an intricate storyline with developing characters, motives, places, etc. Current models are too limited in the amount of data they can handle to pick up on these high-level patterns in training and replicate them. Another limitation is that because of the internet-based training data, the models produce generic texts. These texts are often unusable in business scenarios, as internal communication requires an understanding of domain-specific topics.

★ I am optimistic about the future of text generation as the performance and context size of models are increasing fast. Thus, it is plausible that somewhere in the future models can generate highly complex texts like books, movie scripts, and speeches. This progression will likely lead to automation in the media industry and companies’ external communications. The lack of domain knowledge can also be addressed by fine-tuning models with niche data such as internal company documents. This could enable generative models to also automate significant portions of internal company communication and documentation.

Interesting Links

🔗 GPT-4 Technical Report: This paper highlights the written performance of GPT-4 based on academic and professional exams. It demonstrates how GPT-4 can produce academic texts on common subjects like History, Biology, Physics, and more. As previously mentioned, it is important not to confuse this ability to replicate academic content with reasoning on an academic level.

🔗 LLM-Generated Book about the Philosophy of AI: An experiment where an LLM is iteratively prompted to write an entire book. As mentioned in the limitations, the quality of the generated book is not great. Small paragraphs make sense individually, but zooming out reveals that they are random chunks of text glued together without leading up to an overarching argument or storyline.

Code Modality

Generative models can generate code at a granular level. In a previous experiment, I discovered that code generation is effective up to approximately 400 lines of code or keeping track of around five high-level concepts. Thus, the models can autonomously create small stand-alone applications.

Challenges emerge when attempting to generate complex applications. The generated code loses coherence, meaning one part of the code becomes incompatible with another. Any application that requires a sound high-level architecture or complex dependencies cannot be generated autonomously. Furthermore, there are limitations regarding which coding languages the models can understand as they struggle with old, niche, or brand-new languages that are less prevalent on the internet.

★ I believe that generative models will become significantly more powerful in the future and could be capable of handling extensive contexts like an entire codebase. Nonetheless, I don’t believe that code generation will benefit from these advancements in the same way that text generation will as writing code is only a small fraction of a programmer’s job. The majority of the time is spent on understanding the user, designing high-level architectures, and engaging in creative problem-solving—tasks for which the code itself provides little to no insight. Therefore, I believe that in the long term, generative models will become a powerful tool for the sub-task of writing code, but their impact on the broader task of software engineering will remain limited.

Interesting Links

🔗 Github Copilot: An IDE plugin that uses generative models to suggest code autocompletion.

🔗 Cursor: An IDE centered around code generation with generative models.

🔗 Claude Artifacts: The Claude models can produce small code artifacts within the conversation and also run them directly so the user can interact with them.

🔗 Devin: Devin is an attempt at autonomous programming. Despite being marketed as the “first AI Software Engineer”, it fails at complex software engineering tasks and is only useful to generate small snippets or demos. A similar project is ChatDev, which heavily focuses on gamification and a vibrant user interface. It is nice to look at, but it only generates the typical demo applications (Pong, Calculator, Paint). The tool Townie tackles full-stack autonomous programming by coupling frontend, backend, and an SQLite database in one codebase. This setup is clever for small demos but also too simplistic for most real-world applications.

Images Modality

Image generation is one of the generation tasks that current models excel at. Generative models are already capable of generating high-quality images on a wide variety of topics and in a diverse range of styles. For example, you can create stock-photo-like images, photorealistic images, or creative images that follow common art styles.

The main challenge with image generation is not the quality of the images but the level of detail. Image generation models handle generic prompts like “a car driving on a road” well, but struggle to generate a specific car driving in a specific location with particular wheels and headlights. For a precise photo or illustration that matches your exact vision, a photographer or artist is still required. This tendency to produce generic images also complicates their use in business scenarios, as businesses often need very specific illustrations and images that depict internal topics not represented in publicly available images on the internet.

★ I am optimistic that more powerful models will increase the precision of image generation and allow for greater control in the future. Additionally, less common contexts, such as company-specific images, can be addressed by fine-tuning models with niche data. Therefore, I believe that the task of image generation will be highly automated very soon which again has huge implications for the media industry, internal communication, and external communication.

Interesting Links

🔗 Midjourney: One of the best end-user tools for generating images from prompts. The tool excels at creating creative images, but the generation process can be quite challenging to control to achieve exactly what you envisioned.

🔗 Photoshop Firefly: With Firefly, Adobe has integrated generative features directly into their products. Users can take advantage of novel features like generative fill, generative expansion, and generative recolor, which improve efficiency and, together with the already existing features, allow for great control over the final image.

🔗 FLUX: State-of-the-art model that can generate very realistic images and is also great at incorporating texts into images. There is both a proprietary and open-source version available.

🔗 ControlNet: A method to get precise outputs from diffusion models by controlling the generation process with sketches, edges, segments, poses, or depth maps.

🔗 MindEye: A stable diffusion model that generates images. However, it does not take text prompts as input but fMRI brain activity. Thus, it can roughly reconstruct the image a person was looking at just from the recorded brain activity.

Audio Modality

Audio generation is another generative task current models can handle well. Numerous models are capable of producing high-quality sound effects, music, speech, and ambient noises. Even complex audio compositions, such as entire songs with vocals, are beginning to sound good. More and more AI-generated playlists emerge on Spotify and an AI-generated song even made it into the German charts this summer.

There are only a few limitations in audio generation. Similar to image generation, producing audio that perfectly matches specific requirements can be challenging.

★ In my opinion, audio generation is already mature. Humans are generally more focused on vision than hearing, so we are quite forgiving about artifacts or inaccuracies in audio. Because of the high maturity, I believe that there will be a wide-spanning automation of audio generation throughout industries. Music composed by musicians might be excluded from this as their craft is often centered around the connection to the artists, but this is not the case for more transactional audio like ambient music, sound effects, or narration.

Interesting Links

🔗 A Survey of AI Music Generation Tools and Models: This paper provides an overview of the timeline of automated music generation while listing concrete tools and the non-neural and neural models used. A list of other audio generation papers including sound effects, background sounds, vocals, and many more is provided here.

🔗 Suno Music: Popular music generation tool.

🔗 Elevenlabs: A suite of generative audio tools including sound effects and speech generation. Another similar project is Meta AudioCraft, which is a code base from Meta for diverse audio tasks like generating music, sound effects, and audio compression.

🔗 Speech Neuroprosthesis: An amazing paper in which the neural activity of a person with ALS was decoded into words and read aloud using a generative model that was trained on the voice of the person. This enabled the participant to communicate verbally again at a rate of approximately 32 words per minute.

Video Modality

Multiple organizations are working on video generation models, but progress lags behind image and audio generation. The most notable demo is OpenAI’s Sora model, which generates impressive videos up to 1 minute in length. Unfortunately, this model is not publicly available yet. The applications that are already available target simpler tasks like generating short stock footage or stylizing existing video content.

There are significant limitations in video generation. Video content requires much more data than images or audio and comes with the challenge of understanding how objects change over time. Because of these challenges, current models can only produce very short clips, about 5 to 60 seconds long. Furthermore, it is hard to control the generation process to achieve a video that exactly matches the requirements. The generated content is also generic, covering common topics and styles often found in stock footage, YouTube videos, or movies, while niche and business-specific topics are hard to generate.

★ I believe it will take a long time until video generation becomes truly useful. Generating longer content requires massive improvements in model performance to handle the data volume and to “understand” long-form content. This involves keeping track of a coherent storyline, shots, characters, objects, movements, and more. While automatically generating a whole Netflix episode might be possible someday, it seems far off. For now, video generation will mostly impact the media industry and external communications in editing tasks like special effects, adapting mouth movements for another language, etc.

Interesting Links

🔗 Open AI Sora: State-of-the-art video generation model. Great showcase for what is already possible with video generation models, even though the model is not publicly available yet.

🔗 Runway AI: A popular platform to generate videos from scratch and add visual effects to existing videos using generative models. This blog article nicely describes how traditional editing and generative models can work together.

🔗 AI-Shorts-Creator: A good example of an already available video editing helper. Given a long-form video like a podcast, the tool analyzes the video transcript, cuts out the most interesting parts, and crops them into a horizontal mobile format using face detection.

3D Object Modality / World Modality

Current generative models can create 3D objects from text descriptions or images, including simple objects, human avatars, and basic 3D scenes. Another interesting trend that has surfaced lately are so-called “world models”, where digital actions like keyboard presses are recorded alongside the visual changes they affect in a digital world like a computer game. This data is then used to train a generative model which essentially “simulates” the digital world. The resulting models do not produce a new digital world, but an interactive video stream of a digital world.

AI-generated 3D meshes often lack the detail and structure needed for professional-quality work, requiring significant manual refinement. Additionally, users have limited ability to direct the AI toward specific designs, so outputs may not meet precise requirements. While generative models can generate generic, artistic 3D models, they typically do not consider engineering principles and function requirements which are essential for engineering applications.

Future developments may enable more detailed and accurate models with greater user control over the creation process. This progress could significantly accelerate 3D content creation in fields like media, entertainment, and gaming. However, in the domain of CAD, integrating engineering principles into machine learning models remains a complex challenge that will require extensive research. The new wave of generic content generation models is unlikely to help with this problem. While the new trend of world models is quite fun, I do not see any real use cases here since these models only produce a simulation of a digital world but no new artifacts that could be further used.

Interesting Links

🔗 Nvidia XCube: State-of-the-art paper that showcases how 3D objects can be generated from a prompt using voxels.

🔗 Technical Overview of 3D Generative Models: Overview of using generative models to generate 3D models. Showcases how 3D meshes for simple objects can be automatically created. It also showcases methods to generate avatars, environments, and animations.

🔗 Simulating Counter-Strike: A project that uses a neural network to simulate the computer game Counter-Strike in real-time. There user can send mouse and keyboard inputs to the game and the neural network will react to these inputs by altering the output video stream.

🔗 SHOW-1 South Park: Another interesting project falling in between video generation and world generation. You first set up a digital world by selecting, prompting, and placing characters on a map. Subsequently, the tool will generate a new South Park episode. While the final output is a video, it is quite impressive how the tool simulates the storyline based on the world you created. You can find an example episode generated by the tool here.

🔄 Transformation Tasks

Transformation tasks are very similar to generation tasks. The main difference is that in transformation, we do not generate content from scratch but instead provide all information that should be incorporated into the content upfront. While generation is focused on creating entirely new content, transformation aims at taking information that already exists and rearranging it into a more useful format.

Multiple Modalities

Generative models can restructure information within a modality (e.g. text-to-text) very well. GPT-4 for example can easily rewrite a blog article into another language, a bullet point format, or a poem format. The basic information stays the same, but the format is transformed into something more useful. This also works great for image-to-image or video-to-video use cases where generative models can be used to change the style of an input image or video. Furthermore, generative models can also transform content between modalities, for example by turning a text-based blog article into an audio-based narration. The table below shows a matrix of tools which both allow to transform content within a modality or between modalities.

There are only few limitations when restructuring content within a modality. Even complex formats like video can be manipulated by a model to create a modified version of it with ease (=video-to-video). Converting between modalities, on the other hand, is dependent on the output modality. Transforming images (image-to-video) or audio files (audio-to-video) into videos only works for short and generic videos as the video models are not powerful enough yet.

★ In my opinion, transforming content already works really well. I am certain that in the future, we can change the representation of information on the fly. As boundaries between modalities become more blurry we might start to store information using abstract representations and then turn it into a piece of content ad-hoc based on what is needed at the moment. Automating transformation tasks will have a major effect on knowledge management in all industries, as we are now able to transform, streamline, connect, and learn from information that previously has been scattered all over the place.

Interesting Links

🔗➜Text➜Code➜Image➜Audio➜Video➜3D
TextChatGPTClaude
Artifacts
MidjourneySunoSoraMagic 3D
CodeGitHub
Copilot
Code
Reviewer
ImageGPT-4VSketch
2Code
SDEditImage
2SFX
Leonardo
AI
Toon3D
AudioWhisperMusic
2Image
WZRD
AI
Audio
2Face
VideoGeminiRecording
2Code
DeepMind
V2A
Neural
angelo
3DLeo
AI

🤏🏻 Summarization Tasks

Summarization tasks are in the same realm as generation and transformation, but they are different enough to warrant a separate category. Summarization means that a complex piece of content is reduced to its essence, and only the most important information is extracted. This reduction of complexity is also a common task in traditional machine learning, where it goes by the name of dimension reduction. In the same way, dimension reduction can shrink the complexity of structured data, generative models can reduce the complexity of content.

Multiple Modalities

Text summarization works well with models like GPT-4 and is very straightforward up to the 128k context size limit. There are also clever approaches for extracting key information from larger documents. Additionally, models like GPT-4V are great at summarizing the contents of an image, while the Gemini models handle video summaries well.

The main limitation of summarization is the amount of data that models can handle in their input layer. As already pointed out, there are clever ways to work around those limits, but they often make the summaries a little less precise.

★ In my experience, summarization is pretty mature, and numerous businesses are already rolling out summarization features. As the amount of content on the internet continuously increases while the quality decreases (in large part due to Generative AI and bots), I am certain that there will be a widespread need for summarization. Whether it’s journalism, media, or communication, smart summarization features will likely become essential to tackle information overload.

Interesting Links

🔗 Levels of LLM Summarizations: A great notebook about the different techniques to summarize text using generative models, including Map Reduce, Semantic Clustering, and Sliding Window.

🔗 Kununu Review Summaries: The rating platform Kununu is a great example of the useful application of the mentioned summarization approaches. Reviews that users leave about a company are automatically summarized into a short text shown at the top of the page, so visitors can instantly get a feeling for the company culture without having to scroll through huge amounts of reviews and determine the recurring themes themselves.

🎯 Decisions Tasks

Decision tasks differ from the previous categories. While generation, transformation, and summarization are focused on replicating content, decisions are centered around understanding content. As previously mentioned, generative models build an abstract understanding of the content they were trained on. We can utilize this understanding to let generative models make decisions for us.

For example, instead of asking GPT-4 to write a new email, you can provide an existing email and ask if it is spam. The model then produces a fitting answer based on its understanding of the content. The style and format of this answer are not important, as we only care about the decision that is embedded into it. The general method of describing a situation, having the model make a decision, executing the action, and reflecting the outcome back to the model is often referred to as an agentic pattern. If someone mentions generative agents or tool use, this is what they are referring to.

Multiple Modalities

Generative models can make sensible decisions when they are provided with text descriptions of simple and common situations. This text-based decision-making is often used in agents, where a generative model decides to execute pieces of code or call an API to perform actions within a digital system. Decisions can also be made based on images and videos, where generative models can conclude whether an object is present, which objects are in front of others, and what could happen next.

Generative decision-making is only effective in simple, well-documented situations. In unfamiliar contexts, models make nonsensical decisions due to uncertainty about probable outcomes. Additionally, generative models are bound to content, so they do not decide based on the real situation itself but only on content depicting the situation. Many situations are hard or impossible to accurately describe with content like text, images, and videos. The thought process behind decisions is often based on gut feelings and experience—factors that are impossible to document. Both of these hurdles significantly limit the use of generative decision-making for non-obvious decisions.

★ In my experience, making decisions with generative models only works for generic and easily definable situations. The fundamental concept is quite powerful, but to fully realize its potential, the understanding of the models has to be expanded by fine-tuning them with domain knowledge and internal company data. However, the prerequisite for fine-tuning a generative model to make specific decisions is that the situation and courses of action are fully representable with digital content. This is very hard for most decisions as implicit expertise, gut feelings, and knowledge are hard to write down. Thus, I believe that fine-tuning will slightly increase the possibilities for generative decision-making in the future, but it will never factor in implicit domain knowledge, expertise, and gut feelings.

Interesting Links

🔗 OpenAI o1: As you might have guessed, I am not a fan of the marketing around o1, as it focuses on terms like “reasoning,” “mind,” and “thinking.” That said, the model does manage to solve common problems in math, physics, biology, and chemistry by exploring potential solutions step by step until it reaches a final decision.

🔗 LLMs and the consequences of actions: This paper highlights how GPT-4 can understand the consequences of an action in a simple situation with clearly spelled out rules like a text-based computer game. However, the model struggles with complex scenarios requiring common sense, or multi-step reasoning. Thus, it is unreliable for complex real-world situations where the rules are not explicitly written down.

🔗 Swarm AI: A framework for creating LLM-based agents that make decisions and take actions. I picked this framework because it provides concrete multi-agent examples. While examples like “multi-agent customer service assistant” might suggest that generative models are making real-world customer service judgment calls, their actual capabilities are much simpler. In the example implementation, a programmer simply defines a database that stores orders (=situation) and predefined functions like “order Item” and “refund item” (=actions) that manipulate the data. When a user request comes in, the agents decide which function to execute to reflect the user’s wishes in the database. This highlights how current agents are useful for executing pre-programmed actions in simple environments, but far away from real-world decision-making.

⚖️ Comparison Tasks

Comparison tasks are a sub-category of decision tasks. A generative model is given two situations, contexts, or documents and then uses its decision capabilities to determine similarities and differences.

Multiple Modalities

Comparison is a sub-area of decision tasks that already works well. For example, you can provide two texts, images, or videos to a generative model and it outlines similarities and differences. Furthermore, real-world applications of this capability already exist. The startup Harvey AI for instance uses generative models to compare contractual clauses to state law and decide whether the clauses are permitted in a jurisdiction.

The prerequisite for useful results is that the comparison is based on concrete pieces of content. The more you use generative models to compare two abstract situations instead of concrete content, the more hallucinations will be generated.

★ As established, using generative models to compare content works well and already brings value in a few niche scenarios. In the future, fine-tuning will likely lead to more specialized models that understand the content of specific industries. This could help to extend the narrow application area of comparisons further, allowing generative models to impact quality management, knowledge management, purchasing, and compliance.

Interesting Links

🔗 LLM Comparative Assessment: A paper demonstrating that LLMs can compare pairs of text and decide which text has the better quality.

🔗 VisDiff: This paper uses LLMs to generate natural language descriptions to highlight the differences between two sets of images.

🔗 Harvey AI: A tool that leverages generative models to compare contract clauses with relevant state laws, ensuring legal compliance across multiple jurisdictions.

🔎 Retrieval Tasks

Retrieval tasks focus on finding specific information within a large pile of content. While simple information retrieval approaches like keyword searches rely on exact text matches, modern methods use embedding models to find content based on meaning. Retrieval is often combined with generation to create Retrieval-Augmented Generation (RAG) systems, where relevant content is first searched and then an answer is generated based on the found content.

Multiple Modalities

Retrieval works well across multiple modalities. Embedding models can transform text, images, videos, audio, and even 3D content into abstract vector representations, enabling advanced searches that go beyond simple keyword matching. For instance, text can be semantically searched, images can be retrieved based on visual similarity, and videos can be found by thematic content. Thus, we can use these retrieval approaches to find information quickly regardless of the content type.

While current retrieval methods excel in “needle-in-a-haystack” scenarios, where a specific piece of information needs to be found in a vast collection of content, they struggle with queries that require a complete understanding of the content collection. For instance, retrieving or counting all pieces of content related to a specific topic is challenging because, unlike keyword or metadata-based retrieval, there are no clear rules to define category boundaries. Additionally, embedding models struggle to correctly distinguish domain-specific information that uses unique jargon, which often reduces retrieval performance for internal company data.

★ The future of retrieval tasks looks promising, as fine-tuning will likely lead to industry-specific models that can also accurately retrieve domain-specific information. Improvement in multi-modal embeddings will also allow seamless searches across different content types which will broaden the applications of retrieval tasks. I am certain that these retrieval methods already are or will be widely adopted in areas like data management, knowledge management, and recommendation systems in the future.

Interesting Links

🔗 OpenAI File Search: OpenAI’s assistants playground allows users to upload documents and turn on file search. This capability will then search and retrieve relevant information from the documents when needed to answer user questions.

🔗 Perplexity AI: A popular search engine that leverages generative models to retrieve information from the internet.

🔗 Microsoft Recall: An experimental Windows 11 feature that periodically captures the screen of the user. The user can then use natural language queries to search for any content that was previously opened or interacted with on the computer.

🔗 RAG Approaches: A great technical overview of all the different ways retrieval and subsequent generation can be implemented.

👾 Digital Navigation Tasks

The capabilities of generative models to understand visual content and decide on actions can be combined to autonomously navigate digital systems. This process, known as autonomous computer control, involves providing images, video streams, or metadata of a user interface to the model. Based on this visual data, the model executes mouse or keyboard actions to navigate the interface until a task is completed.

Multiple Modalities

Early demonstrations of autonomous computer control show that generative models can navigate websites, click buttons, and enter data autonomously.

The technology operates slowly and only works for simple interfaces and straightforward tasks. A major limitation in this regard is that the vision models misinterpret subtle details and complexities of the interfaces. Additionally, the models struggle with staying on track when executing complex multi-step tasks like booking a flight from start to finish.

★ Using generative models to navigate digital systems is a new and developing field. Many challenges need to be addressed to apply this approach to more complex interfaces and tasks. Identifying real-world use cases may be difficult, as it requires tasks that are repetitive enough to benefit from this type of automation but not so repetitive that creating dedicated automation scripts or APIs would be more efficient.

Interesting Links

🔗 Claude Computer Use: Anthropic is one of the first model providers to offer an API that lets their Claude 3.5 Sonnet model control computer systems.

🔗 Cradle: This paper shows how GPT-4V can interpret screenshots from video games or productivity apps and generate code snippets to perform actions like keyboard presses and mouse movements. Using this setup, the model was able to create a Twitter post and switch weapons in the video game Red Dead Redemption.

🔗 SIMA: A specialized generative model that can execute instructions within video games. It can handle simple actions like “Chop down a tree” on its own, as long as the action can be completed in about 10 seconds.

🔗 Agent Q: This paper explores how generative models can make restaurant bookings on the OpenTable website.

🦾 Physical Navigation Tasks

Just as generative models enable autonomous computer control, their vision and decision-making capabilities can also be used to navigate physical spaces. This approach, known as embodiment, involves equipping models with robotic components that allow them to move around in the real world.

Multiple Modalities

Embodiment utilizes various types of data, including images, video feeds, verbal instructions from a human instructor, and feedback from the robotic components themselves. Current generative model processes can use this input to react to human instructions, analyze the surroundings, and trigger movement programs.

While generative models can provide high-level navigation decisions and descriptions of the surroundings, they do not address the complex challenges of robotics, like precise movement, dexterity, and real-time responsiveness. These problems are still best tackled using specialized deep learning approaches.

★ The integration of generative models like GPTs into robotics is a relatively new development, building on years of existing machine-learning techniques in the field. The advanced world knowledge and communication capabilities of these models can enhance interactions between robots and humans, as well as improve basic orientation and environmental interpretation. Nevertheless, this is just a tiny building block of embodiment, as the real challenge is replicating the complexity of human motion and engineering human-like dexterity, which has historically proven to be an extremely difficult endeavor. The new wave of models will not provide a huge benefit here.

Interesting Links

🔗 Figure 01: A state-of-the-art humanoid robot that uses generative models to communicate with humans and describe its surroundings. Its movements are controlled by a separate set of neural networks.

🔗 OK Robot: Designed for pick-and-drop tasks in home environments, this robot leverages a Vision-Language Model to understand and identify objects that users refer to in a room.

🎭 Behavior Simulation Tasks

An interesting side effect of training generative models on a lot of human-made content is that they also pick up on behavioral patterns of humans, or at least the descriptions of those. This allows us to ask the model to simulate how a person or group of people would likely behave in a scenario. By giving the generative model a character in the prompt and then iteratively asking it what it wants to do next, you get plausible behaviors.

Multiple Modalities

Letting a generative model simulate behavior works well with text, where the models can describe how a character would behave and what it would do next. It also works based on visual content like images or videos, where you can ask the models to do so.

The characters and simulations that generative models can simulate still lack depth and often seem a little cliché. Additionally, letting generative models pretend to be human has, of course, nothing to do with real behavioral patterns, so while it works well in entertainment scenarios, it cannot be used as a replacement for real human participants in fields like psychology.

★ Overall, simulation of behavior works well and is already used in real products. Many of the most valuable Generative AI startups, like Character AI, have specialized in providing digital companions that essentially simulate what it would be like to chat with a specific character. Thus, generative models already have a huge impact on the entertainment and gaming industry, and this impact will continue to strengthen in the future.

Interesting Links

🔗 Interactive Simulacra of Human Behavior: In this paper, a sandbox environment is populated with twenty-five generative agents, each given a specific role. Generative models are then called in a loop to decide how the agents should behave and what they do next. The agents manage to plan their days, share news, form relationships, and even coordinate group activities.

🔗 AIs Determine Who is the Human: A fantastic video where four different generative models (GPT-4, Claude 3 Opus, Llama 3, Gemini Pro) act as Aristotle, Mozart, Da Vinci, and Cleopatra in a virtual world. There is also one human part of the conversation, acting as Genghis Khan. The models then have a conversation and ask each other questions to find out who is the human among them.

🔗 Deaddit: A clone of the popular social network Reddit, but it’s fully populated by generative models that talk to themselves in endless bot discussions. Quite an impressive demo, and I encourage you to read the posts and comments as you get a good feeling for how the models can act similarly to human users, but the content they produce is very cliché.

🔗 Character AI: A platform that enables users to create and interact with characters that are powered by generative models. These digital companions can engage in conversations and simulate distinct personalities.

Where to go from here

In this article, I outlined the usefulness of generative models by defining nine tasks. I deliberately chose to categorize the potentials of Generative AI into “tasks” as they can both refer to technical capabilities and business problems. For example, summarization does not only stand for technical approaches like Map-Reduce but can also describe a bundle of similar business processes such as summarizing supplier offers, legal texts, or market trends.

Applying the outlined tasks on a large scale allows us to brainstorm how generative models will impact entire industries. The task of generation for example makes is obvious that generative models will have a large impact on the media industry by automating the creation of text, images, and videos. Comparison tasks, on the other hand, are prevalent in law and will likely be automated by letting generative models check contexts against legal standards.

Within a company, the nine tasks help to spot departments and processes that benefit from Generative AI. Many purchasing processes for example consist of retrieving, summarizing, and comparing information to break down offers from suppliers into key facts. Thus, purchasing departments naturally provide many opportunities for the application of Generative AI. Same with customer support departments which heavily benefit from the retrieval and answer generation capabilities of generative models.

Nevertheless, Generative AI is not just about these obvious use cases. As the technology is versatile, it can be applied to nearly all digital work on an individual level. For instance, I previously worked as an IT administration where I checked server setups against guidelines (= comparison task), extracted key points from customer requirements (= summarization task), and combed through old logs and incident tickets to troubleshoot problems (= retrieval task). So even for non-obvious fields, one can use the nine tasks to identify valuable use cases for Generative AI.

That is why I want to encourage you to reflect on which tasks in your own environment fall into these nine categories. By combining a deep business understanding with the technical possibilities of the generative models outlined in this article, I am certain that you can find valuable Generative AI use cases in most processes, roles, departments, or industries.