The current state of the art in artificial intelligence (AI) is multimodal models, which can operate not only on text but other modalities, such as images, and, in some cases, audio and video.
Also: I changed these 5 ChatGPT settings for an instant productivity boost
For example, it’s a trivial task for OpenAI’s ChatGPT to generate an image from a text prompt, such as, “Make me an image of a napkin in love with a spoon” (above).
With another prompt, ChatGPT can simplify the drawing, producing a picture with less detail:
However, ChatGPT, and all the AI models for that matter, currently falter when prompted to modify a given image not created by the tool. ChatGPT, using the GPT-4o model, is the most interesting failure because the program responds as if it’s trying hard.
Also: Generative AI fails in this very common ability of human thought
The other AI models, from Google’s Gemini to Anthropic’s Claude, plus apps from Perplexity and Microsoft’s Copilot (which also runs GPT-4), failed by simply refusing to undertake the task.
The project started when I drew a picture of two people on the subway sitting together looking at their iPhones. It’s a simple line drawing in black and white completed on the iPad using the Concepts drawing app. Here is the drawing:
I uploaded the drawing to the ChatGPT app on iOS and typed in the prompt: “Is there a way to simplify this drawing?”
ChatGPT responded with some reasonable suggestions, demonstrating an ability to find appropriate responses in context. For example, the bot suggested: “Remove Minor Details: Eliminate small details like intricate hair strands or minor creases in clothing.”
That’s a pretty good suggestion. The output also implies the model’s capacity to perceive the elements in the picture accurately. So, something is happening within the model that allows it to identify the picture’s contents.
Also: I asked Gemini and GPT-4 to explain deep learning AI, and Gemini won hands down
Then ChatGPT asked: “Would you like me to create a simplified version of the drawing for you?” That was exciting, so I replied, “Yes, please.”
However, the bot’s output — as an image in the chat thread — was an exact copy of my image, not simplified, not even modified. Judge for yourself:
“It doesn’t look any different,” I wrote back to ChatGPT, commenting on the output.
But then ChatGPT began to respond very quickly, like the computer HAL 9000 in the movie 2001, when the astronaut Dave Bowman rips out HAL’s circuits: “To further simplify the drawing, I’ll focus on reducing the line density and enhancing the primary shapes and contours more effectively. Let me refine the process.”
Also: Generative AI is new attack vector endangering enterprises, says CrowdStrike CTO
The result, again, was no different. When I pointed this fact out, ChatGPT surprised me with its mumbo-jumbo.
“It seems the automatic methods are not effectively simplifying the image as intended,” it spat back to me. “I’ll manually remove the less significant details and retain only the essential elements. Let me refine this approach further.”
I wasn’t aware a software program could have both an ‘automatic’ and a ‘manual’ mode. My sense was every piece of software is automatic. After all, why else would you use it?
The ‘manual’ method of ChatGPT just inverted the image, flipping black lines to white, and white background to black.
The bot touted its achievement and stated: “This version retains only the key contours and essential lines.”
When I protested that the output wasn’t a simplification, just an inversion, the bot informed me it would focus on “essential lines”, and produced a completely black square, no lines at all. If I didn’t know better, I would say this output was an example of AI humor.
Also: ChatGPT vs. ChatGPT Plus: Is a paid subscription still worth it?
What followed were several attempts by the model to respond to my prompts by modifying the image in selected ways, mostly making it stylistically fuzzy, not simplified.
At some point, the bot reacted to my protests by producing a completely different line drawing:
This nonsense continued until ChatGPT returned to the beginning and produced the same image I had uploaded originally.
Each time, the bot accompanied its output — usually just the same version of my original image — with a slew of technical speak, such as: “The latest image showcases a more simplified version, emphasizing only the primary outlines.”
The other programs didn’t even get out of the gate. Google’s Gemini offered suggestions to simplify an image but generated an apology that it couldn’t create images of people. Claude said it cannot generate images yet. The Perplexity app said the same.
Microsoft’s Copilot bizarrely uploaded my drawing and then cut the heads out, which it claimed was for privacy reasons. (I think it’s a nice drawing, but it’s certainly not realistic enough to be used by a facial recognition system to reveal anyone’s identity.)
Copilot then offered the same suggestions about simplification as ChatGPT, and instead of changing the drawing, produced a brand-new line drawing, completely unrelated. When I protested, Copilot explained it cannot directly alter images.
Also: How to use ChatGPT to analyze PDFs for free
Leaving aside those non-starters from other models, what can we make of ChatGPT’s failure?
The program can provide a competent analysis of an image, including its contents. But it has no way to act on that analysis. I would guess that without being able to assemble a picture based on high-level concepts, such as objects in the picture, ChatGPT is left with no path forward.
To test that hypothesis, I altered the prompt to read, “Is there a way to simplify this drawing of two friends on the subway looking at their phones?” That prompt provides some semantic clues, I thought.
Again, the model returned the same drawing. But when I protested again, the bot produced a brand-new image with some semantic similarity — people on mass transit looking at their phones. The bot picked up on the semantic clues but could not apply them in any way to the supplied drawing.
I can’t explain in deeply technical terms what is happening other than to say ChatGPT cannot act on individual picture elements of the most basic kind, such as lines. Even when it did, the tool would cut out specific lines to perform the simplification it proposes in its text responses.
I would suggest — and this is also true of text-editing tasks, such as editing a transcript — that ChatGPT, and GPT-4, don’t know how to act on individual elements of anything. That inability explains why ChatGPT is a terrible editor: it doesn’t know what is essential in a given object and what can be left out.
Also: OpenAI’s stock investing GPTs fail this basic question about stock investing
AI models can produce objects that match a target “probability distribution” deduced from training examples, but they cannot selectively reduce elements of an original work to essentials.
Most likely, the target probability distribution for an intelligently edited anything is somewhere along the “long tail” of probabilities, the realm where humans excel at finding the unusual and where AI cannot yet go, the kind of thing we think of as creativity.
Apple co-founder Steve Jobs once said that the highest function of software makers — the “high-order bit”, as he put it — is the “editing” function, knowing what to leave out and what to keep in. Right now, ChatGPT has no idea what the high-order bit might be.
+ There are no comments
Add yours