Whether it’s hands with seven fingers or extra long palms, AI just can’t seem to get it right.
This year, artificial intelligence has won art competitions, dominated the internet, passed law exams and proved that it is the technology of the future… but it still can’t accurately make a hand.
Despite all of the work that has gone into AI image generators, hands have become their nemesis, flaunting weaknesses in the model.
While this has been a noticeable issue from the rise of Dall-E 2 and all of its following competitors, the issue became the centre of attention with a Twitter user’s collection of images created by AI generator Midjourney.
Midjourney is getting crazy powerful—none of these are real photos, and none of the people in them exist. pic.twitter.com/XXV6RUrrAv
— Miles (@mileszim) January 13, 2023
At a glance, they’re impressive, creating a group of realistic-looking people at a party. And yet, in one photo someone has three hands, another person has seven fingers and an extremely long palm and a final image shows someone with a backwards finger going through a phone.
So why is such a small obstacle causing a cog in the machine? “These are 2D image generators that have absolutely no concept of the three-dimensional geometry of something like a hand,” says Prof Peter Bentley, a computer scientist and author based at University College London.
“They’ve got the hang of the general idea of a hand. It has a palm, fingers and nails but none of these models actually understand what the full thing is.”
If you’re just trying to get a very generic image of a hand, this wouldn’t be too much of a problem. The issue is as soon as you give the models context. If it can’t understand the 3D nature of a hand or the context of a situation, it will struggle to accurately recreate it.
For example, a hand holding an object like a knife or camera, or someone making a symbol with their hand is instantly going to confuse a model that doesn’t have the 3D understanding of a hand or the geometric shape of the object it is holding.
“I asked Dall-E to show a photograph of two hands with their fingers interlaced and I got some bizarre results. It showed me two wrists and a ball of fingers for one of them,” says Bentley.
“But you can understand why. It doesn’t really know what it is doing, and it is just combining all these images that it has seen to meet your textual description as best as it can.”
However, it isn’t just Dall-E 2 that struggles with this. Other popular image models like Midjourney and Stable Diffusion have been hit with the same impossible task of making a normal-looking hand.
Taking a closer look at the picture
While it can often feel like the images these models are creating are near-perfect, they are actually often very flawed. The more you look, the more you are likely to spot a host of inaccurate details.
Part of this comes down to the user and the strength of the prompt that they use, with some people getting near flawless images from their detailed prompts. But realistically, this is mostly a problem within the models themselves.
“When you really look closely, there’s a telltale signal somewhere that the laws of physics are being broken somehow. Maybe there’s an arm through someone’s stomach, or an octopus with too many tentacles, or a tree that is floating off the ground,” says Bentley.
“Because they have just been fed lots and lots of examples of things, it is trying to piece it all together as best as it can.”
This can result in some bizarre results sometimes, often giving a dreamlike feel similar to a Salvador Dali painting.
“These models are divorced from reality, they don’t have any context and they don’t actually have any knowledge or ability to consider the context of an image. They just sort of combine all of the junk that we’ve given it.”
The major hurdle for AI images
So these models are good, great even… but they are still a long way from creating perfect images. What would have to happen to resolve this problem and finally create a hand that doesn’t look like it was inspired by David Cronenberg?
“This could all change in the future. These networks are slowly being trained on 3D geometries so they can understand the shape behind images. This will give us a more coherent image, even with complicated prompts,” says Bently.
“Getting enough 3D design data could take time. At the moment, we’re getting the easy results in the form of these 2D images. It is easy to trawl the internet and get a million images without the context.”
This is something OpenAI is started to work on with its Point-E technology, creating a system that can create 3D models from texted prompts. While it is currently usable by the public, it is a long way from producing accurate results.
However, when results do come, they could lead to highly detailed 3D renderings and even digital worlds. As Bentley explains: “A lot of money is going into things like the metaverse with an interest in 3D models. So it is quite possible with these combined budgets that we could see increasingly impressive 3D models created by AI.”
This is technology that could continue to improve to bigger and more impressive things. Right now we’re looking at 2D created images from AI, the future could be highly detailed 3D renderings and even digital worlds.
About our expert, Dr Peter Bentley
Peter is a computer scientist and author who is based at University College London. He is the author of books including 10 Short Lessons in Artificial Intelligence and Robotics and Digital Biology.
Read more:
- ChatGPT: Everything you need to know about OpenAI’s GPT-3 tool
- Dall-E mini: Creator explains blurred faces, going viral and the future of the project
- We badly described cartoon characters to an AI. Here’s what it drew