Right now, AI's can tell stories. They can sometimes even be passable stories.
To pick on an example I heard of recently, Eleanor Konik of Obsidian Iceberg has used them to write short vignettes to read to her baby. I've seen some of them myself, and - in addition to being very short - they're at least as good as the text in many commercial picture books.
But also, they're predictable. Nothing surprising happens. There're no particularly adept turns of phrase. They can completely be summarized with the prompt Eleanor used, combined with background knowledge like "what does a plumber, actually, do?" In other words, they're exactly what you'd expect from the sort of AI's we have today: large language models.
Other people have used AI's to go farther. For instance, someone set an AI to write an (unauthorized) sequel to the Song of Ice and Fire (aka "Game of Thrones") series. It quickly jumped the shark, losing track of the plot and jumping back and forth between writing like it was a story and writing like it was a review article about "Game of Thrones." I haven't read far enough to review the overall storyline for myself - but from the bit I've read, I don't want to. It reads like a mediocre fanfic. And if I want to read something with that quality, I can read an actual mediocre fanfic written by a human.
I can imagine right now an AI optimist saying that all this will get better. And to their credit, all this has gotten better - today's AI's are much better at everything associated with writing than AI's of three years ago. Context windows (the length of text you can prompt an AI with and have it take into account in its response) have increased even in the last couple months (thanks to more memory and processor power), until they can hold an entire short novel. Naively, someone might easily imagine that AI's will just keep getting better until they're writing much better than humans. A recent article about AI's apparently succeeding at persuasive writing lends credence to this. Will AI's just keep getting better and better?
I do think there's a barrier inherent in what it means to be a large language model.
To develop a large language model, you need to train it on reams of "training data": many different sorts of text. The model mechanically summarizes that training data and weights the different pieces of the summary. Then, when it's running and you pass it a prompt, it responds by generating output that wouldn't be out of place in that same training data.
For example, when my friend and I were recently visiting Austin for the eclipse, he asked ChatGPT for interesting things to do in Austin in the rain. ChatGPT replied with a few paragraphs which wouldn't be at all out of place on a mediocrely-written travel webzine. This's because it'd trained on travel webzines, among many other sorts of things, and it determined this query lined up with travel webzines more than any of the other sorts of things. If he'd asked something ChatGPT determined was the sort of thing most associated with Game of Thrones fanfic, it would've answered with something that read like a Game of Thrones fanfic. Or, if he'd wanted things to do in Austin in a style that wasn't a travel webzine, he could've specifically asked for that style and ChatGPT would've given it because that request would've been associated with things not travel webzines.
But, like Eleanor's bedtime stories, you won't get any surprises you haven't specifically asked for. If you ask an LLM for a Game of Thrones fanfic, you'll get something with the events that typically happen in Game of Thrones fanfic. You won't get other sorts of events, because they aren't found in the training data. If you ask it for an original fantasy story, you'll get something with all the tropes commonly found in fantasy stories. All this is inherent in LLM's.
In other words, you won't get creativity.
But, my imaginary AI fan might say, we can program AI's for creativity! Right now, we can turn up the temperature dial on ChatGPT, and have it give wilder and less predictable answers. It'll hallucinate more, but isn't that sort of what we want in storytelling?
As the Oxford Handbook of Cognitive Psychology puts it, there're two parts that make an idea creative: "originality or novelty", and "adaptiveness or utility". A creative idea isn't just something that's original; it's something that's original while being useful in context.
The thing is, AI hallucinations may be original, but they're almost always not useful in context. This's because they're random - random from a weighted distribution, but still random. This random originality is like the free-association ideas an author might generate in his initial brainstorming, or even while dreaming. That's part of creativity; it's original. But another part just as important is knowing which ideas are good. A good author won't just write off his initial brainstorming; he'll consider (perhaps quickly, but still consider) which ideas are good ideas, which have "adaptiveness or utility," and how to improve them still further.
An AI can't do this. It can determine how these ideas align with what's common in the genre, which is indeed one factor the author will consider. But the author will also be considering myriads of other factors about what makes the story good and how readers will receive it. ChatGPT doesn't model these. An AI can't, because good stories depend on intangible essences of the human mind and imagination which we don't know how to put into words ourselves beyond just writing them. The imagined super-AI's of science fiction books can model a whole human brain and determine through experiment how it would appreciate stories. LLM's don't, and won't be able to because that's not how they're constructed. But that's what it would take to get from this weighted randomness to actual storytelling creativity.
But (an AI fan might reply), even the mediocre stories an AI does write do seem to show some creativity. Even if they're mediocre fanfic, that's still something. So where does that creativity come from? Why can't it improve in the future?
Part of this creativity, of course, comes from the human writing the prompt. If you ask it for a story about a plumber fixing a pipe and explaining his job to a watching kid, you'll get that, and the concept (and sometimes other things you ask for) will be thanks to you. If you add in the creative touch that the plumber bonds with the kid over a shared love of motorcycles, and peppers his plumbing descriptions with motorcycle analogies, ChatGPT will give you that creative concept, and the creativity will have come from you. (Probably it'll give you that; I haven't tested. But if it won't, I'm sure some present or future AI will.) We can perhaps analogize this to the famous selfie produced by a monkey operating a trailcam. The monkey appears to be taking a photo - but the monkey doesn't understand cameras. Most of the creativity comes from the person who put the camera there and the editing he did to the photo (or photos) after the fact.
But another part of the creativity comes from the AI's training data. When the AI is writing the events that normally happen in Game of Thrones fanfic, or the sort of words that normally go in travel webzines, those contain some creativity. It mechanically pulls those words from its training data, so we can say it pulls its creativity from the fanfic and webzines in its training data too. This's similar to what a proverbial hack writer does when he's trying to write a typical genre science fiction story without any original spark. When he applies each genre trope, he's invoking in some sense the creativity of all the previous writers in his genre. If the story has any spirit or creativity, it's from them not from him. Of course, an actual human author won't be able to avoid adding some original touch - but the AI can only add the pure randomness.
AI storytelling will get better in the future.
Part of it might be because of better training data. It won't just be more: nearly everything written by humans - from web forum posts to the Google Books crawl - has already been fed into AI's. However, the training data is weighted: the AI is typically instructed to consider a New York Times article more importantly than the Youtube comment section. I expect this weighting to improve, and perhaps even improve separately for different sorts of uses. Perhaps there won't be that many different weighting systems; training runs are extremely expensive. But the weighting will probably be better than today.
There will also be other improvements, such as better context windows and better meta-prompting systems that keep helping the AI to write more consistent stories. I do expect improvements - nowhere near the point of the AI's we see in science fiction stories, but significant improvements.
However, without understanding the human mind and imagination, we can't actually understand what will make a story good. We humans understand it intuitively. A kid can recognize a good story as good even without being able to analyze it and tell what elements make it good. AI's can't recognize that by chance or experiment. Even if they analyze every story available in their training data and how humans rated them, the AI's would be able to mix and match existing tropes - but that would be just recycling the existing creativity of their training data.
This limit comes from the inherent nature of large language models. A large language model looks at its training data, and whatever human feedback (RLHF) was given in training, and nothing else. It can draw connections between bits of training data and pull together different elements, which's how LLM's can answer so many different prompts. This can look like creativity, because it's pulling on the creativity of so many different authors in the training data. But things that aren't in the training data don't get considered.
All the recent advances in AI have come from large language models. They've forced us to ask many fascinating questions, including questions about what it means to be intelligent, and what it means to be creative. They can do many useful things. But, I believe large language models have to pull whatever creativity they have from elsewhere. They, in themselves, can't be creative.
Perhaps someday we'll have an AI that's designed in a different way that can be creative. But, that would be abandoning most of the recent work in the field, which has been done on large language models.
And in the meantime, AI's can recycle the creativity that's already there in their training data, and make us think more about how creativity interplays with story itself.
I think you're right that a big thing missing is for an AI to understand what might be interesting to a human. It doesn't really have that. But I think it does have everything else. You mention that AIs can't write anything creative - all it does is combine things its been trained on potentially with a randomness factor. But I'd argue that's basically exactly what humans do. They have their experience to draw from and they combine novel things together because of their novel set of experiences. The differences is that (some) humans know which random ideas they come up with are interesting, and which to throw out. If one could train an AI to recognize interesting things, that would be a very valuable module.
If your child asks for a bedtime story, you have to improvise on the spot. Which is what LLMs do. Whereas a human author makes plans. A human makes outlines to enforce coherency. (Indeed, the Game of Thrones extension did use multistage outlining, but the github has been deleted, so I don't know the details of the strategy.) A human knows which steps require creativity. But an LLM can be instructed to do these steps. Something like Agent-GPT can probably generate these steps.
If you start with the plan to write a story about explaining plumbing to a child and ask it to add a detail to mix up the story, what kind of details does it suggest? Is the motorcycle theme so implausible? Many people claim that LLMs are valuable specifically for brainstorming. But those humans do the filtering. Can the model filter good ideas? Why not? One way to filter is to just repeat back the ideas and ask which are better. Another way is to ask it to flesh out each idea or finish each story and then ask it which produced the best story, or even specifically which meshed with the theme. Maybe the child can recognize good stories without analysis, but the LLM can perform analysis.
There's a long history of people claiming that LLMs can't do things that they can, if you just ask them, or maybe if you ask them right. I don't know that they can do this, but this very theoretical argument seems unconvincing to me. Is human creativity so special? I don't know, but I doubt it. Can LLMs judge good stories? Sounds hard. Now that you've planted your flag, test it. What would it tell you if you are wrong?
Another thing humans do is drafts. Would this help an LLM, or does it produce the same quality of writing each time? Why do human drafts improve? One possibility is that humans alternate between paying attention to different structures in different drafts, eg, alternating between improving sentences and improving paragraphs. If the LLM has only one level of attention, this wouldn't help, although it could alternate between temperatures. Another possibility is that humans have to take a break to reset the grooves in their brains, their random number generators so that they can find more options. LLMs can't. A possibility that I mentioned above as possibly applicable to LLMs is that some ideas have to be expanded before they can be evaluated.