Training HI on Audiobooks
I’m listening to Audiobooks this morning
We talk a lot about training AI on books and the copyright implications of that. But what about HI (human intelligence)? One of my counters about training AI on copyrighted works is that I was trained on copyrighted works.
Every professional writer knows that the second most important thing you need to do to become a better writer is read… a lot (the first is to write… a lot). And that’s what we do. When I was younger, if I wasn’t reading at least a novel a week, I wondered if I might be feeling ill. I didn’t read a novel a week as some sort of of labor. I loved reading. I was reading a novel a week before I knew I wanted to write them, before I knew that writing was my calling.
And now I’m listening to books?
I listen to business books, but I prefer to read fiction. But part of my marketing plan for Sodom All Over Again, the sequel/prequel to Hell on $5 A Day, was to create an audiobook version of Hell on $5 A Day and release it a chapter at a time as a podcast.
I had grandiose ideas. I was going to do it as a radio play and use AI to make my character voices more authentic/consistent while I acted all the roles. But getting that kind of STS (Speech-to-Speech) AI where it preserves the tone of the input while preserving the accent of the voice model.. at a price I can afford… not as easy as I’d thought it would be. Meanwhile TTS (Text-to-Speech) is getting better at emotional context, but it just doesn’t let me guide the voice enough to get the performance I want.
I’ve wasted a lot of time trying to find a tool that does it for free or a reasonable price, but most are focused on doing expensive deepfakes for Hollywood. I don’t have Hollywood money.
Furthermore, I don’t want to be dependent on a commercial service that could go belly-up at any time and leave me hanging in the middle of a project. I wanted to be able to run it locally. That’s why I invested in a Intel Core i9-Ultra 285H and a Nvidia 5080 card. And the funniest thing is that the 5080 card is only now becoming supported by some of the tools I was hoping to use, 6 months after release.
So as I settled in, a TON of time wasted on my search for the Goldilocks workflow that would work with my 5080, I have about 12 weeks to record, edit, and compile 40 chapters into an audiobook to meet my self-imposed deadline. But though I dropped the idea of all the AI-enhanced character voices, I had a lot of other questions to answer.
- I have no idea how to do Alain’s Louisiana accent well, if at all. Do I just not do it?
- I can do a better French accent, but I do not know enough about regional variations to do them well. How about for Marie?
- Groaner Dad had a sort of simple NY accent, but sometimes I felt I slipped into Boston. How do I keep Vinnie’s consistent?
- Sound effects? Atmospheric background music to enhance tense scenes? A lot, a little, none?
- And the big one…
“What the heck are you doing with that,” Bob asked with a tone halfway between just being nervous and all out panic.
If I’ve just tried to replicate that tone while reading Bob’s dialogue, do I read out the “Bob asked with a tone…”?
Listening to some of the highest rated books in my genre on Audible (thanks to the free trial and the Plus collection), I found some variation.
Some didn’t use sound effects. Those that did were not like foley in a movie, but VERY targeted at specific effects that enhanced the narrative.
Opening music was hit and miss. Background music was almost nonexistent.
Accents were often not done or done very slightly to avoid making them inaccurate or cartoonish.
In the book, I cite some real songs. I was afraid I’d have to change them all to ones I could create with Suno. Now, so long as I don’t play the notes or quote the lyrics, I can use the names because they’re somewhat parodied (like as elevator music in Hell).
And this is me training myself on the works of other artists
I’m studying how the narrators do the character voices, how they modulate their own during pure narration, and how they pace themselves. I’m listening for all those non-verbal vocal cues, musical cues, sound effects, accents, pacing, tone.
I liked to describe my philosophy in Hell on $5 A Day as a pattern of one chapter of moving the plot forward, then one chapter of running and screaming. I still need to listen to how the narrators change their pace, tone, and volume for a “running and screaming” scene. So I’ll see y’all later.