Novel Update: Still using AI, but differently (more ethically?)

New Heroes of Old logo

Audio Book: Four Chapters and Counting

If you’ve followed this blog, you know I’ve been exploring ways to use AI to make an audiobook of Hell on $5 A Day. In a recent post, I was really set on using Elevenlabs. I figured I could use their voice conversion feature to provide a guide performance for bits it didn’t get. As I got into chapter 2, I found myself less enamored with it. The readings were just off enough just often enough that I realized I might be having to guide-read a LOT of it.

I decided to try recording it myself one more time. I got up early and recorded before I had my coffee. I got a take I really liked. Then I decided to try reading Chapter 2 one day after lunch and see if I could reconcile the differences in my 7 a.m. voice and my 2 p.m. voice via voice conversion. I was calling it “speech to speech” or “voice to voice,” but those seem to be used more popularly to refer to talking to an AI and having it talk back.

Here’s a demo of what I’m talking about…

7 a.m. voice

2 p.m. voice

2 p.m. voice with voice conversion based on 7 a.m. voice

 

It’s not a perfect reproduction, but this is in part due to the limitations of the tech. I’m using Chatterbox for the voice conversion with a 30-second sample of chapter 1 as the guide voice for the zero-shot clone. The thing is that it’s “good enough” to keep things consistent. I’m actually tempted to run chapter 1 through the voice conversion process just to tighten up the similarities.

Do I wish I could do a higher-fidelity reproduction? Yes. Could I improve the sound on chapters 2-40 with a little more post-production? Probably. I’ll explore both as I get closer to done. But the big thing is that I’m getting readings that capture the context and emotions I want. Not every bit of narration is going to be emotional, but even just stressing the correct syllable/word to communicate the context of the sentence better is a big improvement.

Is it less work than continuing down the Elevenlabs route? Doubtful. Is it less expensive? Maybe by $200 or so. Do I believe that overall quality is improved by having a human who is familiar with the novel reading every word? Abso-freaking-lutely.

For all the “emotionally intelligent” text-to-speech engines, we’re still far from having the kind of intelligence to get dramatic narration and dramatic dialogue right. But while I’m technically “deepfaking” my 7 a.m. voice, it’s still MY voice and MY performance, so I think it’s completely ethical in this manner of use.

Also, when it comes to energy use, I’m not boiling the ocean. Converting a 10 minute chapter takes about a minute, running Chatterbox locally, so I’m using less than 1/60th of a kilowatt to convert ten minutes of audio. That’s 0.00000015% of the energy OpenAI’s latest 10-gigawatt deal will use in an hour. The whole novel, at an estimated 500 minutes, should use up 0.0000075% or 1/13,333,333th of an hour of that 10-gigawatt deal.

So in multiple ways, even if you don’t consider it completely ethical, it’s at least a more ethical use of AI than many others.

Add a Comment

Your email address will not be published. Required fields are marked *