Posted by Paul Samael on Sunday, December 18, 2016 Under: Random thoughts
According to a report in The Guardian, Google has recently attempted to improve the language capabilities of one of its Artificial Intelligence programs by feeding it over 10,000 free ebooks downloaded from Smashwords (out of a total of well over 50,000 free ebooks). Apparently the idea was to help the AI produce more natural-sounding sentences.
Being The Guardian, the report was a bit po-faced about the whole thing and the journalist seemed to think that the authors ought to have been remunerated – or that at the very least, Google ought to have asked permission. However, doing that for 10,000+ books would’ve been a massive undertaking – and it’s not as if they were trying to pass off the material as their own. Besides, if you make your work available in digital form for free, it is inevitably going to be read by web-bots (even if no one else reads it).
Anyway, as someone who has self-published on Smashwords, I was curious as to whether my own work had been guzzled by Google’s AI - or whether it had chomped down on any of the excellent free Smashwords books I have reviewed on this site. I was also curious to know what, if anything, the researchers had done to ensure that this giant helping of “brain-food” was actually any good. After all, feeding a load of garbage into your AI program wouldn’t necessarily produce very satisfactory results – and even a cursory look at Smashwords is enough to tell you that the quality of work there is, well, a little variable, shall we say. Indiscriminate downloading could also result in the AI program developing an unhealthy fixation with language used to describe acts of sexual congress…..
So I requested the BooksCorpus dataset from the people who originally compiled it at Toronto University (NB these are not the people at Google – Google just borrowed their dataset). It is sorted into a wide range files by genre, ranging from “Adventure” through to “Vampires.” I’m not sure what the AI program made of all the vampire stuff. “Erotica” was not included, although “Romance” was, as was something called “New Adult” (again, not sure what the AI made of these…). Of books reviewed on this site, the following were included:
- Taking Candy from the Devil by Robert P Kaye
- The Ant Farm by Neil Hetzner
- Pedalling Backwards by Julia Russell
- Corpus Callosum by Erika D Price
- Free Indie Reader No. 1 which has contributions from Tom Lichtenberg, Carla Herrera, Lisa Thatcher, Michael Graeme, Giando Sigurani, Willie Wit, Judy B and me
- The Inelegant Universe by Charles Hubbard
Of authors whose books I have reviewed on this site, several books by Tom Lichtenberg were included (e.g. Girl in the Trees), as was a book by Steve Anderson (Underheroes).
As for the other books in the dataset, I’ve opened up a few of them at random, but I haven’t found anything that struck me as utter garbage – so whatever they did to select the books, they don’t seem to have done too bad a job. Indeed, I may use sections of the BooksCorpus as a way of trying to find free Smashwords gems that have so far escaped my attentions.
I did ask them how they selected which books to put into the dataset (e.g. did they take account of downloads, reviews or was it just random?), but the researcher who replied hadn’t been directly involved. It looks as if the idea was to feed the dataset with as much material as possible in the hope that the good would overwhelm the bad – although there was some mention of a facility in the program which can correct for bad grammar/spelling. That said, the researcher did admit that their model sometimes produced broken sentences, including ones that read a bit like porn (just fancy that !).
So there you have it – instead of being fed on a carefully controlled diet of Great Literature from established giants of the publishing world, Google’s AI has been gorging itself on the modest scribblings of a bunch of self-published indie writers. It feels a bit like this scene towards the end of “2001: A Space Odyssey” where astronaut Dave Bowman is deactivating the HAL 9000 supercomputer - only in reverse:
HAL 9000: “Dave, my mind is going. I can feel it. I can feel it. My mind is going. There is no question about it. I can feel it. I can feel it. I can feel it. I'm a... fraid. Good afternoon, gentlemen. I am a HAL 9000 computer. I became operational at the H.A.L. plant in Urbana, Illinois on the 12th of January 1992. My instructor was Mr. Langley, and he taught me to sing a song. If you'd like to hear it I can sing it for you.”
In : Random thoughts
Tags: google ai "artificial intelligence" bookscorpus dataset
blog comments powered by Disqus