Tech
Microsoft Deletes Blog Telling Users To Train AI on Pirated Harry Potter Books
Microsoft pulled a year-old blog post this week after a Hacker News thread flagged that it had encouraged developers to download all seven Harry Potter books from a Kaggle dataset — incorrectly marked as public domain — and use them to train AI models on the company’s Azure platform.
The blog, written in November 2024 by senior product manager Pooja Kamath, walked users through building Q&A systems and generating fan fiction using the copyrighted texts, and even included a Microsoft-branded AI image of Harry Potter. The Kaggle dataset’s uploader, data scientist Shubham Maindola, told Ars Technica the public domain label was “a mistake” and deleted the dataset after the outlet reached out.