O’Reilly Media: Use AI to increase discoverability, don’t serve sloppy AI hamburgers

O'Reilly Media has been working with Miso AI to deliver an LLM-based discovery system, that helps their users find their way to a content steak, not AI hamburger…

Lucky Gunasekara, co-founder and CEO of MISO AI speaking at the Future of Media Technology 2025
Lucky Gunasekara, co-founder and CEO of MISO AI
👨‍💻
These are live-blogged notes from a session at Future of Media Technology conference.

What do the oldest tech media company and a firm named after the founder’s dog have in common? Another panel at the Future of Media Technology explored this…

O’Reilly (the oldest tech media company) has been publishing tech books for 45 years, ran the famous Web 2.0 expos, and has evolved into a tech learning platform, with over 2.8 million users. By 2019, it was becoming obvious that the sheer volume of content on oreilly.com — from books, videos, audio and live courses — was overwhelming the existing search tools.

At the time, the founders of Miso (the company named after a dog) were starting to explore deep learning models, the ancestors of today’s LLMs. They built a model to find research papers for them. They showed it to a friend at O’Reilly, who saw the potential for it to solve the content overwhelm problem.

Search content, naturally

Miso’s challenge was to build a natural language search engine for content. And so, weeks of white-boarding ensued, as they started to explore what was needed. They built prototypes, they integrated on them, and eventually landed on both a model and an interface. But to get there, they had to stop seeing books as containers, but think of it as an assemblage of useful information — and that meant thinking of information snippets as the basis of the experience.

And that meant a new royalty model, too. They didn’t want their publishers partners and creators to miss out on the value. Could the snippet value be more than an ineffective scroll? The models never regurgitate author works for the searchers — they’re there to help the users find their way to the right content.

They had to “liquify” the book content for machines to consume. Preparing content for AI is very different from preparing content for search, said Lucky Gunasekara, co-founder and CEO of MISO AI. The machine needs to understand the context of the content, not just the content itself. This is the process they call “vectorisation”:

“It doesn’t need to be perfect, but it needs a baseline of quality to do the work,” he said

O’Reilly’s AI Answers

Julie Baron, Chief Product officer at O’Reilly Media talking at the Future of Media Technology 2025 conference
Julie Baron, Chief Product officer at O’Reilly Media

The result is their Answers product, which is on its 2.0 iteration right now. “ChatGPT is, if you like, a phantom of the original content,” said Julie Baron, Chief Product officer at O’Reilly Media.

In Answers 2.0 they highlight the sources of the information in a way that allows discovery and then use of the originals. And it’s led to greater usage of their service, with more titles being accessed more often.

“The long tail is something you need to nurture — that depth of archive you have can be making money for you, if you use tools like this,” she said.

And they’re exploring new interaction models, including the ability to highlight sections of text, and ask for an expiation of it.

Bringing AI into the back-end

But they’re also using Studion — an internal LLM built on O’Reilly content to enhance and streamline content production work. Being able to play with different models is an important learning experience, as people figure out what models are good for what tasks.

The forthcoming Answers 2.5 adds more generative abilities — creating lists, for example — but maintaining the link to the original sources. It adapts content into step-by-step processes, or rewrites it for lay people. And it can translate languages on the fly.

O’Reilly want to encourage an ecosystem of participation that rewards creators for use of their work in AI. And that’s why they’re not doing deals with the big Ai companies. Their founder, Tim O’Reilly, has been very clear on that.

The sticky-fingered AI trainers

Just because they don’t do deals with the AI companies, doesn’t mean that their content – even their paywalled material – have been used for training. They’ve found evidence that some of OpenAI’s models have been trained on O’Reilly’s paywalled content — which raises the question of exactly how they got access?

And OpenAI is far from the only culprit. There are hundreds of bots out there trying to train on premium content. The AI companies are renaming their bots, they’re introducing new ones: it’s an arms race between the blocking services and the bots.

They published the reproach in the paper Beyond Public Access in LLM Pre-Training Data:

Copyright-Aware AI: Let’s Make It So
On April 22, 2022, I received an out-of-the-blue text from Sam Altman inquiring about the possibility of training GPT-4 on O’Reilly books. We had a call a few days later to discuss the possibility. As I recall our conversation, I told Sam I was intrigued, but with reservations. I explained to him that we could …

“You should be crawling and scraping them, finding evidence they’ve used your content, and talking to your lawyers and sending cease and desist letters,” Gunasekara said.

Content hamburger versus content steak

This is where the hamburger law comes into place. Tim has said that you can turn steak into hamburger, but you can’t turn hamburgers back into steak, said Gunasekara. ChatGPT is the hamburger in this example, and O’Reilly content is very much the steak.

“They’re serving hamburgers, you should be serving Michelin-starred meals,” he says. “We provide citations, quotes, and links. They can’t.”


ChatGPT is ‘ghost’ of what publishers provide directly
O’Reilly Media and Miso Technologies discuss their Answers tool, which responds to prompts, at the Future of Media Technology Conference.