GPTBot Is Scanning The Internet: How OpenAI Will Change Content Consumption and the Future of Search
ByMaciej Lesiak
- 13 minutes read - 2576 words
Ten artykuł jest dostępny również po polsku:
GPTBot skanuje internet: Jak OpenAI zmieni sposób konsumpcji treści i przyszłość wyszukiwania
What's in this article
Recently, Jakub ‘unknow’ Mrugalski shared an interesting observation on his Mastodon profile:
I’ve noticed increased traffic from OpenAI on my server with projects. Yesterday over 787 thousand visits from the ChatGPT bot, and today already over 377k. Fun facts: it downloads not only HTML, but also some graphics + JS files, respects robots.txt, downloads ZIP files. Previously there was much (~100x) less of this. Is OpenAI seriously getting into indexing the web to threaten Google? source
Mrugalski isn’t the only one who has noticed increased OpenAI bot activity. Analysis of server logs on various websites and online stores confirms the bots’ specific interest in the “embed” function in WordPress and REST API endpoints. Most importantly, after checking my servers, I found that it’s mainly GPTBot scanning (used for training models), not OAI-SearchBot (used for search). The bots are primarily scanning WordPress sites. In Magento stores, they mindlessly traverse product parameters and search functions… trying to access login or customer account sections. They seem to be looking for patterns. However, I’ll only discuss WordPress, as that’s where I found interesting anomalies.
What Is OpenAI Really Doing With Your Website?
There are three main OpenAI bots you might find in your logs:
GPTBot (user-agent: GPTBot/1.1 or GPTBot/1.2) - used to train generative AI models. Full user-agent string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot
OAI-SearchBot (user-agent: OAI-SearchBot/1.0) - used for searching and displaying results in ChatGPT’s search functions. Not used for training models. Full string includes
OAI-SearchBot/1.0; +https://openai.com/searchbot
ChatGPT-User (user-agent: ChatGPT-User/1.0) - used when ChatGPT users or Custom GPT ask questions and visit sites. Full string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
GPTBot is much more active and particularly interested in the simplified version of WordPress pages available in embed format, e.g., at addresses like /post-name/embed/
in WordPress and other scanned REST API endpoints. Why? All signs indicate that OpenAI is preparing to introduce direct citation of internet sources in ChatGPT answers, similar to the function already existing in Microsoft Copilot. Regarding the most popular CMS WordPress, don’t confuse /embed/ with the oEmbed function, which is used for embedding multimedia in WordPress content – this is about the technical page format used by bots.
Why Is Embed So Interesting for GPTBot?
Embed versions of WordPress pages contain “clean” content without navigational elements, making them ideal for direct citation. GPTBot will likely use these versions to generate fragments or “snippets” directly inserted into ChatGPT answers.
This is a step toward reducing the AI hallucination problem by enabling direct citation of sources instead of their interpretation.
Rich Snippet Effect – A Strategy to Keep Users in the Ecosystem
What really happens when OpenAI introduces direct source citation? We can find the answer by analyzing the Google vs. Wikipedia case, which I described in more detail in my article on digital palimpsests.
Google, by introducing rich snippets from Wikipedia (text fragments displayed directly in search results), caused a drastic decrease in the number of visits to Wikipedia. Users simply read the displayed fragment in Google and didn’t click further. Global statistics showed a decline in traffic to Wikipedia.
By implementing a similar direct citation function, ChatGPT aims to:
- Keep users in its ecosystem - users get an answer with a source citation without needing to leave the ChatGPT interface
- Reduce the hallucination problem - citation instead of interpretation means less risk of misrepresentation
- Build trust in answers - providing sources increases credibility
- Reduce traffic to cited sites - just as with Wikipedia, users can be expected to click on provided sources less frequently
For website owners, this means their content may be read by users but without generating traffic to their site. Their sites will effectively become content providers for AI, without the benefits of ad impressions or conversions. It will be difficult, but not hopeless - let’s be optimists.
What to Do as a WordPress Site Owner?
I’ll describe this specific case because it’s one of the most popular CMSs. You currently have this dilemma:
Disable embed? Traditionally, disabling the embed function is recommended for improving SEO (avoiding duplicate content) and reducing server load.
Keep embed? If you want your content to be directly cited by ChatGPT, keeping the embed function makes sense.
Maybe a Compromise?
Keep embed, but add rel=canonical and monitor the crawler, because GPT doesn’t have a panel to modify the index. Including removing content.
Importantly, this applies not only to WordPress, but WordPress is the most popular. Adding a rel=canonical tag to the embed version, pointing to the original page, solves the duplicate content problem for SEO while keeping the content available for GPTBot:
<link rel="canonical" href="<?php echo get_permalink(); ?>" />
Website optimization also requires work on compatibility and various standards, including refining OpenGraph, while maintaining the embed technology that many people disable in WordPress for optimization purposes. Also note that GPT primarily performs searches in English, translating your question into English, then generating results in English and translating them back to Polish. Languages other than English, including Polish, are still treated as secondary in the context of AI development. This will certainly change, but it’s worth considering this risk.
Additionally, You Can Block
- Disable XML-RPC - if you don’t use WordPress mobile apps, disable XML-RPC for security (the bot tries to access this too).
- Blocking bots - if you don’t want your site to be used by AI, you can block the appropriate declarations in robots.txt, but it’s best to block IP ranges from the specification on the firewall or at the service level (Apache/Nginx). Personally, I block using ipset/iptables where I want to block.
MY UNCONFIRMED SPECULATIONS
WARNING: you are entering the realm of unconfirmed speculation
Increased OpenAI bot activity suggests we’re moving toward a world of browser wars and attempts to blow Google out of business, which means a complete reconstruction of how content is consumed and searched for on the internet. Not enough? AI will directly cite internet sources instead of interpreting them, and users will forget there’s such a thing as a search engine because they’ll simply have conversations or receive advice through prediction algorithms. In my opinion, this opens new possibilities for site owners - it’s an opportunity to increase visibility, but also a challenge related to controlling their own content. With all the assumptions and risks I wrote about in digital palimpsests about the disappearance of content and manipulation of recommendation algorithms in the context of Google. You’ll also find there fragments about prediction and changes in Google’s algorithm.
Regardless of the decision, in my opinion, if you have technical competencies, it’s worth monitoring server logs and being aware of how and by whom your content is being used. If someone is interested in blocking IPs, the list of ranges can be found here: Overview of OpenAI Crawlers in the OpenAI specification.
What Awaits Us in the Next 2 Years
The currently observed increase in AI model activity and the OpenAI bots discussed here is just the tip of the iceberg of changes coming in the next few years. I’d like to share with you as an industry specialist my fears, thoughts… what awaits us in a few years:
1. Search Dominated by AI
Traditional search engines will likely move away from the “10 blue links” model in favor of direct answers generated by AI. Google, Bing, and other platforms will provide ready-made content syntheses instead of referring to sources - embedded page fragments will be the only trace of the source. SERP (search result) will fade into oblivion like Netscape.
2. The End of Traditional SEO
If AI becomes an intermediary between content and user, classical search engine optimization will lose meaning. Companies will have to fight for the attention of AI algorithms, not directly users. Of course, for two years I’ve been gathering practical knowledge of the potential for manipulating AI algorithms. It’s possible and not only in the short-term area like data poisoning… Data poisoning is the deliberate introduction of incorrect or misleading information into AI training sets, which can affect its responses. An example is the attempt to manipulate content in Wikipedia or the mass generation of news sites to mislead language models. SEO is evolving toward optimization for AI, which can be a challenge, but also an opportunity for creators offering unique content. The only threat I see for such creativity is SEO heist, meaning content theft using AI - it can threaten creators if their unique materials are used on a massive scale to create generative page clones. That’s why I believe they will remain a niche, because by rising above average, they’ll attract the attention of spammers.
3. AI-Generated Content
A significant portion of new online content will be created with AI assistance or entirely by AI, which may lead to flooding the web with materials of low substantive value. Everyone fears this; in my opinion, it will draw a line between what was before and the completely generative era. Paradoxically, I see an opportunity for people who possess not only real competencies but also passion…
4. Deepening Information Bubbles
Unfortunately, I don’t have good news here. We’re losing to disinformation. Recommendation algorithms will be so sophisticated that users will function in completely personalized information environments, which will limit diversity and accidental discoveries. Critical thinking may be harder to maintain in an algorithm-dominated environment.
5-Year Perspective: Internet Transformation
1. Multimodal Internet
Working in the industry since 2008, I’ve seen how devices not only increased user saturation and lowered content quality but also accelerated content consumption and changed presentation formats. Text interfaces may be largely replaced by voice, visual, and mixed interactions (AR/VR), reducing the demand for traditional websites. This is especially evident in the YT format, podcasts, but also the planned metaverse.
2. Platform Consolidation - walled garden
I don’t have good news here. A small number of megaplatforms will likely take control of access to most content, acting as “gateways” to the internet. It’s clearly visible in the panicked attempt of some publishers to escape and free themselves from the primacy of the algorithm that this is already practically impossible. This is similar to the current UberEats model and its complete domination in SERP and food delivery orders. This is the walled garden effect, creating closed ecosystems and platform economies. This phenomenon leverages the network effect. The more similar businesses use this solution, the greater its value becomes. Companies that don’t see possibilities to enter the business independently start becoming dependent on the platform. The threshold for entering alone or exiting is so high that we have no option other than to be on the platform. It reminds me a bit of the case of Wyborcza’s editorial team migrating from X to BSKY.
3. Disappearance of the “Open” Internet
Free browsing and discovering content may be replaced by a model where AI curators decide what the user sees. A shadow profile built on our interactions will practically lead us by the hand.
4. Content Monetization Crisis
Traditional business models based on ads and organic traffic may become unprofitable, leading to the decline of independent content creators. The threshold for entering the internet for companies will be so high that it will be necessary to cooperate with intermediaries. Something like the current UberEats model and total domination in SERP and food delivery orders.
5. Regional Internets
The internet may become more geographically fragmented, with separate ecosystems in different regions of the world, subject to local regulations. This will be caused by restrictions, different moderation requirements, which is already visible on Facebook, which depreciates the visibility of content depending on local culture.
Dead Internet Theory in the Context of AI
The “Dead Internet Theory”, which in fact is conspiracy theory, assumes that a significant part of the internet is already generated by bots and AI programs, not humans. Although in its extreme form this theory sounds conspiratorial, the current development of generative AI models suggests that its moderate version may become reality. We will thus falsify the conspiracy theory and it will become fact.
Manipulating AI on a large scale will consist of so-called poisoning (data contamination) combined with high-quality SEO. As a result, we can expect a flood of generative content with low substantive value, but optimized for AI algorithms and search engines.
In such an environment, paradoxically, a huge niche emerges for creators offering authentic, non-generative content. However, maintaining this niche will require:
- Ignoring statistics in the initial phase when generative content will dominate
- More emphasis on community interactions
- Activating readers
- Creating materials of undeniable quality
Not every company or individual creator can afford this. Especially those who, like the architect mentioned in the text, have completely moved to social media platforms (e.g., Instagram), giving up their own domain and website, will lose.
Is There Any Hope?
User rebellion has worked in the past – e.g., mass exodus from Facebook after the Cambridge Analytica scandal. Despite the gloomy vision, in my opinion, there are factors, undeniably, that can change the course of events:
Decentralization and Web3 - Technology such as blockchain, IPFS, or platforms like Mastodon can offer an alternative to corporate dominance. Despite the excitement of a community consisting mainly of computer scientists and activists during each supposed migration from Twitter to alternatives, primarily Mastodon, I don’t yet see signals that Mastodon or niche solutions like NOSTR would threaten the dominance of Big Tech.
Government Regulations - Initiatives like the EU’s AI Act or Digital Services Act try to limit big tech power. This is probably the greatest hope as long as we don’t go in the direction of “Muskian” deregulation. Here I simply refer to EU projects, because probably only the EU constitutes a counterweight to tech bros.
Technological Breakthroughs - You can always count on a miracle, right? Some unforeseen innovations that could overturn the current order. It’s not impossible that new innovative devices or solutions will shake up the cemented system and new players will emerge. As long as they aren’t bought out earlier for hundreds of millions of dollars.
User Rebellion - The least realistic scenario. Although history shows that people sometimes turn away from toxic platforms, which can force companies to change their approach, I wouldn’t be so naive as to think Zuckerberg won’t want to make us into free workers. Instagram already generates more free-created content and generates more revenue than Netflix, which spends billions on its productions.
Practical Conclusions
Don’t give up on websites, e.g., by moving to Facebook/Instagram. For site owners and content creators in the face of these changes, it’s important that you build long-term relationships with your audience. Build model audiences and try to reach them through independent platforms (newsletter, communities). Don’t create content using AI. It’s precisely original and sometimes poorly edited content that will be valuable in a world of perfect content, perfect photos, and perfect life detached from evil. I understand the reluctance to various formats, but you must test different solutions and ecosystems. As I wrote in the digital palimpsest: if you want to survive on the internet and stay afloat, you must adapt.
The internet as we know it – open, chaotic, full of possibilities – is rather heading toward extinction. However, the final shape of this change also depends on the activity and awareness of network users. I’ve repeatedly criticized the Fediverse, but I did so not because I’m against the idea, but because I would like it to constitute a real alternative to Big Tech.
Sources
Related
- SEO Spam and Competition Gaming - The Dark Side of AI Content Marketing
- TECH: How AI is Changing the Face of Polish Digital Media in 2024
- AI series: A scenario of how AI can take over recommendation systems, generating and reinforcing conspiracy theories and disinformation
- AI Series: The maieutic method – enhancing AI with prompts
- AI Series: The final warning – AI's Self-Reflection on Its own development
- AI in Service of Conspiracy Theories and Paranoid Thinking
- Bypassing Security Filters in ChatGPT's SVG Generation
- The Illusory Security of BIP: A Brief Technical Analysis of Security Measures