In the past few years, Microsoft has heavily invested in OpenAI, forging a relationship with the company behind the well-known generative language model ChatGPT. We suspected in our first ChatGPT post that companies like OpenAI could be tempted to use Bing or Google search engine scraper bots to gather data to train their large language models (LLMs) like ChatGPT. This integration would make it much harder for businesses to opt out of data collection without negatively impacting their business’ online presence.
Earlier this year, Microsoft announced that AI would be integrated in their search engine, Bing, so that users could interact with it directly from the search engine to ask questions. This feature is called new Bing, available for Microsoft Edge users, and uses GPT-4—the same model as ChatGPT.
You may be wondering how you can prevent the new Bing from using your website data at training, or how to stop users from obtaining responses that can only be found on your website, as this could negatively impact your business. We looked into how the Bing–GPT integration works and how businesses can opt out of having their data used by the new Bing.
If you perform a search on Bing—for example, “what is DataDome”—a ‘Chat’ section next to a blue icon will show up below the search bar.
If you click on it, it opens a new page with the new Bing interface, which is set up like a chat.
Our “what is DataDome” search query was automatically processed by GPT-4 and the new Bing provided a summary of what DataDome is doing: protecting businesses against online fraud and bad bots!
As a user, you can ask any question directly in the new Bing chat UI, and Bing will use GPT-4 to answer your questions—meaning you won’t need to visit the websites directly to get your answer. Note, however, that the new Bing still lists its sources in the “Learn more” section.
In the first popular version of ChatGPT, based on GPT-3, OpenAI was quite transparent about the source of the training data. They don’t provide this information anymore for the latest versions of GPT, as there is no mention of the training dataset in the GPT-4 technical report.
As we predicted a few months ago, it’s highly likely OpenAI is leveraging its relationship with Bing to use the data collected by Bingbot—the scraper used by Bing to index the web—to gather training data at scale for training their LLMs.
The reason we argue this is highly likely comes from our next finding: what happens when you ask the new Bing to retrieve information from a specific URL?
To conduct our test, we asked the new Bing to summarize the content of a page located on the DataDome website. We asked it to ensure it was using the latest version to try to force it to make a request to our site.
Even though we asked Bing GPT to retrieve the latest version of the URL, we don’t see any requests made to the URL, no matter the IP address or the user-agent.
However, in going over the previous 24 hours of our logs, we observed that Bingbot made several requests to this page (among others on our website). This activity appears to be the standard Bingbot scraper analyzing every public page for display on the search engine.
This is strong evidence that the new Bing is probably using the content gathered by Bingbot. However, it is not performing HTTP requests in the moment to gather information about URLs provided in the Chat UI.
In future testing we could go further by delivering a special page only to Bingbot, then see if that content is the one used when asking questions about it in Bing’s Chat UI.