8 Aug 2023

Web crawling in the AI Era: OpenAI’s GPTBot and Meta’s Open-Source Strategy

Recently, OpenAI unveiled GPTBot, a cutting-edge bot that would be important in gathering information for the training of their next AI systems through the process of web browsing.

OpenAI has introduced a new web crawling bot, GPTBot, to gather data for training their next AI system, potentially named “GPT-5.” The bot collects public website data, avoiding restricted content, and users can prevent data collection by adding a “disallow” rule. Furthermore, OpenAI has implemented a pre-scanning process to remove personally identifiable information (PII) and content that violates its policies from the scraped data.

OpenAI recently faced criticism for scraping data without permission to train Large Language Models (LLMs) like ChatGPT. As a response, the company updated its privacy policies in April to address these concerns. In line with their updated privacy policies, OpenAI has now introduced GPTBot. Similar to popular search engines such as Google, Bing, and Yandex, GPTBot is designed to systematically collect publicly accessible data from websites across the internet. This new development aims to ensure that the data collection process is conducted ethically and with permission, taking into consideration the concerns raised regarding data scraping. By aligning with established search engines, OpenAI is taking steps to transparently and responsibly gather information for the training of language models like ChatGPT.

In contrast to OpenAI’s strategy, Meta has introduced an open-source language model (LLM). Meta’s model is available for free, with certain usage restrictions, and allows users to fine-tune the model using their own datasets.