ChatGPT and GDPR: Balancing AI innovation with data protection

By Feodora Hamza

OpenAI’s ChatGPT has gained widespread attention for its ability to generate human-like text when responding to prompts. However, after months of celebration for OpenAI and ChatGPT, the company is now facing legal action from several European data protection authorities who believe that it has scraped people’s personal data, without their consent. The Italian Data Protection Authority has temporarily blocked the use of ChatGPT as a precautionary measure, while  French, German, Irish, and Canadian data regulators are also investigating how OpenAI collects and uses data. In addition, the European Data Protection Board set up an EU-wide task force to coordinate investigations and enforcement concerning ChatGPT, leading to a heated discussion on the use of AI language models and raising important ethical and regulatory issues, particularly those involving data protection and privacy.

Concerns around GDPR compliance: How can generative AI comply with data protection rules such as GDPR? 

According to Italian authorities, OpenAI’s disclosure regarding its collection of user data during the post-training phase of its system, specifically chat logs of interactions with ChatGPT, is not entirely transparent. This raises concerns about compliance with General Data Protection Regulation (GDPR) provisions that aim to safeguard the privacy and personal data of EU citizens, such as the principles of transparency, purpose limitation, data minimisation, and data subject rights.

As a condition for lifting the ban it imposed on ChatGPT, Italy has outlined the steps OpenAI must take. These steps include obtaining user consent for data scraping or demonstrating a legitimate interest in collecting the data, which is established when a company processes personal data within a client relationship, for direct marketing purposes, to prevent fraudulent activities, or to safeguard the network and information security of its IT systems. In addition, the company must provide users with an explanation of how ChatGPT utilises their data and offer them the option to have their data erased, or refuse permission for the program to use it.

 Electronics, Hardware, Computer Hardware
Padlock symbol for computer data protection system. Source: Envato Elements

Steps towards GDPR compliance: OpenAI’s updated privacy policy and opt-out feature

OpenAI has updated its privacy policy, describing its practices for gathering, utilising, and safeguarding personal data. In a GPT-4 technical paper, the company stated that publicly available personal information may be included in the training data and that OpenAI endeavours to ensure people’s privacy by incorporating models to eliminate personal data from training data ’where feasible’. In addition, OpenAI allows now for an incognito mode on ChatGPT to enhance its GDPR compliance efforts, safeguard users’ privacy, and prevent the storage of personal information, granting users greater control over the use of their data. 

The company’s choice to offer an opt-out feature comes amid mounting pressure from European data protection regulators concerning the firm’s data collection and usage practices. Italy has demanded OpenAI’s compliance with the GDPR by April 30. In response, OpenAI implemented a user opt-out form and the ability to object to personal data being used in ChatGPT, allowing Italy to restore access to the platform in the country. This move is a positive step towards empowering individuals to manage their data.

Challenges in deleting inaccurate or unwanted information from AI systems remain

However, the issue of deleting inaccurate or unwanted information from AI systems in compliance with GDPR is more challenging. Although some companies have been instructed to delete algorithms developed from unauthorised data, eliminating all personal data used to train models remains challenging. The problem arises because machine learning models often have complex black box architectures that make it difficult to understand how a given data point or set of data points is being used. As a result, models often have to be retrained with a smaller dataset in order to exclude specific data, which is time-consuming and costly for companies.

Data protection experts argue that the OpenAI could have saved itself a lot of trouble by building in robust data record-keeping from the start. Instead, it is common in the AI industry to build data sets for AI models by scraping the web indiscriminately and then outsourcing the work of removing duplicates or irrelevant data points, filtering unwanted things, and fixing typos. In AI development, the dominant paradigm is that the more training data – the better. OpenAI’s GPT-3 model was trained on a massive 570 GB of data. These methods, and the sheer size of the data set, mean that tech companies tend to not have full understanding of what has gone into training their models.  

While many criticise the GDPR for being unexciting and hampering innovation, experts argue that the legislation serves as a model for companies to improve their practices when they are compelled to comply with it.  It is presently the sole means available to individuals to exercise any authority over their digital lives and data in a world that is becoming progressively automated.

The impact on the future of generative AI: The need for ongoing dialogue and collaboration between AI developers, users, and regulators

This highlights the need for ongoing dialogue and collaboration between AI developers, users, and regulators to ensure that the technology is used in a responsible and ethical manner. It seems that ChatGPT is facing a rough ride with Europe’s privacy watchdogs. The Italian ban seems to have been the beginning, since OpenAI has not set up a local headquarters in one of the EU countries yet, exposing it to further investigations and bans from any member country’s data protection authority.

However, while the EU regulators are still wrapping their head around the regulatory implications of and for generative AI, companies like OpenAI continue to benefit and monetise from the lack of regulation in this area. With the EU’s Artificial Intelligence Act being passed soon, the EU aims to address the gaps of the GDPR when regulating AI and inspire similar initiatives being proposed in other countries. It seems the impact of generative AI models on privacy will probably be on the regulators’ agenda for many years to come.

How search engines make money and why being the default search engine matters

By Kaarika Das and Arvin Kamberi

Samsung, the maker of millions of smartphones with preinstalled Google Search, is reportedly in talks to replace Google with Bing as the default search provider on its devices. This is the first instance of a threat confronting Google’s long-standing dominance over the search business. Despite Alphabet’s diversified segments, its core business and majority profit accrue from Google Search, which accounted for US$162 billion of US$279.8 billion of Alphabet’s total revenue last year. Naturally, Google’s top agenda is to protect its core business and retain its position as the default search engine in electronic devices like tablets, mobiles, or laptops.

A critical question arises about the underlying business model of online search engines like Google, Bing, Baidu, Yandex, and Yahoo. What do these search engines stand to gain by being the default devices search engine? Let us examine how search engines generate revenue while allowing users to explore the internet for information and content for free.

The profit model of search engines

Search engines make money primarily through advertising (billions of dollars yearly from its Google Ads platform). The working mechanism is as follows: Whenever users can enter a search query into a search engine, the search engine provides a list of web pages and other content related to the search query, including advertisements. Advertisers pay search engines to display sponsored results when users search for specific keywords. These ads typically appear at the top and/or bottom of Search Engine Results Pages (SERPs) and are labelled as ‘sponsored’ or ‘ad’. Search engines get paid based on the number of clicks these ads get. This model is popularly known as the PPC (Pay-Per-Click).

Apart from sponsored listing, search engines also track user data for targeted advertising, using people’s search history. Search engines can easily gather information about users’ search history, preferences, and behaviours. This is done through cookies, IP address tracking, device and browser fingerprinting, and other technologies. Search engines then use these data points to profile their users to improve the targeting of advertisements. For example, if a user frequently searches for details about recipes and food, the search engine may display advertisements for restaurants and related food ingredient products. Thus, the user search history effectively helps improve search engine algorithms and enhances search accuracy by identifying patterns in user behaviour. In capitalising on user data, search engines allow advertisers to manage their advertisements using strategies such as ad scheduling, geotargeting, and device targeting – all made possible because of accumulated user history data!

Google, magnifying glass
Google making money from search engine. Image generated by DALL-E/OpenAI.

The power of default

Let us now delve into the edge granted to a search engine by being the default setup. Regardless of the default search engine, people can always change their search engine on their respective devices based on personal preferences. Despite the absence of any exclusivity, there is massive inertia to change the default search engine. It happens because the effort required to manually navigate to a different search engine to perform search functions makes the transition process a hassle, especially for ordinary people. Parallelly, technologically challenged people may not be aware of alternative search engines and might have no explicit preference for a specific search engine. Even with awareness of alternatives, the effectiveness, performance, and security of the search engine paired with their current device remains unapproved and may lead to apprehension among users.

Therefore, a default search engine further provides a sense of security (however misleading) as its performance and device compatibility are assumed to be vetted by the manufacturers. As a result, being the default search engine is advantageous for search engines as it provides them with a broader audience base leading to increased traffic alongside greater brand recognition. Thus, being the default search engine is vital for a search engine’s success as having large traffic ensures that search engines remain attractive to advertisers, their primary source of revenue – the higher the number of search engine users, the dearer the advertising space becomes, generating better returns.

For users, however, pre-installed search engines deprive them of the choice to select their preferred alternative and select those search engines that do not track user details. In 2019, the European Commission stated that Google had an unfair advantage by pre-installing its Chrome browser and Google search app on Android smartphones and notebooks. To circumvent antitrust concerns, in early 2020, Google enabled Android smartphones and tablets sold in the European Economic Area (EEA) to show a ‘choice screen’ that offered users four search engines to choose from.

While Google pays billions to device manufacturers like Samsung and Apple to remain the default search engine, the ongoing AI leap in the industry has enormous ramifications for the future of internet search and its ensuring business model. With unprecedented developments in AI and search engine functionality integrated with AI, the tussle of search rivals battling for popularity and influence is set to continue.