What ChatGPT Knows about You: OpenAI’s Journey Towards Data Privacy

After all the concerns raised by ChatGPT’s data outage on the past 20th of March, we quickly observed some reactions from the outside world. The most forceful one? Italy banning ChatGPT for data privacy concerns.

Nearly one month after the incident, OpenAI has already taken some actions regarding user data privacy: The possibility to turn off the chat history and to export your personal data, i.e., the data they keep from your interaction with ChatGPT.

This article outlines the two major actions that OpenAI has taken regarding Data Privacy in the powerful AI-generation technology ChatGPT. We will try the two new features with a closer look at the data that ChatGPT keeps from its users, to help you understand the given format and therefore, interpret your data.

#1. Turn off the Chat History

ChatGPT history is more than a way of storing your conversations with the chatbot so that you can log in at any time and check past conversations: Your chat history is also used to train and improve the models behind ChatGPT.

Chat history was enabled on the 15th of December 2022, and let’s be honest: we all benefit from the storage of our conversations! But it is also true that this feature raised some data privacy concerns: Was ChatGPT keeping conversation data to train its AI models? What if sensitive or personal data was shared in those conversations?

Now OpenAI has given users the power to control this! According to OpenAI’s announcement, as of April 25th, it is possible to disable the chat history so that conversations won’t appear anymore on the sidebar. Moreover, they won’t be used for further training, providing the user with an option to manage their data.

Previously, users could periodically clear their chat history on demand, but any conversation could still be used for fine-tuning. As of now, if the chat history is disabled, conversations are only retained for 30 days. This is done just in case conversations need to be reviewed due to a misuse of the tool, before permanently deleting them.

Disabling the chat history is quite straightforward in the Settings control. To access Settings in the web interface, navigate to the lower-left section on the main page. A small window will pop up and there you will find the control for Chat History & Training:

Self-made screenshot from ChatGPT’s settings window.

At this point, I am sure you will have noticed the catch as well:
Why has OpenAI coupled saving your chat history with using this data to train its AI models?

I guess it is a way of micro-pressuring users to keep using their conversations for training purposes. As a point in favor of OpenAI, from my professional experience, I clearly see the benefits of using this real-world data for training.

#2. Export your Personal Data

OpenAI has also added a second new function in ChatGPT’s Settings: an Export option to get your ChatGPT data and find out what information ChatGPT stores about you.

This new option can be seen as a step towards the EU General Data Protection Regulation (GDPR). The GDPR defines, among other statements, the obligations of those processing data to facilitate access to the data subject to their personal data. That is the reason why the platforms gathering personal information such as Google or Netflix are now obliged to send to the users the data they have from them, at any time.

In the web interface, exporting personal data is also very straightforward. The Export data button is available just below the Chat History & Training one:

Just a couple of minutes before exporting the data, I received a file with my conversations and other relevant information in my registration email inbox.

After confirming the export action, this is what I received in my mailbox:

Self-made screenshot from ChatGPT’s export email.

By clicking the Download button, I got a folder with 5 files in html and json formats.

If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.

Companies allowing their users to ask for their personal data make them comply with the aforementioned GDPR regulation. Nevertheless, there is a catch: the file format can make the data unreadable for most of the population. In this case, we got both html and json files. While html can be read directly, json files can be more difficult to interpret. I personally think that new regulations should also enforce a readable format of the data. But for the time being…

Let’s explore the files one by one to get the most out of this new feature!

Chat History

The first file is chat.html which contains my entire chat history with ChatGPT. Conversations are stored with their corresponding title. The user’s questions and ChatGPT’s answers are labeled as assistantand user, respectively.

If you have ever trained an AI model yourself, this labeling system will sound familiar to you.

Let’s observe a sample conversation from my history:

Self-made screenshot from my ChatGPT history. The conversation title is highlighted in blue. User/Assistant labels are highlighted in red and green, respectively.

User Feedback

Have you ever seen the thumbs-up, thumbs-down icons () next to any ChatGPT answer?

This information is seen by ChatGPT as the feedback for a given answer, which will then help in the chatbot training.

This information is stored in the message_feedback.json file containing any feedback you provided to ChatGPT using the thumbs icons. Information is stored in the following format:

[{"message_id": &lt;MESSAGE ID>, "conversation_id": &lt;CONVERSATION ID>, "user_id": &lt;USER ID>, "rating": "thumbsDown", "content": "{\"tags\": [\"not-helpful\"]}"}]

The thumbsDown rating accounts for wrongly-generated answers while the thumbsUp accounts for the correctly-generated ones.

User Data

There is also a file (user.json) containing the following personal data from the user:

{"id": &lt;USER ID>, "email": &lt;USER EMAIL>, "chatgpt_plus_user": [true|false], "phone_number": &lt;USER PONE>}

Some platforms are known for creating a model of the user based on their usage of the platform. For example, if the Google searches of a user are mostly about programming, Google is likely to infer that the user is a programmer and use this information to show personalized advertisements.

ChatGPT could do the same with the information from the conversations, but they are currently obliged to include this inferred information in the exported data.

FYI, One can access What Google knows about them from Gmail by clicking on Account >> Data & Privacy >> Personalized Ads >> My Ad Center.

Complete Conversation History

There is another file containing the conversation history, and also including some metadata. This file is named conversations.json and includes information such as the creation time, several identifiers, and the model behind ChatGPT, among others.

The metadata provides information about the main data. It may include information such as the origin of the data, its meaning, its location, its ownership, and its creation. Metadata accounts for information related to the main data, but it is not part of it.

Let’s explore the same conversation about the A320 Hydraulic System Failure exposed in the first example in this json format. The conversation itself consists of the following Q&A:

[user]: What happens when one of the three hydraulic systems of a plane airbus 320 fails?

[assistant]: The Airbus A320 aircraft is equipped with three independent hydraulic systems, each providing hydraulic power to different parts of the aircraft. The hydraulic systems are labeled as Green, Blue, and Yellow […]

[user]: Do you know what pilots will do in case of a dual hydraulic failure?

[assistant]: In the event of a dual hydraulic failure on an Airbus A320 aircraft, the pilots will face a more challenging situation as all three hydraulic systems are affected, and there is no redundancy to fall back on […]

From this simple conversation, OpenAI keeps quite some information. Let’s review the stored information:

{
 "title":"A320 Hydraulic System Failure.",
 "create_time":1682368832.626937,
 "update_time":1682369104.0,
 "mapping": { [+] },
 "moderation_results":[]
 "current_node":"&lt;children_id4>",
 "plugin_ids":null,
 "id":"&lt;conversation_id>"
}

The main fields of the json file contain the following information:

The field moderation_results is empty since no feedback was provided to ChatGPT in this concrete case. In addition, the [+] symbol in the mapping field means that more information is available.

In fact, the mapping field contains all the information about the conversation itself. Since the conversation has four interactions, the mapping stores one children entry per interaction.

{
  "&lt;mapping_id>":{ [+] },
  "&lt;parent_id>":{ [+] },
  "&lt;children_id>":{ [+] },
  "&lt;children_id2>":{ [+] },
  "&lt;children_id3>":{ [+] },
  "&lt;children_id4>":{ [+] }
}

Again, the [+] symbol indicates that more information is available. Let’s review the different entries!

mapping_id: It contains an id for the conversation as well as information about the creation time and the type of content, among others. As far as one can infer, it also creates a parent_id for the conversation and a children_id that corresponds to the following interaction of the user with ChatGPT. Here is an example:

{
   "id":"&lt;mapping_id>",
   "message":{
      "id":"&lt;message_id>",
      "author":{
         "role":"system",
         "name":null,
         "metadata":{
            
         }
      },
      "create_time":1682369079.639335,
      "update_time":null,
      "content":{
         "content_type":"text",
         "parts":[
            ""
         ]
      },
      "end_turn":true,
      "weight":1.0,
      "metadata":{
         
      },
      "recipient":"all"
   },
   "parent":"&lt;parent_id>",
   "children":[
      "&lt;children_id>"
   ]
}

children_idX: A new children entry is created for each interaction either from the user or from the assistant. Since the conversation has four interactions, the json file displays four children entries. Each children entry has the following structure:

{
   "id":"&lt;children_id>",
   "message":{
      "id":"&lt;children_id>",
      "author":{
         "role":"user",
         "name":null,
         "metadata":{
            
         }
      },
      "create_time":1682368832.628375,
      "update_time":null,
      "content":{
         "content_type":"text",
         "parts":[
            "What happens when one of the three hydraulic systems of a plane airbus 320 fails?"
         ]
      },
      "end_turn":null,
      "weight":1.0,
      "metadata":{
         "timestamp_":"absolute",
         "message_type":null
      },
      "recipient":"all"
   },
   "parent":"&lt;mapping_id>",
   "children":[
      "&lt;children_id2>"
   ]
}

The first children entry is nested within the conversation by having the mapping_id as a parent and the second interaction — the answer from ChatGP — as a second child.

Children that correspond to a ChatGPT answer contain additional fields. For example, for the second interaction:

{
   "id":"&lt;children_id2>",
   "message":{
      "id":"&lt;children_id2>",
      "author":{
         "role":"assistant",
         "name":null,
         "metadata":{
            
         }
      },
      "create_time":1682368856.590211,
      "update_time":null,
      "content":{
         "content_type":"text",
         "parts":[
            "The Airbus A320 aircraft is equipped with three independent hydraulic systems, each providing hydraulic power to different parts of the aircraft. The hydraulic systems are labeled as Green, Blue, and Yellow.\n\nIf one of the hydraulic systems of an Airbus A320 aircraft fails, the other two systems are designed to provide redundancy and ensure that the aircraft remains controllable. The remaining systems will automatically take over the functions of the failed system, so there should be no loss of control of the aircraft.\n\nHowever, the failure of one hydraulic system will result in a reduction of hydraulic pressure, which can affect the performance of some aircraft systems. For example, if the Green hydraulic system fails, some components like the slats, flaps, and some spoilers will be affected as they are powered by the Green system only. This can cause the aircraft to experience reduced maneuverability and increased drag, which can affect the ability of the aircraft to climb or maintain altitude.\n\nIn such a situation, the pilots will follow the procedures outlined in the aircraft's emergency checklist to manage the situation and ensure that the aircraft is flown safely to the nearest suitable airport for landing. The exact procedures will depend on the nature of the failure, but the pilots will prioritize maintaining control of the aircraft and ensuring the safety of the passengers and crew."
         ]
      },
      "end_turn":true,
      "weight":1.0,
      "metadata":{
         "message_type":null,
         "model_slug":"text-davinci-002-render-sha",
         "finish_details":{
            "type":"stop",
            "stop":"&lt;|im_end|>"
         },
         "timestamp_":"absolute"
      },
      "recipient":"all"
   },
   "parent":"&lt;children_id>",
   "children":[
      "&lt;children_id3>"
   ]
}

In the case of a ChatGPT answer, we get information about the model behind ChatGPT and the stopping words. It also shows the first children as it parent and the third children as the following interaction.

The full file can be found in this GitHub gist.

Model Comparison

Have you ever used the “Regenerate response” button when you were not fully convinced by the response provided by ChatGPT?

Self-made screenshot from the Regenerate response button in ChatGPT.

This feedback information is also stored!

There is a last file named model_comparisons.json that contains snippets of the conversations and the consecutive attempts anytime ChatGPT regenerated the response. The information contains only the text without the title but including some other metadata. Here is the basic structure of this file:

{
  "id":"&lt;id>",
  "user_id":"&lt;user_id>",
  "input":{[+]},
  "output":{[+]},
  "metadata":{[+]},
  "create_time": "&lt;time>"
}

The metadata field contains some important information such as the country and continent where the conversation took place, and information about the https access schema, among others. The interesting part of this file comes in the input/output entries:

Input

The input contains a collection of messages from the original conversation. Interactions are labeled depending on the author and, as in the previous cases, some additional information is also stored. Let’s observe the messages stored for our sample conversation:

[system]: You are ChatGPT, a large language model trained by OpenAI, based on the GPT-3.5 architecture.\n Knowledge cutoff: 2021–09\n Current date: 2023–04–07.

[user]: What happens when one of the three hydraulic systems of a plane airbus 320 fails?

[assistant]: The Airbus A320 aircraft is equipped with three independent hydraulic systems, each providing hydraulic power to different parts of the aircraft. The hydraulic systems are labeled as Green, Blue, and Yellow […]

[user]: Do you know what pilots will do in case of a dual hydraulic failure?

[assistant]: In the event of a dual hydraulic failure on an Airbus A320 aircraft, the pilots will face a more challenging situation as all three hydraulic systems are affected, and there is no redundancy to fall back on […]

User/Assistant entries are expected, but I am sure at this point we are all wondering why is there a system label?

And moreover, why do they feed an initial statement like this at the beginning of each conversation?

Is ChatGPT pre-feed with the current date in any new conversation?

Yes, those entries are the so-called system messages.

System Messages

System messages give overall instructions to the assistant. They help to set the behavior of the assistant. In the web interface, system messages are transparent to the user, which is why we do not see them directly.

The benefit of the system message is that it allows the developer to tune the assistant without making the request itself part of the conversation. System messages can be fed by using the API. For example, if you are building a car sales assistant, one possible system message could be “You are a car sales assistant. Use a friendly tone and ask questions to the users until you understand their necessity. Then, explain the available cars that match their preferences”. You could even feed the list of vehicles, specifications, and prices so that the assistant can give this information too.

Output

The output entry contains the responses given by ChatGPT and the consecutive trials every time you hit the Regenerate response button:

{
   "output":{
      "feedback_version":"inline_regen_feedback:a:1.0",
      "ui_feature_name":"inline_regen_feedback",
      "ui_feature_variant":"a",
      "ui_feature_version":"1.0",
      "feedback_step_1":{[+]},
      "feedback_step_2":{
         "original_turn":[
            {
               "id":"&lt;original_turn_id>",
               "author":{[+]},
               "create_time":1680877473.736083,
               "update_time":null,
               "content":{&lt;original_response>},
               "end_turn":true,
               "weight":1.0,
               "recipient":"all"
            }
         ],
         "new_turn":[
            {
               "id":"&lt;new_turn_id>",
               "author":{[+]},
               "create_time":1680877502.81384,
               "update_time":null,
               "content":{&lt;new_response>},
               "end_turn":true,
               "weight":1.0,
               "recipient":"all"
            }
         ],
         "completion_comparison_rating":"new",
         "new_completion_placement":"not-applicable",
         "feedback_start_time":1680877456156,
         "compare_step_start_time":1680877456156,
         "new_completion_load_start_time":1680877456156000,
         "new_completion_load_end_time":1680877502976,
         "frontend_submission_time":1680877507949
      }
   }
}

As observed above, the feedback_step_1 entry stores information about the thumbs-up/thumbs-down feedback mentioned previously.

The regeneration information is stored in the feedback_step_2 entry with the first subentry original_turn for the original response and the retried response under new_turn.

And that is all the information OpenAI keeps about our interactions with ChatGPT! I think having an idea of which information is stored can be useful for two major purposes.

Firstly, in today’s data world, it is important to care about our privacy and be aware of the information that the platforms store and infer about us. Secondly, knowing the way information is structured and handled can help us in building customized models using ChatGPT as a starting point. For example, by looking into our own data, we realized you can feed ChatGPT with system messages to orient the agent to the purpose we want the agent to work on in a transparent way to the user.

Summary

In this article, we have reviewed the actions taken by OpenAI regarding users’ Data Privacy as a response to the concerns raised during the past months.

Both the possibility of turning off the chat history and the new feature to export your personal data anytime are clear steps towards protecting ChatGPT users. I personally find these steps as a commitment to prioritize data privacy by adhering to relevant data protection regulations. Transparency and security are key in building trust and ensuring responsible AI usage.

From our perspective — the user side — I think it is worth being aware of the possibilities to manage our data privacy. Especially regarding these two new features that control fundamental points such as making sure your interactions with ChatGPT are not used for training purposes if you don’t want to, or explicitly receive the exact data a company has about you.

Of course, there are other risks associated with the usage of this technology. For example, users should be aware also of data retention policies. That is knowing how long the platform retains the data, which ideally should be the minimum necessary. Understanding the intended use of the data you provide to the AI platform and being informed whether the platform shares your data with third parties and which is the purpose of the sharing should be also part of our main concerns.

By considering these factors, users can make informed decisions about their data privacy when using ChatGPT or any other Large Language Model.

It’s important to be proactive in understanding how your data is handled and taking steps to protect your privacy rights.

And that is all! Many thanks for reading!

I hope this article helps understand the information ChatGPT is keeping from our conversations, as well as to manage the new OpenAI features towards Data Privacy.

You can also subscribe to my Newsletter to stay tuned for new content. Especially, if you are interested in articles about ChatGPT.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

We’ll let you know when we release more summary articles like this one.

#1. Turn off the Chat History

#2. Export your Personal Data

Chat History

User Feedback

User Data

Complete Conversation History

Model Comparison

Input

System Messages

Output

Summary

Enjoy this article? Sign up for more AI research updates.

Related

Reader Interactions

About Andrea Valenzuela

Leave a Reply

Footer

About TOPBOTS