After all the concerns raised by ChatGPT’s data outage on the past 20th of March, we quickly observed some reactions from the outside world. The most forceful one? Italy banning ChatGPT for data privacy concerns.
Nearly one month after the incident, OpenAI has already taken some actions regarding user data privacy: The possibility to turn off the chat history and to export your personal data, i.e., the data they keep from your interaction with ChatGPT.
This article outlines the two major actions that OpenAI has taken regarding Data Privacy in the powerful AI-generation technology ChatGPT. We will try the two new features with a closer look at the data that ChatGPT keeps from its users, to help you understand the given format and therefore, interpret your data.
#1. Turn off the Chat History
ChatGPT history is more than a way of storing your conversations with the chatbot so that you can log in at any time and check past conversations: Your chat history is also used to train and improve the models behind ChatGPT.
Chat history was enabled on the 15th of December 2022, and let’s be honest: we all benefit from the storage of our conversations! But it is also true that this feature raised some data privacy concerns: Was ChatGPT keeping conversation data to train its AI models? What if sensitive or personal data was shared in those conversations?
Now OpenAI has given users the power to control this! According to OpenAI’s announcement, as of April 25th, it is possible to disable the chat history so that conversations won’t appear anymore on the sidebar. Moreover, they won’t be used for further training, providing the user with an option to manage their data.
Previously, users could periodically clear their chat history on demand, but any conversation could still be used for fine-tuning. As of now, if the chat history is disabled, conversations are only retained for 30 days. This is done just in case conversations need to be reviewed due to a misuse of the tool, before permanently deleting them.
Disabling the chat history is quite straightforward in the Settings control. To access Settings in the web interface, navigate to the lower-left section on the main page. A small window will pop up and there you will find the control for Chat History & Training:
At this point, I am sure you will have noticed the catch as well:
Why has OpenAI coupled saving your chat history with using this data to train its AI models?
I guess it is a way of micro-pressuring users to keep using their conversations for training purposes. As a point in favor of OpenAI, from my professional experience, I clearly see the benefits of using this real-world data for training.
#2. Export your Personal Data
OpenAI has also added a second new function in ChatGPT’s Settings: an Export option to get your ChatGPT data and find out what information ChatGPT stores about you.
This new option can be seen as a step towards the EU General Data Protection Regulation (GDPR). The GDPR defines, among other statements, the obligations of those processing data to facilitate access to the data subject to their personal data. That is the reason why the platforms gathering personal information such as Google or Netflix are now obliged to send to the users the data they have from them, at any time.
In the web interface, exporting personal data is also very straightforward. The Export data button is available just below the Chat History & Training one:
Just a couple of minutes before exporting the data, I received a file with my conversations and other relevant information in my registration email inbox.
After confirming the export action, this is what I received in my mailbox:
By clicking the Download button, I got a folder with 5 files in html
and json
formats.
If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
Companies allowing their users to ask for their personal data make them comply with the aforementioned GDPR regulation. Nevertheless, there is a catch: the file format can make the data unreadable for most of the population. In this case, we got both html
and json
files. While html
can be read directly, json
files can be more difficult to interpret. I personally think that new regulations should also enforce a readable format of the data. But for the time being…
Let’s explore the files one by one to get the most out of this new feature!
Chat History
The first file is chat.html
which contains my entire chat history with ChatGPT. Conversations are stored with their corresponding title. The user’s questions and ChatGPT’s answers are labeled as assistant
and user
, respectively.
If you have ever trained an AI model yourself, this labeling system will sound familiar to you.
Let’s observe a sample conversation from my history:
User Feedback
Have you ever seen the thumbs-up, thumbs-down icons (👍👎) next to any ChatGPT answer?
This information is seen by ChatGPT as the feedback for a given answer, which will then help in the chatbot training.
This information is stored in the message_feedback.json
file containing any feedback you provided to ChatGPT using the thumbs icons. Information is stored in the following format:
[{"message_id": <MESSAGE ID>, "conversation_id": <CONVERSATION ID>, "user_id": <USER ID>, "rating": "thumbsDown", "content": "{\"tags\": [\"not-helpful\"]}"}]
The thumbsDown
rating accounts for wrongly-generated answers while the thumbsUp
accounts for the correctly-generated ones.
User Data
There is also a file (user.json
) containing the following personal data from the user:
{"id": <USER ID>, "email": <USER EMAIL>, "chatgpt_plus_user": [true|false], "phone_number": <USER PONE>}
Some platforms are known for creating a model of the user based on their usage of the platform. For example, if the Google searches of a user are mostly about programming, Google is likely to infer that the user is a programmer and use this information to show personalized advertisements.
ChatGPT could do the same with the information from the conversations, but they are currently obliged to include this inferred information in the exported data.
⚠️ FYI, One can access What Google knows about them from Gmail by clicking on Account >> Data & Privacy >> Personalized Ads >> My Ad Center.
Complete Conversation History
There is another file containing the conversation history, and also including some metadata. This file is named conversations.json
and includes information such as the creation time, several identifiers, and the model behind ChatGPT, among others.
⚠️ The metadata provides information about the main data. It may include information such as the origin of the data, its meaning, its location, its ownership, and its creation. Metadata accounts for information related to the main data, but it is not part of it.
Let’s explore the same conversation about the A320 Hydraulic System Failure exposed in the first example in this json
format. The conversation itself consists of the following Q&A:
[user]: What happens when one of the three hydraulic systems of a plane airbus 320 fails? [assistant]: The Airbus A320 aircraft is equipped with three independent hydraulic systems, each providing hydraulic power to different parts of the aircraft. The hydraulic systems are labeled as Green, Blue, and Yellow […] [user]: Do you know what pilots will do in case of a dual hydraulic failure? [assistant]: In the event of a dual hydraulic failure on an Airbus A320 aircraft, the pilots will face a more challenging situation as all three hydraulic systems are affected, and there is no redundancy to fall back on […]
From this simple conversation, OpenAI keeps quite some information. Let’s review the stored information:
{
"title":"A320 Hydraulic System Failure.",
"create_time":1682368832.626937,
"update_time":1682369104.0,
"mapping": { [+] },
"moderation_results":[]
"current_node":"<children_id4>",
"plugin_ids":null,
"id":"<conversation_id>"
}
- The main fields of the
json
file contain the following information:
The field moderation_results
is empty since no feedback was provided to ChatGPT in this concrete case. In addition, the [+]
symbol in the mapping
field means that more information is available.
- In fact, the
mapping
field contains all the information about the conversation itself. Since the conversation has four interactions, the mapping stores onechildren
entry per interaction.
{
"<mapping_id>":{ [+] },
"<parent_id>":{ [+] },
"<children_id>":{ [+] },
"<children_id2>":{ [+] },
"<children_id3>":{ [+] },
"<children_id4>":{ [+] }
}
Again, the [+]
symbol indicates that more information is available. Let’s review the different entries!
mapping_id
: It contains anid
for the conversation as well as information about the creation time and the type of content, among others. As far as one can infer, it also creates aparent_id
for the conversation and achildren_id
that corresponds to the following interaction of the user with ChatGPT. Here is an example:
{
"id":"<mapping_id>",
"message":{
"id":"<message_id>",
"author":{
"role":"system",
"name":null,
"metadata":{
}
},
"create_time":1682369079.639335,
"update_time":null,
"content":{
"content_type":"text",
"parts":[
""
]
},
"end_turn":true,
"weight":1.0,
"metadata":{
},
"recipient":"all"
},
"parent":"<parent_id>",
"children":[
"<children_id>"
]
}
children_idX
: A newchildren
entry is created for each interaction either from the user or from the assistant. Since the conversation has four interactions, thejson
file displays fourchildren
entries. Eachchildren
entry has the following structure:
{
"id":"<children_id>",
"message":{
"id":"<children_id>",
"author":{
"role":"user",
"name":null,
"metadata":{
}
},
"create_time":1682368832.628375,
"update_time":null,
"content":{
"content_type":"text",
"parts":[
"What happens when one of the three hydraulic systems of a plane airbus 320 fails?"
]
},
"end_turn":null,
"weight":1.0,
"metadata":{
"timestamp_":"absolute",
"message_type":null
},
"recipient":"all"
},
"parent":"<mapping_id>",
"children":[
"<children_id2>"
]
}
The first children
entry is nested within the conversation by having the mapping_id
as a parent and the second interaction — the answer from ChatGP — as a second child.
Children
that correspond to a ChatGPT answer contain additional fields. For example, for the second interaction:
{
"id":"<children_id2>",
"message":{
"id":"<children_id2>",
"author":{
"role":"assistant",
"name":null,
"metadata":{
}
},
"create_time":1682368856.590211,
"update_time":null,
"content":{
"content_type":"text",
"parts":[
"The Airbus A320 aircraft is equipped with three independent hydraulic systems, each providing hydraulic power to different parts of the aircraft. The hydraulic systems are labeled as Green, Blue, and Yellow.\n\nIf one of the hydraulic systems of an Airbus A320 aircraft fails, the other two systems are designed to provide redundancy and ensure that the aircraft remains controllable. The remaining systems will automatically take over the functions of the failed system, so there should be no loss of control of the aircraft.\n\nHowever, the failure of one hydraulic system will result in a reduction of hydraulic pressure, which can affect the performance of some aircraft systems. For example, if the Green hydraulic system fails, some components like the slats, flaps, and some spoilers will be affected as they are powered by the Green system only. This can cause the aircraft to experience reduced maneuverability and increased drag, which can affect the ability of the aircraft to climb or maintain altitude.\n\nIn such a situation, the pilots will follow the procedures outlined in the aircraft's emergency checklist to manage the situation and ensure that the aircraft is flown safely to the nearest suitable airport for landing. The exact procedures will depend on the nature of the failure, but the pilots will prioritize maintaining control of the aircraft and ensuring the safety of the passengers and crew."
]
},
"end_turn":true,
"weight":1.0,
"metadata":{
"message_type":null,
"model_slug":"text-davinci-002-render-sha",
"finish_details":{
"type":"stop",
"stop":"<|im_end|>"
},
"timestamp_":"absolute"
},
"recipient":"all"
},
"parent":"<children_id>",
"children":[
"<children_id3>"
]
}
In the case of a ChatGPT answer, we get information about the model behind ChatGPT and the stopping words. It also shows the first children
as it parent
and the third children
as the following interaction.
The full file can be found in this GitHub gist.
Model Comparison
Have you ever used the “Regenerate response” button when you were not fully convinced by the response provided by ChatGPT?
This feedback information is also stored!
There is a last file named model_comparisons.json
that contains snippets of the conversations and the consecutive attempts anytime ChatGPT regenerated the response. The information contains only the text without the title but including some other metadata. Here is the basic structure of this file:
{
"id":"<id>",
"user_id":"<user_id>",
"input":{[+]},
"output":{[+]},
"metadata":{[+]},
"create_time": "<time>"
}
The metadata
field contains some important information such as the country and continent where the conversation took place, and information about the https
access schema, among others. The interesting part of this file comes in the input
/output
entries:
Input
The input
contains a collection of messages from the original conversation. Interactions are labeled depending on the author and, as in the previous cases, some additional information is also stored. Let’s observe the messages stored for our sample conversation:
[system]: You are ChatGPT, a large language model trained by OpenAI, based on the GPT-3.5 architecture.\n Knowledge cutoff: 2021–09\n Current date: 2023–04–07. [user]: What happens when one of the three hydraulic systems of a plane airbus 320 fails? [assistant]: The Airbus A320 aircraft is equipped with three independent hydraulic systems, each providing hydraulic power to different parts of the aircraft. The hydraulic systems are labeled as Green, Blue, and Yellow […] [user]: Do you know what pilots will do in case of a dual hydraulic failure? [assistant]: In the event of a dual hydraulic failure on an Airbus A320 aircraft, the pilots will face a more challenging situation as all three hydraulic systems are affected, and there is no redundancy to fall back on […]
User
/Assistant
entries are expected, but I am sure at this point we are all wondering why is there a system
label?
And moreover, why do they feed an initial statement like this at the beginning of each conversation?
Is ChatGPT pre-feed with the current date in any new conversation?
Yes, those entries are the so-called system messages.
System Messages
System messages give overall instructions to the assistant. They help to set the behavior of the assistant. In the web interface, system messages are transparent to the user, which is why we do not see them directly.
The benefit of the system message is that it allows the developer to tune the assistant without making the request itself part of the conversation. System messages can be fed by using the API. For example, if you are building a car sales assistant, one possible system message could be “You are a car sales assistant. Use a friendly tone and ask questions to the users until you understand their necessity. Then, explain the available cars that match their preferences”. You could even feed the list of vehicles, specifications, and prices so that the assistant can give this information too.
Output
The output entry contains the responses given by ChatGPT and the consecutive trials every time you hit the Regenerate response button:
{
"output":{
"feedback_version":"inline_regen_feedback:a:1.0",
"ui_feature_name":"inline_regen_feedback",
"ui_feature_variant":"a",
"ui_feature_version":"1.0",
"feedback_step_1":{[+]},
"feedback_step_2":{
"original_turn":[
{
"id":"<original_turn_id>",
"author":{[+]},
"create_time":1680877473.736083,
"update_time":null,
"content":{<original_response>},
"end_turn":true,
"weight":1.0,
"recipient":"all"
}
],
"new_turn":[
{
"id":"<new_turn_id>",
"author":{[+]},
"create_time":1680877502.81384,
"update_time":null,
"content":{<new_response>},
"end_turn":true,
"weight":1.0,
"recipient":"all"
}
],
"completion_comparison_rating":"new",
"new_completion_placement":"not-applicable",
"feedback_start_time":1680877456156,
"compare_step_start_time":1680877456156,
"new_completion_load_start_time":1680877456156000,
"new_completion_load_end_time":1680877502976,
"frontend_submission_time":1680877507949
}
}
}
As observed above, the feedback_step_1
entry stores information about the thumbs-up/thumbs-down feedback mentioned previously.
The regeneration information is stored in the feedback_step_2
entry with the first subentry original_turn
for the original response and the retried response under new_turn
.
And that is all the information OpenAI keeps about our interactions with ChatGPT! I think having an idea of which information is stored can be useful for two major purposes.
Firstly, in today’s data world, it is important to care about our privacy and be aware of the information that the platforms store and infer about us. Secondly, knowing the way information is structured and handled can help us in building customized models using ChatGPT as a starting point. For example, by looking into our own data, we realized you can feed ChatGPT with system messages to orient the agent to the purpose we want the agent to work on in a transparent way to the user.
Summary
In this article, we have reviewed the actions taken by OpenAI regarding users’ Data Privacy as a response to the concerns raised during the past months.
Both the possibility of turning off the chat history and the new feature to export your personal data anytime are clear steps towards protecting ChatGPT users. I personally find these steps as a commitment to prioritize data privacy by adhering to relevant data protection regulations. Transparency and security are key in building trust and ensuring responsible AI usage.
From our perspective — the user side — I think it is worth being aware of the possibilities to manage our data privacy. Especially regarding these two new features that control fundamental points such as making sure your interactions with ChatGPT are not used for training purposes if you don’t want to, or explicitly receive the exact data a company has about you.
Of course, there are other risks associated with the usage of this technology. For example, users should be aware also of data retention policies. That is knowing how long the platform retains the data, which ideally should be the minimum necessary. Understanding the intended use of the data you provide to the AI platform and being informed whether the platform shares your data with third parties and which is the purpose of the sharing should be also part of our main concerns.
By considering these factors, users can make informed decisions about their data privacy when using ChatGPT or any other Large Language Model.
It’s important to be proactive in understanding how your data is handled and taking steps to protect your privacy rights.
And that is all! Many thanks for reading!
I hope this article helps understand the information ChatGPT is keeping from our conversations, as well as to manage the new OpenAI features towards Data Privacy.
You can also subscribe to my Newsletter to stay tuned for new content. Especially, if you are interested in articles about ChatGPT.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.