Guides

Will AI share my inputs with other users?

Introduction

Anytime we put information into software or an app, it’s a good idea to consider where our data flows. Unless you’re a software engineer, it may not be possible to trace exactly how computer systems communicate with one another. However, there are still some common-sense steps we can take to understand information security.

When it comes to generative AI, the same cybersecurity rules still apply. We need strong passwords, trust in the software companies we use, and vigilance against bad actors. Additionally, understanding how these systems work can help you use them safely and responsibly.

Key Concepts

  1. Data Flow and Cybersecurity Awareness: Understanding where and how your data is transmitted and processed when you input it into software or apps, and the importance of maintaining strong passwords, trusting reputable software companies, and staying vigilant against cyber threats.
  2. AI Training vs. Usage: Differentiating between the process of training an AI, which involves feeding large amounts of data into the system to improve its performance, and using a pre-trained AI, where the AI applies what it has learned to new inputs without further data collection.
  3. Terms of Service Differences: Understanding the different terms of service between free and paid AI software and what they allow the companies to do with your data if you're on a free plan vs. paid.

Where does my data go?

When we first started using computers, before they were connected to the internet, the information we entered stayed on them unless we exported it. As we moved online, using email and migrating to the cloud, it became increasingly important to ensure the entire chain of systems passing our information around the web is secure.

The modern internet is a complex ecosystem of digital systems. Often, when we’re interacting with a single service provider or app, behind the scenes is a complex network of different softwares performing an array of functions to deliver what you’re seeing on your screen. Most reputable software companies have a vested interest in maintaining a totally secure experience for users and they do a lot to make sure the data flowing in and out of their systems does so without leaks.

To keep their systems secure, software companies often implement a variety of measures such as encryption, regular security audits, and multi-factor authentication. Encryption ensures that even if data is intercepted, it cannot be read without the proper decryption key. Security audits help identify and fix vulnerabilities before they can be exploited by bad actors. Multi-factor authentication adds an extra layer of security by requiring users to provide two or more verification factors to gain access.

On the user side, it’s crucial to use strong, unique passwords for different services, enable multi-factor authentication when available, and stay informed about the best practices for cybersecurity.

Will my inputs be used to train AI?

When talking with someone, we know much of what we say will be remembered, at least in the short term. It makes sense to infer that when interacting with an AI chatbot capable of mimicking human conversation and thought, this system might also store what we tell it. While some AI systems have started to implement memory systems to better serve users, let’s first step back and understand how data is used in two distinct phases regarding AI.

AI training data

Modern large language models, a type of AI capable of understanding and producing sophisticated text dialogues and other forms of digital content, are first created by assembling a massive set of training data. For instance, to create an AI model capable of interacting in English, it is trained by analyzing billions of samples of English writing. During training, researchers guide the model to recognize patterns in the text, how words relate to one another, and the probability of their co-occurrence.

AI inputs

Even though a huge amount of data is used in the training phase, it's important to understand that once training is complete, this type of AI model does not reference its training data when producing outputs. This is perhaps the most counterintuitive feature because we are used to interacting with people who remember our conversations and using search engines that scour the internet for specific information. It's difficult to imagine how AI language models can interact with us so effectively without memory or referencing information.

As unlikely as it may seem, large language models don't need any stored information to produce their remarkable outputs. Instead, using the patterns and probabilities learned during training, they generate text and other content by mimicking those patterns and predicting the next word or pixel in a series.

As mentioned above, some popular AI chatbots have added a memory feature that selectively stores information about user interactions for a more seamless experience. For example, if a user asks for help writing a birthday message for a relative, the system may store details about the occasion and use this information the next time the user mentions the same relative. This memory, however, is separate from the training phase of the AI. It does not alter the underlying AI model and is specific to the user.

Why are some AI models seemingly plagiarizing other people’s work?

It is true that many users have been able to get mainstream AI language models to produce text and imagery that highly resembles protected intellectual property of other people and organizations. For the most part, this is likely because the work in question was strongly represented in the AI model’s training data, meaning there was a lot of it. The more a given piece of content appears in the system, the more influence it will have on model training.

One way to understand this is to examine a specific trait of AI language models: they handle simple mathematics well but struggle with more complex equations. This isn't due to the difficulty of the math problems themselves but rather an artifact of how often correct examples of these problems and their solutions appear in the training data.

Across the internet, many people have written about 2 + 2 equaling 4, and rarely anything else, so a large language model will almost always predict the correct solution. However, it is unlikely that the equation 343,626,247.3 / 97,334.22 appears in the AI’s training data. This means it is highly likely the predicted "answer" will be incorrect. The AI is not solving the equation; it is trying to predict the answer based on similar equations in its training data.

For any digital content widely available on the internet, such as writing by famous authors or media organizations, AI systems can more easily mimic these assets because they have been trained on numerous examples. Notably, many of these individuals and companies have filed lawsuits against leading AI companies, alleging infringement of their intellectual property. This has opened up a new field of legal analysis in the age of AI.

How can I make sure my data isn’t being used to train AI?

For decades, a dominant business model online has been the exchange of free services for user data. From search engines, to email, to social media, these companies have provided their applications for people to use, free of charge, in exchange for being able to utilize their data to make money.

This business model is also prevalent among generative AI companies and is one area where people should pay close attention to the terms of service they agree to upon signing up to use a free service.

In some cases, the terms will indicate that the AI company reserves the right to use user activity for the purpose of AI model tuning and product improvement. In this scenario it is possible that data a user puts into the system could be used to update an AI model. While this does not necessarily mean the data will then be made accessible to or shared with other users, it is an important distinction to be mindful of.

In all cases, it’s a good idea to analyze the terms of service (you can even ask an AI chatbot to help you!) and to give yourself a strong understanding of the data and privacy policies involved. When using free versions of any software, including AI, it’s advisable to assume everything you put into the system is fair game for the software provider to use for their own, legally allowable purposes. If your work requires extra precautions, there are generative AI systems being developed to operated within a closed, more secure environment, and it would be worthwhile to research some of these solutions.

Conclusion

Understanding the flow of your data in the context of generative AI is essential for ensuring your information remains secure. By recognizing the importance of strong passwords, trusted software providers, and vigilance against cyber threats, you can better protect your data. Additionally, differentiating between AI training and usage phases helps clarify how your inputs are utilized and the potential risks involved. While AI models do not reference their training data for generating outputs, some systems may store user-specific information to enhance interactions, which underscores the importance of being aware of such features.

Moreover, the prevalent business model of exchanging free services for user data necessitates a close examination of terms of service agreements. When using free AI services, it is crucial to understand that your data might be employed for AI model tuning and product improvement. This awareness allows you to make informed decisions about the tools you use and the data you share. For those requiring heightened security, exploring closed-environment AI solutions may be beneficial. By following these guidelines, you can navigate the complexities of generative AI responsibly and securely.

Table of Contents