Your Data and AI
This is a member-only chapter. Log in with your Signal Over Noise membership email to continue.
Log in to readModule 3: Your Data and AI
In April 2023, a Samsung semiconductor employee uploaded faulty source code to ChatGPT to get debugging help. Another uploaded equipment optimisation code. A third transcribed a meeting and asked for the minutes to be written up. Three separate employees, within days of each other, each making a reasonable-sounding decision in isolation.
The result: proprietary semiconductor source code, manufacturing process details, and the content of a strategic internal meeting were fed into a commercial AI system. Samsung implemented a company-wide ban on generative AI tools the same week.
This is the data leakage problem. It is not dramatic. It does not involve attackers. It is the ordinary, daily risk of using AI tools in a way that is technically convenient but strategically reckless.
What AI Vendors Actually See
When you type into ChatGPT, Claude, Gemini, or any similar tool, the text you enter goes to that company’s servers. This is obvious when you think about it, but people often behave as though the chat window is private in the way a local application is private. It is not.
What varies significantly — and matters enormously — is what happens to that data after it arrives.
Free tiers typically use your conversations for model training unless you specifically opt out. OpenAI’s free tier collects conversation data for training by default. The opt-out exists but requires finding it in settings. If you have never looked for it, you have not opted out.
Paid personal tiers (ChatGPT Plus, Claude Pro) generally offer better data handling. OpenAI’s paid tiers do not use your conversations for training by default. Anthropic states clearly that Claude Pro conversations are not used for training. This does not mean the data is not stored — it means it is not used in that specific way.
Enterprise and Team tiers are a different category. These typically include zero-retention agreements, meaning conversations are not stored beyond the session, explicit data processing agreements under GDPR or equivalent regulation, and contractual commitments about how data is handled. If your organisation is using AI tools for work involving client information, confidential strategy, or regulated data, this is the tier that is appropriate.
Local models are a fourth option worth understanding. Tools like Ollama allow you to run AI models entirely on your own hardware. Nothing leaves your machine. The tradeoff: the models are smaller and less capable than the frontier models, but for many tasks — summarising documents, drafting text, answering questions about your own files — they are entirely adequate. For handling genuinely sensitive material, local models deserve serious consideration.
The “We Don’t Train on Your Data” Question
AI vendors are aware that data handling is a concern, so their communications tend to emphasise the most favourable interpretation of their policies. “We don’t train on your data” sometimes means exactly that. Sometimes it means “we don’t train on your data on this specific tier” — implying other tiers operate differently. Sometimes it means “we don’t use your data for model training but may use it for safety monitoring, fraud detection, or improving our systems in other ways.”
Anthropic is notably transparent: their privacy policy distinguishes clearly between what is collected, what is retained, and what is used for training across different tiers. OpenAI’s policy is more complex, with different rules for different products. Google Workspace integration with Gemini has separate data handling rules from consumer Gemini.
The practical question to ask for any AI tool you use for work is: what is the data processing agreement? If there is no clear answer, you are probably on a free or consumer tier. Anything sensitive should not be going there.
The Shadow IT Problem
Cisco’s 2024 Data Privacy Benchmark Study, which surveyed 2,600 privacy and security professionals, found that 48% of employees admit entering non-public company information into generative AI tools. Separately, 43% use AI tools at work without telling their employer.
This is shadow IT: technology being used for work purposes outside the organisation’s visibility or control. It is not new — people have been using personal email for work files and personal cloud storage for years. What is new is the scale of data that AI tools can absorb in a single session.
Pasting a client proposal into ChatGPT to improve the writing. Summarising a confidential contract. Asking an AI to analyse a spreadsheet containing customer data. Each of these feels like a productivity shortcut. Each of these may represent a data breach depending on the platform used, the data classification of the information, and the regulatory obligations the organisation is subject to.
Healthcare is the clearest example. Five hospitals in Western Australia banned ChatGPT in May 2023 after discovering staff had used it to write private medical notes. Netskope research from 2024 found 71% of healthcare workers still use personal AI accounts for work, and 81% of data policy violations in healthcare involve regulated health data. Most consumer AI tools do not meet HIPAA compliance standards and do not sign Business Associate Agreements — which means patient data entered becomes data the vendor controls.
The legal profession learned a related lesson in June 2023. Attorneys Steven Schwartz and Peter LoDuca submitted a legal brief to US District Court containing fabricated case citations — real-looking cases that did not exist, generated by ChatGPT. Judge Kevin Castel fined them $5,000 collectively. More significantly, the law firm’s reputational damage was described as severe. A survey of legal professionals found that 11% of data inputted by law firm staff into ChatGPT involved information protected by attorney-client privilege.
Can Data Already Entered Be Extracted?
Yes. Carnegie Mellon University and Google DeepMind researchers demonstrated in November 2023 that ChatGPT can be prompted to reproduce verbatim training data. With a $200 research budget, they extracted over 10,000 unique examples of training data including real email signatures with personal contact information, phone numbers, news articles, Stack Overflow source code, and copyrighted legal disclaimers. They estimated adversaries could extract ten times more data with additional queries.
If information you entered into a free-tier AI tool was used in training, that information potentially exists in the model and can potentially be extracted. This is a different risk from a conventional data breach — it is diffuse and hard to quantify — but it is real.
API Keys and Credentials: A Specific Hazard
For developers and anyone who uses AI via API access rather than the chat interface, credentials present a specific, quantifiable risk. In 2024 alone, 39 million API keys and credentials were exposed on GitHub. Over 50,000 publicly leaked OpenAI API keys appeared on GitHub in that period.
The consequences are direct: an attacker with your OpenAI API key can run queries charged to your account, consuming your credits or running up charges. More seriously, API keys often come with the same permissions as the user who created them — meaning a compromised key can expose anything that key has access to.
A researcher at GitGuardian found that tools to scan GitHub for available OpenAI API keys can complete the scan in under two minutes. If you have ever committed a configuration file containing an API key to a repository, even briefly before deleting it, that key has likely been seen.
Specific rules for API key management are in Module 4’s checklist. The conceptual point: treat API keys as credentials equivalent to passwords. They should not appear in code repositories, chat messages, or any publicly visible file.
When to Use Local Models
The threshold for considering local models is lower than most people assume.
If you are:
- Processing documents that contain client names, contact details, or project specifics
- Working with any data covered by GDPR, HIPAA, or sector-specific regulations
- Handling information your organisation would be uncomfortable seeing in a data breach report
- Using AI to work with proprietary methods, pricing data, or competitive intelligence
Then a local model deserves serious consideration for that specific use case. Tools like Ollama with models like Mistral or Llama run adequately on a modern laptop. They are not as capable as GPT-4 or Claude 3.5. For many tasks — summarisation, drafting, answering questions about documents you provide — the capability difference does not matter for the job at hand.
The practical workflow is: use frontier models (Claude, GPT-4, Gemini) for tasks involving no sensitive data or already public information, and use local models for tasks where data sensitivity is a concern. Both can be running simultaneously and the choice becomes habitual quickly.
Module 4 is the practical one: specific things to do, check, and establish based on everything covered so far.
Check Your Understanding
Answer all questions correctly to complete this module.
1. What is the 'shadow IT' problem in the context of AI tools?
2. When should you consider using local models like Ollama instead of cloud-based AI?
3. What did Carnegie Mellon and Google DeepMind researchers demonstrate about ChatGPT's training data?
Pass the quiz above to unlock
Save failed. Please try again.