Foundation models, the bedrock of the much-hyped tool that is generative AI, are data-hungry. If businesses want to differentiate themselves, they must feed these models with proprietary information, including customer and corporate data. But doing so can expose this sensitive material to the outside world – and the bad actors operating in it – potentially contravening the General Data Protection Regulation in the process.
Dr Sharon Richardson, technical director and AI lead at engineering firm Hoare Lea, sums up the situation: “From day one, these models were a very different beast from a security standpoint. It’s hard to bake security into the neural network itself because its strength comes from hoovering up millions of documents. This is not a problem we’ve solved.”
The Open Worldwide Application Security Project, a not-for-profit foundation working to improve cybersecurity, cites data leakage as one of the most significant threats to the large language models (LLMs) on which most GenAI tech is based. This risk drew considerable public attention last year when employees at Samsung accidentally released sensitive corporate information via ChatGPT.
Scrutinise your data inputs
The task of safeguarding the data being used takes on a new meaning with the latest GenAI tools, since it’s hard to control how the information is processed. Training data can get exposed as these systems work to organise unstructured material. It’s why some businesses are focusing their efforts on securing inputs. Swiss menswear company TBô, for instance, carefully labels and anonymises information on customers before feeding this into its model.
“You want to ensure that your AI doesn’t know things it’s not supposed to know,” advises Allan Perrottet, the firm’s co-founder. “If you don’t prepare your data properly and just throw it straight at OpenAI, Gemini or any of these tools, you’re going to have issues.”
Smart organisations are taking a multi-pronged approach to managing the risk. One measure is permissions-based access for specific GenAI tools, under which only certain people are authorised to view classified data outputs. Another control is differential privacy, a statistical technique that allows the sharing of aggregated data while protecting individual privacy. And then there is the feeding of pseudonymised, encrypted or synthetic data into models, with tools that can randomise data sets effectively.
Data minimisation is vital, stresses Pete Ansell, chief technology officer at IT consultancy Privacy Culture.
“Never push more data into the large language model than you need to,” he advises. “If you don’t have really mature data-management processes, you won’t know what you’re sending to the model.”
Retrieval-augmented generation
Understanding the attack surface that an LLM might expose is also important, which is why retrieval-augmented generation (RAG) is growing in popularity. This is a process in which LLMs reference authoritative data that sits outside the training sources before generating a response.
RAG users don’t share vast amounts of raw data with the model itself. Access is via a secure vector database – a specialised storage system for multi-dimensional data. A RAG system will retrieve sensitive information only when it’s relevant to a query; it won’t hoover up countless data points.
“RAG is really good from the perspectives of both data security and intellectual property protection, since the business retains the data and the library of information the LLM is referencing,” Ansell says. “It’s a double win, ensuring that your strategic assets are kept closer to home.”
But he adds that “best practice around identifiable personal information and cybersecurity should also apply to business-level data”.
Such techniques don’t just protect sensitive material from cybercriminals. They also enable businesses to lift and shift learning from one LLM to another since, in practice, it’s not possible to trace the data back to its original source.
There is no doubt that data security problems posed by the training of LLMs are linked to data maturity and managing information assets with the utmost integrity. In many ways, the issues surrounding GenAI are like the challenges of GDPR compliance on steroids.
“If GDPR’s the big stick, the race to utilise AI is a big carrot,” Ansell says.
Other measures that a business can take to improve its AI-related data security include creating a multi-disciplinary steering group, conducting impact assessments, providing AI awareness training and keeping humans in the loop on all aspects of model development.
Is open source the answer?
One of the biggest challenges facing the sector is that sensitive corporate data still has to leave localised servers and be processed in the cloud at data centres owned by one of the tech giants, which control most of the popular AI tools.
“For a brief moment, data can be sitting on a server outside your control, which is a potential security breach. There’s still a weakness there,” Richardson says. “The reality is that we’re still in the Wild West phase when it comes to GenAI. There will be unintended consequences. You may think you’ve got it under control, but you probably haven’t.”
This is why open-source models are becoming increasingly popular. They enable IT teams to externally audit LLMs, spot security flaws and have them rectified by a developer community.
Yash Raj Shrestha, assistant professor at HEC Lausanne, argues that open-source AI is “more secure and trustworthy than closed-source AI. That’s because, when things are open, a large number of people can work together to find bugs, which can then be fixed. It’s the future.”