I’m interested in coming up with a program that can do a comprehensive job of cleansing data that’s potentially security sensitive, before being pasted into a generative AI tool. ChatGPT provides the following advice on how to guarantee the security of code input into generate AI:
Question to ChatGPT:
How can I guarantee the security of code input into generative AI for
various purposes, especially when dealing with proprietary code?
Additionally, what sensitive information should I eliminate from the
code to reduce security vulnerabilities?
Answer:
Ensuring the security of code inputs into generative AI involves
several key practices:
- Data Cleansing: Remove sensitive information such as passwords, API
keys, IP addresses, organizational names, and any other proprietary or
confidential data. This minimizes the risk of exposing critical
information.- Secure Storage: Store sensitive data separately in secure
vaults or credential stores, rather than embedding them directly into
the code. This prevents unauthorized access to confidential
information.- Anonymization: Anonymize or obfuscate any identifiers,
such as database names, entity names, or internal terminology, that
could reveal sensitive details about your organization or
infrastructure.- Review and Validation: Thoroughly review the output
generated by the generative AI tool for any potential security
vulnerabilities. Validate the code to ensure it complies with security
best practices and does not introduce new risks.- Continuous
Monitoring: Implement mechanisms for monitoring and auditing code
inputs and outputs to detect and address any security issues that may
arise over time.By following these practices and consistently
evaluating the security of code inputs and outputs, you can mitigate
the risks associated with using generative AI for proprietary code and
enhance overall security.
This is a good summary. ChatGPT also generated this regex for helping to cleanse data, which is useful but fairly limited:
b(?:password|secret|api[_ ]?key|token|ip[_ ]?address|organization|db[_ ]?name|entity|private[_ ]?info)bs*=s*["'].*?["'];
ChatGPT explanation:
Overall, this regex pattern aims to identify assignments of sensitive
elements in code, such as password = “abc123”, api_key=’xyz’, etc
I want to expand on this by developing a program that can perform basic initial cleansing (a good program will also identify insecure code patterns, I’m interested in doing a more basic initial cleanse, in the first instance). I’ve thought of the following that could be reliably searched for and potentially removed/replaced. Is there anything else that could be included?
- Secrets, passwords, API keys
- IP addresses or any other address addressing info that may be intended to only be internally shared
- A given set of terms that should be removed/replaced
- Any addresses of APIs or accessible resources that may not be publicly available