AI Data Privacy: Classify and Encrypt Data using CIPH3R FPE before Integrating with Gen AI
- Peter
- Architecture , Gen ai , Application
- March 11, 2024
Table of Contents
Prior to embarking on the integration and utilization of Generative AI within your organizational framework, it is imperative to establish and implement an AI Use Policy. This policy serves to delineate the permissible access to internal data by AI models and provides guidance on the integration process, particularly in instances involving Personally Identifiable Information (PII) data.
One approach entails establishing a sandbox environment wherein data is segregated, serving as a gateway for the utilization of Large Language Models (LLM) services. Moreover, tailored data requisites for individual use cases may necessitate the retention of sensitive data within the direct purview of the company, housed within a trusted environment. This framework exemplifies the application of a Generative AI technique known as Retrieval Augmented Generation (RAG), facilitating the integration of external knowledge from databases to enhance the precision, domain specificity, and currency of outcomes.
When considering the utilization of company data, careful steps must be taken prior to furnishing data to chatbots or employing it for the training of generative AI models. It is imperative to classify Personally Identifiable Information (PII) and select suitable data encryption or masking methodologies. Developers must exercise caution to abstain from supplying AI algorithms with PII, Highly Sensitive Personally Identifiable Information (HSPII), or copyrighted data/intellectual property. In certain use cases, the application of masking techniques may not be feasible, as it may compromise the contextual integrity of the data. In such instances, encryption utilizing Format Preserving Encryption (FPE) can be opted for. The adoption of FPE offers several advantages, which are elaborated upon herein.
Allow me to guide you through the technical steps involved in integrating your data with the CIPH3R solution, ensuring the safeguarding of your organization’s sensitive data through anonymization via encryption. Data ingestion facilitated by CIPH3R can seamlessly contribute to training datasets, bolstering the efficacy of your data-driven initiatives.
LangChain integrations types
Document Loader
Vector stores
Chat Messages Memory
In this blog, we will be elaborating Document Loader, the rest two topics will be covered in later.
Technology Stack
Langchain (https://python.langchain.com/docs/get_started/introduction)
- Integrating using Langchain-community
AWS S3 (For Data files import/export)
CIPH3R mInjestor
Note: CIPH3R supports various types of integration like databases and other hypervisors
LangChain Architecture
(Image Credits: https://python.langchain.com)
The scope of this blog is not to explain and implement the key features of LangChain, but to outline how CIPH3R product can help integrate with AI Integration framework LangChain.
List of document loaders are documented here
As mentioned, let us use AWS S3 file integration technique for document loader.
Install boto3 python package:
pip install --upgrade --quiet boto3
Connect CIPH3R mInjestor to read and process sensitive data using CIPH3R data conversion schema. Using automated pipeline or manual trigger through CIPH3R portal, CIPH3R encrypts the input data to FPE format defined in schema. After processing, CIPH3R mInjestor outputs the data to AWS S3 bucket.
Connect to CIPH3R to know how to use CIPH3R free tier to process data.
from langchain_community.document_loaders import S3FileLoader
loader = S3FileLoader("FPE_output_s3bucket", "processed_fpe.csv")
loader.load()
The above code will ensure, only the encrypted data is loaded from S3 to chains.
In the next blogs, we will cover details around vector stores and chat Messages Memory.
Happy learning!