37
Anonymize your data using Amazon S3 Object Lambda

Anonymization or pseudonymization are some of the technics commonly adopted to do protect some data. In both case, you want to remove the ability to identify someone and more important the link to his personal information (financial, health, preferences…), while keeping the data practically useful. Anonymization consists in removing any direct (and part of indirect) identifying data. Pseudonymization does not remove these information but modify them so that we cannot make a link with the original individual.

Multiple papers, algorithms (k-anonymity) and technics exist to perform anonymization and pseudonymization. AWS also provides 2 functions — available in the Serverless Application Repository (SAR) — that use Amazon Comprehend and its ability to detect PII:

On my side, as the input file is pretty straightforward, I don’t need Comprehend to detect sensible information.

Here is my (naive) approach:
After this process, we should end up with the following information, clear from any identifying information (names have been replaced):

Now that we know what we want to do, let’s see it in the context of our workload.
We have 3 main components in our workload:
To provide anonymized data to these applications, we have several options:
Both options add complexity and costs. So this is were I introduce S3 Object Lambda, a capability recently announced by AWS and that will actually act as this proxy. Except that you don’t have to manage any infrastructure, just your Lambda function(s).

Let’s implement this solution. First thing to do is to create a Lambda function. To do so, use your preferred framework (SAM, Serverless, CDK, …). I use SAM and my function is in Python 3.8.
The function must have permission to
WriteGetObjectResponse
, in order to provide the response to downstream application(s). Note this is not in the s3 namespace but s3-object-lambda:{
"Action": "s3-object-lambda:WriteGetObjectResponse",
"Resource": "*",
"Effect": "Allow",
"Sid": "WriteS3GetObjectResponse"
}
And here is the code of my function (commented to understand the details):
My Lambda function is really simple and if you would like to get something more production-ready, I encourage you to have a look at the AWS samples, mentioned above.
Once the function is created and deployed, we need to create an Access Point. Amazon S3 Access Points simplify managing data access for applications using shared data sets on S3, exactly what we want to do here. Using the AWS CLI:
aws s3control create-access-point --account-id 012345678912 --name anonymized-access --bucket my-bucket-with-cid
Then we create the Object Lambda Access Point. It will make the Lambda function act as a proxy to your access point. To do so with the AWS CLI, we need a JSON file. Be sure to replace with your account id, region, access point name (previously created) and function ARN:
Finally, we create the Object Lambda Access Point using the following command:
aws s3control create-access-point-for-object-lambda --account-id 012345678912 --name anonymize-lambda-accesspoint --configuration file://anonymize-lambda-accesspoint.json
And that’s it! You can now test your access point and the anonymization process with a simple get. Note that you don’t perform a get directly on the S3 bucket, but on the access point previously created, using its ARN, just like that:
aws s3api get-object --bucket arn:aws:s3-object-lambda:eu-central-1:012345678912:accesspoint/anonymize-lambda-accesspoint --key patients.csv ./anonymized.csv
You can now provide this access point ARN to the analytics application so it can retrieve anonymized data and perform whatever it needs to.
In this article, I’ve shared how to leverage S3 Object Lambda in order to anonymize your data. In just a few commands and a bit of code, we can safely share data containing identifying information with other applications without duplicating it or building a complex infrastructure.
Note that you can use the same technology to enrich some data (retrieving information in a database), or modify it on the fly (eg. image resizing), or modifying the format (eg. xml to json, csv to parquet, …), and I guess you will find some usage too.
The code of this article is available in github, together with a full sam template to create everything (bucket, access points and Lambda function).
37