Easily anonymize your data


Let’s imagine two scenarios:

How would you personally go about solving this?

Anonymize df

I’ve been confronted with these scenarios multiple times and had a very ad hoc, quick and dirty, kind of approach to it. Starting from scratch every time, which meant saving time in the short run but losing time and quality in the long run.

When a colleague asked how complicated it would be to make a general-purpose tool for this kind of process, I saw that as an opportunity to finally do something about this and create a project for my public portfolio that is quite useful.

The result being Anonymize df, which is a python package and Alteryx macro that helps you quickly and easily generate realistically fake data from a Pandas DataFrame.

Where to get it

You can install it like you would any other package through pip:

pip install anonymizedf

You can also find it here:

Link to Python code

Link to Alteryx macro

Python usage

import pandas as pd
from anonymizedf.anonymizedf import anonymize

Import the data
df = pd.read_csv("https://query.data.world/s/shcktxndtu3ojonm46tb5udlz7sp3e")

Prepare the data to be anonymized
an = anonymize(df)
Example 1 - just updates df
an.fake_names("Customer Name")
an.fake_ids("Customer ID")
an.fake_whole_numbers("Loyalty Reward Points")
an.fake_categories("Segment")
an.fake_dates("Date")
an.fake_decimal_numbers("Fraction")
df.head()
Customer ID Customer Name Loyalty Reward Points Segment Date Fraction Fake_Customer Name Fake_Customer ID Fake_Loyalty Reward Points Fake_Segment Fake_Date Fake_Fraction
0 AA-10315 Alex Avila 76 Consumer 01/01/2000 7.6 Anne Briggs FYKP18464993584790 715 Segment 1 1988-02-21 81.70
1 AA-10375 Allen Armold 369 Consumer 02/01/2000 36.9 Kathryn Poole-Owens KQLT34683822176548 305 Segment 1 2012-01-21 49.64
2 AA-10480 Andrew Allen 162 Consumer 03/01/2000 16.2 Dorothy Knight-Smith KEKQ23089097589905 723 Segment 1 2017-12-05 45.49
3 AA-10645 Anna Andreadi 803 Consumer 04/01/2000 80.3 Dr. Dennis Lowe JUFR80046496812327 503 Segment 1 1993-08-19 43.85
4 AB-10015 Aaron Bergman 935 Consumer 05/01/2000 93.5 Joan Read ZLEK68784141425071 103 Segment 1 2018-10-26 65.30
Example 2 - method chaining
fake_df = (
    an
    .fake_names("Customer Name", chaining=True)
    .fake_ids("Customer ID", chaining=True)
    .fake_whole_numbers("Loyalty Reward Points", chaining=True)
    .fake_categories("Segment", chaining=True)
    .fake_dates("Date", chaining=True)
    .fake_decimal_numbers("Fraction", chaining=True)
    .show_data_frame()
)
fake_df.head()
Customer ID Customer Name Loyalty Reward Points Segment Date Fraction Fake_Customer Name Fake_Customer ID Fake_Loyalty Reward Points Fake_Segment Fake_Date Fake_Fraction
0 AA-10315 Alex Avila 76 Consumer 01/01/2000 7.6 Matthew Elliott KQPQ33621304584922 62 Segment 1 2011-05-24 96.96
1 AA-10375 Allen Armold 369 Consumer 02/01/2000 36.9 Lynne Harding CLAA15849783691822 494 Segment 1 2000-10-14 20.78
2 AA-10480 Andrew Allen 162 Consumer 03/01/2000 16.2 Dr. Molly Holmes VTWU51877283324210 383 Segment 1 1994-01-30 66.87
3 AA-10645 Anna Andreadi 803 Consumer 04/01/2000 80.3 Mr. Frederick Price MVFX95041828905565 82 Segment 1 2000-01-11 25.77
4 AB-10015 Aaron Bergman 935 Consumer 05/01/2000 93.5 Dean Davies CRXZ11641101775380 786 Segment 1 1996-08-19 38.32
Example 4 - for multiple columns

column_list = ["Segment", "Customer Name", "Customer ID", "Date"]

for column in column_list:
     an.fake_categories(column)
df.head()
Customer ID Customer Name Loyalty Reward Points Segment Date Fraction Fake_Customer Name Fake_Customer ID Fake_Loyalty Reward Points Fake_Segment Fake_Date Fake_Fraction
0 AA-10315 Alex Avila 76 Consumer 01/01/2000 7.6 Customer Name 1 Customer ID 1 62 Segment 1 Date 1 96.96
1 AA-10375 Allen Armold 369 Consumer 02/01/2000 36.9 Customer Name 2 Customer ID 2 494 Segment 1 Date 2 20.78
2 AA-10480 Andrew Allen 162 Consumer 03/01/2000 16.2 Customer Name 3 Customer ID 3 383 Segment 1 Date 3 66.87
3 AA-10645 Anna Andreadi 803 Consumer 04/01/2000 80.3 Customer Name 4 Customer ID 4 82 Segment 1 Date 4 25.77
4 AB-10015 Aaron Bergman 935 Consumer 05/01/2000 93.5 Customer Name 5 Customer ID 5 786 Segment 1 Date 5 38.32
Example 5 - grouping

d2 = {"category": ["low", "low", "high", "high"], "number": [0.1, 1, 10.1, 100.1]}
df2 = pd.DataFrame(data=d2)

an = anonymize(df2)
df2.head()
category number
0 low 0.1
1 low 1.0
2 high 10.1
3 high 100.1
Without grouping - relative relationships lost when generating fake data
an.fake_decimal_numbers("number")
df2.head()
category number Fake_number
0 low 0.1 86.03
1 low 1.0 64.72
2 high 10.1 68.66
3 high 100.1 13.41
With grouping - relative relationships remain the same
an.fake_decimal_numbers("number", "category")
df2.head()
category number Fake_number
0 low 0.1 0.38
1 low 1.0 0.92
2 high 10.1 63.70
3 high 100.1 59.40

Alteryx usage

The Macro can be downloaded from the gallery and used just like any other macro

Workflow overview

The interface for using the macro should be quite straightforward, if not let me know ;)

interface example

If you have any thoughts / comments feel free to let me know

Also let me know if you’d be interested in a “behind the scenes” post / video about how this was created.