Skip to main content

Open in Colab

Private Set Intersections - Part 2

Private set intersection (PSI) is a privacy-preserving technique which falls under the umbrella of secure multi-party computation (SMPC) technologies. These technologies enable multiple parties to perform operations on disparate datasets without revealing the underlying data to any party using cryptographic techniques.

PSI itself allows you to determine the overlapping records in two disparate datasets without providing access to the underlying raw data of either dataset, hence why it is referred to as an intersection. When applied to the Bitfount context, a Data Scientist can perform a PSI across a local dataset and a specified Pod and return the matching records from the local dataset.

Prerequisites

!pip install bitfount

Access Controls

Bitfount does not currently enable a Data Custodian to restrict Data Scientists to performing just PSI operations against a given Pod, so the primary use case for running PSI tasks using Bitfount is to understand the overlap of multiple datasets to which the Data Scientist already has Super Modeller or General Modeller permissions.

For the purposes of this tutorial, we will use the psi-demo-dataset created in the Private Set Intersection - Part 1 tutorials. Be sure to re-start these Pods if they have gone offline since you did those tutorials before proceeding.

Setting Up for PSI Tasks

Import the relevant pieces from the bitfount library:

import loggingimport osimport timeimport nest_asyncioimport pandas as pdfrom bitfount import ComputeIntersectionRSA, DataFrameSource, setup_loggersnest_asyncio.apply()  # Needed because Jupyter also has an asyncio loop

Set up the loggers:

loggers = setup_loggers([logging.getLogger("bitfount")])

Running PSI Tasks

PSI Tasks leverage the ComputeIntersectionRSA algorithm to run, which can take several optional arguments as outlined in the API documentation. The required arguments are:

  • datasource: The format of data linked to the Pod(s).
  • pod_identifiers: The identifiers for the Pods you wish to compare.

You will also need to define the columns you wish to compare across Pods as an input variable. In the example below, we use data to define this. Now that we understand the requirements, we can run an example PSI task as outlined below:

# Compare the overlap in the "workclass" and "occupation" columns.data = {"col1": [3, 4, 5, 6, 7]}algorithm = ComputeIntersectionRSA()intersection_indices = algorithm.execute(    datasource=DataFrameSource(pd.DataFrame(data)), pod_identifiers=["psi-demo-dataset"])print(intersection_indices)

The above should print out the overlap between the values in the variable data and the remote dataset psi-demo-dataset.

note

Private Set Intersection requires a significant amount of computation. This computation is linear in the size of both the query itself and the database being queried. When running PSI we recommend starting with smaller numbers of entries to understand the way it scales before executing larger queries.

You've successfully run a PSI task!

Contact our support team at support@bitfount.com if you have any questions.