An Approach for Generating Realistic Synthetic Transaction Data
Name
Mert Bektas
Abstract
Privacy factors are crucial in today’s constantly evolving financial technology and banking world. Banks are turning to creative and secure solutions to overcome these issues and perform better. Federated Learning (FL) is a novel approach that enables model training between separate organizations. The author collaborated with "Swedbank" and worked on a project to prepare the application of FL with another bank and an intermediary company to enhance our money laundering detection system while preserving data privacy. Both banks used an updated version of the open-source multi-agent-based simulator "AMLSim" to generate synthetic data.
This thesis aims to generate synthetic transaction data close to real-life transactions, making collaborating in anti-financial crime between banks possible. Based on our real-life transaction data, we created features similar to the data generated by AMLSim. Both real and synthetic datasets turned into a graph. The graph evaluation metrics used are In-degree/Out-degree Ratio, PageRank, and Label Propagation. The Snowball sampling algorithm is used to sample real-life transaction data to make it comparable with smaller generated synthetic data. The sampling algorithm is evaluated by generating three different subsamples from the same graph, and their structure is evaluated by the aforementioned evaluation metrics in addition to Graph Density and Graph Components to check if all subsamples are relevant to each other. Finally, generated synthetic graphs are evaluated by the aforementioned evaluation methods to check if their structures are
close to real graphs. The results are used to hyperparameter tune AMLSim to generate a more realistic dataset.
This thesis aims to generate synthetic transaction data close to real-life transactions, making collaborating in anti-financial crime between banks possible. Based on our real-life transaction data, we created features similar to the data generated by AMLSim. Both real and synthetic datasets turned into a graph. The graph evaluation metrics used are In-degree/Out-degree Ratio, PageRank, and Label Propagation. The Snowball sampling algorithm is used to sample real-life transaction data to make it comparable with smaller generated synthetic data. The sampling algorithm is evaluated by generating three different subsamples from the same graph, and their structure is evaluated by the aforementioned evaluation metrics in addition to Graph Density and Graph Components to check if all subsamples are relevant to each other. Finally, generated synthetic graphs are evaluated by the aforementioned evaluation methods to check if their structures are
close to real graphs. The results are used to hyperparameter tune AMLSim to generate a more realistic dataset.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Amnir Hadachi, Alexander Jöhnemark, Jolanta Goldsteine
Defence year
2024