Hypothesis Generation Engine: Fraud Probability Calculation
Four months ago, I was selected as a Software Engineer through the Microsoft Leap Apprenticeship Program. Coming from a non-traditional and historically underrepresented background, the apprenticeship offered me an entry point into tech while I leveraged the skills and experiences I developed from outside traditional paths. The pathway offered an immersive 16-week apprenticeship which began with in-classroom learning and ended with a hands-on engineering project. I’ve spent the past three months working on a team within the Microsoft organization, learned about fraud detection, and completed a project that allows our human investigations team to efficiently find fraud. In this post, I will share how I built an end-to-end pipeline that takes in consumer transaction data and outputs a list of potential fraud transactions.
FIGHT is an engineering team under Commerce Engineering Operations that focuses on processing transaction data of commercial and consumer businesses to detect fraud. We offer internal products that empower data investigators to search, sort, and perform analysis on large amounts of data with low latency. Simply put, we work closely with the human investigations team as well as the upstream risk data team to find fraud.
Inherently, the need to efficiently investigate potential fraud transactions becomes pivotal in catching fraudsters. As numerous transactions are processed per second, our human investigation team must perform a quick and diligent analysis of these transactions. As a result, the Hypothesis Generation Engine project is critical to Microsoft’s business needs as we match the pace of ever-advancing fraudsters.
Hypothesis Generation Engine: Fraud Probability Calculation
As part of the Commerce Engineering Operations team, the goals of my apprenticeship project were:
- Leverage and build on top of existing project and python codebase.
- Become familiar with internal fraud products and pipelines.
- Utilize Neo4j and understand its capabilities.
- Setup an end-to-end pipeline that:
- Takes in hashed transaction data.
- Outputs a ranked fraud list of transactions.
- Have a business impact by generating data insights.
My project centered around Neo4j, a graph database that is instrumental in finding connections within our consumer data. Traditional fraud prevention measures focus on discrete data points (accounts, individuals, devices), but Neo4j has the capability to look beyond individual data points and uncover difficult-to-detect patterns. We utilize Neo4j to augment our existing fraud detection capabilities and has the following advantages:
- Data and Relationships: Each data record, or node, stores direct pointers to all the nodes it’s connected to. Unlike traditional databases, which arrange data in rows, columns, and tables.
- Flexible and Efficient: Data is stored as properties of nodes, can be accessed in real-time, and versatile to adapt to evolving fraudsters.
- Cypher Query Language: Uncovering fraud requires complex associations; real-time traversal queries of graphs are essential to detect and prevent fraud.
- Graph Data Science Library: Graph data science augments existing analytics to improve fraud detection.
For more information on Neo4j, please visit their docs.
Step 1: Event Hub
To process our consumer data into our pipeline, it is first streamed through an event hub which serves as the front door and allows us to receive and process millions of events per second. Our evaluation stream file written in python listens for new data and begins processing our data through the cypher queries.
For more information on Azure Event Hubs, please visit their docs.
Step 2: Cypher Queries
Now we are ready to transform our data through our queries:
- Cypher Query: Create Constraints
The constraints query creates the schema of the nodes, which is versatile and makes it easy for our team to change as we evolve our fraud detection data model. We identified six attributes we consider essential in identifying fraud that corresponds to our schema constraints. In addition, we designated the main transaction node that contains a unique transaction number and in the next query adds additional properties.
2. Cypher Query: Create Nodes
The next query targets the main transaction node and writes over 30 attributes we consider influential in identifying fraud. The properties are written in a key-value format and become consequential in running the graph data science algorithms. Below is a snippet of hypothetical property attributes in the main transaction node in table format:
3. Cypher Query: Create Relationships
Up until this point, we have created seven distinct nodes but only added property attributes/data in the main transaction node. The create relationships query adds an attribute in the other six nodes and creates direct relationships from the main transaction node. Below is a visualization of the main transaction node in orange, and the direct relationships to the constraint nodes:
4. Cypher Query: Create Common Property
The last cypher query targets each of the seven essential constraint nodes and writes a common property on each node. Like the attributes we added as properties in the main transaction node, this common property becomes consequential when the graph data science algorithms are initialized.
Step 3: Graph Data Science Algorithms
Next, comes the Neo4j Graph Data Science Library. As mentioned before, Neo4j allows me to incorporate the predictive power of relationships and network structure in our data by utilizing the algorithms. By this step, the Neo4j database has been populated with data through the cypher queries. Now the model is ready to run the algorithms to make sense of our data.
The graph data science library contains many graph algorithms, which are categorized as:
- Centrality algorithms
- Community detection algorithms
- Similarity algorithms
- Path finding algorithms
- Link prediction algorithms
Community Detection Algorithm
To get a better sense of how these algorithms help uncover fraud, we can look in more detail at the community detection algorithms. They are widely used in identifying fraud rings and focus on clustering the graph based on relationships. They also have the capability of finding both connected and disconnected groups by examining attributes and direct connections. In the snapshot below, we can clearly see the direct connections from the transaction node 1 and 2 encircled in red. However, the community detection algorithm can detect whether transaction node 3 encircled in yellow has any undirected connections to the transaction nodes 1 and 2 — possibly matching nodes 1, 2, and 3 in the same cluster.
Furthermore, algorithms can write a new property on each node that corresponds to a number that later can be utilized in grouping and finding transactions that have similar connections.
For more information on Neo4j Graph Data Science Library, please visit their docs.
Step 4: Fraud Probability Ranking
Finally, we calculate the percentage of fraud for each transaction through one last cypher query. The query matches the transactions on specific properties and our internal percentage calculation is executed. Below is a hypothetical example of the fraud ranking output file using dummy data:
This near-real-time end-to-end pipeline enables our human investigation team to get crucial insights into specific transactions and begin their investigation. Providing a manageable ranked list allows our manual review agents to efficiently evaluate by beginning with the transaction with the highest fraud percentage. As a result, when our manual review agent submits their decision, the probability of reversing the fraud transaction is promising. In addition, we can continue to work with our human investigation team and improve our model accuracy as we build our Neo4j database and gain insight into the results of our data. Finally, as our model improves and our predictive accuracy increases, the cost-benefit is priceless.
My experience working on this project and exposure to a real industry problem has been instrumental in my development as a software engineer. Many of the technologies implemented in my pipeline were unfamiliar to me, but with the support of my team, I was able to quickly grasp these technologies. In addition, working on a Microsoft engineering team exposed me to real industry processes as well as my team’s Azure infrastructure, DevOps, blob storage, and virtual machines. Lastly, having trivial knowledge of python, I was able to grasp the language while utilizing pandas and py2neo frameworks.
Many thanks to my mentor Ke Zhang, my engineering manager Xiaokai He, and of course my entire Commerce Engineering Operations team in supporting me throughout my project. Lastly, thank you to Microsoft Leap for sourcing and developing non-traditional talent like myself, and allowing me to continue my journey in tech as a historically underrepresented minority.
Noel Mendoza is on LinkedIn.