How Robinhood uses Graph Algorithms to prevent Fraud
How Robinhood uses algorithms like Connected Components, PageRank and more. Plus, what makes an outstanding programmer, four important software design principles and more.
Hey Everyone!
Today we’ll be talking about
How Robinhood uses Graph Algorithms to prevent Fraud
The different types of fraud Robinhood has to deal with
Introduction to graphs and graph databases
Vertex-centric algorithms like BFS and DFS
Graph-centric algorithms like Connected Components and Page Rank
Robinhood’s data processing architecture
Tech Snippets
4 Software Design Principles I Learned the Hard Way
Why you should become an Engineering Manager
Basic Things to Implement Early in Projects
How Robinhood uses Graph Algorithms to prevent Fraud
Robinhood is a brokerage and financial services company that was started in 2013.
Since it’s founding, the company has become incredibly popular with younger investors due to their zero-commission trading, gamified experience and intuitive app interface. They have over 23 million accounts on the platform.
As with any financial services company, fraud is a major issue.
Brokerages like Robinhood face three main types of fraud:
Identity Fraud - accounts opened using a stolen identity.
Account takeover - hackers taking over someone’s account and transferring their funds out of Robinhood to the hacker.
First-party Fraud - someone making deliberately reckless trades with borrowed-funds (on margin), where they have no intent of paying the funds back if the trade is unsuccessful (in other words, what r/WallStreetBets does on a daily basis).
When fraud happens, it usually involves multiple accounts and is done in a coordinated way. Due to this coordination, accounts that belong to hackers will often have common elements (same IP address, bank account, home address, etc.).
This makes Graph algorithms extremely effective at tackling this kind of fraud.
Robinhood wrote a great blog post on how they use graph algorithms for fraud detection.
Introduction to Graphs
Graphs are a very popular way of representing relationships and connections within your data.
They’re composed of
Vertices - These represent entities in your data. In Robinhood’s case, vertices/nodes represent individual users.
Edges - These represent connections between nodes. For Robinhood, an edge could represent two users sharing the same attributes (same home address, same device, etc.)
Graphs can also get a lot more complicated with edge weights, edge directions, cyclic/acyclic and more.
Graph Databases
Graph databases exist for storing data that fits into this paradigm (vertices and edges).
You may have heard of databases like Neo4j or AWS Neptune. These are NoSQL databases specifically built to handle graph data.
There are quite a few reasons why you’d want a specialized graph database instead of using MySQL or Postgres (although, Postgres has extensions that give it the functionality of a graph database).
Faster Processing of Relationships - Let’s say you use a relational database to store your graph data. It will use joins to traverse relationships between nodes, which will quickly become an issue (especially when you have hundreds/thousands of nodes)
On the other hand, graph databases use pointers to traverse the underlying graph. Each node has direct references to its neighbors (called index-free adjacency) so traversing from a node to its neighbor will always be a constant time operation in a graph database.
Here’s a really good article that explains exactly why graph databases are so efficient with these traversals.
Graph Query Language - Writing SQL queries to find and traverse edges in your graph can be a big pain. Instead, graph databases employ query languages like Cypher and Gremlin to make queries much cleaner and easier to read.
Here’s an example SQL query and an equivalent Cypher query to find all the directors of Keanu Reeves movies.
### SQL
SELECT director.name, count(*)
FROM person keanu
JOIN acted_in ON keanu.id = acted_in.person_id
JOIN directed ON acted_in.movie_id = directed.movie_id
JOIN person AS director ON directed.person_id = director.id
WHERE keanu.name = 'Keanu Reeves'
GROUP BY director.name
ORDER BY count(*) DESC
### Cypher
MATCH (keanu:Person {name: 'Keanu Reeves'})-[:ACTED_IN]
->(movie:Movie),
(director:Person)-[:DIRECTED]->(movie)
RETURN director.name, count(*)
ORDER BY count(*) DESC
Algorithms and Analytics - Graph databases will come integrated with commonly used algorithms like Djikstra, BFS/DFS, cluster detection, etc.
You can easily and quickly runs tasks for things like
Path Finding - find the shortest path between two nodes
Centrality - measure the importance or influence of a node within the graph
Similarity - calculate the similarity between two nodes
Community Detection - evaluate clusters within a graph where nodes are densely connected with each other
Node Embeddings - compute vector representations of the nodes within the graph
Graph Algorithms at Robinhood
Robinhood’s graph-based fraud detection algorithms can be split into two categories:
Vertex-centric Algorithms
Graph-centric Algorithms
Vertex-centric Algorithms
With these algorithms, computations typically start from one specific node (vertex) and then expand to a subsection of the graph.
Robinhood might start from one vertex that represents a known-attacker and then identify risky neighbor nodes.
Examples of Vertex-centric algorithms Robinhood could use include
Breadth-First Search
Depth-First Search
A Breadth-first search visits all of a node’s immediate neighbors before moving onto the next layer.
Breadth First Search Animated
If you have a certain node that is a known fraudster, then you can analyze all the nodes that share any attributes with the fraudster (IP address, device, home address, bank account details, etc.).
This can be the most efficient way to detect other fraudsters as nodes that share the same IP address/bank account are very likely to be malicious. After analyzing all those nodes, you can repeat the process for all the neighbors of the nodes you just processed (second-degree connections of the original fraudster node).
On the other hand, a DFS will explore as deep as possible along a single branch of the graph before backtracking to explore the next path.
Depth First Search Animated
A DFS could be particularly useful in scenarios where the fraud detection system needs to explore complex, convoluted chains of transactions or relationships that extend very deeply into the graph.
Graph-centric Algorithms
With graph-centric algorithms, Robinhood analyzes the entire graph to search for relationships (so computations usually involve looking at the full graph).
They look for clusters of nodes that might be useful in finding fraudulent actors.
Examples of useful Graph-centric algorithms include
Connected Components - Connected components identifies sub-graphs within the graph where each node is reachable from every other node within the subgraph (each node is connected in some way to every other node in the sub-graph).
Connected components helps Robinhood identify groups of accounts that are linked together even if those accounts don’t directly share attributes.Page Rank - Page Rank is an algorithm that measures the importance or influence of each node within a graph based on the number and quality of incoming connections. It assigns a score to each node, with higher scores indicating greater importance or centrality within the network.
Robinhood can leverage Page Rank to identify the most influential or central accounts within a fraud network. By treating each node as a "page" and the relationships as "links," Page Rank can help Robinhood determine which accounts are likely to be the key players in a fraud scheme. Accounts with high Page Rank scores can be prioritized for investigation and intervention.
Graph Embeddings - Graph Embeddings are techniques that transform a graph's structure and node relationships into a vector representation. This embedding captures the essential properties of the graph while simplifying its representation for further analysis.
Robinhood can utilize Graph Embeddings to convert their complex account relationship graph into a much easier-to-use format. Then, they can apply machine learning algorithms (like k-nearest neighbors or other clustering strategies) to identify potential fraudsters.
Data Processing and Serving
For graph-centric features, Robinhood needs to analyze the entire graph and identify clusters of entities.
This is super resource-intensive so it’s done offline in periodic batch-jobs. After, the results are ingested into online feature stores where it can be used by Robinhood’s fraud detection platform.
For vertex-centric algorithms, Robinhood uses the same batch processing approach. However, they also use a real-time streaming approach to incorporate the most up-to-date fraud data (similar to a Lambda Architecture where they have a batch layer and a speed layer). This lets them reduce lag from hours to seconds.
Results
With these graph algorithms, Robinhood has been able to compute hundreds of features and run many different fraud models.
This has helped them save tens of millions of dollars of potential fraud losses.