How Robinhood uses Graph Algorithms to prevent Fraud

How Robinhood uses algorithms like Connected Components, PageRank and more. Plus, what makes an outstanding programmer, four important software design principles and more.

Arpan KG
April 23, 2024

Hey Everyone!

Today we’ll be talking about

How Robinhood uses Graph Algorithms to prevent Fraud
- The different types of fraud Robinhood has to deal with
- Introduction to graphs and graph databases
- Vertex-centric algorithms like BFS and DFS
- Graph-centric algorithms like Connected Components and Page Rank
- Robinhood’s data processing architecture
Tech Snippets
- 4 Software Design Principles I Learned the Hard Way
- Ask HN: What makes an outstanding programmer?
- Why you should become an Engineering Manager
- Basic Things to Implement Early in Projects
- Anthropic’s Prompt Library

How Robinhood uses Graph Algorithms to prevent Fraud

Robinhood is a brokerage and financial services company that was started in 2013.

Since it’s founding, the company has become incredibly popular with younger investors due to their zero-commission trading, gamified experience and intuitive app interface. They have over 23 million accounts on the platform.

As with any financial services company, fraud is a major issue.

Brokerages like Robinhood face three main types of fraud:

Identity Fraud - accounts opened using a stolen identity.
Account takeover - hackers taking over someone’s account and transferring their funds out of Robinhood to the hacker.
First-party Fraud - someone making deliberately reckless trades with borrowed-funds (on margin), where they have no intent of paying the funds back if the trade is unsuccessful (in other words, what r/WallStreetBets does on a daily basis).

When fraud happens, it usually involves multiple accounts and is done in a coordinated way. Due to this coordination, accounts that belong to hackers will often have common elements (same IP address, bank account, home address, etc.).

This makes Graph algorithms extremely effective at tackling this kind of fraud.

Robinhood wrote a great blog post on how they use graph algorithms for fraud detection.

Introduction to Graphs

Graphs are a very popular way of representing relationships and connections within your data.

They’re composed of

Vertices - These represent entities in your data. In Robinhood’s case, vertices/nodes represent individual users.
Edges - These represent connections between nodes. For Robinhood, an edge could represent two users sharing the same attributes (same home address, same device, etc.)

Graphs can also get a lot more complicated with edge weights, edge directions, cyclic/acyclic and more.

Graph Databases

Graph databases exist for storing data that fits into this paradigm (vertices and edges).

You may have heard of databases like Neo4j or AWS Neptune. These are NoSQL databases specifically built to handle graph data.

There are quite a few reasons why you’d want a specialized graph database instead of using MySQL or Postgres (although, Postgres has extensions that give it the functionality of a graph database).

Faster Processing of Relationships - Let’s say you use a relational database to store your graph data. It will use joins to traverse relationships between nodes, which will quickly become an issue (especially when you have hundreds/thousands of nodes)
On the other hand, graph databases use pointers to traverse the underlying graph. Each node has direct references to its neighbors (called index-free adjacency) so traversing from a node to its neighbor will always be a constant time operation in a graph database.
Here’s a really good article that explains exactly why graph databases are so efficient with these traversals.
Graph Query Language - Writing SQL queries to find and traverse edges in your graph can be a big pain. Instead, graph databases employ query languages like Cypher and Gremlin to make queries much cleaner and easier to read.
Here’s an example SQL query and an equivalent Cypher query to find all the directors of Keanu Reeves movies.

### SQL

SELECT director.name, count(*)
FROM person keanu
  JOIN acted_in ON keanu.id = acted_in.person_id
  JOIN directed ON acted_in.movie_id = directed.movie_id
  JOIN person AS director ON directed.person_id = director.id
WHERE keanu.name = 'Keanu Reeves'
GROUP BY director.name
ORDER BY count(*) DESC

### Cypher

MATCH (keanu:Person {name: 'Keanu Reeves'})-[:ACTED_IN]
  ->(movie:Movie),
  (director:Person)-[:DIRECTED]->(movie)
RETURN director.name, count(*)
ORDER BY count(*) DESC

Algorithms and Analytics - Graph databases will come integrated with commonly used algorithms like Djikstra, BFS/DFS, cluster detection, etc.
You can easily and quickly runs tasks for things like
- Path Finding - find the shortest path between two nodes
- Centrality - measure the importance or influence of a node within the graph
- Similarity - calculate the similarity between two nodes
- Community Detection - evaluate clusters within a graph where nodes are densely connected with each other
- Node Embeddings - compute vector representations of the nodes within the graph

Graph Algorithms at Robinhood

Robinhood’s graph-based fraud detection algorithms can be split into two categories:

Vertex-centric Algorithms
Graph-centric Algorithms

Vertex-centric Algorithms

With these algorithms, computations typically start from one specific node (vertex) and then expand to a subsection of the graph.

Robinhood might start from one vertex that represents a known-attacker and then identify risky neighbor nodes.

Examples of Vertex-centric algorithms Robinhood could use include

Breadth-First Search
Depth-First Search

A Breadth-first search visits all of a node’s immediate neighbors before moving onto the next layer.

Breadth First Search Animated

If you have a certain node that is a known fraudster, then you can analyze all the nodes that share any attributes with the fraudster (IP address, device, home address, bank account details, etc.).

This can be the most efficient way to detect other fraudsters as nodes that share the same IP address/bank account are very likely to be malicious. After analyzing all those nodes, you can repeat the process for all the neighbors of the nodes you just processed (second-degree connections of the original fraudster node).

On the other hand, a DFS will explore as deep as possible along a single branch of the graph before backtracking to explore the next path.

Depth First Search Animated

A DFS could be particularly useful in scenarios where the fraud detection system needs to explore complex, convoluted chains of transactions or relationships that extend very deeply into the graph.

Graph-centric Algorithms

With graph-centric algorithms, Robinhood analyzes the entire graph to search for relationships (so computations usually involve looking at the full graph).

They look for clusters of nodes that might be useful in finding fraudulent actors.

Examples of useful Graph-centric algorithms include

Connected Components - Connected components identifies sub-graphs within the graph where each node is reachable from every other node within the subgraph (each node is connected in some way to every other node in the sub-graph).
Connected components helps Robinhood identify groups of accounts that are linked together even if those accounts don’t directly share attributes.
Page Rank - Page Rank is an algorithm that measures the importance or influence of each node within a graph based on the number and quality of incoming connections. It assigns a score to each node, with higher scores indicating greater importance or centrality within the network.
Robinhood can leverage Page Rank to identify the most influential or central accounts within a fraud network. By treating each node as a "page" and the relationships as "links," Page Rank can help Robinhood determine which accounts are likely to be the key players in a fraud scheme. Accounts with high Page Rank scores can be prioritized for investigation and intervention.
Graph Embeddings - Graph Embeddings are techniques that transform a graph's structure and node relationships into a vector representation. This embedding captures the essential properties of the graph while simplifying its representation for further analysis.
Robinhood can utilize Graph Embeddings to convert their complex account relationship graph into a much easier-to-use format. Then, they can apply machine learning algorithms (like k-nearest neighbors or other clustering strategies) to identify potential fraudsters.

Data Processing and Serving

For graph-centric features, Robinhood needs to analyze the entire graph and identify clusters of entities.

This is super resource-intensive so it’s done offline in periodic batch-jobs. After, the results are ingested into online feature stores where it can be used by Robinhood’s fraud detection platform.

For vertex-centric algorithms, Robinhood uses the same batch processing approach. However, they also use a real-time streaming approach to incorporate the most up-to-date fraud data (similar to a Lambda Architecture where they have a batch layer and a speed layer). This lets them reduce lag from hours to seconds.

Results

With these graph algorithms, Robinhood has been able to compute hundreds of features and run many different fraud models.

This has helped them save tens of millions of dollars of potential fraud losses.

Tech Snippets

4 Software Design Principles I Learned the Hard Way

This is a fantastic read by Engineer’s Codex on engineering principles learned during a massive launch that happened over the past few months.

Some of the principles include
- Don’t be afraid of code duplication
- Maintain one source of truth
- Don’t overuse mocks

Read the full blog post for all the principles and a detailed explanation on each one

read.engineerscodex.com/p/4-software-design-principles-i-learned

Basic Things to Implement Early in Projects

With engineering projects, there are certain “basic things” that are irrelevant while the project is small but become a huge productivity multiplier when the project is large. These things become much harder to introduce down the line.

This is a fantastic list of these “basic things” and also how you should think about implementing them.

Some examples include
- Style & Process docs
- Benchmarking (how long does it take to build/test/ship)
- Build & CI

matklad.github.io/2024/03/22/basic-things.html

Anthropic’s Prompt Library

This is a fantastic list of optimized LLM prompts for work-related and personal tasks.

It includes the system and user prompts and also examples of how you can use them with the Claude API.

If you’re working on prompt engineering for your own applications then you can get significantly better results by using one of these prompts as your baseline instead of starting from scratch.

docs.anthropic.com/claude/page/prompts

Ask HN: What makes an outstanding programmer?

Some developers have incredible productivity and are also able to sense what products/tools are useful for the world. Ludvig Strigeus, for example, is the creator of uTorrent, OpenTTD, WebWorks and was also an early employee at Spotify.

In this AskHN post, the hacker news community tries to answer what quality traits make these kinds of developers.

news.ycombinator.com/item?id=40113488

Why you should become an Engineering Manager

As some point in your career, you’ll face a choice between staying as an individual contributor or making the transition to becoming an engineering manager.

This is a really interesting blog post by Charity Majors on reasons you might want to choose the management path.

Some of the reasons include
- it makes you significantly better at building relationships
- work can become more meaningful as you’re helping your colleagues grow and develop
- it gets you closer to how the business operates and gives insight into how business decisions are made

charity.wtf/2023/12/15/why-should-you-or-anyone-become-an-engineering-manager