How Robinhood uses Graph Algorithms to prevent Fraud

How Robinhood uses algorithms like Connected Components, PageRank and more. Plus, what makes an outstanding programmer, four important software design principles and more.

Hey Everyone!

Today we’ll be talking about

  • How Robinhood uses Graph Algorithms to prevent Fraud

    • The different types of fraud Robinhood has to deal with

    • Introduction to graphs and graph databases

    • Vertex-centric algorithms like BFS and DFS

    • Graph-centric algorithms like Connected Components and Page Rank

    • Robinhood’s data processing architecture

  • Tech Snippets

    • 4 Software Design Principles I Learned the Hard Way

    • Ask HN: What makes an outstanding programmer?

    • Why you should become an Engineering Manager

    • Basic Things to Implement Early in Projects

    • Anthropic’s Prompt Library

With LLMs like GPT-4, Claude and LLaMA, there’s a huge possibility of new apps that you can build.

You could create a bot that looks at the latest headlines on stocks in your portfolio and texts you an alert when there’s something important. Or find all the latest news headlines and use GPT-4 to send you summaries of the articles without clickbait.

You can build both of these—and hundreds more ideas—in a weekend with the Brave Search API.

  • Affordable - It’s much cheaper and easier to set up than the other big tech options. So it’s perfect for everything from small projects to large apps.

  • High Quality - Brave’s index is populated with sites that real people actually visit. No junk or clickbait farms, no SEO spam, and a much more human dataset.

  • Easy to Use - It’s quick and easy to set up. Data is structured for simple implementation across a wide range of apps, from NLP to complex analysis.

  • It’s Fast - The API is optimized for low latency, so it’s ideal for real-time apps like responsive search or chatbots.

You can use the API for free for up to 2,000 queries per month.

sponsored

How Robinhood uses Graph Algorithms to prevent Fraud

Robinhood is a brokerage and financial services company that was started in 2013.

Since it’s founding, the company has become incredibly popular with younger investors due to their zero-commission trading, gamified experience and intuitive app interface. They have over 23 million accounts on the platform.

As with any financial services company, fraud is a major issue.

Brokerages like Robinhood face three main types of fraud:

  • Identity Fraud - accounts opened using a stolen identity.

  • Account takeover - hackers taking over someone’s account and transferring their funds out of Robinhood to the hacker.

  • First-party Fraud - someone making deliberately reckless trades with borrowed-funds (on margin), where they have no intent of paying the funds back if the trade is unsuccessful (in other words, what r/WallStreetBets does on a daily basis).

When fraud happens, it usually involves multiple accounts and is done in a coordinated way. Due to this coordination, accounts that belong to hackers will often have common elements (same IP address, bank account, home address, etc.).

This makes Graph algorithms extremely effective at tackling this kind of fraud.

Robinhood wrote a great blog post on how they use graph algorithms for fraud detection.

Introduction to Graphs

Graphs are a very popular way of representing relationships and connections within your data.

They’re composed of

  • Vertices - These represent entities in your data. In Robinhood’s case, vertices/nodes represent individual users.

  • Edges - These represent connections between nodes. For Robinhood, an edge could represent two users sharing the same attributes (same home address, same device, etc.)

Graphs can also get a lot more complicated with edge weights, edge directions, cyclic/acyclic and more.

Graph Databases

Graph databases exist for storing data that fits into this paradigm (vertices and edges).

You may have heard of databases like Neo4j or AWS Neptune. These are NoSQL databases specifically built to handle graph data.

There are quite a few reasons why you’d want a specialized graph database instead of using MySQL or Postgres (although, Postgres has extensions that give it the functionality of a graph database).

  • Faster Processing of Relationships - Let’s say you use a relational database to store your graph data. It will use joins to traverse relationships between nodes, which will quickly become an issue (especially when you have hundreds/thousands of nodes)

    On the other hand, graph databases use pointers to traverse the underlying graph. Each node has direct references to its neighbors (called index-free adjacency) so traversing from a node to its neighbor will always be a constant time operation in a graph database.

     Here’s a really good article that explains exactly why graph databases are so efficient with these traversals.

  • Graph Query Language - Writing SQL queries to find and traverse edges in your graph can be a big pain. Instead, graph databases employ query languages like Cypher and Gremlin to make queries much cleaner and easier to read.
     Here’s an example SQL query and an equivalent Cypher query to find all the directors of Keanu Reeves movies.

### SQL

SELECT director.name, count(*)
FROM person keanu
  JOIN acted_in ON keanu.id = acted_in.person_id
  JOIN directed ON acted_in.movie_id = directed.movie_id
  JOIN person AS director ON directed.person_id = director.id
WHERE keanu.name = 'Keanu Reeves'
GROUP BY director.name
ORDER BY count(*) DESC
### Cypher

MATCH (keanu:Person {name: 'Keanu Reeves'})-[:ACTED_IN]
  ->(movie:Movie),
  (director:Person)-[:DIRECTED]->(movie)
RETURN director.name, count(*)
ORDER BY count(*) DESC
  • Algorithms and Analytics - Graph databases will come integrated with commonly used algorithms like Djikstra, BFS/DFS, cluster detection, etc.

    You can easily and quickly runs tasks for things like

    • Path Finding - find the shortest path between two nodes

    • Centrality - measure the importance or influence of a node within the graph

    • Similarity - calculate the similarity between two nodes

    • Community Detection - evaluate clusters within a graph where nodes are densely connected with each other

    • Node Embeddings - compute vector representations of the nodes within the graph

Graph Algorithms at Robinhood

Robinhood’s graph-based fraud detection algorithms can be split into two categories:

  • Vertex-centric Algorithms

  • Graph-centric Algorithms

Vertex-centric Algorithms

With these algorithms, computations typically start from one specific node (vertex) and then expand to a subsection of the graph.

Robinhood might start from one vertex that represents a known-attacker and then identify risky neighbor nodes.

Examples of Vertex-centric algorithms Robinhood could use include

  • Breadth-First Search

  • Depth-First Search

A Breadth-first search visits all of a node’s immediate neighbors before moving onto the next layer.

Breadth First Search Animated

If you have a certain node that is a known fraudster, then you can analyze all the nodes that share any attributes with the fraudster (IP address, device, home address, bank account details, etc.).

This can be the most efficient way to detect other fraudsters as nodes that share the same IP address/bank account are very likely to be malicious. After analyzing all those nodes, you can repeat the process for all the neighbors of the nodes you just processed (second-degree connections of the original fraudster node).

On the other hand, a DFS will explore as deep as possible along a single branch of the graph before backtracking to explore the next path.

Depth First Search Animated

A DFS could be particularly useful in scenarios where the fraud detection system needs to explore complex, convoluted chains of transactions or relationships that extend very deeply into the graph.

Graph-centric Algorithms

With graph-centric algorithms, Robinhood analyzes the entire graph to search for relationships (so computations usually involve looking at the full graph).

They look for clusters of nodes that might be useful in finding fraudulent actors.

Examples of useful Graph-centric algorithms include

  • Connected Components - Connected components identifies sub-graphs within the graph where each node is reachable from every other node within the subgraph (each node is connected in some way to every other node in the sub-graph).
    Connected components helps Robinhood identify groups of accounts that are linked together even if those accounts don’t directly share attributes.

  • Page Rank - Page Rank is an algorithm that measures the importance or influence of each node within a graph based on the number and quality of incoming connections. It assigns a score to each node, with higher scores indicating greater importance or centrality within the network.

    Robinhood can leverage Page Rank to identify the most influential or central accounts within a fraud network. By treating each node as a "page" and the relationships as "links," Page Rank can help Robinhood determine which accounts are likely to be the key players in a fraud scheme. Accounts with high Page Rank scores can be prioritized for investigation and intervention.

  • Graph Embeddings - Graph Embeddings are techniques that transform a graph's structure and node relationships into a vector representation. This embedding captures the essential properties of the graph while simplifying its representation for further analysis.

    Robinhood can utilize Graph Embeddings to convert their complex account relationship graph into a much easier-to-use format. Then, they can apply machine learning algorithms (like k-nearest neighbors or other clustering strategies) to identify potential fraudsters.

Data Processing and Serving

For graph-centric features, Robinhood needs to analyze the entire graph and identify clusters of entities.

This is super resource-intensive so it’s done offline in periodic batch-jobs. After, the results are ingested into online feature stores where it can be used by Robinhood’s fraud detection platform.

For vertex-centric algorithms, Robinhood uses the same batch processing approach. However, they also use a real-time streaming approach to incorporate the most up-to-date fraud data (similar to a Lambda Architecture where they have a batch layer and a speed layer). This lets them reduce lag from hours to seconds.

Results

With these graph algorithms, Robinhood has been able to compute hundreds of features and run many different fraud models.

This has helped them save tens of millions of dollars of potential fraud losses.

With LLMs like GPT-4, Claude and LLaMA, there’s a huge possibility of new apps that you can build.

You could create a bot that looks at the latest headlines on stocks in your portfolio and texts you an alert when there’s something important. Or find all the latest news headlines and use GPT-4 to send you summaries of the articles without clickbait.

You can build both of these—and hundreds more ideas—in a weekend with the Brave Search API.

  • Affordable - It’s much cheaper and easier to set up than the other big tech options. So it’s perfect for everything from small projects to large apps.

  • High Quality - Brave’s index is populated with sites that real people actually visit. No junk or clickbait farms, no SEO spam, and a much more human dataset.

  • Easy to Use - It’s quick and easy to set up. Data is structured for simple implementation across a wide range of apps, from NLP to complex analysis.

  • It’s Fast - The API is optimized for low latency, so it’s ideal for real-time apps like responsive search or chatbots.

You can use the API for free for up to 2,000 queries per month.

sponsored

Tech Snippets