How PayPal prevents fraud with Graph Databases
Hey Everyone!
Today we’ll be talking about
How PayPal uses Graph Databases for Fraud Prevention
A brief intro to Graph Databases and their properties
PayPal's real-time, interactive and analytics graph technology stacks
The architecture of PayPal's real-time graph technology stack
Tech Snippets
How to quickly get promoted as a developer - A great interview with the former VP of Engineering at Mixpanel.
How Slack scaled their mobile codebase - A 3 part series on how Slack refactored their iOS and Android codebases.
Command Line Interface Guidelines - An amazing guide that goes over how to write better CLI programs.
Your API is Bad - A great article that goes through the best practices you should employ when building an API.
Do you find these articles interesting?Are the tech summary and tech snippets in today's email interesting to you? Your answer helps me improve Quastor's curation! |
How PayPal uses Graph Databases for Fraud Prevention
PayPal is a fintech company that lets users transfer money online. They have over 400 million active consumers and merchants on the platform and they process thousands of payment transactions every minute.
Detection and prevention of fraud is one of the biggest problems fintech companies have to deal with and PayPal is no exception.
In order to do this, PayPal relies on a graph database. Quinn Zuo is the head of AI/ML Product Management at PayPal and he wrote a great blog post on how PayPal does this.
Here’s a summary
A graph is a collection of nodes (vertices) and relationships (edges) between those nodes.
A graph database management system (graph database) gives you a durable way to store your graph along with an interface to easily perform create/read/update/delete (CRUD) operations on the graph.
Relationships are first class citizens in graph databases, so traversing through nodes and computing various graph algorithms (PageRank, Connected Components, etc.) is much easier to implement compared to doing it with SQL queries on a relational database. It can also be significantly faster than a relational database depending on how the graph database is built.
You can view a comparison of querying for data between SQL and Cyper (Neo4J’s graph query language) here.
If you have a graph database with data about actors and the movies they were involved with, then an SQL query for all the directors of Keanu Reeves movies might look like this…
A Cypher query would be much shorter and easier to interpret…
Properties
When looking at graph database technologies, there are two properties you should be examining: the underlying storage (native vs. non-native storage) and the processing engine (native vs. non-native processing).
Underlying Storage
This is the underlying structure of the database that contains the graph data.
It can be either native or non-native. Native graph storage means that it’s been built specifically for storing graph-like data, which means more efficiency when running graph queries. You can see this with graph databases like Neo4j. For non-native storage, the graph database will serialize the graph data into relational, key-value, document-oriented or some other general-purpose data store.
The benefit of non-native graph storage is that you can build your graph database on a battle-tested backend like Postgres or Cassandra where the scaling characteristics are well understood. You can take advantage of a Graph API without having to rebuild all the sharding, replication, redundancy, etc.
PayPal uses Aerospike as the underlying storage for their graph database. Aerospike is an open source, distributed, key value database.
The Processing Engine
The processing engine runs database operations on your graph, and can be split into native or non-native processing.
The key difference between native and non-native processing is index-free adjacency. Index-free adjacency means that your graph doesn’t have to work with a database index to hop from any node to its neighboring nodes. Each node has direct addresses to all of its neighboring nodes. Native graph processing means using index-free adjacency.
On the other hand, if you’re using a relational database as the underlying storage and you need to first check a database index to find the location of a neighboring node, then you do not have native graph processing.
Here’s a great article that delves into index-free adjacency.
Graph Database
PayPal has a two-sided network with buyers and merchants who are sending each other transactions.
They encode this network as a graph with buyers/sellers modeled as vertices in the graph.
Edges are connections between the vertices. Examples of potential connections are sending a payment, sharing the same IP, having the same home address, etc.
An Overview of PayPal’s Graph Platform
PayPal’s graph platform is an integrated platform that consists of real-time, interactive, and analytics graph technology stacks.
The real-time graph platform will return graph query results very quickly (sub-second). The returned query results can be used in machine learning models for immediate fraud prevention. If a malicious actor tries to create a new PayPal account after getting banned, the real-time graph platform can help identify that user and block his account right after he creates it.
The Interactive graph platform serves use cases where the query latency can be within a few seconds or minutes. This is useful for graph visualization and is suitable for investigations done by PayPal’s fraud-prevention teams.
The analytics graph platform is used to uncover unknown patterns using graph algorithms and training graph ML models. It’s built on HPCs so that training and algorithms can be run quickly whereas the interactive and real-time platforms are both built on commodity servers.
Real Time Graph Database
The real time graph platform is used to query the graph database and identify potential fraudulent activities immediately.
Some of the core requirements for PayPal’s Real Time Graph Platform are
Customizable high performance graph query APIs
Sub-second level query latency and optimization
Seconds level data freshness (near real-time update)
Horizontal scalability with fault tolerance
Million queries per second throughput
Flexible data backfill from offline to online
Here’s the architecture for the Real-Time Graph Stack. You can view a larger image here.
The storage backend for the database uses Aerospike and there’s a Write Path and a Read Path built around it to perform CRUD operations on the database.
Write Path
For the write path, offline data loading (in the lower part of the image above) and event-based data updates are abstracted into a single Graph Data Process Service.
The offline channel is set up for loading snapshots of the data and supports daily or weekly updates.
Event-based, near real-time data comes from a variety of production data services at PayPal. These data sources have been abstracted as events/messages in Kafka. The Graph Data Process Service consumes those messages to create new vertices and edges in the graph database.
Read Path
The Graph Query Service is responsible for handling reads from the underlying Aerospike data store. It provides template APIs that the upstream services (for running the ML models) can use.
Those APIs wrap Gremlin queries that run on a Gremlin Layer. Gremlin is an open source graph query language that can be used for OLTP and OLAP traversals. It’s part of Apache TinkerPop, which is a popular graph computing framework.
The Gremlin layer converts the queries into optimized Aerospike queries, where they can be run against the underlying storage.
For more details on PayPal’s Graph viewer and Graph embeddings, you can read the full article here.
Tech Snippets
How to get promoted as a developer - This is a great video from Rahul Pandey (co-founder of Tech Career Growth) on how to get promoted quickly as a software engineer.
He interviews the former VP of Engineering at Mixpanel on how a software engineer at the company went from Engineer 2 -> Senior Engineer -> Staff Engineer -> Tech Lead -> Principal Tech Lead in just 4 years. The key is to do work that has a great impact on the metrics your company is trying to improve. The video gives some tips on how to do that.
How Slack scaled their Mobile Codebases - Over the years, Slack accumulated a great deal of tech debt for their mobile codebases. By the beginning of 2020, this debt had started to slow development down enough to affect product roadmaps so Slack engineers decided to refactor the codebases.
This is a great series of blog posts on how Slack went about the refactoring
Part 1 - Get rid of the worst of the tech debt and anti-patterns and finish stalled migrations
Part 2 - Break the app monolithic into smaller components to reduce interdependencies, decrease build times and allow for independent development
Part 3 - Adopt forward-looking technologies and design patterns’
Command Line Interface Guidelines - This is an amazing guide that goes over how to write better command line interface programs. It goes over philosophical ideas that you should keep in mind when designing the interface and also gives concrete guidelines and best practices that you should implement.
Examples of some guidelines are
Have full-length versions of all flags - For example, have both - h and - - help. Adding the full version is useful in scripts where you want to be verbose and descriptive, so you don’t have to look up the meaning of flags everywhere.
Put the most important information at the end of the output - The user will usually look at the end of the output first, so put the most important information there.
Make it easy for the user to escape - Don’t do what VIM does. Make it clear how to get out.
Your API is Bad - Paddy Carver is an engineer at LaunchDarkly. Previously, he was a senior software engineer at HashiCorp.
He wrote an awesome article on the best practices you should be following if you want to build an API that’s easy to use.
Interview Question
Write a function that adds two numbers.
You cannot use + or any arithmetic operators!
Previous Question
As a refresher, here’s the previous question
A robot is located at the top-left corner of a m x n grid.
The robot can only move either down or right at any point in time.
The robot is trying to reach the bottom-right corner of the grid.
How many possible unique paths are there?
Solution
The easiest way to solve this question is with combinatorics, specifically permutations.
A permutation is a way of counting the number of ways you can order n distinct objects.
If you want to order 5 people in a line, then you have 5 choices for the first spot in the line.
Then, you have 4 choices for the second spot.
3 choices for the third spot.
2 choices for the fourth spot.
And 1 choice for the last spot.
The number of orderings is therefore 5 * 4 * 3 * 2 * 1 or 5!.
Going back to our question, let’s take the example where m = 4 and n = 8.
In other words, we will have to make 3 moves down and 7 moves right in order to reach the bottom right corner.
You can imagine this as 3 downs and 7 rights, or 3 Ds and 7 Rs, and we have to figure out the total number of orderings of downs and rights.
You can represent this in a string as “DDDRRRRRRR” with 3 Ds and 7 Rs.
Our goal is to find the total number of orderings for that string.
Since we have 10 total items (3 + 7), then the total number of orderings is 10!.
However, each D is not distinct from another D. The same applies for the Rs. Each R is not different from another R.
Therefore, solely doing 10! results in overcounting.
How much are we overcounting by?
Well, we’re double counting by the number of D permutations (which is 3!) and overcounting by the number of R permutations (which is 7!).
Therefore, we can divide 10! by (3! * 7!) to get our answer.
Generalizing this, the formula we can use for our answer is
((m - 1) + (n - 1))! / ((m - 1)! * (n - 1)!)
Here’s the Python 3 code…