Apache Graph Database: Choosing the Right Open Source Tool

Head of Developer Relations
No items found.
|
April 15, 2025
Apache Graph Database: Choosing the Right Open Source Tool

The idea of a graph database has gained traction in recent years as developers look for better ways to model connected data—like social networks, supply chains, or access logs. Alongside this growing interest, a phrase that often comes up in search engines is “Apache Graph Database.”

But if you’re expecting a single project under that name, you won’t find one.

Instead, several projects in the Apache Software Foundation ecosystem provide building blocks for working with graph data. Some handle graph querying, others support graph processing, and a few focus on interfaces and traversals. Each serves a different purpose, and together they form a loosely connected toolkit for graph workloads.

This post breaks down the main technologies involved—what they do, how they differ, and where each fits. We’ll use a simple social graph as a running example to show how these tools approach the same data model from different angles.

What Does “Apache Graph Database” Really Mean?

The phrase “Apache Graph Database” doesn’t refer to a specific project. Rather, it’s a convenient but vague label that bundles together several Apache projects related to graph computing. These projects fall into three broad categories:

  • Graph Querying – tools that support declarative graph queries, like Apache AGE.
  • Graph Traversal and Interfaces – frameworks like Apache TinkerPop that define how to navigate and interact with graph structures.
  • Graph Processing – systems such as Apache Giraph and Flink Gelly, which enable large-scale computation over graph data, often for analytics or algorithmic purposes.

What they have in common is a focus on nodes, edges, and relationships—but they differ in how they store, query, and process that data.

So when people search for “Apache Graph Database,” they’re often looking for any technology under Apache that supports graph modeling or analysis. This post focuses on four such projects:

Each of these addresses graph data in a different way—and understanding those differences helps clarify what the “Apache Graph Database” ecosystem really looks like.

Apache AGE: Bringing Graph Queries to PostgreSQL

Apache AGE (A Graph Extension) adds graph capabilities to PostgreSQL by letting you define and query property graphs using the openCypher language. Unlike traditional graph databases that require a separate system and ETL processes, AGE lets you work with graph data directly inside your relational database.

This makes it possible to store graph structures—like people and their connections—as regular PostgreSQL rows, while querying them in a graph-native way.

Social Graph Example: Who Follows Whom

Let’s use a simple social graph of six users and a few “follows” relationships. 

Suppose you already have two tables in PostgreSQL:

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name TEXT
);

CREATE TABLE follows (
  from_id INTEGER,
  to_id INTEGER,
  since DATE
);

With AGE, we begin by creating a graph:

SELECT create_graph('social_graph');

Insert six users as vertices:

SELECT * FROM cypher('social_graph', $$
  CREATE (:User {id: 1, name: 'Alice'}),
         (:User {id: 2, name: 'Bob'}),
         (:User {id: 3, name: 'Charlie'}),
         (:User {id: 4, name: 'Dana'}),
         (:User {id: 5, name: 'Eve'}),
         (:User {id: 6, name: 'Frank'})
$$) AS (v agtype);

Create a few follow edges:

SELECT * FROM cypher('social_graph', $$
  MATCH (a:User), (b:User)
  WHERE a.name = 'Alice' AND b.name = 'Bob'
  CREATE (a)-[:FOLLOWS {since: date('2022-01-01')}]->(b)
$$) AS (e agtype);

You can now query the graph structure:

SELECT * FROM cypher('social_graph', $$
  MATCH (a:User)-[:FOLLOWS]->(b:User)
  RETURN a.name, b.name
$$) AS (a_name TEXT, b_name TEXT);

This returns all follow relationships in the graph, expressed as directed edges.

Why Use Apache AGE?

The design of Apache AGE offers several practical advantages:

No need for data duplication (zero ETL)

Most graph databases require exporting and transforming your existing data into a new format. AGE avoids this by operating directly inside PostgreSQL. Your source data remains in place, and you define graph structures on top of it using metadata or by creating views. This simplifies both development and data governance.

Graph and relational models in one system

You can combine SQL and Cypher queries. For example, join graph results with tabular data, or use PostgreSQL indexes to optimize graph lookups. This dual model is powerful for applications where part of your data fits cleanly into tables but some relationships need graph traversal.

Familiar infrastructure and tooling

Because AGE is a PostgreSQL extension, it works with standard PostgreSQL tooling—JDBC, psql, pgAdmin, connection pooling, and backup systems. You don’t need to introduce or manage a new database engine.

Cypher-based graph language

AGE uses openCypher, a declarative language purpose-built for expressing graph patterns. It lets you write queries like:

MATCH (a:User)-[:FOLLOWS]->(b:User)
WHERE b.name = 'Bob'
RETURN a.name

This is far more intuitive for pathfinding or relationship queries than equivalent SQL.

Apache TinkerPop: Defining Graph Traversals with Gremlin

Apache TinkerPop logo

While Apache AGE integrates graph querying into PostgreSQL, Apache TinkerPop takes a different approach. It isn’t a graph database, but rather a graph computing framework—a specification and runtime for working with graph data across different systems. Its centerpiece is the Gremlin traversal language, a powerful and expressive tool for navigating property graphs.

TinkerPop is used by many graph databases and data platforms—such as JanusGraph, AWS Neptune and Azure Cosmos DB—as their query and traversal layer. This makes it ideal if you need a vendor-neutral graph interface that works across engines.

The Role of Gremlin

Gremlin is a graph traversal language designed to express paths and patterns in a graph. While Cypher is a declarative language similar to SQL, Gremlin is procedural and fluent, designed to be embedded directly in programming languages like Java, Groovy, or Python. Instead of writing standalone queries, you build traversals step by step using the host language’s syntax.

Let’s revisit the social graph example. Once the vertices and edges are loaded, we can write traversals like:

g.V().has('name', 'Alice').out('follows').values('name')

This means: start at the vertex where name is “Alice”, follow outgoing “follows” edges, and return the names of the connected vertices.

It’s equivalent to the Cypher query we saw in Apache AGE:

MATCH (a:User {name: 'Alice'})-[:FOLLOWS]->(b:User)
RETURN b.name

Where TinkerPop Fits

Unlike AGE, TinkerPop doesn’t handle data storage or schema—it defines how to traverse and manipulate a graph. You’ll still need a graph engine or database that supports the TinkerPop stack, like PuppyGraph, JanusGraph or Neptune.

That said, its biggest strength is interoperability. You can write Gremlin traversals that run across multiple platforms with minimal changes, making it ideal for teams evaluating or migrating between graph solutions.

TinkerPop is best viewed as the “graph language layer” of the Apache ecosystem. It doesn’t provide storage or processing out of the box, but gives you a unified, expressive traversal language to explore complex graph structures—like our social graph—on top of compatible graph systems.

Apache Giraph: Batch Graph Processing at Scale

Apache Giraph logo

Apache Giraph is a distributed graph processing system built on top of Apache Hadoop, designed for running large-scale graph algorithms in a batch-oriented fashion. It follows the Bulk Synchronous Parallel (BSP) model, where computation is broken into coordinated steps called supersteps. This makes it fundamentally different from query-based systems like Apache AGE or Apache TinkerPop—it’s not about retrieving paths or subgraphs interactively, but about computing over the entire graph in parallel.

A Vertex-Centric Programming Model

In Giraph, each vertex is a computation unit. It holds a value, communicates only with its direct neighbors, and runs a user-defined function during each superstep. Between steps, messages are passed asynchronously and buffered. This model supports scalable execution of algorithms like: PageRank, shortest path, connected components, label propagation.

For example, consider our social graph, assume the edges are like:

Figure: An example social graph

To compute the set of users reachable from Alice, Giraph would simulate this process in supersteps:

  1. Superstep 0: Alice sends messages to Bob and Charlie.
  2. Superstep 1: Bob and Charlie mark themselves as reached, and send messages to Alice and Dana, respectively.
  3. Superstep 2: Dana marks herself and sends to Alice.
  4. Superstep 3: No new messages—computation halts.

This model scales efficiently to graphs with billions of vertices and edges, distributed across many machines. Giraph runs on top of Hadoop and integrates with HDFS, but it’s write-once, run-many: it’s not meant for real-time or ad hoc queries.

Development and Retirement

Giraph was originally developed at Yahoo!, then donated to the Apache Software Foundation, where it became a top-level project in 2012. It gained early attention after Facebook used it to process a trillion-edge graph.

However, over time, development activity declined. In September 2023, Apache Giraph was officially retired and moved to the Apache Attic, meaning it is no longer actively maintained. The source code and documentation are still available, but it’s no longer recommended for production use.

Where It Fits

Giraph was built for offline, large-scale analytics, particularly when working in batch environments powered by Hadoop. Although it’s now inactive, it helped define the distributed graph processing space. Today, alternatives like Apache Flink Gelly and Apache Spark GraphX offer similar capabilities with more active communities and modern runtimes.

Apache Spark GraphX: Scalable Graph Analytics on Spark

Apache Spark GraphX is a distributed graph processing library built into Apache Spark. Like Apache Giraph, GraphX is designed for batch computation over large graphs, but it takes a more flexible approach by integrating with Spark’s general-purpose data abstractions. GraphX represents graphs using RDDs (Resilient Distributed Datasets) and supports both graph-parallel operations and relational-style transformations.

GraphX introduces a Graph[VD, ED] abstraction, where VD is the type of data stored at each vertex, and ED is the type of data stored on each edge. It also includes built-in algorithms like PageRank, connected components, triangle counting, and shortest paths.

Applying GraphX to Our Social Graph

We still use the social graph example, and we can model this in GraphX by first defining vertices and edges:

val vertices = sc.parallelize(Seq(
  (1L, "Alice"),
  (2L, "Bob"),
  (3L, "Charlie"),
  (4L, "Dana"),
  (5L, "Eve"),
  (6L, "Frank")
))

val edges = sc.parallelize(Seq(
  Edge(1L, 2L, "follows"),  // Alice → Bob
  Edge(2L, 1L, "follows"),  // Bob → Alice
  Edge(1L, 3L, "follows"),  // Alice → Charlie
  Edge(3L, 4L, "follows"),  // Charlie → Dana
  Edge(4L, 1L, "follows"),  // Dana → Alice
  Edge(5L, 3L, "follows"),  // Eve → Charlie
  Edge(6L, 5L, "follows"),  // Frank → Eve
  Edge(5L, 6L, "follows")   // Eve → Frank
))

val graph = Graph(vertices, edges)

Suppose we want to compute the number of users each person can reach in one step (out-degree):

val outDegrees = graph.outDegrees

Or run PageRank to assess influence in the network:

val ranks = graph.pageRank(0.001).vertices

This combination of graph and RDD-style APIs allows for flexible data manipulation and computation on large graphs.

Where GraphX Fits

GraphX is ideal if you’re already working in the Spark ecosystem and need scalable batch analytics on graph data. It’s especially well-suited for pipelines that mix structured data processing with graph-based algorithms.

However, GraphX does have limitations:

  • It lacks native support for streaming graphs (unlike Flink Gelly),
  • Its API is based on RDDs, which are lower-level than Spark’s newer DataFrame and Dataset APIs,
  • There’s no built-in query language like Cypher or Gremlin.

Still, for large-scale offline graph analysis, GraphX remains one of the most mature and widely used tools in the Apache stack.

Apache Flink Gelly: Graph Processing in Data Streams

Apache Flink Logo

Apache Flink Gelly is a graph processing API built into Apache Flink, designed for both batch and streaming graph analytics. Like Giraph and GraphX, it’s focused on computing over large-scale graphs—but Gelly distinguishes itself by supporting streaming semantics, enabling graph algorithms to operate on incremental, evolving data.

Gelly provides a Graph abstraction that models vertices and edges with arbitrary properties. It supports transformations like mapVertices, filterOnEdges, and joinWithVertices, and includes a growing library of graph algorithms, such as PageRank, label propagation, and triangle counting.

Let’s go back to the social graph. In Gelly, you would typically represent this graph using three datasets:

val vertices: DataSet[(Long, String)] = env.fromElements(
  (1L, "Alice"),
  (2L, "Bob"),
  (3L, "Charlie"),
  (4L, "Dana"),
  (5L, "Eve"),
  (6L, "Frank")
)
// Each tuple is (ID, name)

val edges: DataSet[Edge[Long, String]] = env.fromElements(
  Edge(1L, 2L, "follows"),   // Alice → Bob
  Edge(2L, 1L, "follows"),   // Bob → Alice
  Edge(1L, 3L, "follows"),   // Alice → Charlie
  Edge(3L, 4L, "follows"),   // Charlie → Dana
  Edge(4L, 1L, "follows"),   // Dana → Alice
  Edge(5L, 3L, "follows"),   // Eve → Charlie
  Edge(6L, 5L, "follows"),   // Frank → Eve
  Edge(5L, 6L, "follows")    // Eve → Frank
)

val graph: Graph[Long, String, String] = Graph.fromDataSet(vertices, edges, env)

Suppose we want to compute the number of users reachable from Alice. You could write a custom BFS-style iteration or use Gelly’s scatter-gather model:

graph.run(new SingleSourceShortestPaths(1L, maxIterations))

This would traverse the graph starting from Alice (ID 1) and propagate distances.

What’s important here is that Gelly can handle dynamic graphs—if new edges arrive via a stream, the computation can adapt in real time, unlike Giraph and GraphX which are strictly batch-oriented.

Where Gelly Fits

Gelly is ideal for applications where graph structure is constantly evolving—social media streams, fraud detection systems, or real-time recommendation engines. Its ability to integrate with Flink’s DataStream API makes it suitable for event-driven architectures where relationships form and change continuously.

In contrast to Giraph (now retired) and GraphX (static RDD-based), Gelly provides more flexibility and extensibility, with support for iterative, incremental, and asynchronous computation.

PuppyGraph: A Practical Alternative

While Apache AGE, TinkerPop, Giraph, Gelly, and GraphX each offer valuable capabilities for graph querying and analytics, they also come with trade-offs—such as requiring data to be migrated into specific engines, limited support for streaming or multi-source access, or the need to manage custom infrastructure. These constraints can make graph integration more complex than it needs to be, especially in environments where data already lives in multiple relational or semi-structured systems.

PuppyGraph is the first and only graph analytics engine in the market, empowering companies to transform existing relational data stores into a unified graph model in under 10 minutes, bypassing traditional graph databases’ cost, latency, and maintenance hurdles. PuppyGraph addresses the integration issues by letting you define and query virtual property graphs directly on top of existing data sources—without ETL, data duplication, or backend lock-in.  It supports a broad range of sources, including databases, data warehouses, data lakes, and streaming platforms. You define your graph structure using a simple JSON schema, and start querying it using either openCypher or Gremlin, without reformatting or relocating your data.

Figure: A demo of a cybersecurity knowledge graph using PuppyGraph.

Designed for petabyte-level data volumes, PuppyGraph decouples storage from computation. Its distributed, columnar architecture allows it to execute complex multi-hop queries across hundreds of millions of edges in seconds, thereby mitigating performance bottlenecks. This ensures that even as data scales, security teams can perform deep historical lookups and real-time threat analyses without delay.

Figure: PuppyGraph Architecture

Conclusion

The term “Apache Graph Database” covers a diverse set of technologies rather than a single product. From Apache AGE’s Cypher support in PostgreSQL, to Apache TinkerPop’s Gremlin-based traversal framework, to distributed graph processors like Giraph, GraphX, and Flink Gelly, the Apache ecosystem offers multiple entry points for working with graph data—each with its own strengths and limitations.

These tools reflect different design goals: some focus on graph querying, others on large-scale computation, and some serve as interfaces or frameworks rather than standalone engines. Understanding these distinctions is essential when choosing the right tool for your workload.

For teams looking to simplify graph access across mixed environments, newer solutions like PuppyGraph aim to unify graph querying without requiring data migration or system reconfiguration—bridging concepts introduced across the Apache landscape into a more streamlined experience. 

If you’re exploring graph technology and want to see what a no-ETL, query-ready engine looks like in practice, try the forever-free Developer Edition, or book a free demo with our graph expert team to see it in action.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.

Matt Tanner
Head of Developer Relations

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.

No items found.
Join our newsletter

See PuppyGraph
In Action

See PuppyGraph
In Action

Graph Your Data In 10 Minutes.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Dev Edition

Free Download

Enterprise Edition

Developer

$0
/month
  • Forever free
  • Single node
  • Designed for proving your ideas
  • Available via Docker install

Enterprise

$
Based on the Memory and CPU of the server that runs PuppyGraph.
  • 30 day free trial with full features
  • Everything in Developer + Enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required

Developer Edition

  • Forever free
  • Single noded
  • Designed for proving your ideas
  • Available via Docker install

Enterprise Edition

  • 30-day free trial with full features
  • Everything in developer edition & enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required