COMPUTER SCIENCE FUNDAMENTALS SERIES

Graph Data
Structures

Adjacency matrix · Adjacency list · CSR format ·
DAGs · Property graphs · Graph libraries

Mid-level software engineer track · 20 slides

Graph Terminology

A graph G = (V, E) consists of a set of vertices (nodes) and a set of edges (links) connecting them.

Term	Definition
Vertex (node)	A fundamental unit -- represents an entity
Edge (link)	A connection between two vertices
Degree	Number of edges incident to a vertex
Path	Sequence of vertices connected by edges
Cycle	A path that starts and ends at the same vertex
Connected	Every vertex reachable from every other
Component	A maximal connected subgraph

5 vertices, 6 edges, connected graph

Graphs are the most general-purpose data structure -- trees, linked lists, and arrays are all special cases.

Directed & Undirected Graphs

Undirected graphs

Edges have no direction. If (u, v) exists, then (v, u) is implied.

Friendships on a social network -- symmetric
Road segments allowing traffic in both directions
Molecular bonds in chemistry

Directed graphs (digraphs)

Each edge has a source and a target. (u, v) does not imply (v, u).

Twitter follows -- Alice follows Bob ≠ Bob follows Alice
Web hyperlinks -- page A links to page B
Dependencies -- module A depends on module B

Most real-world graphs are directed. Even "undirected" relationships are often stored as two directed edges for implementation simplicity.

Weighted Graphs

A weighted graph assigns a numeric value (weight, cost, capacity) to each edge. Weights represent distance, latency, bandwidth, probability, or any domain metric.

Where weights matter

Shortest path -- Dijkstra, Bellman-Ford, A* all require edge weights
Minimum spanning tree -- Kruskal, Prim select lowest-weight edges
Network flow -- edge weights represent capacities
Negative weights -- Bellman-Ford handles them; Dijkstra does not

An unweighted graph is a weighted graph where every edge has weight 1.

Adjacency Matrix

For a graph with n vertices, allocate an n x n matrix M. M[i][j] = 1 if edge (i, j) exists, 0 otherwise. For weighted graphs, store the weight.

Vertices: A=0  B=1  C=2  D=3

     A  B  C  D
A [  0  1  1  0 ]
B [  1  0  0  1 ]
C [  1  0  0  1 ]
D [  0  1  1  0 ]

Characteristics

Property	Value
Space	`O(V²)` -- regardless of edge count
Edge lookup	`O(1)` -- direct array access
Add edge	`O(1)`
Iterate neighbours	`O(V)` -- must scan entire row
Add vertex	`O(V²)` -- reallocate matrix

Best for dense graphs where |E| approaches |V|². Wastes memory on sparse graphs.

Adjacency List

Each vertex stores a list (array, linked list, or hash set) of its neighbours. The most common representation in practice.

A -> [B, C]
B -> [A, D]
C -> [A, D]
D -> [B, C]

Default choice for most graph algorithms. BFS and DFS run in O(V + E) -- optimal.

Characteristics

Property	Value
Space	`O(V + E)` -- proportional to graph size
Edge lookup	`O(degree)` or `O(1)` with hash set
Add edge	`O(1)` -- append to list
Iterate neighbours	`O(degree)` -- direct traversal
Add vertex	`O(1)` -- append new list

Matrix vs List -- Trade-offs

Criterion	Adjacency Matrix	Adjacency List
Space complexity	`O(V²)`	`O(V + E)`
Edge existence check	`O(1)`	`O(degree)`
Iterate all neighbours	`O(V)`	`O(degree)`
Add / remove edge	`O(1)`	`O(1)` add / `O(degree)` remove
Dense graph performance	Excellent	Wasteful pointer overhead
Sparse graph performance	Wasteful memory	Excellent
Cache locality	Good (contiguous)	Poor (pointer chasing)
Matrix operations	Natural (multiply, transpose)	Requires conversion

Prefer the matrix

Dense graphs (|E| > |V|²/4)
Frequent edge-existence queries
Matrix algebra (spectral methods, PageRank)
Small graphs (under ~1000 vertices)

Prefer the list

Sparse graphs (most real-world graphs)
Traversal algorithms (BFS, DFS, Dijkstra)
Dynamic graphs with frequent mutations
Memory-constrained environments

Edge List & Incidence Matrix

Edge list

Store edges as a flat list of (source, target, weight) tuples. Simplest possible representation.

[(A,B,5), (A,C,2), (B,D,3),
 (C,D,7), (D,E,1)]

Space: O(E)
Edge lookup: O(E) -- linear scan
Best for: Kruskal's algorithm (sort by weight), input parsing, serialisation

Incidence matrix

An |V| x |E| matrix. Column j has +1 at source, -1 at target (directed), or 1 at both endpoints (undirected).

       e1  e2  e3  e4
  A [   1   1   0   0 ]
  B [   1   0   1   0 ]
  C [   0   1   0   1 ]
  D [   0   0   1   1 ]

Space: O(V * E)
Use case: theoretical analysis, network flow, circuit modelling

Edge lists are underrated for batch processing. Many distributed graph frameworks (Spark GraphX) use edge-list partitioning internally.

Compressed Sparse Row (CSR)

CSR stores a sparse adjacency structure using three flat arrays, eliminating pointer overhead and maximising cache locality.

Graph:  0->[1,3]  1->[2]  2->[3]  3->[]

row_ptr: [0, 2, 3, 4, 4]
col_idx: [1, 3, 2, 3]
values:  [1, 1, 1, 1]

Array	Size	Purpose
`row_ptr`	`V + 1`	Start index in col_idx for each vertex
`col_idx`	`E`	Concatenated sorted neighbour indices
`values`	`E`	Edge weights (omit for unweighted)

Characteristics

Space: O(V + E) with minimal overhead
Neighbour iteration: contiguous memory scan -- cache-friendly
Edge lookup: binary search within row -- O(log degree)
Weakness: static -- inserting edges requires full rebuild

Standard format for high-performance graph libraries (Boost Graph, SuiteSparse, cuGraph) and GPU graph processing. CSC (Compressed Sparse Column) enables efficient in-neighbour access.

Implicit Graphs

An implicit graph generates vertices and edges on demand via a function rather than storing them in memory. The graph exists logically but is never fully materialised.

Examples

Grid / lattice

Vertex (r, c) connects to (r+/-1, c) and (r, c+/-1). An n x n grid has n² vertices but the adjacency structure is never allocated.

State-space search

Puzzle states (Rubik's cube, 15-puzzle) are vertices; legal moves are edges. Billions of nodes but BFS explores only a fraction.

Game trees

Chess positions are vertices; legal moves are edges. ~10⁴⁷ positions -- fully materialising the tree is impossible.

Neighbour function pattern

def neighbours(state):
    for move in legal_moves(state):
        yield apply(move, state)

# BFS with implicit graph
from collections import deque
def bfs(start, goal):
    queue = deque([start])
    visited = {start}
    while queue:
        node = queue.popleft()
        if node == goal:
            return True
        for nbr in neighbours(node):
            if nbr not in visited:
                visited.add(nbr)
                queue.append(nbr)
    return False

If your graph has more than ~10⁸ vertices, implicit representation combined with BFS, DFS, or A* is the only practical option.

Multigraphs & Hypergraphs

Multigraphs

Allow multiple edges (parallel edges) between the same pair of vertices, and optionally self-loops.

Flight routes -- multiple airlines between same cities
Communication networks -- multiple links between routers
Representation: adjacency list with edge IDs

A ==[flight_101]==> B
A ==[flight_202]==> B
A ==[flight_303]==> B

Hypergraphs

A hyperedge connects an arbitrary number of vertices (not just two). Generalises the edge concept.

Database schemas -- a table (hyperedge) relates multiple columns
Co-authorship -- a paper connects all its authors
VLSI design -- a net connects multiple pins

Standard graph algorithms do not apply directly. Common approach: expand to a bipartite graph.

Bipartite Graphs

A graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to a vertex in V. No edge within the same set.

Detection: BFS 2-colouring

Pick any unvisited vertex, colour it red
Colour all neighbours blue
Colour their uncoloured neighbours red
If a neighbour already has the same colour -- not bipartite
Repeat for disconnected components

Time complexity: O(V + E). A graph is bipartite iff it contains no odd-length cycle.

Applications

Job matching -- workers to tasks (Hungarian algorithm)
Recommendations -- users to items
Scheduling -- time slots to courses

DAGs -- Directed Acyclic Graphs

A directed acyclic graph is a directed graph with no cycles. Every DAG has at least one topological ordering.

Key properties

At least one vertex with in-degree 0 (source) and one with out-degree 0 (sink)
Longest path is finite and computable in O(V + E)
Number of paths between two vertices can be exponential
Transitive closure and reduction are well-defined

Cycle detection

Run DFS and check for back edges -- an edge to a vertex currently on the recursion stack. If found, the graph has a cycle and is not a DAG.

Why DAGs matter

Application	Vertices	Edges
Build systems (Make, Bazel)	Tasks	Dependencies
Package managers (npm, pip)	Packages	Version deps
Data pipelines (Airflow)	Stages	Data flow
Git commit history	Commits	Parent ptrs
Spreadsheet formulas	Cells	Cell refs
Neural network layers	Layers	Data flow

Topological order: A, B, C, D, E

Topological Sort

Produces an ordering of vertices such that for every directed edge (u, v), vertex u appears before v. Only possible on DAGs.

Kahn's algorithm (BFS-based)

1. Compute in-degree for every vertex
2. Enqueue all vertices with in-degree 0
3. While queue is not empty:
   a. Dequeue vertex u, append to result
   b. For each neighbour v of u:
      - Decrement in-degree of v
      - If in-degree becomes 0, enqueue v
4. If result length != |V|, cycle exists

Time: O(V + E) Space: O(V)

DFS-based approach

1. Run DFS from each unvisited vertex
2. On finishing a vertex (all descendants
   explored), push it onto a stack
3. Pop the stack to get topological order

Time: O(V + E) Space: O(V)

Multiple valid orderings may exist. Kahn's can be adapted to find all orderings or detect a unique one.

Graph Storage in Databases

Native graph databases

Store vertices and edges as first-class objects with index-free adjacency -- each node physically points to its neighbours.

Neo4j Neptune TigerGraph Memgraph

Relational graph storage

CREATE TABLE vertices (
  id    INTEGER PRIMARY KEY,
  label VARCHAR(64),
  props JSONB
);
CREATE TABLE edges (
  src    INTEGER REFERENCES vertices(id),
  dst    INTEGER REFERENCES vertices(id),
  label  VARCHAR(64),
  weight FLOAT,
  PRIMARY KEY (src, dst, label)
);

Trade-offs

Feature	Native Graph DB	Relational
Multi-hop traversal	Fast (pointer chasing)	Slow (recursive JOINs)
Ad-hoc analytics	Limited	Excellent (SQL)
ACID transactions	Varies	Mature
Ecosystem / tooling	Smaller	Enormous
Schema flexibility	High	Rigid
Aggregation queries	Weak	Strong

Use a native graph DB when your core access pattern is multi-hop traversal. Use relational when you also need joins, aggregation, and mature tooling.

Property Graphs & RDF

Property graph model

Vertices and edges carry labels and arbitrary key-value properties. Dominant model in industry.

(:Person {name:"Alice", age:30})
  -[:FOLLOWS {since:2023}]->
(:Person {name:"Bob", age:28})

Query languages:

Cypher Gremlin GQL (ISO 2024)

RDF (Resource Description Framework)

Data as triples: (subject, predicate, object). Every entity is a URI. W3C standard for the Semantic Web.

<:Alice>  <:follows>  <:Bob> .
<:Alice>  <:age>      "30"^^xsd:integer .
<:Bob>    <:worksAt>  <:Acme> .

Query language:

SPARQL

Aspect	Property Graph	RDF
Schema	Optional labels + constraints	Ontologies (OWL, RDFS)
Strength	Application development	Data integration, linked data
Ecosystem	Neo4j, TigerGraph, Memgraph	Wikidata, DBpedia, knowledge graphs

Graph Libraries & Frameworks

Python

NetworkX -- pure Python, rich algorithms, great for prototyping
igraph -- C core with Python bindings, much faster
graph-tool -- C++/Boost core, best performance in Python

Java / JVM

JGraphT -- pure Java, extensive algorithms, well-documented
TinkerPop / Gremlin -- vendor-neutral graph traversal framework
Neo4j Java Driver -- native access from JVM

C++

Boost Graph Library -- generic, header-only, STL-style
LEMON -- lightweight with LP solver integration

Distributed / Large-scale

Spark GraphX -- distributed on Spark RDDs
Pregel / Giraph -- vertex-centric BSP model
NVIDIA cuGraph -- GPU-accelerated, CSR-based

Selection guidance

Need	Library
Quick prototyping, <100K edges	NetworkX
Large graphs in Python (>1M edges)	igraph or graph-tool
Production Java application	JGraphT
High-performance C++ pipeline	Boost Graph Library
Billion-edge distributed processing	Spark GraphX / Pregel
GPU-accelerated analytics	NVIDIA cuGraph

Choosing the Right Representation

Scenario	Representation	Reason
Sparse, dynamic graph	Adjacency list (hash map)	O(V+E) space, fast traversal, easy mutation
Dense graph or matrix algebra	Adjacency matrix	O(1) edge lookup, spectral methods
Static analysis, HPC, GPU	CSR / CSC	Cache-friendly, minimal memory
Edge-centric algorithms (Kruskal)	Edge list	Sort by weight, union-find
Enormous state space (AI search)	Implicit graph	Generate on demand, never materialise
Persistent storage with queries	Property graph DB	Index-free adjacency, Cypher/Gremlin
Linked data / knowledge graph	RDF triple store	Standards-based, federated queries
Multiple edge types, same nodes	Multigraph adj. list	Edge IDs distinguish parallel edges

Rule of thumb

Start with an adjacency list. Move to CSR when profiling shows memory or cache bottlenecks. Move to a graph database when you need persistence and multi-hop queries at scale.

Applications of Graphs

Social networks

Vertices = users; edges = follows/friendships. Graph algorithms power friend suggestions (common neighbours), influence scoring (PageRank), and community detection (Louvain).

Road maps & navigation

Intersections = vertices; road segments = weighted edges. Dijkstra, A*, and contraction hierarchies enable real-time routing in Google Maps, Waze, and OSRM.

Dependency graphs

Packages, modules, or build targets as vertices; imports as edges. Topological sort determines build order; cycle detection prevents circular dependencies.

Knowledge graphs

Entities and relationships modelled as a graph -- Google Knowledge Graph, Wikidata. Power semantic search, question answering, and recommendation.

Compiler optimisation

Control flow graphs, data flow graphs, SSA form. Graph analyses drive dead code elimination, register allocation, and instruction scheduling.

Bioinformatics & fraud detection

Protein interaction networks, genome assembly (de Bruijn graphs). Transaction graphs reveal suspicious patterns via subgraph matching in fraud detection.

Summary & Further Reading

Key takeaways

A graph G = (V, E) is the most flexible data structure -- trees, lists, and matrices are all special cases
Adjacency list is the default choice -- O(V + E) space, optimal traversal
Adjacency matrix excels at dense graphs and matrix algebra
CSR format is essential for HPC and GPU graph processing
DAGs underpin build systems, data pipelines, version control, and scheduling
Property graph DBs and RDF triple stores serve different persistence needs
Choose representation based on density, mutation frequency, and algorithm requirements

Source	Description
Cormen et al.	Introduction to Algorithms (CLRS) -- chapters 22-26
Skiena	The Algorithm Design Manual -- practical graph problem taxonomy
Needham & Hodler	Graph Algorithms (O'Reilly) -- Neo4j and Spark
NetworkX Docs	networkx.org
Boost Graph	boost.org/libs/graph
Neo4j Academy	graphacademy.neo4j.com

Graph DataStructures

Graph Terminology

Directed & Undirected Graphs

Undirected graphs

Directed graphs (digraphs)

Weighted Graphs

Where weights matter

Adjacency Matrix

Characteristics

Adjacency List

Characteristics

Matrix vs List -- Trade-offs

Prefer the matrix

Prefer the list

Edge List & Incidence Matrix

Edge list

Incidence matrix

Compressed Sparse Row (CSR)

Characteristics

Implicit Graphs

Examples

Grid / lattice

State-space search

Game trees

Neighbour function pattern

Multigraphs & Hypergraphs

Multigraphs

Hypergraphs

Bipartite Graphs

Detection: BFS 2-colouring

Applications

DAGs -- Directed Acyclic Graphs

Key properties

Cycle detection

Why DAGs matter

Topological Sort

Kahn's algorithm (BFS-based)

DFS-based approach

Graph Storage in Databases

Native graph databases

Relational graph storage

Trade-offs

Property Graphs & RDF

Property graph model

RDF (Resource Description Framework)

Graph Libraries & Frameworks

Python

Java / JVM

C++

Distributed / Large-scale

Selection guidance

Choosing the Right Representation

Rule of thumb

Applications of Graphs

Social networks

Road maps & navigation

Dependency graphs

Knowledge graphs

Compiler optimisation

Bioinformatics & fraud detection

Summary & Further Reading

Key takeaways

Recommended reading

Graph Data
Structures