COMPUTER SCIENCE FUNDAMENTALS SERIES

Graph Data
Structures

Adjacency matrix · Adjacency list · CSR format ·
DAGs · Property graphs · Graph libraries
Mid-level software engineer track  ·  20 slides
02

Graph Terminology

A graph G = (V, E) consists of a set of vertices (nodes) and a set of edges (links) connecting them.

TermDefinition
Vertex (node)A fundamental unit -- represents an entity
Edge (link)A connection between two vertices
DegreeNumber of edges incident to a vertex
PathSequence of vertices connected by edges
CycleA path that starts and ends at the same vertex
ConnectedEvery vertex reachable from every other
ComponentA maximal connected subgraph
A B C D E

5 vertices, 6 edges, connected graph

Graphs are the most general-purpose data structure -- trees, linked lists, and arrays are all special cases.

03

Directed & Undirected Graphs

Undirected graphs

Edges have no direction. If (u, v) exists, then (v, u) is implied.

  • Friendships on a social network -- symmetric
  • Road segments allowing traffic in both directions
  • Molecular bonds in chemistry
A B

Directed graphs (digraphs)

Each edge has a source and a target. (u, v) does not imply (v, u).

  • Twitter follows -- Alice follows Bob ≠ Bob follows Alice
  • Web hyperlinks -- page A links to page B
  • Dependencies -- module A depends on module B
A B

Most real-world graphs are directed. Even "undirected" relationships are often stored as two directed edges for implementation simplicity.

04

Weighted Graphs

A weighted graph assigns a numeric value (weight, cost, capacity) to each edge. Weights represent distance, latency, bandwidth, probability, or any domain metric.

Where weights matter

  • Shortest path -- Dijkstra, Bellman-Ford, A* all require edge weights
  • Minimum spanning tree -- Kruskal, Prim select lowest-weight edges
  • Network flow -- edge weights represent capacities
  • Negative weights -- Bellman-Ford handles them; Dijkstra does not

An unweighted graph is a weighted graph where every edge has weight 1.

A B C D E 5 2 3 7 1
05

Adjacency Matrix

For a graph with n vertices, allocate an n x n matrix M. M[i][j] = 1 if edge (i, j) exists, 0 otherwise. For weighted graphs, store the weight.

Vertices: A=0  B=1  C=2  D=3

     A  B  C  D
A [  0  1  1  0 ]
B [  1  0  0  1 ]
C [  1  0  0  1 ]
D [  0  1  1  0 ]

Characteristics

PropertyValue
SpaceO(V2) -- regardless of edge count
Edge lookupO(1) -- direct array access
Add edgeO(1)
Iterate neighboursO(V) -- must scan entire row
Add vertexO(V2) -- reallocate matrix

Best for dense graphs where |E| approaches |V|2. Wastes memory on sparse graphs.

06

Adjacency List

Each vertex stores a list (array, linked list, or hash set) of its neighbours. The most common representation in practice.

A -> [B, C]
B -> [A, D]
C -> [A, D]
D -> [B, C]

Default choice for most graph algorithms. BFS and DFS run in O(V + E) -- optimal.

Characteristics

PropertyValue
SpaceO(V + E) -- proportional to graph size
Edge lookupO(degree) or O(1) with hash set
Add edgeO(1) -- append to list
Iterate neighboursO(degree) -- direct traversal
Add vertexO(1) -- append new list
07

Matrix vs List -- Trade-offs

CriterionAdjacency MatrixAdjacency List
Space complexityO(V2)O(V + E)
Edge existence checkO(1)O(degree)
Iterate all neighboursO(V)O(degree)
Add / remove edgeO(1)O(1) add / O(degree) remove
Dense graph performanceExcellentWasteful pointer overhead
Sparse graph performanceWasteful memoryExcellent
Cache localityGood (contiguous)Poor (pointer chasing)
Matrix operationsNatural (multiply, transpose)Requires conversion

Prefer the matrix

  • Dense graphs (|E| > |V|2/4)
  • Frequent edge-existence queries
  • Matrix algebra (spectral methods, PageRank)
  • Small graphs (under ~1000 vertices)

Prefer the list

  • Sparse graphs (most real-world graphs)
  • Traversal algorithms (BFS, DFS, Dijkstra)
  • Dynamic graphs with frequent mutations
  • Memory-constrained environments
08

Edge List & Incidence Matrix

Edge list

Store edges as a flat list of (source, target, weight) tuples. Simplest possible representation.

[(A,B,5), (A,C,2), (B,D,3),
 (C,D,7), (D,E,1)]
  • Space: O(E)
  • Edge lookup: O(E) -- linear scan
  • Best for: Kruskal's algorithm (sort by weight), input parsing, serialisation

Incidence matrix

An |V| x |E| matrix. Column j has +1 at source, -1 at target (directed), or 1 at both endpoints (undirected).

       e1  e2  e3  e4
  A [   1   1   0   0 ]
  B [   1   0   1   0 ]
  C [   0   1   0   1 ]
  D [   0   0   1   1 ]
  • Space: O(V * E)
  • Use case: theoretical analysis, network flow, circuit modelling

Edge lists are underrated for batch processing. Many distributed graph frameworks (Spark GraphX) use edge-list partitioning internally.

09

Compressed Sparse Row (CSR)

CSR stores a sparse adjacency structure using three flat arrays, eliminating pointer overhead and maximising cache locality.

Graph:  0->[1,3]  1->[2]  2->[3]  3->[]

row_ptr: [0, 2, 3, 4, 4]
col_idx: [1, 3, 2, 3]
values:  [1, 1, 1, 1]
ArraySizePurpose
row_ptrV + 1Start index in col_idx for each vertex
col_idxEConcatenated sorted neighbour indices
valuesEEdge weights (omit for unweighted)

Characteristics

  • Space: O(V + E) with minimal overhead
  • Neighbour iteration: contiguous memory scan -- cache-friendly
  • Edge lookup: binary search within row -- O(log degree)
  • Weakness: static -- inserting edges requires full rebuild

Standard format for high-performance graph libraries (Boost Graph, SuiteSparse, cuGraph) and GPU graph processing. CSC (Compressed Sparse Column) enables efficient in-neighbour access.

10

Implicit Graphs

An implicit graph generates vertices and edges on demand via a function rather than storing them in memory. The graph exists logically but is never fully materialised.

Examples

Grid / lattice

Vertex (r, c) connects to (r+/-1, c) and (r, c+/-1). An n x n grid has n2 vertices but the adjacency structure is never allocated.

State-space search

Puzzle states (Rubik's cube, 15-puzzle) are vertices; legal moves are edges. Billions of nodes but BFS explores only a fraction.

Game trees

Chess positions are vertices; legal moves are edges. ~1047 positions -- fully materialising the tree is impossible.

Neighbour function pattern

def neighbours(state):
    for move in legal_moves(state):
        yield apply(move, state)

# BFS with implicit graph
from collections import deque
def bfs(start, goal):
    queue = deque([start])
    visited = {start}
    while queue:
        node = queue.popleft()
        if node == goal:
            return True
        for nbr in neighbours(node):
            if nbr not in visited:
                visited.add(nbr)
                queue.append(nbr)
    return False

If your graph has more than ~108 vertices, implicit representation combined with BFS, DFS, or A* is the only practical option.

11

Multigraphs & Hypergraphs

Multigraphs

Allow multiple edges (parallel edges) between the same pair of vertices, and optionally self-loops.

  • Flight routes -- multiple airlines between same cities
  • Communication networks -- multiple links between routers
  • Representation: adjacency list with edge IDs
A ==[flight_101]==> B
A ==[flight_202]==> B
A ==[flight_303]==> B

Hypergraphs

A hyperedge connects an arbitrary number of vertices (not just two). Generalises the edge concept.

  • Database schemas -- a table (hyperedge) relates multiple columns
  • Co-authorship -- a paper connects all its authors
  • VLSI design -- a net connects multiple pins

Standard graph algorithms do not apply directly. Common approach: expand to a bipartite graph.

12

Bipartite Graphs

A graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to a vertex in V. No edge within the same set.

Detection: BFS 2-colouring

  1. Pick any unvisited vertex, colour it red
  2. Colour all neighbours blue
  3. Colour their uncoloured neighbours red
  4. If a neighbour already has the same colour -- not bipartite
  5. Repeat for disconnected components

Time complexity: O(V + E). A graph is bipartite iff it contains no odd-length cycle.

U u1 u2 u3 V v1 v2

Applications

  • Job matching -- workers to tasks (Hungarian algorithm)
  • Recommendations -- users to items
  • Scheduling -- time slots to courses
13

DAGs -- Directed Acyclic Graphs

A directed acyclic graph is a directed graph with no cycles. Every DAG has at least one topological ordering.

Key properties

  • At least one vertex with in-degree 0 (source) and one with out-degree 0 (sink)
  • Longest path is finite and computable in O(V + E)
  • Number of paths between two vertices can be exponential
  • Transitive closure and reduction are well-defined

Cycle detection

Run DFS and check for back edges -- an edge to a vertex currently on the recursion stack. If found, the graph has a cycle and is not a DAG.

Why DAGs matter

ApplicationVerticesEdges
Build systems (Make, Bazel)TasksDependencies
Package managers (npm, pip)PackagesVersion deps
Data pipelines (Airflow)StagesData flow
Git commit historyCommitsParent ptrs
Spreadsheet formulasCellsCell refs
Neural network layersLayersData flow
A B C D E

Topological order: A, B, C, D, E

14

Topological Sort

Produces an ordering of vertices such that for every directed edge (u, v), vertex u appears before v. Only possible on DAGs.

Kahn's algorithm (BFS-based)

1. Compute in-degree for every vertex
2. Enqueue all vertices with in-degree 0
3. While queue is not empty:
   a. Dequeue vertex u, append to result
   b. For each neighbour v of u:
      - Decrement in-degree of v
      - If in-degree becomes 0, enqueue v
4. If result length != |V|, cycle exists

Time: O(V + E)   Space: O(V)

DFS-based approach

1. Run DFS from each unvisited vertex
2. On finishing a vertex (all descendants
   explored), push it onto a stack
3. Pop the stack to get topological order

Time: O(V + E)   Space: O(V)

Multiple valid orderings may exist. Kahn's can be adapted to find all orderings or detect a unique one.

15

Graph Storage in Databases

Native graph databases

Store vertices and edges as first-class objects with index-free adjacency -- each node physically points to its neighbours.

Neo4j Neptune TigerGraph Memgraph

Relational graph storage

CREATE TABLE vertices (
  id    INTEGER PRIMARY KEY,
  label VARCHAR(64),
  props JSONB
);
CREATE TABLE edges (
  src    INTEGER REFERENCES vertices(id),
  dst    INTEGER REFERENCES vertices(id),
  label  VARCHAR(64),
  weight FLOAT,
  PRIMARY KEY (src, dst, label)
);

Trade-offs

FeatureNative Graph DBRelational
Multi-hop traversalFast (pointer chasing)Slow (recursive JOINs)
Ad-hoc analyticsLimitedExcellent (SQL)
ACID transactionsVariesMature
Ecosystem / toolingSmallerEnormous
Schema flexibilityHighRigid
Aggregation queriesWeakStrong

Use a native graph DB when your core access pattern is multi-hop traversal. Use relational when you also need joins, aggregation, and mature tooling.

16

Property Graphs & RDF

Property graph model

Vertices and edges carry labels and arbitrary key-value properties. Dominant model in industry.

(:Person {name:"Alice", age:30})
  -[:FOLLOWS {since:2023}]->
(:Person {name:"Bob", age:28})

Query languages:

Cypher Gremlin GQL (ISO 2024)

RDF (Resource Description Framework)

Data as triples: (subject, predicate, object). Every entity is a URI. W3C standard for the Semantic Web.

<:Alice>  <:follows>  <:Bob> .
<:Alice>  <:age>      "30"^^xsd:integer .
<:Bob>    <:worksAt>  <:Acme> .

Query language:

SPARQL
AspectProperty GraphRDF
SchemaOptional labels + constraintsOntologies (OWL, RDFS)
StrengthApplication developmentData integration, linked data
EcosystemNeo4j, TigerGraph, MemgraphWikidata, DBpedia, knowledge graphs
17

Graph Libraries & Frameworks

Python

  • NetworkX -- pure Python, rich algorithms, great for prototyping
  • igraph -- C core with Python bindings, much faster
  • graph-tool -- C++/Boost core, best performance in Python

Java / JVM

  • JGraphT -- pure Java, extensive algorithms, well-documented
  • TinkerPop / Gremlin -- vendor-neutral graph traversal framework
  • Neo4j Java Driver -- native access from JVM

C++

  • Boost Graph Library -- generic, header-only, STL-style
  • LEMON -- lightweight with LP solver integration

Distributed / Large-scale

  • Spark GraphX -- distributed on Spark RDDs
  • Pregel / Giraph -- vertex-centric BSP model
  • NVIDIA cuGraph -- GPU-accelerated, CSR-based

Selection guidance

NeedLibrary
Quick prototyping, <100K edgesNetworkX
Large graphs in Python (>1M edges)igraph or graph-tool
Production Java applicationJGraphT
High-performance C++ pipelineBoost Graph Library
Billion-edge distributed processingSpark GraphX / Pregel
GPU-accelerated analyticsNVIDIA cuGraph
18

Choosing the Right Representation

ScenarioRepresentationReason
Sparse, dynamic graphAdjacency list (hash map)O(V+E) space, fast traversal, easy mutation
Dense graph or matrix algebraAdjacency matrixO(1) edge lookup, spectral methods
Static analysis, HPC, GPUCSR / CSCCache-friendly, minimal memory
Edge-centric algorithms (Kruskal)Edge listSort by weight, union-find
Enormous state space (AI search)Implicit graphGenerate on demand, never materialise
Persistent storage with queriesProperty graph DBIndex-free adjacency, Cypher/Gremlin
Linked data / knowledge graphRDF triple storeStandards-based, federated queries
Multiple edge types, same nodesMultigraph adj. listEdge IDs distinguish parallel edges

Rule of thumb

Start with an adjacency list. Move to CSR when profiling shows memory or cache bottlenecks. Move to a graph database when you need persistence and multi-hop queries at scale.

19

Applications of Graphs

Social networks

Vertices = users; edges = follows/friendships. Graph algorithms power friend suggestions (common neighbours), influence scoring (PageRank), and community detection (Louvain).

Road maps & navigation

Intersections = vertices; road segments = weighted edges. Dijkstra, A*, and contraction hierarchies enable real-time routing in Google Maps, Waze, and OSRM.

Dependency graphs

Packages, modules, or build targets as vertices; imports as edges. Topological sort determines build order; cycle detection prevents circular dependencies.

Knowledge graphs

Entities and relationships modelled as a graph -- Google Knowledge Graph, Wikidata. Power semantic search, question answering, and recommendation.

Compiler optimisation

Control flow graphs, data flow graphs, SSA form. Graph analyses drive dead code elimination, register allocation, and instruction scheduling.

Bioinformatics & fraud detection

Protein interaction networks, genome assembly (de Bruijn graphs). Transaction graphs reveal suspicious patterns via subgraph matching in fraud detection.

20

Summary & Further Reading

Key takeaways

  • A graph G = (V, E) is the most flexible data structure -- trees, lists, and matrices are all special cases
  • Adjacency list is the default choice -- O(V + E) space, optimal traversal
  • Adjacency matrix excels at dense graphs and matrix algebra
  • CSR format is essential for HPC and GPU graph processing
  • DAGs underpin build systems, data pipelines, version control, and scheduling
  • Property graph DBs and RDF triple stores serve different persistence needs
  • Choose representation based on density, mutation frequency, and algorithm requirements

Recommended reading

SourceDescription
Cormen et al.Introduction to Algorithms (CLRS) -- chapters 22-26
SkienaThe Algorithm Design Manual -- practical graph problem taxonomy
Needham & HodlerGraph Algorithms (O'Reilly) -- Neo4j and Spark
NetworkX Docsnetworkx.org
Boost Graphboost.org/libs/graph
Neo4j Academygraphacademy.neo4j.com