Scalability of Isochronous Mesh Networking to 2^40 Switches

When discussing mesh networking, the common refrain is “mesh networking is not scalable”. Here is data and code that indicates it can scale ‘enough’ to support a full-scale global isochronous mesh network of 2^40 nodes with mostly stable links. These data were generated by running a mesh routing algorithm on each node in simulated networks sized from 2^7 to 2^23. Note that this algorithm isn’t the full spec’d IsoGrid algorithm, just a framework to test scalability.

Assumptions

This simulation and analysis makes the following assumptions:

The link layer is isochronous
The network layer is source-routed
Latency per hop is bounded by a constant time that is comparable to the time it takes for light to propagate across a hop
The links (on average) stay up for longer than a day (no mobile nodes)
Nodes join the network in a bootstrapping process inspired by S/Kademlia
The simulation of networks sized from 2^7 to 2^23 is enough to extrapolate likely scalability to 2^40
The global network topology is approximated by placing nodes in a 2D grid pattern and linking each node to the 4 nearest nodes (except at the edges and corners).
Each node has 100Gib/s (or more) total switching capacity
Each bootstrapping message is around 1kB
The cost of network transit is directly proportional to physical distance

It follows from #1 & #2 that latency per hop can be bounded by a constant. Specifically, #3 is reasonable at least until 2^47 nodes (~1 node per square meter of the land surface of the earth). Light takes ~5ns to travel 1m in optical fiber. With 128 bit words switched isochronously at an achievable 100Gib/s, each hop in this full-scale scenario has on the order of 1.2ns extra latency.

#4 reveals an acceptance that mesh networks with highly mobile nodes (actually nodes with highly transient links) aren’t likely to be scalable. But the switches that form the network don’t have to be mobile, just like the core routers of the Internet aren’t mobile.

The current TCP/IP Internet routing table is NOT scalable: Each node must know about all other nodes. #6 means that the following is only likely true but isn’t proven: Isochronous&source-routed mesh networks can scale to 2^40 nodes with conservative assumptions about topology and current technology. If, instead, one assumes a more 3D topology would develop naturally in the full-scale scenario (urban environments, satellites, and rooftop wireless laser hops), then it scales beyond 2^50 nodes.

#7 means that taking a simple bisection of the square grid produces the nodes that have the heaviest bootstrapping data load to switch. In order to be a viable network, this load must be a small portion of the available bandwidth of those nodes. Higher physical bandwidth links and more stable links network wide make this better.

Simulation

The simulation code is shared here for verification and reproducibility purposes. This code is NOT production quality. It’s never been code reviewed. It’s more important to move to the next phase of development rather than spend time cleaning up the simulation code.

The simulation was built to have the following scalability properties, and the data confirms it was achieved:

The “Search Rounds” taken by each node to find its best matched nodes (XOR metric) grows as LOG2(N)
The “Highest Traffic on a Link” grows near N^(1/2)

Data

Raw data was collected by simulating the bootstrapping of various networks from 2^7 nodes to 2^23 nodes. A seed is used to drive a pseudo-random number generator to produce the node IDs. With the same seed, the first three columns are produced deterministically, but with a different seed the numbers move around a bit. In order to determine the growth rates with more accuracy, the simulation was repeated with different seeds (up to 2^21 nodes), and then averaged:

Nodes	Hops Traversed Per node	Centerline crossing messages Per node	Search Rounds Per node	Avg. Mem. Per node (B)	Avg. CPU Per node (ms)	Highest Traffic on a Link (kB)
2^7	354.63	28.70	3.647	4916	1.00	324.65
2^9	797.70	29.89	5.607	7471	1.22	676.36
2^11	1718.72	31.80	7.597	11362	1.75	1439.18
2^13	3673.90	34.64	9.596	13538	2.44	3134.81
2^15	7819.45	37.56	11.596	16310	3.17	6799.43
2^17	16684.50	40.66	13.596	19211	3.99	14720.58
2^19	35614.29	43.87	15.596	21949	5.06	31768.34
2^21*	76090.86	47.35	17.60	#	#	68570.64
2^23*	162771.73	50.98	19.60	#	#	147642.81

*The 2^21 sim required a large system and many hours of compute time, so it was only run 7 times. The 2^23 sim required a very large system and two days of compute time, so it was only run once.
#Memory and CPU time aren’t meaningful when tested near the memory limits of the machine, (due to additional Garbage Collection) which occurred on the 2^21 and 2^23 node sims

Analyzing the effect of doubling the network size on each column, from left to right:

If one normalizes the per-node cost (the number of hops traversed per node) by distance, then the per-node cost growth rate upon doubling the network size is low (<6%). This can be done by assuming the growing number of nodes are spread evenly over a non-growing space: Since the network is a square, this normalized growth rate was computed by multiplying by the square root of the network size.
The number of messages per node that have to cross to the other side of the network is growing at only 3.9%
The search rounds per node are tracking LOG2(N) perfectly
The memory is growing at a low rate of 9%
The CPU time is growing at a low rate of 13%
The last column on the right (in bold) is the scalability bottleneck, growing at a rate of 47%

Download the data

Conclusion

As expected, the main bottleneck for mesh networks is the load on the network attributed to link and node updates. The worst case load would be a full re-bootstrap of the entire network, so this analysis focuses on that scenario. The bootstrap data load of the links that cross the centerline of the network goes up by 47% each time the network size doubles. Extrapolating the above simulations to bootstrapping a (truly epic) 2^40 node network:

Each node would take ~140KB of memory
Each node would consume much less than 1 second of Core i5 CPU time
Each centerline node would only have to switch 50-100GB

The last point seems like a lot of data for a simple bootstrap, but this would be 5-10 years in the future where 10Gb/s links would be the bare minimum, and this data can be transmitted in about a minute. Even if these extrapolations are low by a factor of 100, with the assumptions made above, the network would still perform well.

A full-scale global isochronous mesh network can scale to at least 2^40 switches

Discussion on Hacker News!