2021 Progress Update

I originally built all the SoC-side code using C#, but for the switch kernel this was a mistake due to the soft-realtime requirements of a switch. So I ported the switch kernel to Zig. Zig is so fun! The switch works much better now ūüôā

I made a lot of progress on understanding the ethical considerations of deploying the IsoGrid and we now have a plan to keep the specification and source-code implementation private until ethical norms can be established.

There isn’t likely to be any updates to this site until a minimum viable product is ready.

Scalability of Isochronous Mesh Networking to 2^40 Switches

When discussing mesh networking, the common refrain is “mesh networking is not scalable”. Here is data and code that indicates it can scale ‘enough’ to support a full-scale global isochronous mesh network of 2^40 nodes with mostly stable links. These data were generated by running a mesh routing algorithm on each node in simulated networks sized from 2^7 to 2^23. Note that this algorithm isn’t the full spec’d IsoGrid algorithm, just a framework to¬†test scalability.

Assumptions

This simulation and analysis makes the following assumptions:

  1. The link layer is isochronous
  2. The network layer is source-routed
  3. Latency per hop is bounded by a constant time that is comparable to the time it takes for light to propagate across a hop
  4. The links (on average) stay up for longer than a day (no mobile nodes)
  5. Nodes join the network in a bootstrapping process inspired by S/Kademlia
  6. The simulation of networks sized from 2^7 to 2^23 is enough to extrapolate likely scalability to 2^40
  7. The global network topology is approximated by placing nodes in a 2D grid pattern and linking each node to the 4 nearest nodes (except at the edges and corners).
  8. Each node has 100Gib/s (or more) total switching capacity
  9. Each bootstrapping message is around 1kB
  10. The cost of network transit is directly proportional to physical distance

It follows from #1 & #2 that latency per hop can be bounded by a constant. Specifically, #3 is reasonable at least until 2^47 nodes (~1 node per square meter of the land surface of the earth). Light takes ~5ns to travel 1m in optical fiber. With 128 bit words switched isochronously at an achievable 100Gib/s, each hop in this full-scale scenario has on the order of 1.2ns extra latency.

#4 reveals an acceptance that mesh networks with highly mobile nodes (actually nodes with highly transient links) aren’t likely to be scalable. But the switches that form the network don’t have to be mobile, just like the core routers of the Internet aren’t mobile.

The current TCP/IP Internet routing table is NOT scalable: Each node must know about all other nodes. #6 means that the following is only likely true but isn’t proven: Isochronous&source-routed mesh networks¬†can scale¬†to 2^40 nodes with conservative assumptions about topology and current technology. If, instead, one assumes a more 3D topology would develop naturally in the full-scale scenario (urban environments, satellites, and rooftop wireless laser hops), then it scales beyond 2^50 nodes.

#7 means that taking a simple bisection of the square grid produces the nodes that have the heaviest bootstrapping data load to switch. In order to be a viable network, this load must be a small portion of the available bandwidth of those nodes. Higher physical bandwidth links and more stable links network wide make this better.

Simulation

The simulation code is shared here for verification and reproducibility purposes. This code is NOT production quality. It’s never been code reviewed. It’s more important to move to the next phase of development rather than spend time cleaning up the simulation code.

The simulation was built to have the following scalability properties, and the data confirms it was achieved:

  • The¬†“Search Rounds”¬†taken by each node to find its best matched nodes (XOR metric) grows as LOG2(N)
  • The¬†“Highest Traffic on a Link”¬†grows¬†near N^(1/2)

Data

Raw data was collected by simulating the bootstrapping of various networks from 2^7 nodes to 2^23 nodes. A seed is used to drive a pseudo-random number generator to produce the node IDs. With the same seed, the first three columns are produced deterministically, but with a different seed the numbers move around a bit. In order to determine the growth rates with more accuracy, the simulation was repeated with different seeds (up to 2^21 nodes), and then averaged:

 Nodes Hops Traversed
Per node
Centerline crossing messages
Per node
Search Rounds
Per node
Avg. Mem.
Per node
(B)
Avg. CPU
Per node
(ms)
Highest
Traffic on a Link
(kB)
2^7 354.63 28.70 3.647 4916 1.00 324.65
2^9 797.70 29.89 5.607 7471 1.22 676.36
2^11 1718.72 31.80 7.597 11362 1.75 1439.18
2^13 3673.90 34.64 9.596 13538 2.44 3134.81
2^15 7819.45 37.56 11.596 16310 3.17 6799.43
2^17 16684.50 40.66 13.596 19211 3.99 14720.58
2^19 35614.29 43.87 15.596 21949 5.06 31768.34
2^21* 76090.86 47.35 17.60 # # 68570.64
2^23* 162771.73 50.98 19.60 # # 147642.81

*The 2^21 sim required a large system and many hours of compute time, so it was only run 7 times. The 2^23 sim required a very large system and two days of compute time, so it was only run once.
#Memory and CPU time aren’t meaningful when tested near the memory limits of the machine, (due to¬†additional Garbage Collection) which occurred on the 2^21 and 2^23¬†node sims

Analyzing the effect of doubling the network size on each column, from left to right:

  1. If one normalizes the per-node cost (the number of hops traversed per node) by distance, then the per-node cost growth rate upon doubling the network size is low (<6%). This can be done by assuming the growing number of nodes are spread evenly over a non-growing space: Since the network is a square, this normalized growth rate was computed by multiplying by the square root of the network size.
  2. The number of messages per node that have to cross to the other side of the network is growing at only 3.9%
  3. The search rounds per node are tracking LOG2(N) perfectly
  4. The memory is growing at a low rate of 9%
  5. The CPU time is growing at a low rate of 13%
  6. The last column on the right (in bold) is the scalability bottleneck, growing at a rate of 47%

Download the data

Conclusion

As expected, the main bottleneck for mesh networks is the load on the network attributed to link and node updates. The worst case load would be a full re-bootstrap of the entire network, so this analysis focuses on that scenario. The bootstrap data load of the links that cross the centerline of the network goes up by 47% each time the network size doubles. Extrapolating the above simulations to bootstrapping a (truly epic) 2^40 node network:

  • Each node would take ~140KB of memory
  • Each node would consume much less than 1¬†second of¬†Core i5 CPU time
  • Each¬†centerline node would only have to¬†switch 50-100GB

The last point seems like a lot of data for a simple bootstrap, but this would be 5-10 years in the future where 10Gb/s links would be the bare minimum, and this data can be transmitted in about a minute. Even if these extrapolations are low by a factor of 100, with the assumptions made above, the network would still perform well.

A full-scale global isochronous mesh network can scale to at least 2^40 switches

Discussion on Hacker News!

IsoGrid Design Overview

This post outlines the design for a new network protocol with a mesh topology called the IsoGrid.

Layer 0: Physical Layer

  • Options include, but are not limited to:
    • Ethernet, ATM, USB, etc.
    • Tunnels through other networks (like the TCP/IP Internet, etc.)

Layer 1: Link Layer

  • Defines how nodes directly communicate with each other across links
  • Defines how a ¬ĶPkt can be sent between two nodes
  • Provides for mesh-wide frequency synchronization
  • The IsoGrid does NOT mandate a globally-required protocol at this layer
  • The IsoGrid does impose generic requirements at this layer

Layer 2: Network (¬ĶPkt) Layer

  • Defines the extensibility model for ¬ĶPkt types and ¬ĶRoute types
  • Defines how the network routes ¬ĶPkts across the IsoGrid
  • Defines how nodes send credits in exchange for forwarding ¬ĶPkts

Layer 3: Transport (IsoStream) Layer

  • Defines how the network uses source routing for Isochronous Streams (IsoStream)
  • An IsoStream is switched at the word level, one word at a time
  • Defines how nodes send credits in exchange for switching IsoStreams

Layer 4: Session (EccFlow) Layer

  • Defines how remote nodes use IsoStreams to safely and reliably communicate with each other over an arbitrarily long time period
  • Forward Error Correction coding, safety, and multi-path segmentation
  • Defines how nodes use the network transport to distribute routing information

Layer 5: Application Layer

  • Globally-Scalable mesh mapping and routing using HashMatchLogMap (HMLM)
  • Distributed content addressable storage (CAS)
  • Distributed self-certified naming using GetNodeInfoFromLocatorHash
  • Higher Layers: All the normal protocols you would expect to run on networks

IsoGrid vs. TCP/IP Comparison

Layer TCP/IP IsoGrid
Application SSH, FTP, HTTP, etc.
DNS
BGP
CAS
GetNodeInfoFromLocatorHash
HMLM
Session/Transport TCP, UDP, etc. EccFlow
IsoStream
Network IP ¬ĶPkt
Link Point-To-Point, Ethernet, subnet Broadcast, token ring, ATM, etc. Point-To-Point only, Isochronous USB, ATM, Point-to-point Ethernet.
Physical Any Point-To-Point only: Communication mediums that have collisions aren’t well suited to IsoGrid

How it works

The IsoGrid is a mesh network that supports routing of one-way isochronous streams (IsoStreams) and small packets (¬ĶPkts). In order to provide isochronous streams, the IsoGrid runs with a synchronized frequency, very similar to the way an electrical grid runs on Utility Frequency (except much higher frequency). The source provides a series of route instructions to be used at each hop (Source Routing). Each switch along the way uses its route instruction to establish the IsoStream connection. The ¬ĶPkt that starts a connection defines the length of the IsoStream, and how many credits to send per word. The IsoGrid network and transport layers are optimized to support the IsoGrid session layer (EccFlow) which fragments the data, applies a forward error correction code, and sends the data over many paths across the network.

Credits and Micro-Payments

In order to avoid a tragedy of the commons, the IsoGrid allows for exchanging credits between neighbor nodes so as to pay for the data sent or processed. This allows payments to be made among neighbors instead of having to have a centralized payment processor.

In the IsoGrid model, each node owns its outbound links (but not really its inbound links), its CPU hardware, its storage, etc. Each node SHOULD charge credits for use of those resources. Each pair of neighbor nodes SHOULD agree on the settlement instrument that will represent the value of a credit between themselves, and payment SHOULD be by simple agreement (there is no required third party).

Any settlement instrument is possible, but here are some examples:

  • electricity
  • Etherium
  • cash
  • check
  • ACH transfer
  • propane
  • gasoline
  • gold
  • water

Nodes MAY have different costs for different outbound links.

IsoGrid Secondary Limitations

No network is without limits. The design of a protocol standard necessitates making tradeoffs in order to meet the requirements. Here are some of the known limitations due to the design of the IsoGrid Protocol:

  • Links that use a shared physical medium (ex: those with collisions) aren‚Äôt well suited to be used by the network (too much latency).
    • The IsoGrid is best suited for running on top of physical layers that have exclusive access to the underlying communication medium. Like fiber, copper, and point-to-point wireless such as 60 GHz
  • Highly mobile nodes that want to offer isochronous data transit services might have a harder time competing against stationary nodes (because route tracking across transient links might not be scalable)
  • A single IsoStream can use no more than the fastest available slot along a route
    • However, an endpoint can create multiple connections such that nearly 100% utilization with EccFlow is possible
  • Low-rate connections have higher latency
  • Half-Duplex links not supported (except with unacceptably high latency)
  • Streams through nodes that move will have non-trivial buffering/timing requirements
    • The faster it moves, the more demanding the requirements
  • New Links (for example, link tunnels)¬†could take a (longer latency) three-way handshake (and a credit cost) for remote nodes to be able to effectively use the new links

Discussion is at Hacker News!