The Clean, Green, Immutable Dream

Paul Green
Level Up Coding
Published in
9 min readMay 11, 2021

--

We consume huge quantities of digital information, but at what environmental cost? Is there a more efficient way to share humanity’s data?

Library with lots of books and repetitive architectural features
Photo by Giammarco on Unsplash

Repetition, Repetition, Repetition

Are we there yet?

No. We just left.

Are we there yet?

No! I said it would be an hour or so and it has been 5 minutes.

Are we there yet?

NO! I’ll tell you when we get there.

Are we there yet?

Give me strength…

Anyone with kids will understand this. They have little or no understanding of time or distance. You can’t explain how long something will take or when it will be ready. They simply lack the tools of knowledge and experience to figure it out themselves. Children are not alone here though, it is a theme replicated throughout our existence, with complexity contributing at every turn.

How long with it take to get there?

About 3 hours.

How much time do we have left?

The traffic is bad, so still 2 hours.

How far is it?

About 70 miles.

It’s taking longer than 2 hours, why aren’t we there yet?

We got lost, but we know where to go now.

Have we arrived?

Yup!

Ah yes, much better! We have some comprehension of the problem space now! There is an exchange of knowledge, based on an understanding of the speed we may be travelling at, the distance remaining and the vagaries of traffic and locations. There are still lots of questions that need answering though.

How long with it take to get there?

About 3 hours.

Are we there yet?

Ask your brother.

How much time do we have left?

The traffic is bad, so still 2 hours.

Are we there yet?

I’ve just told your brother!

How far is it?

About 70 miles.

Are we there yet?

No, we’re about 70 MILES AWAY!

It’s taking longer than 2 hours, why aren’t we there yet?

We got lost, but we know where to go now.

Are we there yet?

Is there an echo in here? No!

Have we arrived?

Yup!

Are we there yet?

Yes, my little darling — give me a cuddle!

Repetition, repetition, repetition. They say that is how you learn, but my goodness, it can be excruciating! Saying the same thing to different people, over and over again, is exhausting. It wastes time and energy. Gnawing your own arm off may seem more appealing at times (even though we do love the cheeky monkeys!).

Man screaming
Photo by Dmitry Vechorko on Unsplash

What has any of this got to do with clean or green energy? It’s looking tenuous so far, I agree. It isn’t about self driving cars, although that would be some mercy! It isn’t about Alexa fielding questions from the underlings either. No, this is about how to deal with repeated requests for the same data. I bet you didn’t see that one coming!

Let There Be Caching!

A common solution to the same questions being asked over and over is to get something (or someone!) to repeat it for you. It saves you the energy of having to dream up an answer and vocalise it. Maybe I do need to get Alexa in my car…

Caches are great. They usually return the conclusion of a complex question, without having to think about said complexities over and over again. However, they have their limitations, which usually pivots around how often the answer may change, depending on when the question is asked.

Asking if we are there yet has a different answer at each time. Even asking how far away we are or what the distance remaining is changes over time. This requires a calculation to be completed in order to return a valid answer. When the same question is asked moments later, the answer can often remain the same though, which is where caches are brilliant. Caches love repetition. Caches love repetition. Caches love repetition. Ok, I’ll stop.

Let There Be Immutable Data!

Given caching is great when we know the answer is unchanged, this begs the question — if we can guarantee the answer is unchanged, can we cache it forever? Yes, we can! Given we know all the inputs and the same algorithm is used, we know what the output will be. It is predictable and re-usable.

We know that on a given day, when the car is moving at a similar speed, the traffic is similar and we have been travelling for a certain time, the answer will be the same (no, we are NOT there yet!). With computer systems, what ‘similar’ is can be defined and constrained to suit. Sometimes the inputs will be exact anyway. The point is, we know the answer to the question, because it has been asked before.

But Things Change!

Change happens, but if we know the inputs and the algorithm, we know what the output will be. Within an application, we can manage this. We know what we can cache and for how long. We base this on how frequently something may change and how critical it is for the correct answer to be given. This sometimes impacts the internal application design, but is frequently used to inform other downstream applications on how they should use the data before it becomes stale or invalid. This is often in the form of TTL — Time To Live.

The concept of TTL is easy to grasp. If some data isn’t expected to change frequently, a long TTL can be used. Imagine dictionary definitions of words — they are seldom changed and the TTL could be measured in months or years quite safely. On the other extreme, your favourite news feed will change frequently and would need a very short TTL, perhaps minutes or seconds. There are more extreme cases in both directions and many in the middle too. The choice of TTL is critical to performance, scalability and usability.

But what is change? In terms of data, we frequently think of it as scrubbing the old and replacing it with the new. This is what we observe in the physical world. When we repaint a wall from red to blue, a blue wall is all that remains. The red wall has gone. It has been replaced. It is easy for us to visualise this, which when combined with storage scarcity, also feels like the most efficient way to deal with change. But is this the best way to work with data changes?

Open pots of paint with brushes
Photo by russn_fckr on Unsplash

Changes Can Be Immutable

Storage is cheap. Really cheap. Keeping all iterations of a record is therefore also cheap. There are exceptions to this rule with complex data sets, but for most data types — text scripts and markup especially — storing iterations is a small cost, especially if iterations are relatively infrequent. For less suitable data types, such as images and videos, iterations are also likely to be more rare (I’d love to see some statistics on this!).

Cost is also relative. Bandwidth is more expensive than storage. Calculations are more expensive than storage (else why would you not just recalculate it?). What may seem like a saving in the small, can be a big cost in the large. Saving a few kilobytes of disk space to replace mutable data, may end up costing much more to transmit or recalculate the data when it goes stale.

Given immutable data can have an infinite TTL, there are obvious gains. Each change can be cached for as long as desired. If it is some sort of LFRU (Least Frequent Recently Used) cache is used, old and unused stuff will drop out of the cache, making way for new and popular stuff.

With a big enough cache, offline operation can even be achieved. When you can guarantee that something won’t change, that article you read last week is certain to be the same. The book you started yesterday will still be identical. The video you watched before reading this will be there to watch again, unchanged and in your cache. Or in the cache of the host you requested it from or a proxy in-between.

What if an article has changed since you last read it? If the original was immutable, it should still be available. This opens the doorway to self-archiving applications and websites, where consumers can go back in time and read the original. Think what that could do for censorship. Imagine how easily we could preserve today’s information, to learn from it in the future?

Mixing Mutable And Immutable

At the protocol level the WWW doesn’t guarantee that a response is immutable. While there are non-standard Cache-Control headers¹ to indicate something should be assumed to be immutable, it can’t make it true. Given that you are routed to a host which can change the underlying data at will, how can it? There are mechanisms which let a client application check to see if it has changed, but it provides no guarantees that it won’t change.

What we can do is hash data to generate a unique fingerprint for a data item. It is also possible to use this as a way to address data and confirm that it has not changed. Taking things further, we can prevent hosts from manipulating this data unanimously. I’ve written about the Safe Network² previously and this is exactly what it is attempting to provide. Access to immutable data which is guaranteed to have not changed and consequently, can be cached forever.

By breaking data items down into immutable chunks, addressed by their unique fingerprint, each chunk can be cached for as long as needed. Moreover, storing a data item with common chunks will allow the chunks to be re-used. They are de-duplicated. As they are known to be identical, there is no need to store them. If a client has already cached the chunk for a different data item, they already have it too. This saves storage, bandwidth and compute time.

Woven fabric
Photo by Kate McLean on Unsplash

Of course, there needs to be a way to pull these immutable chunks together. Websites and news feeds change. If these applications served only immutable data, it would be frozen in time and of little use. The application design needs to be able to pull immutable data together in a mutable way. By having mutable indexes, we can achieve this.

How Would Applications Be Designed?

What would practical designs look like? For something like a blog or a news feed, I would envisage a mutable pointer to an immutable index and immutable data items. When the blog or feed is requested, the address of the latest index is returned and the data items can be retrieved. When new data is added as an immutable item, the index is replaced with a new immutable index and the mutable pointer is updated to point to it.

Given the mutable pointer would be a small data item, retrieving it would be simple. It would be tiny and trivial for the network to provide. Once retrieved, the index and the data items are all immutable and would be cached for as long as needed. You could also specify a previous version of the index to see an older snapshot, which you may have already downloaded and cached, potentially offline.

In this sort of application a tiny proportion is mutable and only exists to point to the immutable data that is needed. Given many people may access the blog or news feed, caching through the network and on individual clients would distribute the load and minimise duplicated effort.

A Clean, Green, Silver Bullet?

While immutable data still requires energy to store and retrieve, being able to cache data forever brings huge benefits with it. In many ways, it provides similar gains to broadcasting, but with the flexibility for the consumer to choose when to receive it. It may also reduce complexity, with fewer moving parts, fewer timing issues and more consistency.

Could it help to reduce energy wasted on repeating things over and over? Could it help to reduce energy wasted on repeating things over and over? Sorry, I couldn’t resist… but I believe the answer could be yes.

Can common websites and applications be updated to use an immutable web application architecture³? Do we require a more fundamental switch to have a common data layer akin to what the Safe Network is seeking to achieve? Or are dynamic server driven applications an inevitability, along with the associated repetition and complexity they require? I’d love to hear your thoughts.

--

--