I’m sure that you’ve heard the adage that the “Internet is forever” or “the Internet never forgets.” But as it turns out, that’s not quite true—unless, of course, you’re a celebrity who said the wrong thing on MySpace 15 years ago. Then, it sticks around forever.
According to Pew Research Center, 1 in 4 web pages that existed between 2013 and 2023 can no longer be accessed. That includes links to news stories, government pages (how will we track policy changes now?), and so on.
The vain human that I am, from time to time, I would Google myself—mostly to ensure that nothing overly strange or controversial would show up. Media outlets have a habit of randomly quoting my posts on social media. In the process, I had discovered that most of my journalistic work has vanished from the Internet. Or rather, it was no longer being indexed by Google, which, isn’t so different from a vanishing. Even if those published articles still exist, they cannot be found.
That discovery was alarming to me.
While it’s understandable that some websites will eventually have the plug pulled—like your old geocities masterpiece and ode to puzzle games—I worry that we are losing some valuable data too, factual, cultural and historical.
According to Pew, 38% of pages from 2013 are inaccessible today. More recent content is inaccessible at a lower rate, but it’s still high enough to be concerning. Clearly, the Internet is indeed…forgetting.
In the past, libraries served as repositories that documented our lives—including keeping records of past articles on microfilms and encyclopedias. But as we’ve gone digital, the amount of data has grown extremely vast—hundreds of billions of pages.
On the one hand, the Internet doesn’t require quite so much physical space to store data, but on the other, there’s no central entity tasked with safeguarding it.
Google has always had an algorithm that prioritized some pages over others. It was a necessity for prioritizing quality content and ensuring the end user didn’t have to sort through thousands or even millions of pages. But on what basis did they “disappear” content that had previously showed up at the top? Age? If so, then it’s a problem when we can no longer reference back to potentially rather pertinent information. The publications I had written for still exist.
But perhaps it’s not entirely Google’s decision. Simon Brew writes that some of this content is being scrapped to improve from Google’s search algorithms, which favor faster loading times and user experience. Older content often falls by the wayside, especially as publications update their CMS systems, orphaning past articles that are no longer compatible, choosing not invest the resources needed to preserve and transfer them over.
There are, of course, organizations that are attempting to archive as much as possible. Amongst them is archive.org, “The Wayback Machine,” which has kept snapshots of websites on particular dates since 1996. However, it is a non-profit, has limited resources, can only capture data on occasional dates, and we still end up losing a lot. It also requires a lot more digging to find old content. It was also recently hacked, defacing it, temporarily shutting it down, and impacting 31 million users with data breaches.
The Internet Archive’s Wayback Machine preserves over 900 billion webpages, some of which have even been cited in court as evidence. Fortunately the data itself was left intact by the hackers, but things could have been worse.
A coordinated effort to ensure the preservation of digital history is needed. We must build more fortified systems to protect digital data, just as we preserve libraries and their archives. Perhaps not everything needs the same level of protection, but certainly important documents, government communications, and articles do. It’s essential when it comes to protecting our collective memory.
Then again, in the Stone Age, when cavemen carved out their drawings, perhaps they thought they’d last forever too. Some have, but many more are gone. And like those ancient carvings, the digital traces we leave today may one day crumble into obscurity, lost to the tides of technological change.
The Internet’s memory, too, is fragile. There’s too much content to “never forget.” That is, unless it’s an embarrassing tweet.
☕️ By popular request, you can also support my work by making a one-off donation via Buy Me a Coffee.
Order my book, No Apologies: How to Find and Free Your Voice in the Age of Outrage―Lessons for the Silenced Majority —speaking up today is more important than ever.
NOTE TO READERS:
Thank you for keeping me company. Although I try to make many posts public and available for free access, to ensure sustainability and future growth—if you can—please consider becoming a paid subscriber. The more paid subscribers I have, the more time I’ll have to work on new essays.
In addition to supporting my work, it will also give you access to an archive of member-only posts. And if you’re already a paid subscriber, THANK YOU!
I would not blame companies like Google, etc. here. Search engine algorithms prioritize the order of results but should not remove the results (in general). What I could experience is that real sources of information are disappearing too - see how many discussion boards are locking threads just for being too old, despite containing useful and valid information. Some sites are disappearing just because the maintainer has no longer resources to keep it running. Recently there was a merger of two retailers that eradicated two decades of quality source of information on computer products just because the new owner did not want to run it anymore and was not interested in preserving/archiving it. All that eventually leads to that the info would not be retrievable in longer or shorter term.
My other concern is a call to forcibly demonopolize Google as a search engine just because it is too big. See what happened and still happens in area of TV streaming: if you like certain genre and would like to see “all” the shows, there is no single provider that would provide it but have to subscribe to several of them. And - because of mergers and licensing - the content is disappearing there as well. I can see an analogy here. Especially when useful content is or will turn to a paid one.
Honestly, it feels good but also weird to read posts like this, by people who have been writing online for years, only in 2024. But OK, better late than never.
Because this has been a well known problem for years. For many more details about both the causes and the fixes, check out these posts from me, and the links they contain:
https://mfioretti.substack.com/p/ever-wondered-why-we-are-in-a-digital
https://mfioretti.substack.com/p/google-sucks-snow-white-is-woke-in
And, specifically about professionals who ignore how fragile their online presence is:
https://mfioretti.substack.com/p/what-we-should-all-learn-by-a-famous