Library Of Babel

May 28, 2025

I have a lot of data. Not as much as some of the folks at /r/datahorder but a lot none the less. Much of which comes from backups of old laptops, desktops and phones over the years. Many of which contain copies or even revisions of the same files. So I have a need to organize that data. I’d also like to be able to archive and preserve some of it in its original form as a lot of them represent snapshots in time and that contextual information is lost if they are all merged into one another.

So the problem is to be able to be able to index everything that exists, and identiy what is unique and what isnt so I can know if its “safe” to remove. This is especially useful if I want to say re-purpose an old harddrive but it has some old files. How can I easily tell if I have it archived already?

Enter the Library of Babel, a database of file checksums and urls/metadata associated with those checksums. The goal being that I could scan a old harddrive or file store and know how many of its files are unique to where it is now and where else they can be found. Ideally would be able to come up with an answer like “this locations contents is found at this other location with 98% binary match” where the other location could be a compressed tar that had been previously created and scanned into the database.