I guess https://web.archive.org/cdx/search/cdx?url=*.blogspot.com will give you much more names. And there is also CommonCrawl On 11/26/23, devnoname120 <devnoname120@gmail.com> wrote: > Hello, > > Yo
devnoname120
devnoname120
devnoname120@gmail.com
give you much more names.
And there is also CommonCrawl
You may have heard that Google is going to delete a massive amount of blogs
from Blogspot and Blogger that they deem
“inactive”: news.ycombinator.com/item?id=38411415
ArchiveTeam — an archiving project most notably known for backing up Reddit
and Imgur — is currently massively archiving the blogs before they
effectively disappear from the internet. One of the challenges is collecting
the hostnames of the available blogs in order to archive them. They use
various sources but none is comprehensive unfortunately (not even close).
I noticed that archive.is knows a huge number of these blog hostnames even
if most of the time just a single page is archived.
Could you transmit us a list of the subdomains that archive.is is aware of?
This would be massively helpful.
Blogpost and Blogger have a huge number of subdomains so here is the RegEx
that we recommend for filtering lists:
\S+\.(blogspot|blogger)\.\S+
You can just send a list of matching hostnames to me, and I would take care
of queuing up the urls. I would of course credit you as a source unless you
would prefer to remain anonymous.
Your help would be incredibly valuable to preserve these troves of knowledge
and the history of the internet. 🙏
P. S. — If you happen to have spare server capacity and willing to donate
some CPU time, you could additionally help us scrap everything by running
these two Docker
commands: github.com/Archi...-users
Thank you a lot,
devnoname120