While helping a friend with his web site, we made small talk about certain information he had displayed that he wished he had never made public. While removing it from his site was the first order of business, I also presented the idea of making sure the old versions were not living anywhere else on the Internet…namely, at the Internet Archive.
For those unaware, the Internet Archive is an archive of a large portion of the World Wide Web from across the years. It is accessed at http://archive.org, and you use their Wayback Machine with a site’s original URL to locate snapshots of that site from months and years past. Granted, much of the archived material is broken (images are often not available, and anything like scripting does not make the translation), but there is usually enough meat there to find what you are looking for.
For blocking the Internet Archive from spidering a site and taking snapshots, you simply upload a robots.txt file to the root of your site which disallows access to the selected directories, or the entire site, using the following:
User-agent: ia_archiver Disallow: /
Simple! Your site will no longer be spidered, and any existing archive material will not be displayed either. (It is unknown whether or not they delete what is there, or simply block it.)
What really got under my skin, though, was the attitude of some commenters about robots.txt. Some are complaining about us “evil webmasters” destroying the archives of the Internet one site at a time. Some go even further, as though they demand the right to anything archived. Here is a quote from one of those posts to which I refer:
Honestly, half of the internet is being missed because some honest do-gooder decided that the garbage that is robots.txt should be followed by this archival service. This needs to stop.
The wayback machine is exempt from copyright issues under fair use doctrine and due to its educational purpose.
Please stop ignoring website because of ignorant, uninformed, or possessive webmasters.
This seems to be typical of Internet mentality today. What’s yours is mine, and I have every right to use it for any purpose whatsoever. Oh, and all the hate spewed through insults? Nice touch. Comments like this simply come across as being stupid.
I won’t get into the gross cluelessness of this particular person about the function that robots.txt serves, nor his/her unabashed hatred towards us “ignorant, uninformed [and] possessive” publishers. But I will touch on other issues regarding the Archive itself. Here are FIVE arguments I have against the Internet Archive.
- Site owners and operators create sites to dispense timely, relevant information. It is up to us to control how it is presented, and where. We frequently update or remove content for valid reasons, especially when information becomes outdated and stale, or is deemed to be inaccurate or irrelevant. We crack down on unauthorized usage, while at the same time enable proper channels so the content can be shared properly. Our information is not a free-for-all, and our sites are presented in the here-and-now, timely and relevant. Allowing past copies of our sites does more harm than good; we have our own “archives” which visitors are more than welcome to browse…on our terms, not a third party’s.
- Some Internet users, especially those who *cough* “borrow” *cough* content from other sites, automatically hide behind the concept of “fair use,” claiming that such usage is for “educational” purposes. They clearly have no understanding of what “fair use” actually means, and use it as a broad paintbrush to cover their unauthorized activities. To them, their line of thinking is, “I’m not making money from it, therefore it is ‘fair use’.” The Internet Archive really does nothing more than perpetuate this misguided thinking, and provides an endless source for such content.
- I feel that the Internet Archive should not be exempt from copyright laws. Of all the sites I have created or maintained over the years, I have never once given them the explicit permission to use my content, nor has anyone else to my knowledge. Some content I may donate if asked, and be glad to do so. Otherwise, no.
- For security and privacy reasons, there are perfectly valid reasons a page, directory or entire site may need to be removed from the archive. Many busy site publishers may have created dozens of sites over the years, and (present company included) it is possible to overlook some of the earliest.
I found an archive page from one of my sites that was chock full of email addresses, personal names, and even some street addresses, as part of an Internet guestbook hosted on one of my sites. Do I care if this is public? Yes. Not so much for myself, but for the visitors who trusted me to keep their information secure. Back in 1995, we never had the concerns we do now about privacy and security concerns. Yes, we were all naive. But when my visitors applauded my reasoning behind deleting the page, it is not fair to them to have that same page perpetually and publicly stored without either their or my consent.
- Finally, I feel that the Wayback Machine should be strictly an opt-in service, not an entity that simply grabs anything it finds and stores it. I did not give permission for any of my copyrighted work to be accessed and stored elsewhere. I am surprised they have not yet been sued. There really is no reason for archiving the entire Internet if you think about it. Most of it is outdated anyway, and most pages and sites are broken anyway.
While the Internet Archive itself may be a worthy cause for archiving some types of content, many of us out here would appreciate it if the Wayback Machine would simply go away. Or at the very least, let us opt in to be archived, as opposed to having it taken from us.