Page 1 of 1

Archiving the Forums?

Posted: 09 Aug 2016, 21:40
by jlev
Hi duerig and all other site admins,

First, thank you so very much for your ongoing work running the site. This website, and the forums in particular, are, in my mind, one of the lasting gems of the internet. That's not only for the current value of the discussions happening here, but also for the historical value.

I'm wondering whether any ongoing work is being done to archive the materials on this site in a lasting way. For example, beyond daily or weekly backups that you might have as site admins, is there any ongoing process in place to put copies of the site materials in the hands of long-term archivists (for example, through submitting periodic database dumps to the Internet Archive, or through spidering the site and turning it into a locally-viewable HTML zip file / tarball)?

I see this being useful for two reasons: first, in case some sort of catastrophe happens in which all of the servers crash at once, or the domain name gets hijacked somehow. Second, to enable research on what's happened here (I do a lot of text mining in my research, for example, and could see other researchers wanting to know how communication networks here have looked over time, what phrases come up most frequently, etc.).

If there isn't a process in place, but you'd be open to the idea, I'd be really happy to help with this. I'm currently finishing a doctorate in psychology and will thereafter be starting work at a university library, and so could also facilitate working with university institutional repositories, if that would be preferable to using the Internet Archive's services.

Thanks for your consideration!

-Jacob

Re: Archiving the Forums?

Posted: 13 Aug 2016, 09:24
by dtic
Not an admin. Aren't the automatic internet archive wayback machine snapshots sufficient? https://web.archive.org/web/*/http://ww ... org/forum/

Re: Archiving the Forums?

Posted: 13 Aug 2016, 23:07
by jlev
I would say not, for two reasons. First, they don't catch everything / they don't always spider the entire site. Second, if the admins are amenable to enabling research on the site, it would be much more straightforward to just make everything accessible directly (this could be automated, too -- there wouldn't need to be constant human action), either through partial database dumps (no user email addresses or passwords, etc.), or through an already-spidered version with all links turned into relative links (such that the site could be viewed in an offline mode on someone's local machine); I suspect that the Wayback Machine is even more difficult to spider, especially if posts include links that resolve back to this domain name (vs. the Wayback Machine's cached versions). That's my thought, but hearing input like yours, dtic, is also why I started the conversation :)

On another note, I've really admired your work here on the forums!

Re: Archiving the Forums?

Posted: 14 Aug 2016, 23:03
by duerig
I help Scann manage the forums. If somebody was interested in getting periodic backups of the current database (or just the post tables), I think we could pass it along pretty easily. The final say is down to Scann, though. And there would have to be somebody who wanted the backups. :-)

-D

Re: Archiving the Forums?

Posted: 15 Aug 2016, 20:12
by jlev
Hi, duerig! Ha, in that case, I'd like a copy! :)

Re: Archiving the Forums?

Posted: 13 Nov 2017, 16:09
by daniel_reetz
This is an important topic - and one dear to my heart.

There is one important consideration which is that we need to be careful backing up/sharing the databases directly, because people's private messages and other stuff are in there.

jlev, did this discussion ever go further? Do you want to help out a bit with our archiving for the long term?

Re: Archiving the Forums?

Posted: 14 Nov 2017, 11:13
by duerig
There was some followup in email. Basically, we need a script to pull out the public information from the database periodically and save it off. This is different from a normal backup because we only want to pull some information from the database. I talked about this some with Jacob, but he hasn't had time as yet to work on this script. I haven't had the time to work on it myself either.

If there is a member of the community who wants to write a cron job to pull information from a database and stuff it into an S3 container, I'd be happy to work with them and then set up the script to automatically run.

-Jonathon Duerig

Re: Archiving the Forums?

Posted: 16 Mar 2018, 10:32
by jlev
Hi Daniel and Jonathon,

I just remembered to come back to this, after a time out of contact. I hope that this message finds you both well!

I am still interested in contributing to this. I looked into it a bit more today, and found out about PHPBB-Static, which was recommended from here. That second linked page lists several other options for archiving a copy of a BBForum such as this, including using wget or httrack, which is what I originally had in mind, but which would presumably use a lot of bandwidth unless it were done on an offline copy. PHPBB-Static looks like the most promising way to me currently, but I haven't used it yet.

- Jacob

Re: Archiving the Forums?

Posted: 16 Mar 2018, 10:37
by jlev
Ideally, in my mind, the archive would comprise a static HTML mirror of the entire archive, with internal URLs changed to be relative URLs within the archive. In addition, it could be useful to create a partial database dump of post content for anyone wanting to do, say, text analyses in the future.

There is an issue re: users having the right to delete their posts on the forum, and not having the ability to similarly delete from an archive. Having said that, that is also already the case with Internet Archive mirrors.