Is there an API or data feed to get threads from this forum? I'm looking to research themes in the discussions here and would thus like to collect some info in volume. Is there a fast/approved way to do this, or should resort to scraping?

Aside, I'm mindful of privacy - the identities of participants are not of interest other than their degree of interactions (i.e. like the karma stats). The work is for WebSci research rather than private use/gain.


The site runs OSQA, so it might be worth asking in forums there about any kind of export mechanism. If there is, it might be worth talking to the site admins to see if such an export could be facilitated. How to do all this would be extremely useful to have documented for the time when we're ready to throw OSQA under a bus and move to something with better anti-spam controls and a decent search facility :-) .

Sure - happy to give feedback if I find a way. I have asked on OSQA forum. Looks like the non-pro version has no API, but I'll update if I get an answer.

I guess you need to crawl it on your own, or you might ask the admins or OSMF to run prepared queries on the DB (guess they won't do). Would be interesting if you list/pubish your results at research page

I asked politely on OSQA's meta forum on 12 Aug 14:


I'd like to get lots of threads from a forum using OSQA. Is there any way to read data directly as JSON, XML or whatever. If not is there a best way to scrape OSQA data - ideally so as not to inconvenience others. I only need to do the sample once (pending any follow up) - it's not a constant process.


After an age in 'moderation' the question has now been rejected without explanation. Bottom line, I guess they don't want to help.

Meanwhile I'm trying - if I get a solution I will of course share.

There is an RSS feed for all questions and one for each question/comment feed. Deriving the individual answers feed from the individual question links in the main RSS seem to just be be the addition of "?type=rss&comments=yes" to the link in the main feed.

So it seems you may be able to get "published" versions of all this information in an easily processed format.

I haven't read the Terms of Service recently though, so I have no idea if this is officially permitted.

As it shows at the footer of this page, the content is creative commons, so no problem :-)

The RSS only gives me the last 30 topics. Regardless of sort type chosen for the web page the RSS seems to be the 30 most recently added/answered topics. The RSS feed does seem to accept pagination, e.g. &page=2 or &p2 both have no effect on the base feed.

