Harry Smart ...

photography, writing, fine art... stuff ... other stuff ...

Sampling the porn

a hundred thousand images later ...
or, "I may be gone some time".

A note on 'sampling'. Usenet exists, like any other aspect of the internet, on 'servers', machines set up to hold messages or data, which respond to requests from the individual user's machine and supply the requested data. Readers familiar with the web have access, via their own service providers, to a network of machines which are all more or less publicly accessible. The individual surfer, effectively, asks the web server at his or her ISP to go to the web server on which a particular website lives, to fetch a specific page, and to pass it on to the surfer's own machine. The requested web page exists on one particular server somewhere in the world, or perhaps is also mirrored onto another specific server to cope with high demand.

Usenet is slightly different. Some ISPs maintain their own news servers, but most simply buy in a newsfeed from a third party. Most ISPs don't, for fear of litigation, include the alt.binaries tree. Instead, third party servers exist which attempt to hold comprehensive Usenet data, and to which the individual surfer connects more or less directly (that is to say, using his or her own ISP purely as a channel to the wider internet). Individual newsnet messages are posted to a particular server, but then 'proliferate' through a system of data-sharing between news servers. No server is actually comprehensive, and proliferation is never complete even to the servers which typically take data from the originating server. Added to this, Usenet is not an archived system in the way that the web is. Web sites, and individual web pages, often have a life measured in years. Given the volume of Usenet messages, most servers can only let individual messages stay on their machines for a limited time before they are deleted to make way for new messages. Typically, individual messages are retained for something between a few days to a few weeks, but this 'retention' varies from server to server, and within servers, from one newsgroup to another.

Usenet is therefore a much less standardised experience than the web. Two Usenet users are unlikely to have access to the same range of servers; no two servers have exactly the same list of newsgroups; no two servers with a given newsgroup will have received exactly the same list of individual messages; and the same group accessed from the same server at two different times will show different lists of available messages as old ones are deleted and new ones appear.

However, for my purpose, this relatively ephemeral quality has a value in that it provides a practical solution to the problem of sampling, which bedevils most attempts by individual writers to 'survey' the internet. The problem is that, given a virtually infinite field to research, the individual's own sense of what is relevant, appropriate, interesting or contentious inevitably plays a major role in deciding which data to include and which to exclude. Given that most studies of the internet focus primarily on the web, this problem of selection, and consequent bias turns the majority of internet studies into essays in subjectivity. Usenet, on the other hand, has indeterminacy built in at the start and does not really suffer from the web's (illusory) appearance of relative stability. If you download the headers for a particular newsgroup from a particular server on a particular date, you get a sample, more or less random, but finite. In a collaborative study (which the present one isn't!) there would probably be a very large correspondence between two samples taken by different users in different geographical locations, provided the samples were taken at the same time, and both from servers providing similar retention and coverage. Given sufficient collaborators, it would be possible to assemble (by comparing lists of messages received) both a 'complete' listing of messages (i.e. all the messages that were posted over a given period) and a listing of posts that reached more or less all of Usenet. I know of no academic group which has done this, but it does seem to be a relatively manageable exercise.

To give some idea of the number of posts, I downloaded the current message headers from alt.binaries.pictures.erotica, and received approximately 110,000 message headers. Another problem arises here. Although downloading the message headers takes only a few minutes on an ISDN connection, reading them clearly takes some time (I mean reading the message headers, not the messages themselves). Individual messages typically contain a few lines of text, and then the attachment, usually a still image, and typically betwwen about 20K and 200K in size. Surveying a newsgroup involes scanning the headers, deciding which messages to look at, then downloading those messages (to see the images). Downloading an individual message may take anything from a few seconds to a couple of minutes (and some images are split between several messages ... this applies even more so to video files, which are often split between hundreds of messages and then have to be reassembled). This can easily mean that by the time you have looked at, say, a hundred images (i.e. one tenth of one percent of the total) in order to get some idea of the content, some of the earlier messages have been deleted from the server (and deletion is, effectively, randomised ... the older a message, the more chance of it being deleted). This in turn means you waste time trying to download images which no longer exist, interpreting error messages, etc. So occasionally the list of available messages has to be refreshed, which clears from the user's machine those message headers for which the messages are no longer available, and at the same time adds the headers for messages which have arrived in the meantime. Hence 'approximately' 100,000 images, which reflects the typical capacity or retention of the particular server I was using (which, being a paid service, is relatively good, certainly much better than most free or public servers).