It’s the filtering, stupid

I’ve been struggling for a while around a very big concept – how to bring scientific publishing to the modern web world of social, collaborative creation. Yes, the current science publishing system is indeed a social, collaborative effort, but I think it quite weak when using current online tools. So I’ve been spending a lot of time thinking of all the places the current system could be helped by all the tools I use daily in the social, living, collaborative Web.

Ugh. Big Gulp, indeed. So, the best I can do is break it down into smaller chunks. Partly because different parts of the find-navigate-recombine-contribute cycle of scientific publishing are at different Web-savviness. Partly because it’s easier to digest for me. And partly because some parts are more likely to change sooner than other parts.

Finding and navigating

If you take the progression of content being generated on the Web, it just seems to be getting easier to publish and more fine-grained. We went from big publishing houses, putting out digital representations of physical units, such as The Article, or The Paper, or The Book. Search and index sites like Google and Yahoo stepped in to help us find and navigate all the Stuff. And the letter to the Webmaster became the feedback channel.

Things quickly got pared down to the blog post size and a democratization of tools causing an explosion of all sorts of info on the Web. RSS feed readers and personal home pages stepped in to help us manage these morsels of information. Comments and new posts became the feedback channel.

The latest push in data generation has been nano-sized grains of info, flooding us through Facebook, Twitter, and all sorts of status update services. Used to keeping up with things by reading everything, we have become stuck just keeping up with what others might be saying. And our tools to follow this are just not keeping up.

I was banging my head trying to summarize what this was. At Le Web, I found myself hearing things related to “filtering”. I realize now that this thought might have been triggered by a good talk by Clay Shirky (which I only discovered recently through my tweeps) on filter breakdown – if you can’t keep up with the stream then your filter is broken.

That gave me the word I was missing to describe the first and what I see largest issue with the future of science publishing. Indeed, I see filtering as a problem that is relevant to personal social Web use and even to business use of the social Web.

Linearity

The current filter tools we use, such as Google or Technorati, are too linear (an earlier rant of mine on this topic). You need to go through each item in turn, and the hierarchy is linear. There is very little by way of discovering new things or understanding the conceptual relationships between items other than order in a list. While I am at it, blogs are linear too. When there are more than five comments, the conversation breaks down and it’s between the poster and the commenter. What’s more, after more than five comments per post, a blog becomes less a conversation with the poster, since keeping up with all that can be difficult.

Personal home pages, such as NetVibes are not the solution, since they are still set up by the user and still require the user to read things. No help from the tool except managing the multiple streams. Even sites like Alltop seem to be curated pages for mulitple feeds. No help from the tool, once more.

I have been watching as various multi-dimensional search engines for various particular streams have appeared. Since I use Twitter a lot, I have been more keen to see a new tool to follow Twitter (and was happy that Twitter bought Summize). Indeed, for work, I find Twitter useless due to the volume of of the data stream and my desire to follow and participate in that stream. There are no tools that do this. The tools I have seen are simple word counters (Twitstat, twitt(url)y, Twitscoop, twitrratr) and can have serious failures (for example in this pic, see how a negative reaction was misconstrued)

Semantics anyone?

Folks have been talking a long time about a semantic web, where “meaning” added to information makes that information in some way richer. There are a ton of tools out there based on semantics and folks thinking and working on it. And there are some interesting search engines for the sciences, such as DeepDyve, NextBio, and Knewco, all of which layer some form of multi-dimensional interface on top of search data.

In the social Web space, there is one company that I have been talking with a bit, Crimson Hexagon (hopefully, more on them later). They semantically analyze feeds of data for sentiment analysis.

But, many of these seem like librarian jobs, where much of the semantics is hard-coded in the data as it is classified and created or by data-mining static sets of data. I’d like to see semantics arising out of the use and creation of the data, much like people tagging their photos have added a layer of semantics in Flickr, rather than some librarian in the company data-mining all the time.

The closes analogy I get to explain user-generated semantics versus librarian-style categorization is the difference between Yahoo 1996, with its cadres of employees manually cataloguing the Web, versus Delicious, where the users do it as part of their regular, personal, use of the service. Another analogy I like to use is how paths on a commons can be designed: don’t put down paths at first and then observe where the grass is worn down, indicating optimal user paths.

Water water everywhere

I think it’s great that there are so many folks working on this. But, the Semantic Web has been expected for a long time, but we’ve been too busy being geeky rather than applying it for something useful. The services above are all going in a good direction, though, and all of them are trying to get all that stuff on the web and filter it.

I feel that this year someone will come out with a wizz-bang search tool that throws in some form of semantics (part librarian, a priori, and part user-generated) and simple but powerful visualization and navigation of relationships between results. I think there’s still a hole for a tool to allow individuals or corporations navigate streams of data. The companies above are all trying it in their own particular way.

Is there a winner in any of them? Or will one arise that takes the most useful features of each of these?