Archive for June, 2007

Searching for people and authors

The problem with people is that they don’t have unique names. To take an extreme case, Yahoo reported a couple of weeks ago that the Chinese authorities were considering a move to try to end the confusion caused by the fact that more than a billion people are now sharing just 100 surnames, and 93 million have the family name Wang.

More pertinently for this publishing, the problem of author disambiguation has long been an issue for searching bibliographic databases such as PubMed/Medline. There is a lot of work being on done on automatic ways to disambiguate author names, such as using affiliations, email addresses, subjects and co-authorships.

However a more “Web 2.0” way to do this has been suggested in the WikiAuthors proposal. The idea here is that a copy of the database (in this case, Medline initially) would be placed on a wiki, and the authors and their colleagues – that is, the scientific community at large – would do the necessary work.

At present the WikiAuthors proposal appears stalled, pending the development of other WikiMedia projects (e.g. WikiProteins).

I was struck by some similarities with Spock, the current hot new search engine. Spock (currently in private beta, but there’s a good overview on Read/WriteWeb) focuses on people search, that is, it treats all search terms as a request to find matching people. Thus searching for “President of the United States” returns George W Bush and Bill Clinton as its first two hits.

The similarity with the WikiAuthors proposal is that Spock will allow users to add tags (in addition to automatically generated tags).

Spock will, however, be much more semantically rich than is proposed in WikiAuthors. Tags will include name, gender, age, occupation and location, and others. The really interesting bit comes from the “relationship” tag, which will link people together. Thus Spock can offer links to related people – in the case of Bill Clinton, this might be Chelsea Clinton (daughter), Bush (successor), Hilary (wife), Gore (VP). This will be a powerful tool if it works as promised.

Looking back the other way, I wondered if it would it be useful to tag relationships between authors in a bibliographic database, for instance co-authors, co-workers, student/supervisor, etc. This could give a whole new way of exploring links in the literature beyond the current way of using citations.

Technorati Tags: ,


uBioRSS – Track the latest research by taxon or species

uBioRSS looks a very interesting development. From Matthew Cockerill’s post on the BioMedCentral blog:

uBioRSS is a nifty service from the MBLWHOI Library at Woods Hole, which harvests bibliographic information about new articles from publishers’ RSS feeds, and then passes them through the uBio taxonomic classification system which identifies any species that are mentioned in the article, and classifies the article appropriately.

This makes it possible to browse the literature taxonomically, so that, for example you might view a list of all the latest articles on cetaceans far more easily than can be done using plain text search.

uBioRSS is a great example of the way in which semantic enrichment can add value to the literature

Of course it’s not new for third parties to add tagging to content e.g. to improve the search experience (e.g. product names – Google Product Search, place names – MetaCarta, etc.) but this is a nice example of what can be easily done with STM content. I’m sure this sort of thing will become increasingly common.

Technorati Tags: ,

Online advertising primer

I was a bit surprised to see in his presentation “Online Advertising in Scholarly Journals:the Opportunities, Risks, and Rewards” to the STM Spring Conference in April, Richard Newman of the American Medical Association felt it necessary to explain how Google’s AdSense and AdWords programmes work. But on reflection I realised that he was probably right, because many STM publishers have yet to engage with online advertising in any substantial way.

Anyway, if you’re looking for a primer on online advertising, covering the different kinds of advertising, the size and growth of the market, challenges and innovation, current and future trends, and lots of links, I can recommend this recent post on the MediaShift blog: Your Guide to Online Advertising

From time to time, I’ll give an overview of one broad MediaShift topic, annotated with online resources and plenty of tips. The idea is to help you understand the topic, learn the jargon, and take action. I’ve already covered blogging, citizen journalism, presidential campaign videos and various other topics. This week I’ll look at online advertising.

Technorati Tags: ,

Automated Content Access Protocol (ACAP) Conference

Today sees the first major conference on Automated Content Access Protocol (ACAP).

ACAP is potentially very important for online (i.e., all) publishers. What is it? This is the description from the conference website:

a standard by which the owners of content published on the World Wide Web can provide permissions information (relating to access and use of their content) in a form that can be recognised and interpreted by a search engine “spider”, so that the search engine operator is enabled systematically to comply with the permissions granted by the owner. ACAP will allow publishers, broadcasters and any other to express their individual access and use policies in a language that search engine’s robot “spiders” can be taught to understand.

(For more information, see also the Wikipedia page; the official ACAP website)

ACAP was conceived in January 2006 and born some 9 months later at the Frankfurt Book Fair. As of today, it is mid-way through a first pilot project that is intended to design v1.0 of ACAP. The pilot is due to finish in October 2007, with a final conference scheduled for 29 October in London.

The ACAP partners include leading publishers (STM and others), media and news organisations (including WAN, the World Organisation of Newspapers) and the British Library. But right now none of the three major search engines are formal members (though they have participated informally), and clearly ACAP is never going to work without their active endorsement and participation.

So what’s in it for the search engines? On the one hand, they stand to lose access to content, or be barred from certain kinds of re-use they currently enjoy (particularly in news). But on the other hand there are potential gains, with publishers being able to make certain kinds of currently restricted content (books, databases) available to search engines if they feel they have more control over re-use (and potential to monetise that use). There are certainly potential huge gains for end-users here.

But one problem for an early adoption of ACAP is that it (at least partly) addresses an area of current tension between content owners and the search engines: copyright. For example, publishers and Google have engaged in legal battles over Google’s interpretation of copyright law in relation to book scanning (e.g. The Authors Guild of America and Association of American Publishers have separately sued Google) and news aggregation (e.g. Agence-France Presse). So it wouldn’t be surprising if the search engines decided to play their cards close to their chests at this point.

So what’s new at the London conference? In the opening keynote (pdf), WAN’s Gavin O’Reilly says of the Big Three search engines’ non-membership:

So however perplexing I find the fact that the big three still aren’t full participants in ACAP, it is – for me, probably the sole and minor disappointment among a long and continuing litany of successes and triumphs and I welcome the self-evident operational involvement that we continue to see from some of them.

Francis Cave’s project report (pdf) appears to shows that the pilot is roughly on track, with the Use Cases defined, the technology options for the Technical Framework researched and specified. Defining the Use Cases was clearly an important milestone, as they lie at the heart of what ACAP is about and are potentially quite complex. For instance Mark Maddocks’ presentation Why ACAP? Reed Elsevier Perspective (pdf) gives these examples:

  • Specify permitted use of indexed content (e.g. Limit number of displayed words in the search result; of Require direct link back to publisher site)
  • Exclude certain SE services from using indexed content (e.g. Allow for inclusion in main Google index but not Google Scholar)
  • Exclude specific parts of the page from indexing and/or display (e.g. Paragraphs, images, figures, or tables)
  • Exclude from the index copies of the page not found at the specified URL
  • First Click Free – site is indexed, but provides limited content to the user (e.g. Crawlers are allowed to index pages; A search results page can present search results from these pages; People can link to the page from the search results page but not onward link from that destination page. Instead they are redirected to a registration or other page)
  • Registration – site is indexed & free, but have to register for access (e.g. Crawlers are allowed to index the pages bypassing registration, the pages are flagged as Registration so that the crawler can explain this to the user if they choose to; Users clicking on a link are asked to register before seeing the content)

Technorati Tags: , ,

Update/correction on Nature Precedings statistics

Oops. In my note about the launch of Nature Precedings last week, I said incorrectly there were 64 submissions on the launch date and gave the breakdown by subject category.

This made a rather elementary error – my numbers assumed that each submission was in only one subject category, whereas course many have multiple categories.

Looking at the site a week later, there appear to be 43 submissions, split roughly 40:60 between manuscripts (17) and poster/presentations (26).

Bioinformatics is still by far the most popular single category, though.

Technorati Tags: , , ,

Mother Goose & the scientific peer review process

Since we have been covering peer review developments recently (e.g. here) we couldn’t resist posting a link to this (an oldy but goody): Mother Goose & the scientific peer review process (from the Science Creative Quarterly). Extract:

Hey diddle diddle, the cat and the fiddle.
The cow jumped over the moon.
The little dog laughed, to see such a sight.
And the dish ran away with the spoon.

The reviewers felt that not enough data was presented to support your claims. For example – how many times did your group observe the cow jumping over the moon? From the text and supporting figures, it would appear that you base this conclusion on one data point as no calculations regarding standard deviations were presented. As an analytical journal of high repute, the reviewers felt that this is simply not acceptable. In addition, several of the reviewers felt that the word ‘diddle’ was inappropriate, and should have been replaced by the more scientifically correct, ‘Hey fornicate fornicate.“ Because of these, and other problems, we are sorry to inform you that your manuscript has not been accepted for publication.

Technorati Tags:

Web 2.0 for higher education

A couple of reports/blog entries caught my eye recently in the area of Web 2.0 and education.

A new JISC-funded report Web 2.0 for Content for Learning and Teaching in Higher Education by Tom Franklin and Mark van Harmelen (who blogged its release here) was published at the end of last month. It offers recommendations to JISC on how to respond to the opportunities and challenges of Web 2.0. Overall they recommend that:

Recommendation 1: Guidelines should not be so prescriptive as to stifle the experimentation that is needed with Web 2.0 and learning and teaching that is necessary to take full advantage of the possibilities offered by this new technology.

From a publisher’s perspective, these recommendations could be important:

Recommendation 2: JISC should consider funding projects investigating how institutional repositories can be made more accessible for learning and teaching through the use of Web 2.0 technologies, including tagging, folksonomies and social software.

Recommendation 6: JISC should consider funding a study to look at how repositories can be used to provide end-user (i.e. referrer) archiving services for material that is referenced in academic published material, including Internet journal papers. Part of this consideration should extend to copyright issues.

Recommendation 3: JISC should consider funding work looking at the legal aspects of ownership and IPR, including responsibility for infringements in terms of IPR, with the aim of developing good practice guides to support open creation and re-use of material.

Other blog coverage: see Brian Kelly (UKOLN) on UK Web Focus.

Coincidentally, the Read/Write Web blog today published a round-up of some of its recent coverage of Web 2.0 in e-learning in e-learning 2.0: All You Need To Know. This gives a whistle-stop tour at 30,000 feet with a lot of useful links, of which I found these particularly interesting:

e-learning 2.0 – how Web technologies are shaping education

Elgg – social network software for education

Elgg is an open source social platform that:

provides each user with their own weblog, file repository (with podcasting capabilities), an online profile and an RSS eader. Additionally, all of a user’s content can be tagged with keywords – so they can connect with other users with similar interests and create their own personal learning network. However, where Elgg differs from a regular weblog or a commercial social network (such as MySpace) is the degree of control each user is given over who can access their content. Each profile item, blog post, or uploaded file can be assigned its own access restrictions – from fully public, to only readable by a particular group or individual.

Elgg is being used at a number of universities including Brighton, Leeds and MIT. Coincidentally, there is a detailed case study of the Elgg implementation at Brighton in the Franklin/van Harmelen report discussed above.

Technorati Tags: ,


LinkedIn button
free debate

RSS feed for this blog

Subscribe via email



Top Clicks

  • None