Thursday, September 3, 2009

First thoughts on: Abdullah & Gibb, Students' attitudes towards e-books in a Scottish higher education institution: part 3

Library Review 58.1 (2009): p17-27. doi: 10.1108/00242530910928906 (University of Illinois Access)

Why should you read this: Well written and concise. I personally found it very thought provoking (although I'm betting there are many, many articles and blog posts out there that suggest largely the same things I do…) Although this article generated some interesting ideas for me (see My Initial Thoughts) I found the study too limited to be of practical, generalized applicability (see In Brief) when considering the performance of TOC versus BOB indexes versus full-text searching.

In Brief: 45 users (Masters students in the department of Computer and Information Science) were studied as they attempted to perform tasks related to locating relevant content (quick look-up reference style consultation, not in-depth reading, related to quick factual items in the books, or drawing conclusions through analysis of small portions of the e-book) in PDF format e-books (non-fiction, Computer Science, with TOCs and indexes), contrasting the effectiveness (defined by authors as how efficient/fast the task was completed, how effective/successful the task was performed, and how useful the user perceived each feature to be) of full-text-searching, Table of Contents, and the Index to locate the necessary information. The study finds that the use of the Index was more efficient (faster) that using the TOC of FTS, but not necessarily more effective (correct answer located) than using either the Table of Contents or full-text searching.

My initial thoughts: Limiting the type of e-book studied to PDFs was probably overly restrictive, and the biggest weakness of this study. It makes this more an evaluation of Adobe Acrobat Reader Search within PDF documents versus more traditional (print media) tools for finding information (TOC and index), than an actual evaluation of the comparative usefulness of full-text searching over the use of TOCs and indexes. In other e-book platforms, full-text searching is not necessarily so brute force (find this word/phrase, then keep clicking through each occurrence, in a linear fashion, until you find the most relevant section) and often does (or could) employ more advanced relevancy ranking techniques (even those based on additional text mark-up practices) than the search functionality in PDF documents. Thus, I don't think I find the author's evaluation that indexes are generally more efficient for finding information that full-text searching to be very persuasive. Even the PDF search tool could (if Adobe chose to do so) be vastly improve to offer a more relevancy ranked set of results to full-text searching rather than just marching linearly through all matching keywords in the document (so long as additional metadata was available for that particular PDF document). The very back of book indexes the authors find so useful could easily, in an electronic format, be used to weight full-text keyword searchers more highly, using the existing data already provided in the BOB index and TOC. In that scenario, if a user searchers for a keyword at it appears in the index or TOC, or especially in both, then the search results would offer the pages reference in the TOC and index with those matching keywords first, and only offer the rest of the linear results second), marrying the convenience of automated full-text searching with the added value of human created indexes.

However, I agree wholeheartedly with the researcher's assertion that (basically) TOC and BOB indexes (indexes in particular) are generally really good things. When the age of the e-book finally arrives, we'll still need actual humans to continue to create and apply meta-data of all types to books (and many other types of information), particularly in the areas of creating indexes.

The authors' argument that including TOC and BOB index information as searchable content in library catalogs does seem like a good one. I'd even go so far as to say that TOC and index terms should probably be given a heavier weight in ranking results than term in the subject/descriptor fields, especially for narrowly defined tasks like finding a small particularly relevant bit of information in a book, versus wanting to find and read an entire book about a topic. But, I think that would only be a small, baby-step towards the inevitable type of information interfaces users want (or will want) – to be able to run a single, full text search across all the relevant content within their domain (or related domains). This means a we need a single search tool that provides full-text searching across books, journals, web sites, basically anything the librarians choose to include in the, for lack of a better word, catalog. Of course, pulling this overwhelming amount of content together and returning the results in a well sorted, relevancy ranked way for the end user will require that we do more than simple TFIDF style rankings. We'll need to leverage that incredibly useful additional human-generated metadata, like that contained in indexes (and in the future, through additional mark-up applied directly to the texts, internally, rather than as separate entities like most indexes currently are). I know I would much rather use that mythical catalog for searching, and have it return not only the metadata about the book (title, subjects, etc) but show me the keywords in context of the most relevant page that contains my keywords, based on that ever so useful information contained in the TOS, index (and someday) internal text-markup. Imagine a day when the indexing of books will not simply point to the page, in a serialized linear way that have a significant use of the word/term/concept, but where each entry in the index is individually weighted. In that case, we might find that a search determines that the use of the term the user entered occurs on page 37 of the book, and is ranked the #1 most important use of the term in the book, even though the term appears 17 times in the book before that point, so it lists page 37 first, rather than the preceding pages that contain the term. Combine this with multiple word searches (or for truly advanced users, nested Boolean searches) and we can suddenly get very fancy with the ranking algorithms. A user searches for [term1 AND term2]. The search/ranking algorithm finds that (along with all the other occurrence of the term) that these terms only appear on the same page/close to each other a few times- term 1 appears on page 17, rank #3, and page 33 rank #2, and page 56 rank#50; term 2 appears on page 17 rank #15, page 33 rank #48 and page 56 rank # 11. We can calculate that page 17, even though it isn't ranked #1 for either term, may very well be the "best" result to offer to the user first, with the same logic applied to sort and organize all the rest of the term matches.