In Part 1 of this series we considered the problem of reinvention, the unstructured nature of information, and the role of metadata.
In this post, we will try to answer the question: what are the best ways to turn unstructured content into structured metadata so that we can find information?
One way of solving the pesky problem of making structured data from unstructured information is to make users add metadata as they go along.
For some content management software, it is the only way.
This is called the "red asterisk" approach, because a typical UI instruction is to "Fill in the document properties (metadata), mandatory fields are marked with a red asterisk".
The problem with the red asterisk approach is that it may dissuade users from adding documents to the document repository.
There are two basic rules in an IM system: (1) Make it easy to add content to the repository, and (2) Finding content should always be efficient.
Anything that gets in-between those rules and adds effort (or makes search less efficient) is not a good thing. As a result, the user may revert to sharing documents by email or resort to adding minimum / arbitrary metadata. In one project,1 for example, users were given a choice of 35 metadata values sorted in alphabetical order on a pull-down menu. Later, it was discovered that 60% of all content had been tagged with the "A" option - the first and easiest choice on the menu. Needless to say, this defeats the goals of the exercise.
Therefore, a powerful enterprise search engine can be an alternative solution.
Enterprise search technology is a massive topic, and at the risk of massive simplification here is a summary: the goal is to extract all document level metadata and content from unstructured data sources; import that data in a format the machine can use (XML for example); index the data as quickly as possible, and provide a search user interface to query the index and return the search results. The data can be sourced from unstructured content, including text extraction from binary documents (PDF, MS-Word, Excel, HTML); or from structured content, such as other databases.
If this technology sounds expensive, the answer is that it can be. Fortunately, the open source Apache Lucene/Solr3 project provides a no-cost licensing entry point, and feature-for-feaure provides a very credible alternative to the expensive closed source products. More recently, ElasticSearch has joined it as another free and open source solution. These are well-resourced software projects: for example, Lucene/Solr has around 60 code committers (and a great many more contributors).
The key point is the features these search engines provide to end users. Because all the content of documents is indexed, full text search is possible on every word or phrase. The search ranking algorithm is relevancy-based, which put simply means that search terms are first returned and then scored. Scoring factors include: up-ranking terms with a high frequency, rare term matching, multi-term phrases, or location in document.
Full-text search therefore provides a very useful additional tool for information findability. It avoids problems that using mandatory metadata can bring.
So far, we've argued for the intrinsic value of metadata in IM, but stopped short of saying that IM systems should be entirely metadata-driven. Full-text and other search mechanisms have an important role too.
The final topic I want to consider is whether it is enough to have a search interface with metadata behind, and leave it to the users to decide what they want to see at any given time by using just that tool? The question is whether document categories or folders bring any additional value.
The topic of folders in an IM system is one that has raised some debate. It often starts from the observation that replicating the directory structure of a shared network drive is a bad way to start building a document folder structure in an IM system. The big problem with the former, we are reminded, is that it often reflects the preferred taxonomy of one person or group, and everybody else has a hard time finding anything as a result.
SharePoint is widely-used software for IM, and SharePoint consultants have argued for a move away from folders to instead provide a flatter structure with more use of metadata in site columns. A Microsoft TechNet article on Using Folders in SharePoint 2013 recommends you keep the folder hierarchy as flat and minimal as possible (and also recommends you don't limit yourself to metadata views exclusively). However, the disadvantages of folders that the article lists are mainly limitations of SharePoint, not of folders.
Another product, M-Files, follows the same reasoning. It adopts a "dynamic metadata" approach to overcome the need for files to exist in more than one location.
"In traditional folder-based systems, users have to define the folder where the document is saved and rely on their memory to find it again."
The sophisticated way of objecting to folders is to deride them as skeuomorphic - a design based too much on elements taken from the physical world. These views would have merit IF the limitation of a file system means that the file can exist in one place only, and has to be copied if it is to exist in more than one place. We can all appreciate that duplicate copies are bad.
But a virtual category on a web page is simply a metaphor for a physical folder. A category name is just a piece of metadata. Any web page can display any list of files. Documents can easily go into more than one category, and document metadata can be used to determine which categories.
Another problem with the "folderless" view is that it contravenes an important User Design principle, the one based on cognitive studies of memory that states recognition is easier than recall. People are better at recognizing things based on previous experience than they are at recalling things from memory. There is a good overview on the Nielsen Norman Group website,2 but two key points are that Search requires knowledge of the search domain, and increases memory load on the user.
It is not a coincidence that e-commerce sites rely on category landing pages for navigation, and I believe the problem is the same one faced by someone searching for information within an organisation. Starting with something familiar such as a category structure is a good way to explore information. If some content appears in several places, it matters not a jot as long as it is found.
If search is the way to maximize Findability ("I know it's there"), then perhaps good category assignation is the way to promote Discoverability, where users uncover new content ("I didn't know we did that").4
In this post we've presented a short overview of concepts that affect findability and information management in the corporate workplace. Much has been left out: auto-classification, semantic search, and content enrichment are just a few examples.
The summary of it all is that the elimination of (unintended) reinvention is key to Quality because it minimises the waste of rework and maximises the benefits of re-use. A big factor in avoiding reinvention is Process, and 'the right information at the right time' underpins good processes. This is where navigation (e.g. folders / categories), metadata, and enterprise search play their parts.
When you are next asked to prepare a company for a quality audit you can apply some of these concepts to the IM strategy they present to you. If the information is kept on a shared network drive, then you know it will lack metadata, automated version control, and be hard to search. If a company has merely copied the shared drive directory structure to the document management system, watch out for content hidden in deep folders (especially if there is no search to rescue the situation). If everything is kept in a free file-sharing site with a flat structure and shared links are used to 'remember' where it is, then it will lack control, folders, metadata, and (maybe) the ability to search efficiently.
These are just some of the ways that IM concepts can help us to appraise Quality.
1 Appears in blog http://www.termset.com/blog/2015/1/16/using-the-end-user-to-apply-metadata-to-your-sharepoint-documents
2 Budiu, R., 2014 "Search Is Not Enough: Synergy Between Navigation and Search" http://www.nngroup.com/articles/search-not-enough/
3 Lucene/Solr, ElasticSearch, SharePoint, M-Files and other trademarks mentioned herein belong to their respective owners.
4 See also "Low Findability and Discoverability: Four Testing Methods to Identify the Causes" http://www.nngroup.com/articles/navigation-ia-tests/
This post was written by Paul Walsh.