How do Taxonomies Underpin Big Data Analysis?

A friend and I chatted (over a cup of English tea) about a project he is involved with that involves analysing a large number of news articles to derive useful data about migration. The initiative’s specific goal is to demonstrate how immigrants have been represented in UK mass media and derive trends in public opinion on migration from that analysis (i.e., a threat to their welfare, culturally enriching, etc.). For example, he mentioned searching the texts for metaphors or adjectives that news outlets used to describe immigrants. Upon reflection, I realised that what he is doing is applying a controlled vocabulary (or Taxonomy) to automatically index big data. Now that computers are powerful enough to accomplish this task and companies offer the service commercially, the method has become popular. There are potential benefits to using this strategy within a DAM system. For example, automatically combing through user-generated tags that have been associated with different asset categories can help taxonomist(s) improve their taxonomy(ies). One might also imagine that automatic categorisation could be used to sort a large number of (textual) documents (see Heather Hedden’s slide presentation available here). Although an automatic text indexing system can be setup to automatically classify documents using a controlled vocabulary, it must be trained and guided by human intervention. Many more non-preferred terms (i.e., synonyms, variant terms, etc.) must be included in automatic indexing versus human idexing. A human will know that Bruce Wilis <sic> is the same as Bruce Willis if he stars in the 1988 movie Die Hard but a computer will not be able to infer this (unless otherwise trained or programmed to). It should also be noted that auto-classifying of complex media such as images and videos does not work at the conceptual level YET. I write yet because developments are being made very rapidly on this front. At the moment, machines can tell us what objects, colours, even people, are represented visually in a visual object, but they cannot tell us what that resource is about. Identifying the subject of digital material is best left to us humans, at least for the time being.

An exploration of the ways digital media can be used and managed

How do Taxonomies Underpin Big Data Analysis?

Leave a Reply Cancel reply