Curated by: Luigi Canali De Rossi

Friday, November 7, 2008

Video Metadata Key Strategic Importance For Online Video Publishers - Part 2

Sponsored Links

Video metadata can break or make your online video success opportunities. By understanding what video metadata is and how it needs to be used, you can significantly affect the way your video content is managed, distributed and found online.

Photo credit: badboo

But should metadata be authored by people, as is most of the video production process, or should it be automated? What are the costs involved in doing this?

Here is the second part of this report (Part 1)

Intro by Daniele Bazzano


The Currency Of Internet Video - Part 2

The Science - and Art - of Metadata Creation


Given the importance of metadata to Internet video, thoughtful consideration needs to be given to what constitutes good metadata.

While enabling video search and interactivity required for quality Internet experiences, metadata is needed not only to maintain the original intent of the video, but also enhance the experience in ways not possible with traditional video distribution. Beyond that, metadata can enhance the Internet video experience in ways yet to be conceived.

Metadata authoring is a topic unto itself and the subject of many technical papers and industry initiatives, which are beyond the scope of this paper.

In order to convey the essence of what constitutes good metadata, we'll rely on a simple illustrative example that addresses a common consideration in authoring metadata - should metadata be authored by people, as is most of the video production process, or should it be automated.


Example - Manual and Automated Metadata

Figure 4

Let's consider searching for a cameo appearance by Brad Pitt in an episode of the NBC sitcom, 'Friends'. Assuming a number of Friends fans have not already spent their valuable time to do this and post it on YouTube, this requires a few things. It requires metadata for the episode in which Brad Pitt appears.

Given his celebrity, it is most likely that Brad Pitt is listed in the content description metadata that was created during production. This data was entered by someone on the production team.

Thereafter, it is possible that a user searching for this episode may watch the entire episode, and the metadata has done its job. More likely, the user may want to watch only the scenes where Brad Pitt appears in the episode. Since the final packaged video does not have the original time code the editors used to edit the video, this information is lost and must be recreated.

In order to search for Brad Pitt within the episode, advanced facial recognition software may be deployed that is trained to recognize Brad Pitt. Assuming it can do the job, it will identify the first frame and subsequent frames that Brad Pitt appears in. Scene change detection software may then be deployed to detect a scene change before the first Brad Pitt frame and mark that as the start of the clip. It may detect the next scene change to mark the end of the clip.

Theoretically, this seems to do the job - provided the technologies work reliably. The most well developed of such technologies - speech to text - works less than 100% reliably (generally considered to be 95% in the best case, but reportedly at 50% on broader scale), so the first concern would be whether the technology worked in identifying Brad Pitt. Since he is a well known face, let's assume the system can be trained rigorously in this exemplary case, but it's still a less than perfect chance. Moreover, training systems to perform voice, face and object recognition is time consuming, requiring tremendous upfront investment of time and resources.

The second concern is whether the resulting clip or clips were watchable from a cinematic experience

  1. Are the scene changes correct, in addition to accurate?
  2. Did the scene boundaries interrupt key dialog?
  3. Do we know the context within which Brad Pitt is introduced into the show?

These are just some of the considerations.

Conceivably, a better place to start the clip was the prior scene, or maybe further into the scene. A person can make this decision very quickly and intuitively, whereas automation can lead to not only a suboptimal result, but it may also be grossly inaccurate. Finally, a person would need to review and potentially edit the work of a machine.

To make a finer point of automated versus manual metadata creation, consider the following:

  1. Were we trying to locate Ted Danson in his Hellboy outfit, or Danny DeVito in his Penguin outfit, chances are facial recognition would be hopelessly lost, as even humans cannot sometimes recognize the faces behind the outfits. Nevertheless, a human is better suited to this task.
  2. More dramatic contrasts between manual and automated metadata can be demonstrated in sports programming. Sports viewing is a combination of close ups, long camera angles, fast motion and fast camera transitions. The combination of this along with the fact that players are not always facing the camera makes it impossible to apply facial recognition technologies to create clips automatically. Creating clips of Lebron James' three pointers or Tom Brady's touchdown passes can only be done by a person.

In any event, given the less than 100% accuracy of any automated systems, be they speech recognition, image or facial recognition, scene change detection and such, quality end results are derived through human authoring while using automation to facilitate the process.

A second important consideration in authoring meta-data for Internet video discussed next is however impossible to automate.


Video Is More Than The Sum of Its Parts


Beyond the obvious scene, object, face and speech recognition whether done automatically or manually, video is a complex communication medium.

The creative combination of visuals, sounds, speech, emotions and storytelling inherent in any
video makes it so. Inferring the intrinsic appeal of a video program on the Internet for different users can only be done by people.

In the earlier 'serene seascape' example, imagine that the music is from Jaws, but the video has a comic audio commentary lampooning the (irrational) fear of sharks.

The emotion associated with the video is humor, as opposed to fear. The commentary could be educational about sharks, the intent being to inform as opposed to thrill. People can immediately establish such intent and capture it in metadata for their audiences.

Among the successful implementations of metadata listed earlier, alternate navigation schemes - including navigation across different video files - is one where human imagination can be applied to create new, lasting user experiences that are not possible with automated metadata schemes.

Consider multi-threaded programs such as, ABC's Lost, or reality shows with many participants and events such as, Fox's American Idol and CBS' Survivor, or sporting events - wherein users can aspire to recapture the experience of the original program in many different ways.

Consider the following examples:

  • Lost: Sawyer + Kate + Romantic Scenes: creates a playlist across all episodes of scenes where Sawyer and Kate are together in a romantic setting
  • American Idol: Seasons 1-9 + Winners + Finals: creates a playlist of all the American Idol winners' final performance on the show
  • Sports: Tom Brady + Touchdown passes: creates a playlist of all of Tom Brady's touchdown passes in NFL games.

While the above examples are hypothetical, metadata easily allows users to essentially apply 'Boolean logic' (similar to what users do in web searches) to generate attachment through new experiences. In the absence of such metadata, programmers would need to actually edit and re-encode individual clips, which is a formidable task, if not an impossible one. It is also impossible to successfully create such dynamic playlists and alternate navigation schemes using individually encoded clips.

Human imagination remains ahead of technology.

Making metadata choices by what automated technologies allow is inherently more limiting than generating metadata manually, wherein video can be tagged in many different ways, and metadata fields can be created and managed any way that a human operator conceives necessary, intuitive, probable, or even imaginable.


Metadata Has The Lowest Production Cost of All Video Attributes


One of the underlying questions is the cost of authoring metadata and whether one approach is more cost effective than another. This boils down to the question of quality versus quantity.

If accuracy and end-user experience is secondary to processing large volume of video for a basic search index, then automation is likely to help solve the problem better than a human.

Automation, such as scene-change and speech-to-text serve well in the production stage of video. This is because there is a lot of raw footage and people handling the video are professionals. Their task is to manage the video production, not to consume or monetize the video.

In the case of researchers looking to sift through large video libraries, the same argument applies - the video experience is secondary to the objective of locating a video or a clip within a video asset.

At the risk of being redundant, let's (re)visit some of the commercial applications of video:

  • Search at a file or scene level
  • Create, display and share virtual clips and playlists

  • Create advertising insertion points and advertising logic
  • Generate detailed usage tracking and reporting data

Automating metadata creation for each of these exemplary applications will require mostly disparate processes, in contrast to human authoring which allows all required metadata to be created in a single pass. The cost of human authored metadata is, therefore, not only lower than automated metadata, but it is also insignificant relative to the overall video production costs.

Human metadata authoring can typically be accomplished in much less time than the play-out duration of the video.

People don't have to be trained to recognize speech or images like machines do, reducing upfront investment of time and resources.

Lastly, human authored metadata allows for further human creativity and reasoning to be applied to video programming, bringing new elements of creativity to an already creative process with negligible incremental costs.




Metadata is a critical element to the success of video on the Internet. Publishers need to address metadata creation as an essential part of the video production workflow.

  • Video as a complex medium requires human authored metadata to bring the vernacular of Internet experiences to video on the Internet.
  • Quality metadata to create audience engagement and monetization should be authored with distinct objectives of creating such Internet experiences for video.
  • Such metadata is best authored by people using authoring systems that allow

    1. Flexible and accurate metadata to be applied to video assets, and
    2. Additional creative expression to be brought to the medium of Internet video.

Publishers need to incorporate systems that author and manage metadata towards these objectives as they look to build audiences and advertising with their Internet video strategies.

Check out the first part: Video Metadata Key Strategic Importance For Online Video Publishers - Part 1

N.B.: The implementation examples described earlier in this paper are based on Gotuit's video metadata authoring and management system. These represent among the most advanced uses of metadata and Internet video implementations. The metadata in each case was human-authored either by Gotuit or its customer.

Originally written by the Gotuit Team and first published as "The Currency of Internet Video" on October 1st, 2008.


About the author

Gotuit is a developer of video metadata technology. Founded in 2000, Gotuit is privately held and funded by Highland Capital Partners, Atlas Venture, Motorola, and private investors.The company enables users to add metadata to sections of videos that are uploaded to their site. Gotuit powers video for leading brands such as Lifetime, Fox, Sports Illustrated, Major League Soccer and more. To learn more about how Gotuit can help implement solutions to create greater use and monetization of your video programming over the Internet, visit our website at, or contact the sales team at: 781.970.5414.

Photo credits:
The Science - and Art - of Metadata Creation - dragerphot
Example - Manual and Automated Metadata - Gotuit
Video Is More Than The Sum of Its Parts - Kuzma
Metadata Has The Lowest Production Cost of All Video Attributes - Aleksey Poprugin
Conclusions - maxxyustas

Gotuit Team -
Reference: Gotuit [ Read more ]
Readers' Comments    
blog comments powered by Disqus
posted by Daniele Bazzano on Friday, November 7 2008, updated on Tuesday, May 5 2015

Search this site for more with 








    Curated by

    New media explorer
    Communication designer


    POP Newsletter

    Robin Good's Newsletter for Professional Online Publishers  



    Real Time Web Analytics