Sunday, March 16, 2014

Playing Librarian/Curator Made Easier with Video File Metadata

My video file collection is a mess. There is duplicate content, and files ranging in resolution from 1080p to generic HD to DVD-level and many that are older and even lower than that. Similarly, some have surround sound tracks, and others only have super low-bitrate stereo tracks. I wanted to know what may need to be re-encoded, and to have the ability to sort through the vast volume of data relatively quickly based on various parameters built into the metadata.

This means that I needed to be able to read the metadata, grab some key information from each file, and have the ability to make decisions quickly. For reading the metadata, I chose ffprobe, which is part of the ffmpeg Windows binaries made available from Zeranoe.


Then, I fired up cygwin and used the following bash script to grab the stuff I cared about:

for file in *; 
do ls "$file" >> list.txt;
ffprobe.exe -v quiet -show_streams -show_data -pretty -of json "$file" | egrep codec_name\|width\|height\|bit_rate\|channel_layout\|sample_rate >> list.txt;
done
The double quotes ward off undesirable behavior when the file names have spaces and other characters that would need to be escaped out in them. The sequence of "OR"ed parameters that egrep is filtering from the output are what allow me to find out what I need to know about both the video and audio parts of each file.

Here's an example of the output for two versions of the same source material that were inadvertently created:

VERSION_A.m4v
            "codec_name": "h264",
            "width": 704,
            "height": 384,
            "bit_rate": "863.986000 Kbit/s",
            "codec_name": "aac",
            "sample_rate": "48000 KHz",
            "channel_layout": "stereo",
            "bit_rate": "164.469000 Kbit/s",
VERSION_B.m4v
            "codec_name": "h264",
            "width": 1920,
            "height": 800,
            "bit_rate": "5.983516 Mbit/s",
            "codec_name": "aac",
            "sample_rate": "48000 KHz",
            "channel_layout": "stereo",
            "bit_rate": "198.056000 Kbit/s",
            "codec_name": "ac3",
            "sample_rate": "48000 KHz",
            "channel_layout": "5.1(side)",
            "bit_rate": "640000 Kbit/s",

In this case, there's a clear winner. Version A is "DVD quality" and only contains a stereo soundtrack. Version B is "full HD" and includes not only a slightly higher bit-rate stereo audio track, but also a 5.1 surround audio track. As you can guess, the file sizes are significantly different, so you'd think you can just keep the larger file...  In this case, that would work, but the distinctions are not always so clear, and the data makes the job of curating a bit easier.

With a small amount of work, I should be able to get this output file into a CSV format and then quickly merge, sort, and filter the whole list as a spreadsheet.