How to retrieve/extract metadata information from audio files using Java and Apache Tika API?
Join the DZone community and get the full member experience.
Join For Freei guess, i’m writing this post after a long time. this time, i’m writing about apache tika api that a friend of mine and i tried out to extract/retrieve metadata information from audio files supported by it – .mp3, .aiff, .au, .midi, .wav.
to make it clear, here’s a screenshot of the information shown by windows vista about an audio file:
we wanted to extract this using java and with googling, found that apache tika would help. we needed this metadata to index audio files for it to be searchable in a search application that we’re building using apache lucene .
here’s a sample java program that extracts metadata from an mp3 file:
package singz.samples.search.audio.metadata; import java.io.file; import java.io.fileinputstream; import java.io.filenotfoundexception; import java.io.ioexception; import java.io.inputstream; import org.apache.tika.exception.tikaexception; import org.apache.tika.metadata.metadata; import org.apache.tika.parser.parsecontext; import org.apache.tika.parser.parser; import org.apache.tika.parser.mp3.mp3parser; import org.xml.sax.contenthandler; import org.xml.sax.saxexception; import org.xml.sax.helpers.defaulthandler; /** * @author singaram subramanian * extract metadata of an audio file using apache tika api * */ public class audiometadataextractordemo { public static void main(string[] args) { // this audio file has metadata embedded in xmp (extensible metadata platform) standard // created by adobe systems inc. xmp standardizes the definition, creation, and // processing of extensible metadata. string audiofileloc = "c:\\pop\\backstreetboys_showmethemeaningofbeinglonely.mp3"; try { inputstream input = new fileinputstream(new file(audiofileloc)); contenthandler handler = new defaulthandler(); metadata metadata = new metadata(); parser parser = new mp3parser(); parsecontext parsectx = new parsecontext(); parser.parse(input, handler, metadata, parsectx); input.close(); // list all metadata string[] metadatanames = metadata.names(); for(string name : metadatanames){ system.out.println(name + ": " + metadata.get(name)); } // retrieve the necessary info from metadata // names - title, xmpdm:artist etc. - mentioned below may differ based // on the standard used for processing and storing standardized and/or // proprietary information relating to the contents of a file. system.out.println("title: " + metadata.get("title")); system.out.println("artists: " + metadata.get("xmpdm:artist")); system.out.println("genre: " + metadata.get("xmpdm:genre")); } catch (filenotfoundexception e) { e.printstacktrace(); } catch (ioexception e) { e.printstacktrace(); } catch (saxexception e) { e.printstacktrace(); } catch (tikaexception e) { e.printstacktrace(); } } }
maven pom xml
<project xmlns="http://maven.apache.org/pom/4.0.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://maven.apache.org/pom/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelversion>4.0.0</modelversion> <groupid>singz.samples.search.audio</groupid> <artifactid>audiometadataextractor</artifactid> <version>0.0.1</version> <packaging>jar</packaging> <name>audiometadataextractor</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceencoding>utf-8</project.build.sourceencoding> </properties> <dependencies> <dependency> <groupid>org.apache.tika</groupid> <artifactid>tika-core</artifactid> <version>0.10</version> </dependency> <dependency> <groupid>org.apache.tika</groupid> <artifactid>tika-parsers</artifactid> <version>0.10</version> </dependency> </dependencies> </project>
output
xmpdm:releasedate: 2001
xmpdm:audiochanneltype: stereo
xmpdm:album: top 100 pop
author: backstreet boys
xmpdm:artist: backstreet boys
channels: 2
xmpdm:audiosamplerate: 44100
xmpdm:logcomment: eng
xmpdm:tracknumber: 04
version: mpeg 3 layer iii version 1
xmpdm:composer: null
xmpdm:audiocompressor: mp3
title: show me the meaning of being lonely
samplerate: 44100
xmpdm:genre: pop
content-type: audio/mpeg
title: show me the meaning of being lonely
artists: backstreet boys
genre: pop
about apache tika
http://tika.apache.org/index.html
“the apache tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.”
http://www.lucidimagination.com/devzone/technical-articles/content-extraction-tika#article.tika
“apache tika is a content type detection and content extraction framework. tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats. tika does not try to understand the full variety of different document formats by itself but instead delegates the real work to various existing parser libraries such as apache poi for microsoft formats, pdfbox for adobe pdf, neko html for html etc.
the grand idea behind tika is that it offers a generic interface for parsing multiple formats. the tika api hides the technical differences of the various parser implementations. this means that you don’t have to learn and consume one api for every format you use but can instead use a single api – the tika api. internally tika usually delegates the parsing work to existing parsing libraries and adapts the parse result so that client applications can easily manage variety of formats.
tika aims to be efficient in using available resources (mainly ram) while parsing. the tika api is stream oriented so that the parsed source document does not need to be loaded into memory all at once but only as it is needed. ultimately, however, the amount of resources consumed is mandated by the parser libraries that tika uses.
at the time of writing this, tika supports directly around 30 document formats. see list of supported document formats . the list of supported document formats is not limited by tika in any way. in the simplest case you can add support for new document formats by implementing a thin adapter that that implements the parser interface for the new document format.”
about xmp standard
http://en.wikipedia.org/wiki/extensible_metadata_platform
“the adobe extensible metadata platform ( xmp ) is a standard, created by adobe systems inc. , for processing and storing standardized and proprietary information relating to the contents of a file.xmp standardizes the definition, creation, and processing of extensible metadata . serialized xmp can be embedded into a significant number of popular file formats, without breaking their readability by non-xmp-aware applications. embedding metadata avoids many problems that occur when metadata is stored separately. xmp is used in pdf , photography and photo editing applications.
xmp can be used in several file formats such as pdf , jpeg , jpeg 2000 , jpeg xr , gif , png , html , tiff , adobe illustrator , psd , mp3 , mp4 , audio video interleave , wav , rf64 , audio interchange file format , postscript , encapsulated postscript , and proposed for djvu . in a typical edited jpeg file, xmp information is typically included alongside exif and iptc information interchange model data.”
from http://singztechmusings.wordpress.com/2011/10/17/how-to-retrieveextract-metadata-information-from-audio-files-using-java-and-apache-tika-api/
Opinions expressed by DZone contributors are their own.
Comments