Newspaper Digitization and Information Retrieval – Workflow in Different Areas
Newspaper Digitization and Information Retrieval – Workflow in Different Areas
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Physical paper Scan and Digitization – Primary objectives of this project includes scanning of different size of newspapers, fixing the page layout and OCR implications.
Steps in the Digitisation of News Paper includes -
1> Planning – How much and which form of contents are to be digitized. Type of digitization for different formats of Data are needed to be considered in Digitization process which overall make the digital collection of the Organisation.
2> Scanning Pages – The physical pages and microfilms are to be digitized to different data formats primarily scanned image and then xml and images. OCR scanning for words will need further correction by human intervention. Archival Images are to be of better quality. Either in-house technology and personnel are to be applied for this work or experienced digital data coneversion service providers are to be selected for this work.
3> Converting Scanned Images to Digital Text - Scanned images are to be converted to digital text by Optical Character Recognition Technology. This digital text will be further utilised for searching and discovery of the information by user online. Here the accuracy of OCR mechanism is an important aspect.
Blurry texts, paper deterioration etc are different factors which are to be modified or corrected by human interaction in between digital data conversion process.
Some alternate process such as “Crowdsourcing OCR Text Correction” can be applied here. Please refer to http://veridiansoftware.com/text-correction/ for this.
Also tesseract - https://code.google.com/p/tesseract-ocr/ is a probable integration point for extracting text from images.
4> Using METS/ALTO to Define OCR Text - METS/ALTO provides rich digital objects, which allows for rich digital library interfaces to be built. For example, a typical METS/ALTO object encodes complete logical and physical structure of a document (i.e. chapters, sections, articles, pages, etc., and their associated metadata), and also the full-text content of each section of the document and the physical coordinates of every word in the document. The impact of this on the user’s search experience can be quite significant. It doesn’t typically cost any more to digitize materials to METS/ALTO than to formats like HTML, which contain much less information.
The National Endowment for the Humanities and the Libarary of Congress publish and maintain technical guidelines for scanning and text conversion of newspaper pages for the National Digital Newspaper Program (NDNP). These specifications, which stipulate the use of METS/ALTO, are seen as the industry standard , they were written for workflow consistency, and they will be sustained by the Library of Congress as a standard for digital content. Libraries and institutions around the world adhere to these guidelines when creating historic digital newspaper collections to align their work with the established standards.
Please refer to
All data of Step 3 and 4 should be stored in NoSql (XML/JSon/HDFS) Database, whcih are digital unstructured content. Where as all metadata of the unstructed data should be stored in relational database with related XML tagging.
While storing the data, the documents should be converted in convenient data format, so that the data can be delivered in all the devices – digital channels as per user requirement.
Organizations should be able to take a Big Data approach to managing their information; they can accumulate and update all data from the disparate systems into a single repository, enabling them to streamline search and use data effectively.
Organisation should employ the classification and sub-classing of data related to the association of metadata and the digital data repository and stored them in RDF format(Ontology Creation Process)
Please refer to http://gate.ac.uk/sale/tao/splitch14.html#chap:ontologies for information about ontologies and usage of those data.
5> Choosing a Presentation System -
6> EXIF, IPTC and XMP Metadata in Photo Uploading
While storing image data in digital repository, the following things should be kept in mind -
Metadata is the extra information which almost all digital cameras store with pictures. The metadata captured by camera is called EXIF data, which stands for Exchangeable Image File Format (non-editable dataformat).
Two of the most commonly used metadata formats for image files are IPTC and XMP.
IPTC is the standard developed as a standard for exchanging information between news organizations and has evolved over time. Around 1994, Adobe Photoshop's "File Info" form enabled users to insert and edit IPTC metadata in digital image files.
XMP is the new XML-based "Extensible Metadata Platform" developed by Adobe in 2001. XMP is an open-source, public standard, making it easier for developers to adopt the specification in third-party software. XMP metadata can be added to many file types, but for graphic images it is generally stored in JPEG and TIFF files.
For automatic metadata extraction for Images organisation can use -https://code.google.com/p/metadata-extractor/
These open standards for metadata should be applied while storing the images and metadata of the images to be exchangeable with any external entites and interfaces.
7> Metadata Preservation -
Metadata of a digital repository is structured information, which allows to store the description of digital assets and accessible via search to end users.
Most of the image and other forms of data do not offer textual descriptive textual information about them which can be used to find them. Where as textual data offer the same (whether structured or unstructured).
So at the time of digitization of a newspaper system should aim to store as much as metadata information of the assets of the organization, which will facilitate user to find the data as much as possible.
A> So while storing the textual data, organisation should store the context of the data, date time of the related context, classification of data and relation of the data with other previous or later digital asset context.
B> While storing the image form of data, organisation should aim to store the context of image, the relation of the image to related textual context, catagory or classification of the image, date/time of the image, any related sub-context, information of creator of the image and other information as much as possible as the metadata for the images. This will further facilitate the information retrieval during search by end user.
C> While storing the other form of data such as video, organisation should aim to store the context of video, the relation of the video to related textual context, catagory or classification of the video, date/time of the video, any related sub-context, information of creator of the video and other information as much as possible as the metadata for the video. This will further facilitate the information retrieval during search by end user.
An search in relation to any context should give the targeted textual information, images, videos and other form of information to the user, which are there in the digital asset repository of the organisation.
8> Metadata Creation Roles and Components -
Followings are the typital roles and components which are needed to create and maintenance for successful implementation of Digital Assets -
Store Creator Information: Entities responsible for making and further contributing to the resource.
Relevance of the Context : The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
Date and Time : A point or period of time associated with an event in the life cycle of the resource.
Descriptive information of the Resource: Storing of information about the resource with reference to the language.
Information Format and Unique Identifier of the Resource: Store metadata information of the digital file format and physical medium and size of the resource. Also the resource should be classified with unique identifier with the related context and classification. Also descriptive title, information about the main subject of the resources and the generation point of the resource should be stored here.
Relational metadata : Storing information about metadata with a related resource.
9> File naming standards - File naming will help to optimize asset/metadata entry, manage conflicts, duplicates, and versions; and search in a collaborative workflow. Criteria for file naming should be established by a top level administrator and published for all users to follow.
10> Controlled vocabularies - The quality of the user experience of a DAM system is largely dependent on the keywords of the metadata that is used to describe digital assets. The solution is to develop a controlled vocabulary that limits the use of keywords describing digital media assets with defined set of terms (values) using controlled field types and is defined by a thesaurus that provides a listing of the controlled vocabulary, identifies synonyms and provides a cross-reference of metadata terminology.
11>Importing metadata - Metadata can be automatically populated into a DAM system through several different methods:
A>Extract embedded metadata from a digital file and populate the metadata fields when the file is uploaded
B>A one-time initial import to transfer metadata from another system into the DAM repository
These embedded metadata can be automatically mapped directly to searchable metadata fields within the DAM system. When the asset is uploaded into the system, the metadata should be automatically extracted from the digital file and placed into that asset’s metadata field in the DAM system.
Metadata is most commonly imported in CSV or XLS format. The unique identifier is commonly the filename of the digital asset. Along with the unique identifier, the metadata fields and values for each asset need to be included.
12> Applying metadata in bulk - Entering metadata is critical to obtaining full value from full library of digital assets. Batch editing has the potential to greatly reduce the time required to enter common metadata for large numbers of assets. Batch editing makes it possible to apply changes simultaneously across a large number of assets, for example classifying groups of assets as logos, product images, lifestyle images, brochures, etc.
DAM systems should offer several batch editing tools that can be used to substantially reduce the data entry task at the time of ingest or after the assets are already in the system. These tools should make it possible to select multiple files and apply common metadata to all of them simultaneously.
13> Evaluating metadata effectiveness – Evaluation of the effectiveness of metadata is required to determine how easily and quickly users are able to find the digital assets they are seeking. Search analytics provides a quantifiable way to measure the relevancy of search results, as well as to inform how well or poor the design and overall user experience are performing against digital content of the Organisation.
14> Analysis and Result storing of the Information in organisation -
Monitoring sentiment, uncovering details of brand activities, and finding the individual events which are responsible for trend change and creation, showing data with visualisation tools, Research data creation and management, search and discovery of information with macro trends are some useful activities which can be sub-project of digital content analysis for Organisation.
Opinions expressed by DZone contributors are their own.