The Opera Metadata Analysis and Mining Application (MAMA) is an incredible tool and Opera recently released some details on this, called MAMA for short, and some of the findings they have managed to gather using MAMA. Zone Leader Schalk Neethling sat down with Brian Wilson, QA Engineer at Opera Software, to learn more about MAMA.
Schalk Neethling: Hi Brian and thank you for granting us this interview. First off, please tell us a little more about yourself and what you do at Opera.
Brian Wilson: I've been involved with software QA for ~15 years, working over the years at Microsoft, Real Networks and now Opera. The Web has always fascinated me with its ability to help people communicate in so many new and interesting ways - to help bring people closer together. My first position was at Microsoft working on their first Web browser (it pre-dated MSIE 1.0 by more than 6 months at least). It was an add-on extension to Word called Internet Assistant that allowed a primitive browsing and authoring capability.
Separately, I also created an online markup reference Web site about HTML and CSS [www.blooberry.com/indexdot/] that has been quite popular for many years. One of the major features of that site is tracking the historical evolution of HTML and CSS in both the browsers and standards. If you look at MAMA, you can trace direct lines between that topic and what MAMA has become.
I actually came to the attention of Opera's CTO (Hakon W. Lie) because of that site and he offered me a job at Opera in 2004. My first QA assignment involved the desktop release of Opera 7.0, but my tasks have branched out over time...producing and maintaining various tools to meet different needs, from QA automation to issue tracking. When I started at Opera, our QA department was less than 10 people. Six years on, Opera's product offerings have blossomed and our QA tasks have multiplied...as has our QA group. Opera's QA department has been proactive in tackling solutions that make everyone's life at Opera a little easier - something we take great pride in.
MAMA initially began as a side project but has grown to be one of my primary tasks. It is definitely a project to hold up as an example of the passion we hold for QA processes at Opera. Our aim is to help ease the pain in the software development process...and MAMA has the potential to spread that help beyond Opera's boundaries.
Neethling: What is your involvement with the MAMA project at Opera?
Wilson: Simply put, the implementation of MAMA is all me. The genesis of the "MAMA idea" was a joint one with my managers, and they've been generous with the time I have been able to devote to it. I've written the engine, and come up with the methodologies and systems used. I've done everything from planning to the development and QA. I've asked for input and advice from all quarters over the years, but so far MAMA = me.
Neethling: My next question to you is why MAMA?
Wilson: In its current form, the answer to "why MAMA?" is easy, but in the early years it was less clear. MAMA allows us (and we hope others outside of Opera) to answer the many questions of "how is X used?" For a browser maker, this is a very important question in many phases of browser software development. The hodge-podge development of various Web page technologies over the years has led to authors generating a huge patchwork of documents, many that conform to very old standards or (more often) no standard at all. It is this environment that the browser must survive in, so any details we can glean from what is actually used on the Web can only help inform Opera's (and other browser-makers) steps forward.
The beginnings of MAMA were not very ambitious. Initially I wrote some simple scripts to analyze a single web page and check for specific factors we wanted to test against in QA. I started asking around for factors to analyze in Web pages, and the laundry list grew quickly. Some of MAMA's current features are a result of expansion and generalization of narrow checks it was previously searching for.
Over time the feature generalization broadened MAMA's scope to the point where I realized it was serving a wider purpose than its humble beginnings.
Checking a single URL at a time would be a fine scope for a tool, but at the same time the obvious next step was to automate the analysis and store the results for easy retrieval. This would allow us locate and test against groups of URLs that satisfied specific criteria.
The long writeup being highlighted on dev.opera.com is simply a natural byproduct of storing all this information...you get statistics for free. That's a great consolation prize!
Neethling: Please give us some background information with regards to similar initiatives from the past.
Wilson: There have been a number of smaller-scale studies (several are mentioned in the MAMA writeup [http://dev.opera.com/articles/view/mama-what-has-come-before/]) Many of these smaller studies looked at limited criteria or analyzed "small" URL sets numbering in the low thousands. As for large-scale studies, there are two that I know of.
The first was released in late 2005 by Ian Hickson (of Google and the WHATWG) http://code.google.com/webstats/ He did this study soon after shifting to Google from Opera. While he was with Opera, I had asked him to suggest some search criteria MAMA could look for. So, even before he published the Google study he already had some influence on MAMA. Hickson's Google study covers the largest URL space I' m aware of to date (~1 billion URLs). There are some major areas that his study doesn't cover, like external CSS and script, but it is an important first public look at a large sampling of Web pages.
After I started working on MAMA more seriously, someone pointed me to Rene Saarsoo's university study he did, "Coding practices of Web pages": http://triin.net/2006/06/12/HTML This was the first documented large-scale study I have seen that tried to paint a complete picture of how a wide variety of factors are used in Web pages (including external CSS and script). The analysis factors in his study bear the closest resemblance to what MAMA is trying to do. I was a little disheartened at the time to find that Saarsoo had already done something similar, but later found that our studies were very complementary.
There was clearly room for MAMA in this space. MAMA digs deeper into more areas than either of these studies did, and its results are intended to be repeatable and live.
Neethling: What gap does MAMA fill with regards to these other studies?
Wilson: MAMA goes further than those two big studies, providing deeper coverage overall, and looking at many more criteria than either provided. MAMA also provides actual URLs that satisfy criteria, which is very useful in any case where you actually want to *do* something with all this data. It will also be a live and evolving entity, which will allow analysis of how things are changing over time.
Neethling: I understand that there are a lot of aspects working together to make MAMA tick but can you give us some explanation of how MAMA works?
Wilson: It consists of several phases:
- Building a set of URLs
In this study, we specifically used the Open Directory (DMoz) URL set, but URLs can and do come from any source.
- Getting URLs
Any incoming URLs are vetted for legitimacy and some initial metadata is gathered. This is separate from the analysis phase
- Analyzing the URLs
MAMA's URL analysis engine picks a single URL from the vetted set of URLs and grabs the content of the page. To treat a Web page like a user's browsing experience, META refreshes and FRAME/IFRAME URLs are added to an aggregate stack of URLs to be analyzed for the current URL's statistics. External scripts and CSS are analyzed, as well as a few select document dependencies. Many document dependencies are not yet analyzed by MAMA.
- The URL is run through a local copy of the W3C markup validator
- Data from phases 3 and 4 are stored in the MAMA database
Neethling: Can you give us an abbreviated run down of the most interesting key findings made about the web pages out there in the real world?
Wilson: On the surface, I thought this would be a difficult question to approach, because with about 30 chapters in this writeup...where would I even begin? But I quickly found that there are things this research makes very obvious. The biggest thing is that the Web is probably not what you think it is. One person questioned MAMA's markup element ranking based solely on the P element ranking lower than they expected. The way they authored documents did not fit in with MAMA's results that indicated there are many authors that do things a different way.
I have already seen this view a few different times now. It is perfectly fine to question the results, but we must set aside our biases to be open to what MAMA can tell us. I think large parts of MAMA's results here are interesting, so giving you a list could get really long. I'll try focusing on some results that surprised me:
- URLs "die" very often. Specifically, domain parkers have a high representation in DMoz and likely many other URL sources. I'm still trying to find ways to weed these out.
- Many documents use Doctypes, but despite this very few of them validate and only a small percentage trigger Standards rendering mode. The question becomes...if an author inserts a Doctype and yet neither of these factors are in play, are there there any other motivating factors for using the Doctype?
- Specifying a document's encoding via the META element was by far the most popular, far outnumbering the HTTP header (I would have expected the balance to be much more even)
- The OL element is rarely used compared to UL
- Scripts dynamically writing references to other external scripts occurs much more often than than I expected. MAMA detected them, but it wasn't known how common it would be. MAMA currently makes no attempt to analyze this extra script content, but it is clear that MAMA needs to allow for this authoring behavior.
- Popular script libraries are often very easily identified by the many factors that MAMA tracks. It is a pleasant surprise to find that.
- The Web works largely on inertia in many cases. If the document works, why change it? This gives way to a rather large mix of old technologies still dominant over newer (and usually better) technologies and techniques.
Neethling: What was the process employed to define the URL set that MAMA based it's analysis on?
Wilson: At first, it was all about achieving diversity. In the study results we have released [http://dev.opera.com/articles/view/mama/], I wanted the set to be both large and publicly repeatable. There was really only one possible set for this - the Open Directory Project (DMoz)...it is the only set publicly available of any significant size. I have many plans to expand MAMA's set, including using other public sources, URLs from random hyperlinks in URLs MAMA has already analyzed and pure Web spidering.
The DMoz set is large, but it has a number of drawbacks. The biggest is domain over-representation. Over 5% of the entire DMoz URL set from late 2007 were URLs from CNN! MAMA needed diversity, so some way was needed to offset this bias. Limiting the number of URLs from any single domain (domain capping) was MAMA's answer, and this strategy will remain its biggest ally moving forward.
It was noticed early on that domain parking is a very pervasive issue in DMoz and likely any URL set MAMA uses. This is a problem because it skirts the domain capping issue entirely while still creating the same result: over-representation of the same repeating patterns in a URL set that do not truly represent the diversity of the Web. It could be argued that domain parking truly does represent some sort of real diversity, but I disagree. If I can, I would like to see pages produced by domain parking totally eliminated from MAMA's analyzed URL set.
Neethling: Through the findings that MAMA made, how would you describe the web as it relates to standards and accessibility?
Wilson: I started to answer this by listing many of the ways that MAMA could check for accessibility factors, but the list is large. I started going through factors in the WCAG checklist - like specifying alternate content, using accessible CSS media types, Longdesc, Accesskey, Lang and other interesting attributes. But there are just too many factors to consider to make quick blanket statements.
One interesting way to address the question is to look at pages specifically claiming to be accessible or standards-compliant. In the section on markup validation, MAMA looked at the use of W3C markup validation image badges. A side-effect of that analysis also found badges touting WCAG compliance. About 10% of the badges detected related to WCAG.
As an arbitrary bar, one could choose this 10% level as an extremely rough estimate to say something about pages attempting to use features that aid accessibility. Levels for any one of the factors MAMA stored were usually much lower than 10% of MAMA's URL population, so such an estimate is likely on the high side. Unfortunately, intent and success are two VERY different things; authors that use such features often use them incorrectly. MAMA's findings regarding validation badges was one of the most pessimistic parts of this whole process. Only half of the pages claiming to pass markup validation actually did so when checked by MAMA. Some of those failing validation may only have a few errors, but staking a claim of standards-compliance and strictly maintaining that is often difficult.
Authors as a whole are bad at conforming to standards and accessibility needs, but when we compare MAMA's results to similar previous studies (like Saarsoo's), there is some evidence that the Web is slowly inching forward on these points. Will we ever get to a Web that is 100% standards-compliant and entirely accessible? No. Why? Too many authors are of the "good enough for now" school that does not have any room for meeting anything beyond a rushed, cobbled together immediate solution.
Neethling: Before we close of the interview, where can developers interested in more detail go to learn more about MAMA?
Wilson: Right now, I'm concentrating on getting the results of MAMA's initial study pushed out on http://dev.opera.com/articles/view/mama/. That will be the best place for the time being. I'm planning to release many more articles in the coming weeks. Feedback, comments and ideas on the project are welcome!
Neethling: When can the public expect to get their hands on MAMA themselves?
Wilson: We are working on feature sets, database optimization and increasing MAMA's URL set right now. There is too much churn in these factors to be able to give concrete numbers yet. Things should be much clearer by the beginning of 2009.
Neethling: Thank you again for granting us this interview Brian. We look forward to what more information MAMA will bring is in the future.
Wilson: Many thanks for the interest (and patience if you've stayed around for all of this!)