Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Chef ETL Battles: What Can Be Prepared With Today's Data?

DZone's Guide to

Data Chef ETL Battles: What Can Be Prepared With Today's Data?

Iron Chef involved chefs using experience, creativity, and imagination to transform questionable ingredients into the ultimate meal. That's just like data transformation.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Do you remember the Iron Chef battles

They were a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given dataset's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

A Social Forum: Sentiment vs. Influence

Ingredient Theme: A Social Forum

Today, we have decided to go vintage and show the analysis implemented in the first KNIME whitepaper, where text processing met network analytics by Tobias Koetter, Kilian Thiel, and Phil Winters.

We propose the data from the year 1999 of the Slashdot News Forum. Slashdot (sometimes abbreviated as “/.”) is a social news website, which was founded in 1997 for science and technology. Users can post news and stories about diverse topics and receive online comments from other users (cfr: Wikipedia).

Some years ago, we started a debate on whether the loudest customers were as important as everybody — including they themselves — thought. We started looking for public data on customer interactions about a given product and stumbled upon the Slashdot dataset. Users in the Slashdot dataset are not strictly customers; they interact via a social forum about a given topic. If the topic were a product, they would be customers. So, assuming that talking about a product is a particular instance of talking about a generic topic, we decided to adopt the Slashdot data set for the analysis. We propose this same data set here again for today’s challenge.

The Slashdot dataset collects posts and comments for a number of sub-forums, such as Science Fiction, Linux, Astronomy, etc. Most of the users posted or commented using their username, while some participated anonymously. The biggest sub-forum revolves around politics and contains about 140,000 comments to 496 articles from a total of about 24,000 users. We will focus on this sub-forum for this challenge.

Each new post is assigned a unique thread ID. Title, subdomain, user, date, main topic, and body all refer to this thread ID. A new data row is created for each comment, appending the thread ID, post title, post user, post date, and post body from the seed post to the comment title, user, date, and body.

The original post and its related comments are connected within the dataset by the unique Thread ID. The seed post is the first item in the thread. The title of the thread is the title of the seed post. In Figure 1, you can see the seed post data on the left and the data for the corresponding comments on the right. Notice that multiple comments might refer to the same seed post.

Figure 1: SlashDot Dataset. Data from seed post on the left; data from related comments on the right

What could we extract from these data? We already know that users talk about politics. What else could be of interest? Let’s see what today’s data chefs will prepare for us!

  • Topic: Social media analysis.

  • Challenge: Identify complainers and happy users as well as influencers and followers.

  • Methods: Sentiment analysis and influence scores.

  • Data manipulation nodes: Text processing and network nodes.

The Competition

We all tend to listen to the loudest complainer. However, this might not always be a good idea. Our data chefs today have focused on detecting the loudest complainers on one side and the most influential users on the other side. We will draw our own conclusions from their results.

In the analysis, all non-anonymous users are considered. Thus, the first step is to remove all data rows where “user” is either “Anonymous Coward” or “An anonymous reader,” where the username is too long, or there is no post ID. This happens in the “Preprocessing” metanode.

Figure 2: Final workflow. The upper part, referred to as "Network Creation and Analysis," calculates influence scores. The lower part, labeled "Text processing and sentiment analysis," measures the sentiment score of each forum user. 

Data Chef Haruto: Influence Scores

Data chef Haruto approaches the analysis by investigating the connections across the users. Each post has a post ID — i.e. thread ID — a title, a user author, and a body text. Comment posts also have a field “post ref” identifying the post they are responding to. The goal here is to build a network object to represent user interactions. 

The first step is to prepare the edge table as the basis for the network. An edge table has a source column (the users author of the posts( and a target column (the users author of the reference post). Notice that posts can be seed posts or comment posts. Seed posts will have a missing reference post.

The edge table is built by the “Create edge table” metanode in the upper branch of the final workflow shown in Figure 2. A left outer join puts together all post authors (source) with all reference authors (target) if any. A GroupBy node then counts the number of occurrences for each connection from source to target. The string “<source user> -> <target user>” is set as the edge ID of the connection. All auto-connections, i.e. users answering themselves, are filtered out.

The edge table is now ready to be transformed into a network object. This is the job of the Object Inserter node. This node transforms the source and target users into nodes and connects them via an edge with the number of connection occurrences as value and the edge ID string as label.

After that, the metanode named “Extract largest component” first splits the network into all of its connected components, using a Network to Row node. Each sub-network is represented as a String and occupies a data row. Then it loops through all of the sub-networks and counts the number of edges and nodes, using a Network Analyzer node. Sub-networks are sorted based on their total number of edges and nodes and the largest sub-network is kept for further analysis.

Finally, a second Network Analyzer node calculates the Hub and Authority score.

The Network Analyzer node provides a great summary for social media activity. It calculates the number of different statistics on a network graph at both the node and the edge level. Such statistical measures try to establish the importance of each node and edge by the number of its connections, their weight, their neighbors, the distance to their neighbors, and similar other parameters. Two of those important measures are the hub and authority scores.

The concept of hubs and authorities, as described here, is rooted in web pages. There are two primary kinds of web pages as results for broad-topic searches. There are authoritative sources of information on the topic (authorities) and there are pages containing just hand-compiled lists of links to authoritative web pages on the specific topic (hubs). Hubs are not in themselves authoritative sources of topic-specific information, but rather direct you to more authoritative pages. The hub/authority score calculation relies on hub pages to discover the authority pages.

To calculate the hub and authority score, the Network Analyzer node implements the HITS algorithm in the JUNG (Java Universal Network/Graph) Framework.

Data Chef Momoka: Sentiment Measure

Momoka has decided to go for sentiment, i.e. to characterize each forum user in terms of positivity and negativity rather than authority.

The lower branch of the workflow in Figure 2 creates the list of Documents for each forum user, from posts or comments he/she has written. At the same time, it collects two lists of words: negative words and positive words from the English dictionary according to the MPQA subjectivity lexicon. All words in all documents are tagged as positive or negative by the two Dictionary Tagger nodes, depending on whether they match any of the words in these two lists. Untagged words are considered as neutral.

Each positive word is assigned a +1 value, each negative word a -1 value, each neutral word a 0 value. By summing up all word values across all documents written by each user, we calculate the sentiment score for each user. I wonder what the most negative author might say. Just out of curiosity, we shall draw a word cloud for the most positive and the most negative author (Fig. 3).

Word clouds are calculated and displayed through Tag Cloud (Javascript) nodes in the last wrapped metanode on the right, named “Tag Cloud.” Here, two word clouds are displayed side by side: the word cloud for the most positive author on the left; the word cloud for the most negative author on the right. Positive words are colored in green, negative words in red, and neutral words in gray. Well, it is easy to see why the most negative author has been labeled… well… negatively!

Note: Sentiment score by user is calculated here using the absolute word frequency without taking into account the number of words used in each post.

In general, forum users with a sentiment score above (average + standard deviation) are considered positive authors; forum users with sentiment score below (average – standard deviation) are considered negative; all other users in between are considered neutral. Positive users are color-coded green, negative users red, and neutral users gray.

Figure 3: Word cloud respectively for the most positive user (on the left) and the most negative user (on the right). In the midst of all gray (neutral) words, you can see a predominance of green (positive) words on the left and of red (negative) words on the right. Notice the repeated word “Stupid” that has gained author pNutz the title of most negative post author.

The Jury: Putting It All Together

To put it all together, a Joiner node joins the authority and hub score with the sentiment score by author.

A Scatter Plot (JavaScript) node, inside the wrapped metanode “Scores and Sentiment on Scatter Plot,” plots the forum users by hub score on the y-axis, authority score on the x-axis, and sentiment score as color (Fig. 4).

Notice that the loudest complainers in red have actually very little authority scores and therefore cannot be considered influencers. Thus, this plot seems to go against the common belief that you should listen and pamper the most aggressive complainers. Notice also that the most authoritative users are actually neutral. This neutrality could well be one of the reasons why other users trust them.

The scatter plot view produced by the Scatter Plot (Javascript) node is interactive. By clicking the “Select mode” button at the top of the view, it is possible to select single points on the scatter plot with a single click. In Figure 4, the “Select mode” button is circled in red as well as the selected point, in this case, a green point, i.e. a positive author named Guppy06.

After clicking the Close button and opting to keep the selection as the new default, the selected author is moved to the next visualization wrapped metanode, where all his/her posts are extracted and the word cloud is displayed (Fig. 5).

The selecting of single points in the scatter plot is made possible via the option Enable Selection in the View Controls tab in the configuration window of the “Scatter Plot (JavaScript)” node. Note that other types of selection have been disabled to allow the selection of only one author for the subsequent word cloud.

Note: This same sequence of plots can be visualized on a KNIME WebPortal. There, the Back button allows you to move back to the scatter plot, select a new author, and display his/her word cloud.

The flow variable connection (the red line) between the wrapped metanode named “Tag Cloud” and the wrapped metanode named “Score & Sentiment on Scatter Plot” is required for the correct page flow on the KNIME WebPortal.

Posts in a forum can be reduced to numbers. In this blog post, they have been reduced to a sentiment score via text processing by data chef Momoka and to an authority/hub score via network graph analytics by data chef Haruto. Both representations produce valuable information. However, the combination of the two has proven to be of invaluable help when trying to isolate the most positive and authoritative users for reward and the most negative and authoritative critics for damage control.

Figure 4: Authors as points on a scatter plot with authority score on the x-axis and hub score on the y-axis. Authors with a positive sentiment score, i.e. sentiment score > average + 1* std dev, are color-coded green. Authors with a negative sentiment score, i.e. sentiment score < average - 1* std dev, are colored red. Authors with sentiment score in between are labeled as neutral and depicted in gray. In the upper right corner, there are the buttons for zooming and selection. The circled button enables point/author selection. The bigger point in the plot is the point that has been selected by single-click. 

Figure 5: Word cloud of the selected author in the previous scatter plot (Fig. 4).

We have reached the end of this competition. Congratulations to both our data chefs for wrangling such interesting features from the raw data ingredients! Oishii!

Topics:
big data ,etl ,data preparation ,data transformation ,data blending ,data manipulation ,sentience

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}