{{announcement.body}}
{{announcement.title}}

Amazon Review Data: Spotting Trends and Fake Reviews

DZone 's Guide to

Amazon Review Data: Spotting Trends and Fake Reviews

We take a look at how to use SQL to query data from Amazon's open data sets and find patterns therein.

· Big Data Zone ·
Free Resource

Introduction

Amazon is the leading provider of cloud computing and has a number of interesting open data sets which you can experiment with. I wanted to try them out with a new product my company has developed, so I've been looking at those data sets. One of the most recognizable is Amazon's own review data which is documented at https://registry.opendata.aws/amazon-reviews/

Approach

I wanted to see what sort of questions the review data could answer. At the end of this article, in "How I did it," I'll show the steps to access the data, but for now you will just need to know some SQL to follow the queries used.

Here are the questions I wanted to answer, and the results — note that unless otherwise specified each result came back in less than half a second.

How Do Ratings Vary With verified_purchase?

Those with a verified_purchase are on average rated 0.13 higher:

select verified_purchase,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1; 
VERIFIED_PURCHASE AVG_RATING COUNT(*)
N 4.11 36046407
Y 4.24 124750163

How Do Ratings Vary With Vine Membership?

Vine reviews are on average 0.13 lower than non-Vine (you can read more about what Amazon Vine is at https://www.amazon.co.uk/gp/vine/help, if you aren't familiar with it):

select verified_purchase,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
VINE AVG_RATING COUNT(*)
N 4.21 160116307
Y 4.08 680263

How Do Ratings Vary With Marketplace (i.e. Region)?

UK reviewers are the most positive (light relief compared to Brexit, perhaps), with an average rating of 0.18 higher than the most negative reviewers, who are from the US marketplace:

select marketplace,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
MARKETPLACE AVG_RATING COUNT(*)
DE 4.25 679121
FR 4.26 254080
JP 4.21 262431
UK 4.38 1707494
US 4.20 157893444

How Have Ratings Altered Over Time?

Discounting the time-traveller who submitted a 5.0 rating review in 1973 according to the data, reviews have generally reduced ratings over time from 1995 to 2004, then oscillated somewhat before improving slightly between 2011 and 2015.

Perhaps this is because people were delighted with any sort of e-commerce offering in the early days of Amazon, then became more demanding; then recent years have seen a material improvement in their experience (including products, ordering, and delivery process).

A cynic might wonder if Amazon has tended to make more positive reviews available — I could not find any literature indicating what percentage of all reviews this data set covers:

select "YEAR",
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
YEAR AVG_RATING COUNT(*)
1973 5.00 1
1995 4.62 216
1996 4.61 5045
1997 4.43 39868
1998 4.36 177984
1999 4.29 484469
2000 4.26 1082639
2001 4.20 966851
2002 4.17 1023868
2003 4.13 1155127
2004 4.07 1201054
2005 4.07 1369376
2006 4.11 1428913
2007 4.17 2194714
2008 4.13 2488330
2009 4.12 3167023
2010 4.08 4119699
2011 4.06 6471918
2012 4.13 12403491
2013 4.21 30092459
2014 4.24 46993975
2015 4.25 43929550

Has Verification of Purchases Changed Over Time for Reviews?

select "YEAR", 
sum(case when verified_purchase = 'Y' then 1 else 0 end) verified_reviews, 
(sum(case when verified_purchase = 'Y' then 1 else 0 end) * 100.00) / count(*) percent_verified 
from opt_reviews 
group by 1 
order by 1;
YEAR VERIFIED_REVIEWS PERCENT_VERIFIED
1973 0 0.00
1995 2 0.93
1996 73 1.45
1997 1053 2.64
1998 6300 3.54
1999 21879 4.52
2000 26368 4.28
2001 62433 6.46
2002 84546 8.26
2003 96450 8.35
2004 98388 8.19
2005 167394 12.22
2006 263610 18.45
2007 672073 30.62
2008 843395 33.89
2009 1428024 45.09
2010 2642344 64.14
2011 4453918 68.82
2012 9585869 77.28
2013 25922599 86.14
2014 38851223 82.27
2015 39502222 89.92

We can see how the number of verified reviews increases each year, and its percentage of all reviews is generally increasing too.

How Do Ratings Vary by Product Category?

It appears that Digital Software is the most disappointing category for reviewers, averaging a massive 1.2 points worse than the impeccable Gift Card category:

select product_category,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 2;
PRODUCT_CATEGORY AVG_RATING COUNT(*)
Digital_Software 3.54 102084
Software 3.57 342135
Major_Appliances 3.72 96901
Mobile_Electronics 3.76 105159
... ... ...
Books 4.34 20726160
Music 4.44 6177781
Digital_Music_Purchas 4.64 1852184
Gift_Card 4.73 149086

How Do Customers Report the Helpfulness of Different Star Rating Reviews?

It seems that people consider negative reviews to be more helpful than positive reviews, possibly because they perceive them as valuable in preventing the purchase of an unsuitable product, or because they highlight issues which might be relevant to whether or not the customer then goes on to purchase the product:

select star_rating,
cast(avg(cast (helpful_votes as decimal (18,4))) as decimal (16,2)) AVG_HELP,
count(*)
from opt_reviews
group by 1
order by 1;
STAR_RATING AVG_HELP COUNT(*)
1 3.83 12667601
2 2.57 7700647
3 2.07 12900929
4 1.68 27996469
5 1.69 99530924

Looking at Reviewer Behavior

We can look at things from a more customer-centric point of view, which is more expensive in terms of query time (because we have to do a lot more processing to aggregate over a lot of different reviewers).

As a result, the query below took just under 14 seconds, compared to all the previous ones which ran in less than 0.5 seconds — it is looking across all reviewers to see how many reviews they have done, how many product categories those reviews span, the average rating they give, and how many ratings at each level they have given.

We also check how helpful their average review is, how many Vine reviews they have, and how many of their purchases are verified:

select customer_id, 
count(*), 
count(distinct product_category) cats, 
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating, 
cast(avg(cast (helpful_votes as decimal (18,4))) as decimal (16,2)) avg_help, 
sum(case when star_rating = 1 then 1 else 0 end) one, 
sum(case when star_rating = 2 then 1 else 0 end) two, 
sum(case when star_rating = 3 then 1 else 0 end) three, 
sum(case when star_rating = 4 then 1 else 0 end) four, 
sum(case when star_rating = 5 then 1 else 0 end) five, 
sum(case when vine = 'Y' then 1 else 0 end) vine_reviews, 
sum(case when verified_purchase = 'Y' then 1 else 0 end) verified_purchases 
from opt_reviews 
group by 1 
order by 2 desc;
CUSTOMER_ID COUNT(*) CATS AVG_RATING AVG_HELP ONE TWO THREE FOUR FIVE VINE_REVIEWS VERIFIED_PURCHASES
50122160 59825 14 5.00 4.36 0 0 6 32 59787 0 1
50732546 30922 2 4.63 3.97 0 3 84 11397 19438 72 422
52496677 15067 38 4.66 6.71 38 127 697 3192 11013 91 1977
52004451 10954 8 4.93 1.61 23 79 168 138 10546 0 873
50736950 10857 24 4.00 2.04 847 1132 967 2175 5736 0 1
18116317 9084 28 4.00 3.16 98 406 1807 3894 2879 48 1716
50881246 8901 23 4.05 11.94 577 397 799 3374 3754 187 763
44731853 7604 2 4.76 2.18 21 79 396 719 6389 0 471
38002140 7509 21 3.92 1.31 269 649 1517 2037 3037 0 3
43430756 7344 24 3.35 3.75 560 1284 1986 2033 1481 0 114
... ... ... ... ... ... ... ... ... ... ... ...
7080939 6506 36 5.00 0.03 0 0 0 1 6505 0 15
... ... ... ... ... ... ... ... ... ... ... ...

The results show:

  • you would like your product to be reviewed by customer_id 50122160 rather than customer 43430756, as on average that will give you a star rating 1.65 higher.
  • the reviewer with the highest number of reviews is not a Vine reviewer and has only one verified purchase. You might not want to give their reviews too much credence though, as they seem to be happy with almost every product (maybe they are just an optimist, or perhaps they have some incentive to give good reviews). Note that their average number of helpful votes per review is in line with other reviewers though, which is somewhat surprising.
  • the other very optimistic reviewer, with id 7080939, appears to be treated with more cynicism by other customers; their average helpful votes per review is the lowest of the top 20 reviews.
  • the people with the most helpful votes per review have a better range of star ratings and a higher number of verified purchases than the two very positive reviewers mentioned above.

Switching to look for quality rather than quantity, we can examine customers whose reviews average the highest number of helpful votes:

select customer_id,
count(*),
count(distinct product_category),
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
cast(avg(cast (helpful_votes as decimal (18,4))) as decimal (16,2)) avg_help,
sum(case when star_rating = 1 then 1 else 0 end) one,
sum(case when star_rating = 2 then 1 else 0 end) two,
sum(case when star_rating = 3 then 1 else 0 end) three,
sum(case when star_rating = 4 then 1 else 0 end) four,
sum(case when star_rating = 5 then 1 else 0 end) five,
sum(case when vine = 'Y' then 1 else 0 end) vine_reviews,
sum(case when verified_purchase = 'Y' then 1 else 0 end) verified_purchases
from opt_reviews
group by 1
having count(*) > 100
order by 5 desc
limit 10;
CUSTOMER_ID COUNT(*) CATS AVG_RATING AVG_HELP ONE TWO THREE FOUR FIVE VINE_REVIEWS VERIFIED_PURCHASES
53025525 159 18 4.35 271.52 1 1 25 46 86 0 75
50629044 107 24 3.70 253.72 29 5 4 0 69 0 81
52621867 110 23 3.45 165.15 14 17 25 14 40 1 67
43346653 103 7 4.13 158.80 0 9 14 35 45 0 0
42793407 138 25 4.12 140.07 1 11 17 50 59 61 55
16378095 430 28 4.18 131.00 7 18 56 158 191 0 182
49431050 101 5 4.93 111.08 1 1 0 0 99 0 73
30669680 727 24 3.58 103.66 31 102 214 173 207 0 245
41866357 215 28 4.52 100.90 13 3 10 23 166 0 181
42299455 123 11 3.81 92.29 1 16 17 60 29 0 78

The main thing I could see from this is that these users tended to have a much lower number of reviews, and in general had a higher percentage of verified purchases.

Which Review Is the Most Helpful in Amazon?

According to this data, the most helpful review in this data set is a positive one for an Amazon product — it is for the original Kindle, with the review submitted in 2007.

However, note the duplicate rows in the data:

  • with review R2JF7E91JJVHAT for "Fifty Shades of Grey" appearing twice.

  • ...and the same for review R1LT38SAC0FA4G for the same novel in digital form.

I've verified these anomalies existed in the underlying S3 Parquet data, and notified AWS of the issue, but haven't had any feedback yet. The duplication does not exist in the CSV version of the same data set, so something has clearly gone wrong when generating the Parquet version.

There are some other anomalies present in the review text as well (parts of the review text are duplicated, apparently most commonly for long review text) — again, this is evident upon inspection of the underlying Parquet files, and AWS has been notified.

select customer_id, review_id, product_id, product_title, star_rating,
helpful_votes, total_votes, vine, verified_purchase from opt_reviews
order by helpful_votes
desc limit 15;
CUSTOMER_ID REVIEW_ID PRODUCT_ID PRODUCT_TITLE STAR_RATING HELPFUL_VOTES TOTAL_VOTES VINE VERIFIED_PURCHASE
52994074 R3R24QH3CDS83N B000FI73MA Kindle: Amazon's Original Wireless Reading Device (1st generation) 5 47524 48362 N Y
50268169 R19XO9PS38WRWO B004F9QBE6 BIC Cristal For Her Ball Pen, 1.0mm, Black, 16ct (MSLP16-Blk) 5 41393 41866 N N
51121444 R2XKMDXZHQ26YX B000OE2OLU The Mountain Kids 100% Cotton Three Wolf Moon T-Shirt 5 41278 41889 N N
53025525 R2YVZNKUMWGYJ4 B003FSUDM4 Kindle Keyboard 3G, Free 3G + Wi-Fi, 6" E Ink Display 5 31924 32373 N Y
12267963 R1PL1U31XQG8KG B0083PWAPW Kindle Fire HD 7", Dolby Audio, Dual-Band Wi-Fi 4 31417 32166 N Y
51835976 RBUNL9QWDYTD7 B0051VVOB2 Kindle Fire (Previous Generation - 1st) 4 28611 29433 N Y
16378095 R2JF7E91JJVHAT 0345803485 Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series) 2 27550 28727 N Y
16378095 R2JF7E91JJVHAT 0345803485 Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series) 2 27550 28727 N Y
9286343 R3PG4OX6C5KVN4 B000IZGIA8 Wheelmate Laptop Steering Wheel Desk 5 23755 24170 N N
50629044 RV0R3AODMRNJZ B00154JDAI Kindle Wireless Reading Device (6" Display, U.S. Wireless) 1 24714 26143 N Y
47364506 RXXPVOUH9NLL3 B00032G1S0 Tuscan Dairy Whole Vitamin D Milk, Gallon, 128 oz 5 23755 24170 N N
52839532 R1JMDSX89JG82B B000X9FLKM The Lord of the Rings: The Motion Picture Trilogy (The Fellowship of the Ring / The Two Towers / The Return of the King Theatrical Editions) [Blu-ray] 1 20928 22982 N N
50423635 R1LT38SAC0FA4G B007J4T2G8 Fifty Shades of Grey (Fifty Shades, Book 1) 1 18429 19368 N N
50423635 R1LT38SAC0FA4G B007J4T2G8 Fifty Shades of Grey (Fifty Shades, Book 1) 1 18429 19368 N N
29167258 R2PKU2MR4GXKG7 B004F9QBE6 BIC Cristal For Her Ball Pen, 1.0mm, Black, 16ct (MSLP16-Blk) 1 18217 18648 N N

Unusual Reviewer Behavior

We've already seen that some reviewers generate a very large number of reviews, and sometimes those are very largely five-star reviews.

Looking for other unusual behaviours we can see, for example, instances of reviewers repeatedly reviewing the same product. I was actually surprised that this was even allowed (have not verified if that is still the case):

select customer_id, product_id, count(*) 
from opt_reviews 
group by 1, 2 
having count(*) > 1 
order by 3 desc;
CUSTOMER_ID PRODUCT_ID COUNT(*)
38118182 B00003CXTG 260
33132919 B0000TB03W 220
29088361 B0000DD7LB 122
23974294 B0000TB03W 114
... ... ...

As we can see above, one product was reviewed 220 times by one reviewer, and 114 times by another reviewer. In the case of the second row above, the 220 reviews occurred between 2003-11-10 and 2003-12-04. Presumably, the reviewer kept thinking of other compliments for this revolutionary product, and could not edit any of their existing reviews for it. You certainly have to admire their persistence.

We can also check for reviewers who are a bit lazy and repeatedly use the same review headline:

select customer_id, review_headline, count(*) 
from opt_reviews 
where char_length(review_headline) > 10 
group by 1, 2
having count(*) > 1
order by 3 desc;
CUSTOMER_ID REVIEW_HEADLINE COUNT(*)
14410556 Courtesy of Teens Read Too 4415
43530996 Not Free SF Reader 2175
43530996 Super Reader 1056
37529167 Good Reading 1049
32865329 Kritters Ramblings 902
43530996 Graphic SF Reader 813
... ... ...

Some reviewers really like to reiterate their point, so they submit multiple reviews for the same product, all with the same headline:

select customer_id, product_id, review_headline, count(*) 
from opt_reviews 
where char_length(review_headline) > 10 
group by 1, 2,3 
having count(*) > 1 
order by 4 desc;
CUSTOMER_ID PRODUCT_ID REVIEW_HEADLINE COUNT(*)
32745072 B00000607F Such an historic band. 29
50281120 1575665379 AMust Buy! 23
38864282 B00007GUIO Excellent Debut Album 23
38503847 0440204887 The Hanbook For Life 21
52906860 8432039799 Un libro incomparable. 20
... ... ... ...

How I Did It

The product I was using was Kognitio's AWS Marketplace offering, which allows you to simply but quickly analyze data stored in Amazon's S3 storage and elsewhere. You can find the link to the product here. From that page, I brought up a system with 4 nodes, each with 64GB RAM and 16 cores, and a spot price of around $4.23 per hour. You also need a tool capable of connecting to Kognitio and running SQL — pretty much any third party tool will be able to do this if you install the Kognitio ODBC or JDBC driver, but I used Kognitio's own SQL submission tool called Console which you can download here.

I then ran the following SQL to create an external table allowing me to query the review data stored in S3. Note that the data is stored in Parquet format, which is an industry standard, and contains information about column names and types to allow them to be automatically generated in the external table:

create external table reviews from s3_parquet
target 'uri_location s3://amazon-reviews-pds, uri_path /parquet/, fmt_filename_partitions product_category, repeatable_results yes, ignore_invalid_records 1';

In the above:

  • 'uri_location' and 'uri_path' tell the software where the data can be found.
  • the 'fmt_filename_partitions' setting provides information on what the Parquet files in S3 are partitioned on (product_category in this case).
  • 'repeatable_results yes' indicates the data isn't changing so you will get the same results if you re-access the data in S3 multiple times in a query.
  • 'ignore_invalid_records 1' specifies that any records that cannot be loaded should be ignored (the failure to load them is logged for further investigation).

I can then query that external table immediately, for example, the following query returns in a few seconds as it can leverage the fact that the data is already partitioned on product_category to eliminate irrelevant partitions:

select count(*)
from reviews
where product_category = 'Camera';

To get better query performance, we can generate an in-memory image of the data, then run against that rather than continually accessing S3. So I defined a view showing the fields I was interested in, and also mapping them to the most appropriate types rather than relying on the types inferred from Parquet — note that I had to quote the column name "YEAR" below, as that is a reserved word in Kognitio as with many databases:

create view opt_reviews as
select cast(marketplace as char(2)) marketplace,
cast(customer_id as bigint) customer_id,
cast(review_id as char (16)) review_id,
cast(product_id as char (12)) product_id,
cast(product_parent as bigint) product_parent, 
product_title, star_rating, helpful_votes, total_votes, 
cast(vine as char(1)) vine, 
cast(verified_purchase as char(1)) verified_purchase, 
review_headline, review_date, "YEAR", product_category 
from reviews;

Then I created a memory image of that view, which took a few minutes as it has to scan all the data in S3 and transfer it to memory:

create view image opt_reviews;

The result was an image with just over 160 million rows in it.

At this point I could ask (hopefully) more interesting questions about the data, as shown earlier in the article — let me know if there are any other questions you'd like answered, or try this out yourself.

Topics:
amazon aws ,big data ,data analysis ,open data ,sql tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}