Amazon Review Data: Spotting Trends and Fake Reviews
We take a look at how to use SQL to query data from Amazon's open data sets and find patterns therein.
Join the DZone community and get the full member experience.
Join For FreeIntroduction
Amazon is the leading provider of cloud computing and has a number of interesting open data sets which you can experiment with. I wanted to try them out with a new product my company has developed, so I've been looking at those data sets. One of the most recognizable is Amazon's own review data which is documented at https://registry.opendata.aws/amazon-reviews/
Approach
I wanted to see what sort of questions the review data could answer. At the end of this article, in "How I did it," I'll show the steps to access the data, but for now you will just need to know some SQL to follow the queries used.
Here are the questions I wanted to answer, and the results — note that unless otherwise specified each result came back in less than half a second.
How Do Ratings Vary With verified_purchase?
Those with a verified_purchase
are on average rated 0.13 higher:
select verified_purchase,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
VERIFIED_PURCHASE | AVG_RATING | COUNT(*) |
N | 4.11 | 36046407 |
Y | 4.24 | 124750163 |
How Do Ratings Vary With Vine Membership?
Vine reviews are on average 0.13 lower than non-Vine (you can read more about what Amazon Vine is at https://www.amazon.co.uk/gp/vine/help, if you aren't familiar with it):
select verified_purchase,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
VINE | AVG_RATING | COUNT(*) |
N | 4.21 | 160116307 |
Y | 4.08 | 680263 |
How Do Ratings Vary With Marketplace (i.e. Region)?
UK reviewers are the most positive (light relief compared to Brexit, perhaps), with an average rating of 0.18 higher than the most negative reviewers, who are from the US marketplace:
select marketplace,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
MARKETPLACE | AVG_RATING | COUNT(*) |
DE | 4.25 | 679121 |
FR | 4.26 | 254080 |
JP | 4.21 | 262431 |
UK | 4.38 | 1707494 |
US | 4.20 | 157893444 |
How Have Ratings Altered Over Time?
Discounting the time-traveller who submitted a 5.0 rating review in 1973 according to the data, reviews have generally reduced ratings over time from 1995 to 2004, then oscillated somewhat before improving slightly between 2011 and 2015.
Perhaps this is because people were delighted with any sort of e-commerce offering in the early days of Amazon, then became more demanding; then recent years have seen a material improvement in their experience (including products, ordering, and delivery process).
A cynic might wonder if Amazon has tended to make more positive reviews available — I could not find any literature indicating what percentage of all reviews this data set covers:
select "YEAR",
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 1;
YEAR | AVG_RATING | COUNT(*) |
1973 | 5.00 | 1 |
1995 | 4.62 | 216 |
1996 | 4.61 | 5045 |
1997 | 4.43 | 39868 |
1998 | 4.36 | 177984 |
1999 | 4.29 | 484469 |
2000 | 4.26 | 1082639 |
2001 | 4.20 | 966851 |
2002 | 4.17 | 1023868 |
2003 | 4.13 | 1155127 |
2004 | 4.07 | 1201054 |
2005 | 4.07 | 1369376 |
2006 | 4.11 | 1428913 |
2007 | 4.17 | 2194714 |
2008 | 4.13 | 2488330 |
2009 | 4.12 | 3167023 |
2010 | 4.08 | 4119699 |
2011 | 4.06 | 6471918 |
2012 | 4.13 | 12403491 |
2013 | 4.21 | 30092459 |
2014 | 4.24 | 46993975 |
2015 | 4.25 | 43929550 |
Has Verification of Purchases Changed Over Time for Reviews?
select "YEAR",
sum(case when verified_purchase = 'Y' then 1 else 0 end) verified_reviews,
(sum(case when verified_purchase = 'Y' then 1 else 0 end) * 100.00) / count(*) percent_verified
from opt_reviews
group by 1
order by 1;
YEAR | VERIFIED_REVIEWS | PERCENT_VERIFIED |
1973 | 0 | 0.00 |
1995 | 2 | 0.93 |
1996 | 73 | 1.45 |
1997 | 1053 | 2.64 |
1998 | 6300 | 3.54 |
1999 | 21879 | 4.52 |
2000 | 26368 | 4.28 |
2001 | 62433 | 6.46 |
2002 | 84546 | 8.26 |
2003 | 96450 | 8.35 |
2004 | 98388 | 8.19 |
2005 | 167394 | 12.22 |
2006 | 263610 | 18.45 |
2007 | 672073 | 30.62 |
2008 | 843395 | 33.89 |
2009 | 1428024 | 45.09 |
2010 | 2642344 | 64.14 |
2011 | 4453918 | 68.82 |
2012 | 9585869 | 77.28 |
2013 | 25922599 | 86.14 |
2014 | 38851223 | 82.27 |
2015 | 39502222 | 89.92 |
We can see how the number of verified reviews increases each year, and its percentage of all reviews is generally increasing too.
How Do Ratings Vary by Product Category?
It appears that Digital Software is the most disappointing category for reviewers, averaging a massive 1.2 points worse than the impeccable Gift Card category:
select product_category,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
count(*)
from opt_reviews
group by 1
order by 2;
PRODUCT_CATEGORY | AVG_RATING | COUNT(*) |
Digital_Software | 3.54 | 102084 |
Software | 3.57 | 342135 |
Major_Appliances | 3.72 | 96901 |
Mobile_Electronics | 3.76 | 105159 |
... | ... | ... |
Books | 4.34 | 20726160 |
Music | 4.44 | 6177781 |
Digital_Music_Purchas | 4.64 | 1852184 |
Gift_Card | 4.73 | 149086 |
How Do Customers Report the Helpfulness of Different Star Rating Reviews?
It seems that people consider negative reviews to be more helpful than positive reviews, possibly because they perceive them as valuable in preventing the purchase of an unsuitable product, or because they highlight issues which might be relevant to whether or not the customer then goes on to purchase the product:
select star_rating,
cast(avg(cast (helpful_votes as decimal (18,4))) as decimal (16,2)) AVG_HELP,
count(*)
from opt_reviews
group by 1
order by 1;
STAR_RATING | AVG_HELP | COUNT(*) |
1 | 3.83 | 12667601 |
2 | 2.57 | 7700647 |
3 | 2.07 | 12900929 |
4 | 1.68 | 27996469 |
5 | 1.69 | 99530924 |
Looking at Reviewer Behavior
We can look at things from a more customer-centric point of view, which is more expensive in terms of query time (because we have to do a lot more processing to aggregate over a lot of different reviewers).
As a result, the query below took just under 14 seconds, compared to all the previous ones which ran in less than 0.5 seconds — it is looking across all reviewers to see how many reviews they have done, how many product categories those reviews span, the average rating they give, and how many ratings at each level they have given.
We also check how helpful their average review is, how many Vine reviews they have, and how many of their purchases are verified:
select customer_id,
count(*),
count(distinct product_category) cats,
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
cast(avg(cast (helpful_votes as decimal (18,4))) as decimal (16,2)) avg_help,
sum(case when star_rating = 1 then 1 else 0 end) one,
sum(case when star_rating = 2 then 1 else 0 end) two,
sum(case when star_rating = 3 then 1 else 0 end) three,
sum(case when star_rating = 4 then 1 else 0 end) four,
sum(case when star_rating = 5 then 1 else 0 end) five,
sum(case when vine = 'Y' then 1 else 0 end) vine_reviews,
sum(case when verified_purchase = 'Y' then 1 else 0 end) verified_purchases
from opt_reviews
group by 1
order by 2 desc;
CUSTOMER_ID | COUNT(*) | CATS | AVG_RATING | AVG_HELP | ONE | TWO | THREE | FOUR | FIVE | VINE_REVIEWS | VERIFIED_PURCHASES |
50122160 | 59825 | 14 | 5.00 | 4.36 | 0 | 0 | 6 | 32 | 59787 | 0 | 1 |
50732546 | 30922 | 2 | 4.63 | 3.97 | 0 | 3 | 84 | 11397 | 19438 | 72 | 422 |
52496677 | 15067 | 38 | 4.66 | 6.71 | 38 | 127 | 697 | 3192 | 11013 | 91 | 1977 |
52004451 | 10954 | 8 | 4.93 | 1.61 | 23 | 79 | 168 | 138 | 10546 | 0 | 873 |
50736950 | 10857 | 24 | 4.00 | 2.04 | 847 | 1132 | 967 | 2175 | 5736 | 0 | 1 |
18116317 | 9084 | 28 | 4.00 | 3.16 | 98 | 406 | 1807 | 3894 | 2879 | 48 | 1716 |
50881246 | 8901 | 23 | 4.05 | 11.94 | 577 | 397 | 799 | 3374 | 3754 | 187 | 763 |
44731853 | 7604 | 2 | 4.76 | 2.18 | 21 | 79 | 396 | 719 | 6389 | 0 | 471 |
38002140 | 7509 | 21 | 3.92 | 1.31 | 269 | 649 | 1517 | 2037 | 3037 | 0 | 3 |
43430756 | 7344 | 24 | 3.35 | 3.75 | 560 | 1284 | 1986 | 2033 | 1481 | 0 | 114 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7080939 | 6506 | 36 | 5.00 | 0.03 | 0 | 0 | 0 | 1 | 6505 | 0 | 15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The results show:
- you would like your product to be reviewed by customer_id 50122160 rather than customer 43430756, as on average that will give you a star rating 1.65 higher.
- the reviewer with the highest number of reviews is not a Vine reviewer and has only one verified purchase. You might not want to give their reviews too much credence though, as they seem to be happy with almost every product (maybe they are just an optimist, or perhaps they have some incentive to give good reviews). Note that their average number of helpful votes per review is in line with other reviewers though, which is somewhat surprising.
- the other very optimistic reviewer, with id 7080939, appears to be treated with more cynicism by other customers; their average helpful votes per review is the lowest of the top 20 reviews.
- the people with the most helpful votes per review have a better range of star ratings and a higher number of verified purchases than the two very positive reviewers mentioned above.
Switching to look for quality rather than quantity, we can examine customers whose reviews average the highest number of helpful votes:
select customer_id,
count(*),
count(distinct product_category),
cast(avg(cast(star_rating as decimal(5,4))) as decimal (3,2)) avg_rating,
cast(avg(cast (helpful_votes as decimal (18,4))) as decimal (16,2)) avg_help,
sum(case when star_rating = 1 then 1 else 0 end) one,
sum(case when star_rating = 2 then 1 else 0 end) two,
sum(case when star_rating = 3 then 1 else 0 end) three,
sum(case when star_rating = 4 then 1 else 0 end) four,
sum(case when star_rating = 5 then 1 else 0 end) five,
sum(case when vine = 'Y' then 1 else 0 end) vine_reviews,
sum(case when verified_purchase = 'Y' then 1 else 0 end) verified_purchases
from opt_reviews
group by 1
having count(*) > 100
order by 5 desc
limit 10;
CUSTOMER_ID | COUNT(*) | CATS | AVG_RATING | AVG_HELP | ONE | TWO | THREE | FOUR | FIVE | VINE_REVIEWS | VERIFIED_PURCHASES |
53025525 | 159 | 18 | 4.35 | 271.52 | 1 | 1 | 25 | 46 | 86 | 0 | 75 |
50629044 | 107 | 24 | 3.70 | 253.72 | 29 | 5 | 4 | 0 | 69 | 0 | 81 |
52621867 | 110 | 23 | 3.45 | 165.15 | 14 | 17 | 25 | 14 | 40 | 1 | 67 |
43346653 | 103 | 7 | 4.13 | 158.80 | 0 | 9 | 14 | 35 | 45 | 0 | 0 |
42793407 | 138 | 25 | 4.12 | 140.07 | 1 | 11 | 17 | 50 | 59 | 61 | 55 |
16378095 | 430 | 28 | 4.18 | 131.00 | 7 | 18 | 56 | 158 | 191 | 0 | 182 |
49431050 | 101 | 5 | 4.93 | 111.08 | 1 | 1 | 0 | 0 | 99 | 0 | 73 |
30669680 | 727 | 24 | 3.58 | 103.66 | 31 | 102 | 214 | 173 | 207 | 0 | 245 |
41866357 | 215 | 28 | 4.52 | 100.90 | 13 | 3 | 10 | 23 | 166 | 0 | 181 |
42299455 | 123 | 11 | 3.81 | 92.29 | 1 | 16 | 17 | 60 | 29 | 0 | 78 |
The main thing I could see from this is that these users tended to have a much lower number of reviews, and in general had a higher percentage of verified purchases.
Which Review Is the Most Helpful in Amazon?
According to this data, the most helpful review in this data set is a positive one for an Amazon product — it is for the original Kindle, with the review submitted in 2007.
However, note the duplicate rows in the data:
with review R2JF7E91JJVHAT for "Fifty Shades of Grey" appearing twice.
...and the same for review R1LT38SAC0FA4G for the same novel in digital form.
I've verified these anomalies existed in the underlying S3 Parquet data, and notified AWS of the issue, but haven't had any feedback yet. The duplication does not exist in the CSV version of the same data set, so something has clearly gone wrong when generating the Parquet version.
There are some other anomalies present in the review text as well (parts of the review text are duplicated, apparently most commonly for long review text) — again, this is evident upon inspection of the underlying Parquet files, and AWS has been notified.
select customer_id, review_id, product_id, product_title, star_rating,
helpful_votes, total_votes, vine, verified_purchase from opt_reviews
order by helpful_votes
desc limit 15;
CUSTOMER_ID | REVIEW_ID | PRODUCT_ID | PRODUCT_TITLE | STAR_RATING | HELPFUL_VOTES | TOTAL_VOTES | VINE | VERIFIED_PURCHASE |
52994074 | R3R24QH3CDS83N | B000FI73MA | Kindle: Amazon's Original Wireless Reading Device (1st generation) | 5 | 47524 | 48362 | N | Y |
50268169 | R19XO9PS38WRWO | B004F9QBE6 | BIC Cristal For Her Ball Pen, 1.0mm, Black, 16ct (MSLP16-Blk) | 5 | 41393 | 41866 | N | N |
51121444 | R2XKMDXZHQ26YX | B000OE2OLU | The Mountain Kids 100% Cotton Three Wolf Moon T-Shirt | 5 | 41278 | 41889 | N | N |
53025525 | R2YVZNKUMWGYJ4 | B003FSUDM4 | Kindle Keyboard 3G, Free 3G + Wi-Fi, 6" E Ink Display | 5 | 31924 | 32373 | N | Y |
12267963 | R1PL1U31XQG8KG | B0083PWAPW | Kindle Fire HD 7", Dolby Audio, Dual-Band Wi-Fi | 4 | 31417 | 32166 | N | Y |
51835976 | RBUNL9QWDYTD7 | B0051VVOB2 | Kindle Fire (Previous Generation - 1st) | 4 | 28611 | 29433 | N | Y |
16378095 | R2JF7E91JJVHAT | 0345803485 | Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series) | 2 | 27550 | 28727 | N | Y |
16378095 | R2JF7E91JJVHAT | 0345803485 | Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series) | 2 | 27550 | 28727 | N | Y |
9286343 | R3PG4OX6C5KVN4 | B000IZGIA8 | Wheelmate Laptop Steering Wheel Desk | 5 | 23755 | 24170 | N | N |
50629044 | RV0R3AODMRNJZ | B00154JDAI | Kindle Wireless Reading Device (6" Display, U.S. Wireless) | 1 | 24714 | 26143 | N | Y |
47364506 | RXXPVOUH9NLL3 | B00032G1S0 | Tuscan Dairy Whole Vitamin D Milk, Gallon, 128 oz | 5 | 23755 | 24170 | N | N |
52839532 | R1JMDSX89JG82B | B000X9FLKM | The Lord of the Rings: The Motion Picture Trilogy (The Fellowship of the Ring / The Two Towers / The Return of the King Theatrical Editions) [Blu-ray] | 1 | 20928 | 22982 | N | N |
50423635 | R1LT38SAC0FA4G | B007J4T2G8 | Fifty Shades of Grey (Fifty Shades, Book 1) | 1 | 18429 | 19368 | N | N |
50423635 | R1LT38SAC0FA4G | B007J4T2G8 | Fifty Shades of Grey (Fifty Shades, Book 1) | 1 | 18429 | 19368 | N | N |
29167258 | R2PKU2MR4GXKG7 | B004F9QBE6 | BIC Cristal For Her Ball Pen, 1.0mm, Black, 16ct (MSLP16-Blk) | 1 | 18217 | 18648 | N | N |
Unusual Reviewer Behavior
We've already seen that some reviewers generate a very large number of reviews, and sometimes those are very largely five-star reviews.
Looking for other unusual behaviours we can see, for example, instances of reviewers repeatedly reviewing the same product. I was actually surprised that this was even allowed (have not verified if that is still the case):
select customer_id, product_id, count(*)
from opt_reviews
group by 1, 2
having count(*) > 1
order by 3 desc;
CUSTOMER_ID | PRODUCT_ID | COUNT(*) |
38118182 | B00003CXTG | 260 |
33132919 | B0000TB03W | 220 |
29088361 | B0000DD7LB | 122 |
23974294 | B0000TB03W | 114 |
... | ... | ... |
As we can see above, one product was reviewed 220 times by one reviewer, and 114 times by another reviewer. In the case of the second row above, the 220 reviews occurred between 2003-11-10 and 2003-12-04. Presumably, the reviewer kept thinking of other compliments for this revolutionary product, and could not edit any of their existing reviews for it. You certainly have to admire their persistence.
We can also check for reviewers who are a bit lazy and repeatedly use the same review headline:
select customer_id, review_headline, count(*)
from opt_reviews
where char_length(review_headline) > 10
group by 1, 2
having count(*) > 1
order by 3 desc;
CUSTOMER_ID | REVIEW_HEADLINE | COUNT(*) |
14410556 | Courtesy of Teens Read Too | 4415 |
43530996 | Not Free SF Reader | 2175 |
43530996 | Super Reader | 1056 |
37529167 | Good Reading | 1049 |
32865329 | Kritters Ramblings | 902 |
43530996 | Graphic SF Reader | 813 |
... | ... | ... |
Some reviewers really like to reiterate their point, so they submit multiple reviews for the same product, all with the same headline:
select customer_id, product_id, review_headline, count(*)
from opt_reviews
where char_length(review_headline) > 10
group by 1, 2,3
having count(*) > 1
order by 4 desc;
CUSTOMER_ID | PRODUCT_ID | REVIEW_HEADLINE | COUNT(*) |
32745072 | B00000607F | Such an historic band. | 29 |
50281120 | 1575665379 | AMust Buy! | 23 |
38864282 | B00007GUIO | Excellent Debut Album | 23 |
38503847 | 0440204887 | The Hanbook For Life | 21 |
52906860 | 8432039799 | Un libro incomparable. | 20 |
... | ... | ... | ... |
How I Did It
The product I was using was Kognitio's AWS Marketplace offering, which allows you to simply but quickly analyze data stored in Amazon's S3 storage and elsewhere. You can find the link to the product here. From that page, I brought up a system with 4 nodes, each with 64GB RAM and 16 cores, and a spot price of around $4.23 per hour. You also need a tool capable of connecting to Kognitio and running SQL — pretty much any third party tool will be able to do this if you install the Kognitio ODBC or JDBC driver, but I used Kognitio's own SQL submission tool called Console which you can download here.
I then ran the following SQL to create an external table allowing me to query the review data stored in S3. Note that the data is stored in Parquet format, which is an industry standard, and contains information about column names and types to allow them to be automatically generated in the external table:
create external table reviews from s3_parquet
target 'uri_location s3://amazon-reviews-pds, uri_path /parquet/, fmt_filename_partitions product_category, repeatable_results yes, ignore_invalid_records 1';
In the above:
- 'uri_location' and 'uri_path' tell the software where the data can be found.
- the 'fmt_filename_partitions' setting provides information on what the Parquet files in S3 are partitioned on (product_category in this case).
- 'repeatable_results yes' indicates the data isn't changing so you will get the same results if you re-access the data in S3 multiple times in a query.
- 'ignore_invalid_records 1' specifies that any records that cannot be loaded should be ignored (the failure to load them is logged for further investigation).
I can then query that external table immediately, for example, the following query returns in a few seconds as it can leverage the fact that the data is already partitioned on product_category
to eliminate irrelevant partitions:
select count(*)
from reviews
where product_category = 'Camera';
To get better query performance, we can generate an in-memory image of the data, then run against that rather than continually accessing S3. So I defined a view showing the fields I was interested in, and also mapping them to the most appropriate types rather than relying on the types inferred from Parquet — note that I had to quote the column name "YEAR" below, as that is a reserved word in Kognitio as with many databases:
create view opt_reviews as
select cast(marketplace as char(2)) marketplace,
cast(customer_id as bigint) customer_id,
cast(review_id as char (16)) review_id,
cast(product_id as char (12)) product_id,
cast(product_parent as bigint) product_parent,
product_title, star_rating, helpful_votes, total_votes,
cast(vine as char(1)) vine,
cast(verified_purchase as char(1)) verified_purchase,
review_headline, review_date, "YEAR", product_category
from reviews;
Then I created a memory image of that view, which took a few minutes as it has to scan all the data in S3 and transfer it to memory:
create view image opt_reviews;
The result was an image with just over 160 million rows in it.
At this point I could ask (hopefully) more interesting questions about the data, as shown earlier in the article — let me know if there are any other questions you'd like answered, or try this out yourself.
Opinions expressed by DZone contributors are their own.
Comments