Inside a Large Retailer’s Web Architecture: Data Extraction and Analytical Insights
In this article, I will go over how to use Costco API to get purchase information and build some tooling around it, as well as share some insights.
Join the DZone community and get the full member experience.
Join For FreeRecently, I was trying to identify and understand my daily sugar and other nutrient intake. So, I was trying to find ways to get my bills sorted and identify specific product items. Whether shopping at Whole Foods, QFC, or Costco, I wanted to access my information in an accessible way.
To solve this problem, I probably need an app that will read my bills and categorize accordingly. There must be an existing some app for that, or we can now use AI to automate most of it.
To start somewhere, I looked into my Costco purchases (at the same time, I also got their CITI credit card, which is awesome for Costco purchases). And I was browsing their website to get my receipts. To my surprise, Costco only allows downloading up to three months of data at a time. This was not sufficient for me, as I wanted to get my data in full.
After I looked into their network calls, I was able to identify that they are using graphql to fetch their information (this is authenticated, so there are no bugs on their end). I was eager to explore this API that was visible for users.
https://ecom-api.costco.com/ebusiness/order/v1/orders/graphql
To make any API calls, you also need costco-x-authorization: header; again, if you already logged in, then there is no need to figure out this information. We only need it once to download all the information.
The payload they used for getting the information is surprising; it's a schema and also a text of the query and start and end date.
{
"query": "query receiptsWithCounts($startDate: String!, $endDate: String!,$documentType:String!,$documentSubType:String!) {\n receiptsWithCounts(startDate: $startDate, endDate: $endDate,documentType:$documentType,documentSubType:$documentSubType) {\n inWarehouse\n gasStation\n carWash\n gasAndCarWash\n receipts{\n warehouseName receiptType documentType transactionDateTime transactionBarcode warehouseName transactionType total \n totalItemCount\n itemArray { \n itemNumber\n }\n tenderArray { \n tenderTypeCode\n tenderDescription\n amountTender\n }\n couponArray { \n upcnumberCoupon\n } \n }\n}\n }",
"variables": {
"startDate": "9/01/2025",
"endDate": "11/30/2025",
"text": "Last 3 Months",
"documentType": "all",
"documentSubType": "all"
}
}
Now that we know that there are some parameters, I realized we should obviously change the parameters to some random dates and see if this works.
Sample output:
{
"data": {
"receiptsWithCounts": {
"inWarehouse": "",
"gasStation": "",
"carWash": "",
"gasAndCarWash": "",
"receipts": [
{
"warehouseName": "",
"receiptType": "",
"documentType": "",
"transactionDateTime": "",
"transactionBarcode": "",
"transactionType": "",
"total": "",
"totalItemCount": "",
"itemArray": [
{
"itemNumber": ""
}
],
"tenderArray": [
{
"tenderTypeCode": "",
"tenderDescription": "",
"amountTender": ""
}
],
"couponArray": []
}
]
}
}
}
If we look into the response structure, we will find that there are itemNumbers, which are Costco's internally tracked numbers, costco 1823420.A quick Google search using this will return the first result we can use to identify the product. In this case, it is "Kirkland Signature Rustic Italian Bread, 32 oz." Now that we have a way to determine the item and what it represents, we can figure out how to get the ingredients and other sugar content.
Because of the start and end date exposed for users, I thought of expanding the range to see how large can these numbers go. By just changing the startDate to some date in 2018, I was able to get the entire purchase information. I decided to write a script to parse this information and get some meaningful insights.
curl 'https://ecom-api.costco.com/ebusiness/order/v1/orders/graphql' \
-H 'Accept: */*' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json-patch+json' \
-H 'Origin: https://www.costco.com' \
-H 'Pragma: no-cache' \
-H 'Referer: https://www.costco.com/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-site' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36' \
-H 'client-identifier: ' \
-H 'costco-x-authorization: \
-H 'costco-x-wcs-clientId: ' \
-H 'costco.env: ecom' \
-H 'costco.service: restOrders' \
-H 'sec-ch-ua: "Not:A-Brand";v="24", "Chromium";v="134"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw $'{"query":"query receiptsWithCounts($startDate: String\u0021, $endDate: String\u0021,$documentType:String\u0021,$documentSubType:String\u0021) {\\n receiptsWithCounts(startDate: $startDate, endDate: $endDate,documentType:$documentType,documentSubType:$documentSubType) {\\n inWarehouse\\n gasStation\\n carWash\\n gasAndCarWash\\n receipts{\\n warehouseName receiptType documentType transactionDateTime transactionBarcode warehouseName transactionType total \\n totalItemCount\\n itemArray { \\n itemNumber\\n }\\n tenderArray { \\n tenderTypeCode\\n tenderDescription\\n amountTender\\n }\\n couponArray { \\n upcnumberCoupon\\n } \\n }\\n}\\n }","variables":{"startDate":"6/01/2017","endDate":"8/31/2025","text":"2025 June - August","documentType":"all","documentSubType":"all"}}'
This is sample shell command I used to get the order information; we can change the start date to a very old date and we get complete information.
import json
from collections import Counter
# Load JSON from file
with open("output.json", "r") as f:
data = json.load(f)
receipts = data["data"]["receiptsWithCounts"]["receipts"]
# -----------------------------
# 1. Compute total purchase sum
# -----------------------------
total_purchase = sum(r.get("total", 0) for r in receipts)
# -------------------------------------
# 2. Count item numbers across receipts
# -------------------------------------
item_counter = Counter()
for r in receipts:
for item in r.get("itemArray", []):
item_num = item.get("itemNumber")
if item_num:
item_counter[item_num] += 1
# Get top N items (default: 20)
top_items = item_counter.most_common(20)
# -----------------------------
# Print results
# -----------------------------
print("Total Purchase Amount:", total_purchase)
print("\nTop Purchased Item Numbers:")
for item, count in top_items:
print(f"ItemNumber: {item}, Count: {count}")
Response from the above code:
ItemNumber: 96716, Count: 39
ItemNumber: 1659031, Count: 37
ItemNumber: 782796, Count: 36
My topmost purchases from Costco were spinach, milk, and water bottles. I could also dig deep into the locations, gas purchases, and other inventory. However, this enabled me to understand my purchasing patterns from Costco.
Just to make sure this is a not a bug, I already filed a bug for this and was told that this is not a bug. That's why I waited for a long time before I even posted this article.
There are few learnings we can gain from this information. One interesting pattern I observed from this is that Costco can learn when I was out for vacation (because I would not get my gas/groceries from there). It would also know when I was on a road trip (again gas and store patterns). This shows that information is very powerful and can be interpreted in different ways.
As for other implementation learnings to consider, I would still consider these as very minor bugs:
- Do not expose start and and end dates in your API ,
- Probably add validations around the API calls.
- There should be a wrapper around the graphQL layer to avoid exposing that this system is used internally.
Thanks for reading so far; in my next article, I will share about my personal surveillance hack I did using my vacuum cleaner and security camera (this is work in progress and will take some time).
Opinions expressed by DZone contributors are their own.
Comments