From Receipt to Analysis: How I Scanned My Costco Receipt and Compared Prices Across Major Retailers. Part One
As a recent Costco member, I’ve heard mixed reviews about whether Costco is truly a bargain or not. Some people claim that the prices are inflated, while others swear by the quality and value of their products. But as a data scientist, I know that the only way to get to the truth is through hard data.
That’s why I was intrigued when I stumbled upon a video claiming to prove that buying groceries at Costco is no better than shopping at any other major retailer. As someone who loves a good bargain and is always looking for ways to save money, I couldn’t just take their word for it.
That’s when I had the idea: What if I could put my data science skills to work and compare the prices of the items I buy at Costco with those at other major retailers?
This project not only allowed me to satisfy my curiosity, but also served as a way to develop and showcase my data science skills through a real-world, end-to-end portfolio project.
This article is divided into two parts. In part one I’ll take you through my journey from scanning my Costco receipts, enriching the data, and storing it for later analysis. While, In part two, we’ll dive into data collection from other major retailers and perform a comparative analysis.
This article is tailored for both bargain hunters and data science enthusiasts. It presents my approach and findings in a clear and concise manner that is easy to understand by non-technical readers. After completing the analysis, I will be writing a walkthrough article on how the code was developed and used.
Getting Started:
To tackle the project effectively, I took a strategic approach by dividing it into smaller, manageable steps. I also researched different data collection methods and chose the most efficient solution. By following this approach, I was able to gather and clean data from Costco’s website and other retailers, store it for analysis, and focus on the most important aspects of the project.
The following outlines the process I followed for this project:
1: Getting data from Costco’s website:
- First, I obtained the product id’s from my Costco receipts by scanning them.
- Using the product id’s, I utilized web scraping to gather additional data from Costco’s website.
- To ensure accuracy, I carefully cleaned and prepared the data using standard data processing methods.
- Finally, I stored the data in a format that would be easy to use for analysis.
2: Getting data from other retailers:
- Using the product names obtained from Costco, I searched other retailer websites to gather pricing information for comparable products.
- I ensured the collected data was accurate and clean by following standard data processing methods.
- Finally, I stored the data for later analysis.
3: Analysis:
- With the data from both Costco and other retailers, I compared the prices to determine if Costco truly offers a better deal.
From Receipt to Data:
Web Scraping:
Think of web scraping as the process of sourcing ingredients. Just as a chef carefully selects and sources high-quality ingredients for a recipe, a data scientist carefully selects and sources high-quality data for a project. In our case, web scraping provides a way to gather data from a website and obtain most accurate and relevant information.
First and foremost, before scanning the receipt, it was essential for me to determine whether web scraping was a feasible method for data collection. To prevent web scraping activities, many websites use methods to detect and block them, which could have impeded our data collection efforts. To tackle this obstacle, I carefully researched the website’s anti-scraping measures and selected a tool that could bypass them.
With our scraping tool and code in place, I proceeded to test the approach using a few sample products. Given the sheer amount of data on the website, I needed to carefully identify and extract only the specific fields that was relevant to our analysis. Once I had successfully tested the scraping tool and ensured that I was collecting the necessary data, I moved on to the next challenge.
Extracting Data from Receipts:
My next challenge was to extract data from receipts, which required a different approach from web scraping. I needed to use advanced image recognition technology to accurately identify the product codes (SKU’s) listed on the receipts. To achieve this, we utilized Optical Character Recognition (OCR), a technology that enables the extraction of text or characters from images.
Instead of dedicating considerable time and resources to develop an OCR model from scratch, which would involve training and optimization, I opted to explore existing solutions that could be effectively utilized. Fortunately, my search led me to discover a tool called Veryfi, which specializes in extracting data from receipts and invoices. Leveraging this tool streamlined the process, enabling me to achieve our desired outcome efficiently. This is a great example of how using existing tools and resources can be an effective way to rapidly develop a minimum viable product.
Cleaning the Data:
Just as a chef needs to clean, chop, and measure ingredients before cooking, we need to clean and prepare the data before analyzing it.
Cleaning and preparing the data involves removing duplicates, filling in missing values, and making sure everything is formatted correctly. This step is critical for ensuring that the data is accurate, reliable, and ready for analysis.
Data Enrichment:
Just as seasoning enhances the taste of a dish, enriching data can unlock hidden insights and extract more value from it. While raw data provides a foundation, it lacks the depth and context that enriching it can provide.
I enriched the data by extracting vital information such as the weights of the products. I also added a category, and subcategory field using Chat GPT to do this, which we will see in action in part two of this article. By enriching our data, I laid the foundation for the next step of our analysis.
Saving Data for Analysis:
Once the data has been collected, cleaned, and enriched, the next step is to store it in a format that can be easily used for analysis. This is similar to how a cook would store their prepared ingredients for later use.
For this project, I chose to store the data in a CSV (comma-separated values) format. This is a simple and widely-used format that can be easily imported into many data analysis tools.
In future iterations of this project, it may be beneficial to store the data in a more structured format, such as a dataset. However, for the proof of concept, the simplicity and flexibility of a CSV file is sufficient.
Conclusion:
I’m proud to say that this project has been a truly fulfilling experience, allowing me to apply my data science skills to a real-world problem. Throughout the process, I leveraged OCR technology to extract product SKU’s and gather valuable information from Costco’s website. Additionally, I followed best practices and ensured that our data was thoroughly cleaned and enriched, laying a strong foundation for our upcoming analysis.
As I move forward with this project, my main objective is to use my data science skills to demystify whether Costco is truly a bargain compared to other major retailers. I am excited to explore the insights that can be gained from comparing prices across different retailers. And look forward to sharing my findings with you in part two of this article.
Follow me to stay tuned for the next article!