👇 Read the project documentation below the dashboard content.
▪️ There is a daily stock movement of approximately 5-20 violins per day; ▪️ The price gap between professional and advanced violins is around 30 thousand dollars; ▪️ There is a large inventory of beginner violins, suggesting they are the most frequently sold items; ▪️ The site has a limited number of reviews, all related to beginner and intermediate violins, with an average rating of 5 stars; ▪️ November 2024 showed the highest average violin prices across the entire period analyzed; ▪️ Stock tends to increase at the beginning of each month, followed by a general upward trend, indicating strong inventory turnover and business growth; ▪️ There is a noticeable correlation between the number of unique violins and the average price. ▪️ In June 2025, the average violin price began to decline while the number of unique violins increased, this suggests a strategic focus on investing in cheaper violins rather than more expensive ones, which tend to have lower turnover.
To automate the extraction of product data (title, price, rating, stock, etc.) from an online violin store, aiming to build a structured dataset for analysis.
The scraper navigates dynamically loaded pages, interacts with category buttons ("advanced-violins", "beginner-violins", "electric-violins"), and retrieves content from multiple sections of the website. The final output consists of cleaned and processed data, ready for use, and stored in a data warehouse.
▪️ I had to identify the correct parsing logic at each step of the
scraping process;
▪️ Recognize and extract the specific variables for each piece of
information (e.g. violin name, price, description);
▪️ Ensure seamless navigation through all paginated content using a
dynamic loop that stops automatically at the last page;
▪️ Structure the data into a well-formed DataFrame by parsing the HTML
content accurately;
▪️ Simulate human-like behavior during the session, avoiding detection
by anti-bot systems and mimicking a real user navigating the site.
▪️ I used Scrapy to perform the web scraping, as I
wanted to explore this powerful framework in a real case scenario. The
website's content could be extracted directly from the HTML without
relying on JavaScript rendering, which made Scrapy the ideal tool for
this project;
▪️ The data is cleaned and converted into a DataFrame using
Pandas, ensuring consistency and clarity in the final
dataset;
▪️ A second script handles the insertion of the collected data into a
PostgreSQL database, enabling further analysis or
dashboard integration;
▪️ The code runs once a day, updating every 24 hours, every morning. It was started in November 2024;
▪️ I also implemented a logger to capture the execution status of each
step in the code, storing these logs in the database for tracking and
debugging purposes.
To view the data extraction code, just click here:
Explore the repository about this project on Github
Check the full structure before moving on to Power BI:
Check the logging data structure:
Web Scraping: Python, Scrapy, Pandas
Database: SQL, PostgreSQL
Infrastructure: Docker, Ubuntu Server (VPS), Crontab
Dataviz: Power BI
UX Design: Figma
Version Control: Git, GitHub
To preserve the identity of the website, its official URL will not be disclosed.
The model used was the star schema.
▪️ This project allowed me to learn and apply Scrapy for HTML parsing, which proved very effective depending on the web scraping context; ▪️ I'm now prepared to handle future projects using spiders, collectors, and more advanced crawling strategies; ▪️ It became clear that the processing cost for this type of operation is very low, making it significantly more economical than using Selenium, especially in cases where JavaScript rendering is not required; ▪️ Therefore, depending on the use case, Scrapy can be a more efficient and scalable solution compared to browser automation tools; ▪️ One of the potential future challenges is that, if any element on the website changes its name or structure, the code may break; however, the logging step will alert me if anything goes wrong.