Capstone

   From September 2024 to June 2025, I did a part-time introductory course in Data Science with the Lisbon Data Science Academy. Their curriculum, which includes individual assignments and group hackathons, can be found here. The course ends with an individual project, the Capstone, where the students must analyse data coming from a company and develop a model to answer given business questions. This year the data was coming from a Portuguese retailing company, whose objective was to predict prices for certain product categories for its two main competitors.

   The first part of the project focused on exploring and analysing the shared data, looking to extract some business insights and to understand which factors have a bigger impact on the prices. The data provided was uncommonly neat and clean, requiring almost no pre-treatment. I was interested in understanding how the data was distributed amongst the different companies, product categories and throughout time. There was more data available from our client than from its competitors, potentially from the ease in acquiring such data. Thus, it was expectable that the model would perform better when predicting the client’s prices than those of its competitors, where the main interest of the project lied. It could also be seen that the number of distinct products falling within each category varied significantly, which could also lead to the model performing better for categories where more information had been provided. All datasets comprised data from the same period, and the amount of information available per competitor was fairly constant throughout this time. When assessing the data’s seasonality, a light weekly autocorrelation could be noticed, as well as a correlation with the prices from the day before. I did some further analysis for answering our business questions, such as promotional participation, the impact of ongoing campaigns, and pricing competitiveness. Finally, I wanted to understand whether price changes from one of the three companies were followed by one of the others in the short-term (few days to a week periods). For this, the correlation between the price from one of the players on a given day and the price for the remaining actors on the following days was studied. No significant correlation between the prices from one of the competitors and the remaining two companies was found. There was a correlation between prices between our client and the other competitor, which was more significant when comparing prices from the same day. With this in mind, it’s not clear one is influencing the other, when both might be looking out to the same signals and being influenced in similar ways by other factors.

   The second phase of the project was the modelling of the sales price to the variables found to be significant for the task. As the objective was to forecast the sales price for the two main competitors, rather than a sector-wide prediction of the prices evolution, two different models were created, one for each competitor. Different models were experimented with and evaluated – the goal was to find the one which would minimize both the mean absolute error (MAE) and root squared mean error (RSME), while providing the most balanced performance possible across the different product categories and obeying to certain practical limitations (such as the time it took to train the model and its size for later deployment). The initial baseline used a linear regression model, and later a Random Forest regressor and Gradient Boosting regressor were both explored. Some variations regarding the input variables were also tested. Given that there is more data available for some product categories than others, both under-sampling and over-sampling strategies were experimented – none of the sampling strategies significantly improved the model’s performance. Finally, a hyperparameter grid search was carried out to assess the impact of the minimum sample split and maximum depth. The final models used the Random Forest regressor from Scikit Library and were based on the product’s category and several date-related features, namely the year, month, day, weekday, cyclic weekday using the sine function, and whether the day is a weekend or working day. The hyperparameters chosen were maximum depth 12 and minimum sample split 5.

   The models were then deployed, and an API was built, allowing the users to get predictions of the selling price at these two key competitors for specific products and dates. The API has two endpoints, one which returns the forecasted selling prices, and another which can be later used to update the database with the actual price for that product on that day. The application performs a thorough input validation, to prevent processing and storing any erroneous value given either by human error or malicious intent. The application was built using Flask and was connected to a PostgreSQL database, both hosted in Railway. It is no longer available, but you can check the code and further analysis from the project's reports here.

Continue exploring my projects