PROJECTS

Wants to see some content of training's project ? It's here !

All report/summary/powerpoints are in French as the course was in French

PROJECT 2
NOVEMBER 2017

Nutritional data analysis

This first project of the Data Scientist Course covers the Exploration / Cleaning part of the Data Analyst / Data Scientist job.

We have datas from food products with their nutritional values. On this dataset there were a lot of missing or incorrect data. The purpose of this project was therefore to clean up/filter the incorrect data. In parallel with the cleaning, a univariate and multivariate exploration was done to determine the most important parameters to healthy dishes (in order to help a company to set up healthy recipes).

DOWNLOAD REPORT
DONWLOAD POWERPOINT


PROJECT 3
NOVEMBER 2017

Movies recommandation engine

During this second project, the objective was to implementate unsupervised algorithms on a dataset of movies. Based on a movie selected by a user, the model must recommend 5 similar movies that should be appreciated too.

Based on a simple dataset, with only 28 features, a cleaning pahse and data preparation was done. After that, different Clustering algorithms (or manifolds) were tested and evaluated on the relevance of the prediction on a set of predefined movies.

DOWNLOAD REPORT
DONWLOAD POWERPOINT


PROJECT 4
DECEMBER 2017

Prediction of flight delays

In this project, we have available various information on all flights in the US in 2016. This dataset includes several millions of flights with different information (company, date and time of departure/arrival, airports of departure and destination, etc ...). From this dataset, a delay prediction model had to be implemented.

During this project, a restriction on models was "requested". Despite the number of data and the relevance of using a Recurrent Neural Network on this type of dataset, a linear model has been set up. As a comparison, a temporal model ("ARIMA") has also been tested.

DOWNLOAD SUMMARY
DOWNLOAD REPORT
DONWLOAD POWERPOINT


PROJECT 5
DECEMBER 2017

Segmentation of customer's behavior

From a dataset of items sold to customers, we are asked to segment customer's behaviors in order to be able to set up a classifier and then be able to classify as soon as possible, what we be the behavior of a new customer.

As a first step, a clustering of articles was done in order to be able to then make a Bag of Words articles. Then using various aggregations to have additional features (purchase frequencies, average bill price, time since the last purchase, ...) a new dataset per customer was set up. Starting from this new dataset, a second clustering was set up to segment the clients. After an analysis of clusters and related behaviors, a classifier was put in place to predict which category a client could be.

DOWNLOAD SUMMARY
DONWLOAD POWERPOINT


PROJECT 6
DECEMBER 2017

Tags prediction for Stack Overflow

From the Stack Exchange API, thousands of questions about Stack Overflow with their tags have been downloaded. From this dataset, supervised and unsupervised methods were put in place to make a system of predicting tags to a given question.

After a pre-processing phase (tokenization, lemmatization), the TF and TF-IDF matrix were set-up. On these, unsupervised methods (Latent Dirichlet Allocation and Non Negative Matrix) were applied to predict tags using a KNN with the Jenson-Shannon distance. For the supervised method, various models with regularization were tested on the TF-IDF matrix with much better results.

DOWNLOAD SUMMARY
DOWNLOAD REPORT
DONWLOAD POWERPOINT


PROJECT 7
JANUARY 2018

Dog breed classification

From a set of 10,000 dog pictures, we are asked to make a classifier to predict the breed of the dog only in this image. At first, we have to use only classical methods and then usign Convolution Neural Networks (Transfer Learning and Custom CNN).

For the classical method, a study of SIFT descriptors, colors, moments and image frequencies were done with very low results. For the Transfer Learning phase, performances are highly better with a classifier like Neural Networks. Finally, a training of zero was tested on few models to evaluate the relevance of Transfer Learning on this type of tasks.

DOWNLOAD SUMMARY
DONWLOAD POWERPOINT


PROJECT 8
JANUARY 2018

Technology watch - Recurrent Neural Networks

During this technology watch, I decided to present Reccurent Neural Networks as there were no project using it.

After an explaination of the principle and cases of use, I went throught the state-of-art from Simple RNN (70s'), Long Short Term Memory (1997 & 2000), Gated Recurrent Unit (2017) up to Quasi Recurrent Neural Network (2017). A comparaison of all them has been done on a fixed dataset to see performances and training time.

DOWNLOAD SUMMARY
DOWNLOAD REPORT
DONWLOAD POWERPOINT


PROJECT 9
FEBRUARY 2018

Kaggle Competition - Image segmentation

To finish the program, we have to participate to an open Kaggle Competition. I decided to work on a competition where we are asked to perform an image segmentation of Cells Nucleis.

After the pre-processing step, a first model based on image processing has been tried as baseline. After that, the model U-Net (reference model for this task - won ISBT in 2015) has been adapted to this specific task. The result was not as good as expected with issues with stability of training. As a result, I've tried a U-Net extednded which takes multiple inputs with différents pre-processing steps to help the model to segment the image. The result was clearly better. To finish, as step of Post-Processeing has been implemented to extract every nucleis and prepare the submission file.

DOWNLOAD SUMMARY
DOWNLOAD REPORT
DONWLOAD POWERPOINT


PROJECT 10
MARCH 2018

Segmentation of customer's behavior (V2)

During the P6, I discovered some algorithms to work on text data (LDA, LSA and NMF). In the Project 5, I knew that a huge drawback of my current model was the clustering done on articles. As a result, I've done the project new after the classes with this new knowledge and the result is clearly better. The Notebook has been shared on Kaggle.

LINK


PROJECT 11
MARCH 2018

Python implementation fo the DOSNES

In the project Movie Explorer, I wanted to represent datas on a Sphere. The only algorithm I found which is doing that was the DOubly Stochastic Neighbor Embedding on a Sphere (DOSNES) published by Yao Lu in Sep. 2016 on ArXiv. As there were no Python implementation existing, I re-created it from the paper and their MatLab implementation using sklearn method naming. The algorithm doesn't perform all checks yet but it is working.

LINK


PROJECT 12
APRIL 2018

Movie Recommander in 3D

This project use Machine Learning Algorithms to be represent, in a 3D space, movies based on similarities. From a Dataset provided by IMDb, a subset of movie has been extracted. Datas have been completed/improved by using OMDb API. After a phase a cleanup/preparation, the tSNE, and DOSNES algorithm have been used to represent movies in a smaller dimension. The rendering has been done in JavaScript using WebGL with three.js.

LINK
WEBSITE URL