*First time loading may take up to 2 minutes

    1 Introduction

    1.1 Motivation

    During this special time of pandemic, cooking at home has become increasingly popular among the younger generation, but it also raised a question, which is how to cook healthily. Unlike other packed products, the raw materials we can buy like meat and vegetables don’t come with nutrition facts. There are some applications or websites that help track what you eat. However, most of them require manually typing in the names of materials which is tedious and time-consuming. Kitchen Alchemy is here to help. You only need to upload a picture of what you eat to our website, we will do everything for the rest. In the end, we will also provide you a graph which helps visualize the nutrient facts.

    1.2 Project overview

    Our project, Kitchen Alchemy, uses neural networks for image recognition.Different from the traditional tracking process, our product provides all necessary nutrition facts conveniently and efficiently. With our program, people may know more about the nutrient facts of their food and develop healthier eating habits.

    2 Dataset Collecting and processing

    2.1 Dataset Description

    2.1.1 Dataset Source

    We started with multiple datasets from Kaggle, ImageNet, and Google data. After comparing different datasets, we chose Fruit 360 from Kaggle as our main raw dataset. [1] Fruits and vegetables were planted in the shaft of a low speed motor (3 rpm) and a short movie of 20 seconds was recorded. Due to the variation of the light condition and background, they decided to extract the object from the background. The Fruit 360 dataset was built on 02/25/2017 but has been actively updated. We used the current newest version which is Version: 2020.05.18.0.

    image alt text

    The nutrient facts dataset was obtained from NSDA and some other datasets found online. For some of the fruits that were not in the dataset, the values were manually searched and inputted.

    2.1.2 Dataset characteristics (the original dataset)

    Total number of images Training set size Test set size image size Total number of classes
    90483 67692 22688 100x100 pixels 131

    2.2 Data preprocessing

    Since there are some fruits in the dataset that are not commonly seen in daily life, we decided to merge those varieties into one class such as group Kumquats with Orange so that the training process would be easier. After cleaning, we have 28 classes in total for both training and testing. Due to combing different classes together, depending on the class, the number of images in one class vary from 103 to 11*103.

    After cleaning our dataset, we need to connect the nutrient facts table to the classes. In the nutrient chart, we have 12 categories associated with each class.

    image alt text

    3 Method

    3.1 Algorithms and Models

    3.1.1 Convolutional Neural Network – A Transfer Learning Approach

    When using a CNN model to complete a typical image classification task, there are usually two approaches. One can build a CNN network from scratch, or use transfer learning from a pre-trained model. We chose the latter approach for a few reasons. First, the transfer learning approach can take advantage of the pre-trained model in terms of structure, weights, and hyperparameters. Second, training a transferred model is more time-efficient than a new model. Transferred models often have less weights, which helps bring down the hardware requirement for training a network.

    In our project, Python libraries Tensorflow as Keras are used in our protypying. With the help of these libraries, manipulations of the network are much easier. We finally chose the ImageNet Contest winning model MobileNetV2 as our starting point. MobileNetV2 has considerably less model size and weights compared to other popular CNN like VGG16/19 and InceptionV3. To fit with our task, the last layer of the original MobileNetV2 (dense layer with 1000 outputs nodes) is removed and replaced with our custom layers:

    To further prevent overfitting, we also added two L2 Regularizations to the two dense layers.

    Meanwhile, augmentations were applied to the training dataset. With the ImageDataGenerator class from Keras, the actual images fed into the network may include random shifts in zoom range, width-to-height ratio, etc. This helps the model better generalize to real-world images.

    With this network structure, training takes about 50 minutes for a total of 20 epochs on a GTX 1060 6GB Max-Q. At the end, the model is able to reach an accuracy of 96% and a cross-entropy loss of 0.38 on the validation dataset. Afterwards, we tested our trained model using images outside of the validation dataset. It performed within our expectations.

    image alt text

    3.1.2 Faster-RCNN and Regional Proposal Network

    To offer better user experiences by allowing users to directly take pictures of their fridge or arranged cooking materials to obtain the nutritious value, we decided to also implement a segmentation layer that will serve as an input to the recognition network for nutrientritious values. Since the user-data is photo-based, there are no real-time requirements for image segmentation. Hence an improved branch of the RCNN network - Faster-RCNN was implemented. Faster-RCNN utilizes Anchor Boxes and Regional Proposal Network, which consists of:

    in parallel to CNN feature extraction to come up with possible ROIs that contain images of food. For the feature extraction network that serves as both the input to the RPN and the ROI pooling layer we used InceptionResnetV2 that should provide us with decent accuracy while not incurring heavy computation cost. However, due time constraint we were unable to integrate the Faster-RCNN and RPN into our project’s final submission.

    4 Results and Discussion

    4.1 Results presentation

    On the sidebar, a “Choose File” button appears under “View Your Report”, That’s where the picture of the fruit is uploaded.

    image alt text

    After a few seconds, the predicted result would appear on the right with the data visualization on the bottom.

    image alt textimage alt text

    4.2 Limitation and Future work

    Like most CNN models, our model has limitations with regard to prediction accuracy for certain types of fruit that are similar in shape and color. Infrequently, the model predicts a red apple as a tomato, or an orange as a potato. This is mainly due to the limitation of the dataset used in training. As mentioned in the dataset section, images in this dataset are taken from different perspectives of the same fruit. As a result, the color, texture, shape etc. learned from the dataset are limited to that particular fruit/vegetable and sometimes do not generalize well to various characteristics of a real-world object.

    4.3 Conclusion

    Overall, since the main limitation we are facing right now is the number and variety of our datasets. The next step that we are looking to is gathering more pictures for our classes. This may help improve the accuracy of current classes. Beyond that, we will also try some other models with more parameters and see whether or not it will help with performance. Moreover, we are also considering adding or finishing some other features that we mentioned in the proposal such as recommendations.

    5 References

    [1] Horea Muresan, Mihai Oltean, Fruit recognition from images using deep learning, Acta Univ. Sapientiae, Informatica Vol. 10, Issue 1, pp. 26-42, 2018.

    [2] Song, Y., Elkahky, A. M., & He, X. (2016). Multi-Rate Deep Learning for Temporal Recommendation. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 16. doi: 10.1145/2911451.2914726

    [3] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. doi:10.1145/3065386

    [4] Xie, S., Kirillov, A., Girshick, R., & He, K. (2019). Exploring Randomly Wired Neural Networks for Image Recognition. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). doi: 10.1109/iccv.2019.00137

    6 Contribution

    Jiajun Mao: Segmentation

    Yue Pan: Model training and tuning

    Wei Xin: Website development, User Interface

    Lisha Yang: Nutrient Fact chart, User Interface

    Tongshu Yang: Data Cleaning