The goal of the project was to recognize landmarks in photographs using machine learning, namely convolutional neural networks. This topic was chosen for the following reasons:
The author already had some experience with computer vision tasks
the task sounded as if it could be done very quickly without much effort and, what is important, without a lot of computing resources (all nets were trained in colab or in kagle)
the problem could have some practical application (well, in theory…)
At first it was planned as a purely educational project, but then I got into the idea of it and decided to refine it to the state that I can.
In what follows, I will talk about how I approached this task, and in doing so I will try to follow the code from the notebook where all the magic happened, while also trying to explain some of my actions. Maybe this will help someone get over their fear of the “blank slate” and see that this kind of thing is really easy to do!
Tools
First things first, let me tell you about the tools which were used for this project.
Colab/Kaggle: used to train networks on GPUs.
Weights And Biases: a service where I was saving models, their descriptions, adding losses, metrics values, training parameters, preprocessing. In general, I kept complete records. You can read the data here. The metadata section was slightly changed while writing the code – it actually contains the parameters of training and preprocessing. In the files section you can find a description of the network (how its layers are arranged), download the trained weights of the network and look at the values of losses and metrics.
Training data
Well, I should probably start by choosing the data to train the neural network. For this I searched data sets on Cagle (see here) and this site caught my eye.
Actually, as it turned out, there is a competition from Google, related just to the recognition of landmarks. Here was the first problem: dataset weighs \approx100gb. Realizing that the grids in the future I will learn not on my bakery, I had to give up this option. After some more research, I settled on this dataset. It contains 210 classes and about 50 pictures per class. The pictures are all different sizes, taken from different angles, from different distances. In general, the dataset is not refined at all, and so far I’ve only worked with these.