On Loss Functions for Deep Neural Networks in Classification

Katarzyna Janocha,

Wojciech Marian Czarnecki

Abstrakt
Deep neural networks are currently among the most commonly used classifiers. Despite easily achieving very good performance, one of the best selling points of these models is their modular design – one can conveniently adapt their architecture to specific needs, change connectivity patterns, attach specialised layers, experiment with a large amount of activation functions, normalisation schemes and many others. While one can find impressively wide spread of various configurations of almost every aspect of the deep nets, one element is, in authors’ opinion, underrepresented – while solving classification problems, vast majority of papers and applications simply use log loss. In this paper we try to investigate how particular choices of loss functions affect deep models and their learning dynamics, as well as resulting classifiers robustness to various effects. We perform experiments on classical datasets, as well as provide some additional, theoretical insights into the problem. In particular we show that L1 and L2 losses are, quite surprisingly, justified classification objectives for deep nets, by providing probabilistic interpretation in terms of expected misclassification. We also introduce two losses which are not typically used as deep nets objectives and show that they are viable alternatives to the existing ones.
Słowa kluczowe: loss function, deep learning, classification theory.
References

[1] Larochelle H., Bengio Y., Louradour J., Lamblin P., Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 2009, 10 (Jan), pp. 1–40.

[2] Krizhevsky A., Sutskever I., Hinton G.E., Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012, pp. 1097–1105.

[3] Oord A.v.d., Dieleman S., Zen H., Simonyan K., Vinyals O., Graves A., Kalchbrenner N., Senior A., Kavukcuoglu K., Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

[4] Silver D., Huang A., Maddison C.J., Guez A., Sifre L., Van Den Driessche G., Schrittwieser J., Antonoglou I., Panneershelvam V., Lanctot M., et al., Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529 (7587), pp. 484–489.

[5] Clevert D.A., Unterthiner T., Hochreiter S., Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.

[6] Kingma D., Ba J., Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [7] Tang Y., Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.

[8] Lee C.Y., Xie S., Gallagher P., Zhang Z., Tu Z., Deeply-supervised nets. In: AISTATS. vol. 2., 2015, pp. 6.

[9] Choromanska A., Henaff M., Mathieu M., Arous G.B., LeCun Y., The loss surfaces of multilayer networks. In: AISTATS, 2015.

[10] Czarnecki W.M., Jozefowicz R., Tabor J., Maximum entropy linear manifold for learning discriminative low-dimensional representation. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2015, pp. 52–67. 59

[11] LeCun Y., Cortes C., Burges C.J., The mnist database of handwritten digits, 1998.

[12] Srivastava N., Hinton G.E., Krizhevsky A., Sutskever I., Salakhutdinov R., Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15 (1), pp. 1929–1958.

[13] Principe J.C., Xu D., Fisher J., Information theoretic learning. Unsupervised adaptive filtering, 2000, 1, pp. 265–319.

Czasopismo ukazuje się w sposób ciągły on-line.
Pierwotną formą czasopisma jest wersja elektroniczna.

Wersja papierowa czasopisma dostępna na www.wuj.pl