At LinkedIn, machine learning is everywhere, from People You May Know, Job Recommendation System to News Feed. Most of existing models were built on top of traditional machine learning algorithms, which are not capable of capturing deep and complex relationship - like language processing, images, and videos understanding. Deep learning has been catching up velocity and gaining its popularity to discover and learn these deep relationships, however, the infrastructure support for deep learning is still in its early age.
LinkedIn offline infrastructure is based on Hadoop and there has been no mature solution to run distributed TensorFlow jobs on Hadoop. TonY was born to solve this problem. TonY makes it easy and effective to run distributed deep learning jobs with GPUs on Hadoop clusters, it supports both Hadoop 2.x and 3.x to be compatible with most Hadoop clusters.
Keqiu is a staff software engineer at LinkedIn in data analytics platform group. He is currently leading efforts in cluster resource management systems and deep learning training infrastructure. Before joining the big data charter, he worked on mobile infrastructure and was responsible for LinkedIn mobile continuous delivery systems