Microsoft Chief Researcher Yu Dong: tell you the latest progress in deep learning

Lei Feng network (search "Lei Feng network" public concern) : Author CSDN Zhou Jianding, this article will introduce 1) based on deep learning voice recognition of the latest developments and future development trends; 2) Microsoft deep learning tool CNTK.

As an important direction in the field of artificial intelligence, voice recognition has made significant breakthroughs in recent years with Deep Learning, which has laid a technical foundation for the development of interactive human-machine speech applications. The evolution and realization methods and effects of speech recognition technology are both the knowledge that the speech recognition practitioner needs to master and the contents that the intelligent application developer should understand. Recently, the chief researcher of Microsoft Research Institute and the first author of "Resolve Deep Learning - Speech Recognition Practice" Yu Dong accepted an exclusive interview with CSDN, which thoroughly analyzed the latest technical direction of deep recognition based on speech recognition, and Microsoft team's practical experience, and Microsoft Open source deep learning tool CNTK's iterative ideas were introduced .

Yu Dong introduced the latest deep learning models for deep speech recognition such as deep CNN, LFMMI, deep clustering, PIT, and RNN generation models, as well as the migration learning and computational networks mentioned in “Analytic Deep Learning- Speech Recognition Practice”. (CN) and other technologies. He said that identification in more difficult environments (such as far-field, high noise, or accented speech recognition) will be the next problem to be solved, and his team is currently focusing on R&D to have stronger recognition of various scenarios. Models of capabilities, such as deep CNN and models that can improve far-field recognition rates (such as PIT).

As a researcher, Yu Dong also pays attention to the practicality of engineering. The importance of the problem, the potential of the research direction, the universality of the solution, and the convenience of the project are four important considerations for him to choose the research direction. From the engineering point of view, he believes that the application of computational networks in speech recognition needs to consider the difficulty of training, the size of the model, and the speed, time delay, and energy consumption of the runtime. This is actually the core appeal of CNTK's future iterations.

In addition, he stated that deep learning is only one of many artificial intelligence technologies, and is mainly good at nonlinear feature extraction and end-to-end gradient-based optimization. It cannot be used alone to solve many practical problems, and it combines various technologies organically. Is the best solution.

Yu Dong joined Microsoft in 1998 and is currently a Principal Research Fellow at Microsoft Research Institute. He is also an Adjunct Professor at Zhejiang University and a Visiting Professor at China University of Science and Technology. Senior experts in the field of speech recognition and deep learning have published two monographs and published more than 160 papers. They are the inventors of 60 patents and one of the initiators and lead authors of deep learning open source software CNTK. Won the 2013 IEEE Signal Processing Association Best Paper Award. He is currently a member of the IEEE speech language processing professional committee and has served as an editorial board member of IEEE/ACM audio, speech and language processing publications, and IEEE signal processing magazines.

The following is an interview record:

| Latest Developments in Speech Recognition

CSDN: Can you introduce some of the most exciting things in the current speech recognition field and some of the interesting work you are currently doing?

Yu Dong: Since we successfully introduced deep neural networks in large vocabulary speech recognition systems in 2010, speech recognition research and applications have entered the era of deep learning. The pace of development in the field of speech recognition in recent years has greatly exceeded our expectations, and new and more effective models and methods have been proposed each year.

In the recent year, several tasks have made me very interesting.

The first work is the successful application of deep CNN (high deep vocabulary speech recognition) reported by research institutions such as IBM, Microsoft, Newspac, and Shanghai Jiaotong University . Previously, we generally used convolutional networks only at the bottom. Under such a framework, convolutional networks greatly increased the workload, but the performance advantage in recognition was not obvious. Therefore, we did not spend much time in this book to introduce this work. But when we applied deep convolutional networks such as those used in image recognition such as VGG, GoogleNet, and ResNet, the recognition rate has greatly improved, even surpassing the best performing deep bi-directional LSTMs. Because of the latency, bidirectional LSTM cannot be used in real-time systems. The delay of deep convolutional networks is relatively small and controllable, so it can be used in real-time speech recognition systems.

The second job is lattice free MMI (LFMMI) led by Dr. Dan Povey of Johns Hopkins University. In order to improve the robustness of the speech recognition system construction process, the construction of a traditional speech recognition system requires many steps. In recent years, many researchers have tried to build identification systems directly by end-to-end optimization to eliminate other steps. The most influential work is the LSTM-based Connectionist Temporal Classification (CTC) model. . Both Google and Baidu report that the technology has been successfully applied. However, as far as we know, direct use of CTC requires a lot of adjustment work and the overall modeling time is longer, and the effect is poor. Or that this method is less reproducible on new tasks. LFMMI has evolved from the traditional MMI sequence training method. It introduces some concepts of CTC and can also achieve end-to-end training. However, the construction process is easier to repeat and more robust.

The third and fourth tasks are solutions to the cocktail party problem. Cocktail parties are a difficult but important issue in speech recognition. There are two recent works that have led us to see the dawn of a solution to this problem.

One is the Deep Clustering method proposed by Dr. John Hershey of MERL. Their method can be clustered together by mapping each time-frequency bin of mixed speech with its context to a new space, so that the bins belonging to the same speaker in this space are a small distance.

The other is the permutation invariant training (PIT) that we proposed in cooperation with Aalborg University. PIT optimizes speech separation by automatically finding the best match between the separated signal and the source of the annotation. These two methods have their merits. My personal opinion is that PIT has more potential. The ultimate solution to the problem may be some improvement in PIT or some combination of these two methods.

In addition, there has been some progress recently in the identification methods based on RNN generation models (such as the sequence-to-sequence model) , but overall, this research is still in the preliminary stage.

I have recently done three tasks:

One is deep CNN. We have discovered the superior performance of deep CNN in large vocabulary speech recognition at about the same time as several other research groups.

One is PIT-based speech separation. I am the leader and major contributor to this work.

The third item is based on the identification method of the RNN generation model. We have some new ideas, but this work is still in its infancy.

| Deep learning and speech recognition

CSDN: In a nutshell, apart from feature extraction, what role does deep learning play in the field of speech recognition?

Yu Dong: The most important role of deep learning is still in feature extraction. Even deep CNN can be seen as a more complex feature extractor. But as you can imagine, the role of deep learning is not just feature extraction. For example, the prediction-adaptation-correction (PAC) model we proposed two years ago can directly identify certain behavioral features such as prediction, adaptation, and correction in the model. For another example, the noise-aware and speaker-aware adaptation methods we mentioned in the book can be implemented directly through the network structure adaptive modeling. For another example, the PIT model can directly separate the speech from the deep learning model, and the recognition method based on the CTC and RNN generation model generates the recognition result directly from the deep learning model.

CSDN: You and Dr. Deng Li's "Analysis of Deep Learning - Speech Recognition Practice" systematically introduced DNN-based speech recognition technology. Who should read this book? What will they harvest? Is this book suitable for entry? What knowledge base do you need for your readers?

Yu Dong: This book will be helpful to scholars, students, and engineers who are or would like to engage in speech recognition research or engineering practice. And this is exactly the original intention of writing this book. In this book, we attempt to depict the entire framework and major technologies of deep learning based speech recognition technology. Because I have been striving for the first line of scientific research, we can provide basic ideas, specific mathematical derivation, and implementation details and experiences in various ways. We hope this book can become a reference book for everyone. Different readers can find in the book what they want to know. For researchers in the non-speech recognition field, the methods and ideas mentioned in this book will also help them solve their own problems because they are universal.

This book is equally suitable for entry. In fact, North America and Japan have both universities that use this book as one of the textbooks or reference books for undergraduate or graduate speech processing courses. To read this book, readers only need basic knowledge of calculus, probability theory, and matrix analysis. Of course, learning machine learning will give you a better understanding of some of the basic concepts mentioned above, and knowledge of traditional speech recognition systems will help you understand the entire framework of the recognition system and the sequence-level discrimination training (sequence-level). Discriminative training).

CSDN: You introduced a lot of methods to improve robustness. Which one is your favorite?

Yu Dong: From a practical point of view, methods based on auxiliary information such as noise-aware and speaker-aware models, and adaptive methods based on SVD and KLD regularization are currently the most simple and effective.

CSDN: The book specifically talks about migration learning, and gives some examples, such as the successful migration of European languages ​​to Chinese Mandarin. What factors determine the boundaries and limitations of the shared DNN hidden architecture in the current speech recognition field? What are the challenges of migrating learning in the field of speech recognition?

Yu Dong: Theoretically speaking, multilingual migration learning based on the shared DNN hidden layer architecture is not essentially limited. This is because you can always find a level where the speech features are very similar, even in very different languages. Such as Spanish and Chinese. From the perspective of engineering practice, it will have the right balance. In general, migration learning wants to achieve two goals. One is to learn new tasks quickly (in this case, the new language) and the other is to reduce the data needed to learn new tasks. So if a new language has enough data and the computing tool is not a problem, then direct training may be better because a converged model is more difficult to adjust to a new language, just as if an adult learns a new language is more difficult than a young child some. But if data and computing tools have a deficiency, multilingual migration learning based on the shared DNN hidden layer architecture will help your final system.

It is worth pointing out that migration learning based on shared DNN hidden layer architecture can also be used in hot-word detection to support user-selected wake-up words. There are similar applications in image recognition. For example, the hidden layer of a classifier trained with ImageNet can be used for image captioning or other image classification tasks. In addition, what we are talking about here is the feature-level migration. Other levels of migration are also possible but will be much more difficult.

CSDN: The book also emphasizes the role of Computational Network (CN) in the new speech recognition system. What issues should CN pay attention to? Is the popular LSTM RNN overrated?

Yu Dong: From the perspective of academic research, the most important thing is to analyze the relationship between the variables of the model, and then use computational networks to achieve these relationships. From an engineering point of view, it is also necessary to consider the degree of difficulty in training, the size of the model, and the speed, time delay, and energy consumption at the time of operation.

LSTM still has an important role in many models. However, we have found that some other models may perform close to or even exceed LSTM performance on certain issues. For example, the aforementioned deep CNN can exceed LSTM in non-specific speech recognition, while GRU and RNN based on Rectified Linear Unit have many timings. The problem is similar to LSTM but simpler.

| Future technology trends

CSDN: The future technology direction of speech recognition, which of your team's attention? How do you determine the direction of a technical study?

Yu Dong: We believe that identification in more difficult environments, such as far-field, high-noise, or accented speech recognition, will be the next problem to be solved. Our research also focuses on these aspects. We are currently focusing on developing models that are more capable of identifying various scenarios such as deep CNN and models that can improve far-field recognition rates such as PIT. We also focus on other new ideas that may trigger technological innovations such as recognition systems based on RNN generation models.

For the research direction, I personally decide based on the following four aspects:

The importance of the problem. We focus on solving important problems regardless of whether the problem itself is simple or difficult.

Study the potential of a direction or method and not just the current performance. If a method is currently not performing well, but has great scalability and imagination, then we will advance in this direction.

The universality of the solution. We prefer methods that can solve one type of problem or multiple scenarios rather than one specific problem or scenario.

Engineering convenience. We prefer the simple method. The simple method is more convenient for engineering and rapid iteration.

CSDN: Training key pronunciation features and generalization, what is the current progress? What do you think are the conditions for mature technology?

Yu Dong: All of our current models do not make any presets for key pronunciation features. Model parameters are completely derived from the data.

In machine learning, there is a well-known bias and variance dilemma. A model with a weak fitting ability generally has a smaller error rate caused by the variance, and it is not easy to overfit, but there is a large error rate caused by the deviation. The model with strong fitting ability is the opposite. The error rate caused by the deviation is not easy to reduce, but the error rate caused by the variance can be reduced by increasing the training data. The deep learning model is actually a type of model with a strong ability to fit. The main method to solve the generalization ability (or the error rate caused by the variance) is to increase the training data.

However, people can use much less training data to achieve higher recognition rates on different occasions, especially for those who have no chance to generalize far more than deep learning. I personally did some research in this area, for example, letting each phone learn a template (or mask) that works in a variety of environments, but unfortunately these attempts were not successful. At present, we have not found any models that have such strong generalization capabilities. To solve this problem, the machine learning algorithm must be able to automatically identify the same points and different points of different samples on each level of the low dimensional flow pattern, and to know which question to use for which level of features.

CSDN: In the area of ​​speech recognition in the next three to five years, are there still some non-deep learning methods that have potential for excavation (or can be combined with deep learning to achieve better results)?

Yu Dong: In fact, the current mainstream speech recognition technology still integrates traditional methods and deep learning methods. If deep learning is defined as any system with multiple levels of non-linear processing, then any system with a deep learning module is a deep learning system. But this does not mean that deep learning is all.

From a broader perspective, deep learning is just one of many artificial intelligence techniques. Its main strength lies in nonlinear feature extraction and end-to-end gradient-based optimization. Many problems cannot be achieved using only deep learning techniques. For example, AlphaGo actually integrates deep learning, reinforcement learning, and Monte Carlo tree search technology. I personally think that each technology should be allowed to do what it is good at. Combining multiple technologies organically can be the best solution to many practical problems.

| New technology learning methods

CSDN: The number of references in this book has reached more than 450, including many papers, which may benefit from your IEEE work. However, there are many papers collected at various conferences and periodicals. Can you introduce some quick choices and study? The general method of the paper?

Yu Dong: You will find that although there are many papers, the main progress is still driven by several major research institutions and individuals. If you do not have enough time, tracking these research institutions and individuals can be a more effective method. If you can establish a good relationship with them, you can even understand their progress or get a preprint of the article before their work is officially published. If you still have time, I suggest you go to the relevant academic conference. The academic conference is the place for information exchange. You can understand what questions and methods people are discussing, which article they recommend reading, and which job they should pay attention to.

Of course, not every article deserves careful study. I will read the summary, introduction, and summary to get a general understanding of an article, and only spend more time on potential work with new ideas, new methods, new perspectives, or new conclusions.

| CNTK acceleration model training speed

CSDN: What do you think of the advantages of CNTK in the development of speech recognition algorithms?

Yu Dong: As far as I personally know, many new speech recognition models are based on CNTK. We began to develop CNTK mainly for the research of speech recognition. Even though CNTK can conveniently support the processing of images, videos, texts, and other information, its support for speech recognition models is still relatively good. CNTK is very flexible. It supports various mainstream models, such as DNN, CNN, and LSTM, and can customize various new models. For example, the PIT model and PAC model are completely constructed by CNTK. In addition, since CNTK is also the main tool on our product line, we have developed many high-efficiency and high-performance parallel algorithms. These algorithms greatly increase the training speed of tasks that require a large amount of training data, such as speech recognition.

CSDN: Can you introduce the progress of CNTK's Python support? What are the future plans for support for other languages ​​such as Matlab, R, Java, Lua, Julia?

Yu Dong: We already have Python support in the 1.5 and 1.6 versions that have already been released. We will provide better support in the 2.0 version that will be released, and the 2.0 version of the API will be more complete and more flexible. In the new API framework, adding support for other languages ​​will become easy.

CSDN: CNTK's ability to expand GPUs is commendable, but large-scale deployment of GPU power consumption is also not small. Now that there are many attempts to accelerate FPGAs and ASICs, CNTK will do similar expansion considerations.

Yu Dong: In fact, thanks to the optimization work of our engineers, all our current speech recognition systems can realize real-time recognition on a single CPU. Therefore, at the serving end, GPU power consumption is not a problem. However, we foresee the single-CPU bottleneck, so we are also deploying a low-precision and low-power CPU/GPU architecture on CNTK. Of course, we also have colleagues working on FPGAs.

CSDN: The form of deep learning for speech recognition is often a hybrid model. Do you think it is necessary to consider the integration of CNTK and non-in-depth machine learning systems, such as Caffe-On-Spark of Yahoo!?

Yu Dong: In terms of integration in the operating environment, the Philly project led by Dr. Xuedong Huang (adopted by Microsoft from Carnegie Mellon University in 1993 to lead the speech recognition project) did similar work.

CSDN: What are the important updates for CNTK in the next half year?

Yu Dong: We will have better and more flexible API layers, will provide more comprehensive Python support, will further improve training efficiency, will do better support for sparse matrix, and will support low-precision calculations. Of course, more kinds of more complex computing nodes (such as LFMMI) will also be added to the tool.

CSDN: In addition to CNTK, what are the deep learning open source technologies you like?

Yu Dong: TensorFlow, Torch, MxNet, Theano and others are all good open source learning tools. Each tool has its own characteristics and strengths.

The CCAI 2016 China Artificial Intelligence Conference will be held in Beijing on August 26-27.

Expanded Ptfe Round Rope

Buy Expanded Ptfe,Unsintered Ptfe Cord,Expanded Ptfe Cord,Expanded Ptfe O-Ring

Cixi Congfeng Fluorine Plastic Co.,Ltd , https://www.cfptfeseal.com