1,721,012 research outputs found
훈련 시간을 활용한 심층 신경망의 하이퍼파라미터 최적화 전략 연구
학위논문 (석사)-- 서울대학교 대학원 : 융합과학부, 2017. 2. 이원종.While the need for feature engineering is greatly reduced in deep neural networks (DNN) in contrast to the machine learning (ML), Hyperparameter optimization (HPO) of DNN emerged as an important problem instead.
When DNN becomes deeper, the number of hyperparameters and the training time for each hyperparameter vector tends to increase significantly over traditional ML.
The HPO algorithms, which are often considered less efficient than manual HPO performed by experts with experiences, are more important in DNN due to the increased complexity of DNN's hyperparameters.
This thesis evaluates the existing HPO algorithms in DNN and analyzes the hyperparameter interdependencies from the viewpoints of test error and training time.
Spearmint, an existing Bayesian optimization method that updates the prior distribution from history, performed well when five or less hyperparameter involved.
Conducting experiments for HPO with seven hyperparameters of MNIST LeNet-5, a convolutional neural network (CNN) shows that
the test error distribution by a hyperparameter looks like a U shape, where test error changes abruptly.
However, the training time is strongly tied with the number of epochs and the number of neurons in DNN architecture.
Hence, HPO strategies utilizing the number of epochs and estimated training time are introduced and investigated in this thesis.
A strategy in this work consists of a coarse optimization and a fine optimization that are trains for small epochs and for large epochs, respectively.
Using a developed framework which provides traceability, extensibility, and comparability to HPO methods,
extended HPO methods are investigated by which apply fine optimization strategy after coarse optimization strategy to any HPO method.
Thus, it was found that extended methods can find better performance faster than the original method.
This thesis reveals that hyperparameter interdependency affects test error and training time variability in a CNN.
And utilizing the training time, which is highly predictable from a hyperparameter vector, shed light on the HPO speed enhancement of DNN.I. Introduction 1
1.1 Background 1
1.1.1 DNN, DL, ML and AI 1
1.1.2 HPO of ML 5
1.2 Research Motivation 7
1.3 Research Objectives 9
II. Related Works 11
2.1 Problem Definition 11
2.2 Manual HPO 13
2.3 Automatic HPO 15
III. Method 23
3.1 Research Questions 23
3.2 Unified HPO Framework 24
3.3 Coarse-Fine Optimization Strategies 27
IV. Experiments 35
4.1 Dataset and Model 35
4.2 Experiment Setup 37
4.3 Experiment Results 40
4.3.1 Existing HPO Algorithms Benchmark 40
4.3.2 Hyperparameter Interdependency 42
4.3.3 Random Coarse-Fine Optimization 57
4.3.4 Bayesian Coarse-Fine Optimization 63
V. Discussions 67
5.1 Linearity between Architecture Hyperparameters and Training Time 67
5.2 Interdependency of Hyperparameters 68
5.3 Reduction of Time to Operation with Coarse-Fine Optimization Algorithm 68
5.4 Limitations 69
VI. Conclusion 71
6.1 Summary 71
6.2 Contributions 72
6.3 Future Work 73
References 75
VII. Appendix. Additional Figures 79Maste
딥 뉴럴 네트워크의 초모수 최적화를 위한 알고리즘
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(디지털정보융합전공), 2021.8. Wonjong Rhee.The need to solve complicated optimization and search problems has been sharply increasing, especially in the machine learning and deep learning field. Compared to the traditional machine learning models,
Deep Neural Networks (DNN) are known to be highly sensitive to the choice of hyperparameters. Therefore, Network Architecture Search (NAS) and Hyper-Parameter Optimization (HPO) are two of the most important
problems where generally NAS requires orders of magnitude larger resource budget. For HPO of DNN, while the required time and effort for manual tuning has been rapidly decreasing for the well developed and commonly used DNN architectures, undoubtedly DNN hyperparameter optimization will continue to be a major burden whenever a new DNN architecture needs to be designed, a new task needs to be solved, a new dataset needs to be addressed, or an existing DNN needs to be improved further. For HPO of general machine learning problems, numerous automated solutions have been developed where some of the most popular optimization algorithms are based on Bayesian Optimization (BO) and Evolutionary Algorithm (EA).
In this dissertation, meta-heuristics based on the existing algorithms are addressed that can be adapted to a
wide set of DNN HPO problems. Six fundamental enhancement strategies are presented for the black-box function optimization method used for NAS and HPO; specifically, pre-evaluated dataset, diversification, parallelization, cooperation, early termination, and cost function transformation are investigated, respectively. Based on these enhancement strategies, two robust algorithms are provided as follows: DEEP-BO (Diversified, Early termination-Enabled, and Parallel Bayesian Optimization) and B2EA (Cooperating Two Bayesian Optimization Modules Can Improve Evolution Algorithm). When they have been exhaustively evaluated on the practical DNN HPO benchmarks, consisted of 6 to 14 various tasks, DEEP-BO and B2EA mostly outperformed the state-of-the-art algorithms such as regularized evolution, BOHB, and BANANAS.머신러닝 및 최근 활발히 연구중인 딥러닝 분야에서는 복잡한 최적화 및 탐색 문제를 해결해야 할 필요성이 급격히 증가하고 있습니다. 특히, 기존 머신러닝 모델에 비해 DNN(Deep Neural Networks)은 초모수 선택에 따른 성능 변화가 크다고 알려져 있습니다. 따라서, NAS(Network Architecture Search)와 HPO(Hyper-Parameter Optimization)가 딥러닝에서 실질적으로 해결해야 할 가장 중요한 두 가지 문제인데, 일반적으로 NAS가 훨씬 더 많은 자원과 예산을 요구합니다. DNN의 HPO의 경우, 많은 비용을 들여 개발하여 널리 사용되는 DNN 아키텍처에 대해서만 수동 튜닝함으로써 필요한 시간과 노력이 줄이고 있지만 의심할 여지 없이 새로운 문제를 해결하기 위한 새로운 DNN 아키텍처를 설계해야 할 때뿐 아니라, 기존 DNN을 새 데이터 세트로 훈련할 때나, 더 개선해야 할 때 DNN의 초모수 최적화는 계속해서 큰 부담이 될 수 밖에 없습니다. 일반적인 기계 학습 문제에서의 HPO의 경우, 가장 인기 있는 알고리즘들은 베이지안 최적화(BO) 및 진화 알고리즘(EA)을 기반으로 해 다양한 자동화 솔루션이 개발된 바 있습니다. 먼저 이 논문에서는 기존 알고리즘들을 다양한 DNN HPO 문제에 적용하면서 발견한 안정적인 성능 향상을 위한 메타 휴리스틱들을 소개합니다. 본 논문에서는 NAS 및 HPO에 사용되는 블랙박스 함수 최적화 방법에 대한 6가지 기본 향상 전략이 제시됩니다. 보다 구체적으로, 사전 평가된 데이터 세트, 다양화, 병렬화, 협력, 조기 종료 및 비용 함수 변환을 각각 조사합니다. 이러한 향상 전략들을 기반으로 DEEP-BO(Diversified, Early Termination-Enabled, Parallel Bayesian Optimization) 및 B2EA(Cooperating Two Bayesian Optimization Modules Can Improve Evolution Algorithm)의 두 가지 강력한 알고리즘을 개발하였습니다. 이 두 알고리즘들을 6~14개의 다양한 문제들로 구성된 실용적인 DNN HPO 벤치마크에서 평가했을 때, DEEP-BO와 B2EA는 대부분 regularized evolution, BOHB, BANANAS와 같은 최첨단 알고리즘의 성능을 크게 능가하는 우수한 결과를 보여주었습니다.Chapter 1. Introduction 1
1.1 Problem Definition 2
1.2 Background and Motivation 5
1.3 Contributions 10
Chapter 2. DNN HPO Components 13
2.1 Search Space 14
2.1.1 Background 14
2.1.2 Candidate set design 15
2.1.3 Benchmarks 18
2.2 Search Method 20
2.2.1 Background 20
2.2.2 Bayesian optimization 22
2.2.3 Evolutionary algorithm 28
2.2.4 Evolutionary algorithm vs. Bayesian optimization 30
2.2.5 Gradient-based optimization 31
2.3 Evaluation Method 32
2.3.1 Background 32
2.3.2 Partial training 33
2.3.3 Weight sharing 33
Chapter 3. Basic Enhancement Strategies 35
3.1 Search Space Enhancements 35
3.1.1 DNN benchmark 35
3.1.2 Cost function transformation 40
3.2 Search Method Enhancements 42
3.2.1 Diversification 42
3.2.2 Parallelization 49
3.2.3 Cooperation 54
3.3 Evaluation Method Enhancement 59
3.3.1 Early termination 60
Chapter 4. DNN HPO Algorithms 66
4.1 DEEP-BO 66
4.2 B2EA 68
Chapter 5. Experiments 73
5.1 Performance Benchmark of DEEP-BO 73
5.1.1 Experiment settings 73
5.1.2 Experimental results 75
5.2 Performance Benchmark of B2EA 77
5.2.1 Experiment settings 78
5.2.2 Experimental results 81
Chapter 6. Discussion 87
6.1 Additional Explanations 87
6.1.1 Diversified BO 87
6.1.2 Cooperation between two BO models 88
6.2 Ablation Studies 89
6.2.1 Ablation test of DEEP-BO 89
6.2.2 Ablation test of B2EA 90
6.3 Limitations 91
6.3.1 Meta-hyperparameters 91
6.3.2 Tuning of HPO algorithm 93
6.3.3 Surrogate modeling cost 97
6.4 Future Directions 99
6.4.1 Encoding of candidate representation 99
6.4.2 Other benchmark tasks 99
6.4.3 Robust HPO with weight sharing 100
6.4.4 Steps forward to AutoML 101
6.5 Broader Impacts 102
6.6 Implementation Details 103
6.6.1 DEEP-BO 103
6.6.2 B2EA 105
Chapter 7. Conclusion 107
Bibliography 110
Appendices 124
A Search Space Design of DNN Benchmarks 124
B Full Benchmark Results 126
B.1 Performance Comparison for DEEP-BO 126
B.2 Performance Comparison for B2EA 129
C User's Guide 133
C.1 Hyperparameter search space design 133
C.2 User-defined objective function 136
C.3 DEEP-BO run configuration 138
C.4 B2EA run configuration 139
C.5 Analysis of the results 140
Acknowledgement 141박
픽셀 강도 암호화를 통한 적대적 강건성 강화
학위논문(석사) -- 서울대학교대학원 : 융합과학기술대학원 지능정보융합학과, 2021.8. 이윤아.Neural networks are known to be vulnerable to gradient-based adversarial examples which are made by leveraging input gradients toward misclassification. Due to these attacks, adversarial defense has become a topic of significant interest in recent years. The most empirically successful approach to defending against such adversarial examples is adversarial training, which incorporates a strong self-attack during training. However, this approach is computationally expensive and hence is hard to scale up. As a result, a series of studies has been undertaken to develop gradient masking methods. One of the method is to to hide the gradient using encryption. This was achieved by transforming the location of pixels. However, there have been no studies regarding how pixel-intensity encryption could work as an adversarial defense.
This study proposes a new defense method that uses pixel intensity encryption to defend against the gradient-based attacks. Furthermore, A new adaptive attack setup for encryption methods is presented in the study to evaluate its effectiveness as an adversarial defense. The experiment shows that the proposed defense is more robust than that of the previous studies under adaptive attack. Moreover, the correlation coefficient of an image is found to make the key role on learnability of the model.I. Introduction 1
1.1 Terminology 3
II. Related Work 5
2.1 Gradient-based Attack 5
2.2 Gradient Masking Defense
2.2.1 Obfuscated Gradients 6
2.2.2 Adversarial Encryption Defense 8
III. Research Questions 10
IV. Proposed Method 11
4.1 Pixel-Intensity encryption
4.1.1 Affine encryption 12
4.1.2 Pixel-intensity shuffling 13
4.2 Upgraded Encryption 15
4.3 Adaptive attack framework for adversarial encryption defense 18
V. Experiments 21
5.1 Setup 22
5.2 Learnability 23
5.2.1 Experiment Design 23
5.2.2 Experiment Results 27
5.3 Adversarial Robustness
5.3.1 Experiment Design 34
5.3.2 Experiment Results 34
VI. Discussion 39
6.1 Discussion 39
6.2 Limitations and Future Work 41
VII. Conclusion 43
References 45석
지식 증류와 경량 파인튜닝을 이용한 바이 인코더 신경망 랭킹 모델의 개선
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 지능정보융합학과, 2022. 8. Wonjong Rhee.In recent studies, pre-trained language models, especially bidirectional encoder representations from transformers (BERT) have been essential in enhancing the performance of neural ranking models (NRMs). Various BERT-based NRMs have been proposed, and many have achieved state-of-the-art performance. BERT-based NRMs can be classified according to how the query and document are encoded through BERTs self-attention layers: bi-encoder versus cross-encoder. Bi-encoder models are highly efficient because all the documents can be pre-processed before the query time, but their performance is inferior compared to cross-encoder models. Because of their efficiency, bi-encoder models are much more deployable in real search engines and tend to receive more attention from industrial practitioners. However, their performance does not reach that of cross-encoder models. Therefore, improving the performance of bi-encoder models is a promising research direction. This thesis explores the methods to improve bi-encoder NRMs using knowledge distillation and lightweight fine-tuning. We consider a method that transfers the knowledge of a teacher cross-encoder model to a student bi-encoder model using knowledge distillation. Knowledge distillation enables a bi-encoder student to imitate the representation of a cross-encoder teacher and have the advantages of both types of models. The resulting student bi-encoder achieves an improved performance by simultaneously learning from a cross-encoder teacher and a bi-encoder teacher. We also investigate lightweight fine-tuning to improve bi-encoder NRMs. Lightweight fine-tuning is a method of fine-tuning only a small portion of the model weights, and is known to have a regularization effect. We demonstrate two approaches for improving the performance of BERT-based bi-encoders using lightweight fine-tuning. The first approach is to replace the full fine-tuning step with lightweight fine-tuning. The second is to develop semi-Siamese models in which queries and documents are handled with a limited amount of difference. The limited difference is realized by learning two lightweight fine-tuning modules, where the main language model of BERT is kept common for both query and document. We provide extensive experimental results, which confirm that both lightweight fine-tuning and semi-Siamese models are considerably helpful for improving BERT-based bi-encoders. Finally, we present a model that uses these two methods simultaneously. Using knowledge distillation and lightweight fine-tuning methods together, a model can gain the effects of both methods, resulting in further performance improvement over the individual methods. We anticipate that these techniques will be broadly applicable to industrial domains.최근 연구에서 다양한 BERT기반의 신경망 랭킹 모델이 제안되었고, 이 모델들은 최고의 성능을 보여주고 있다. BRET기반 랭킹 모델은 쿼리와 문서간의 관계가 BERT의 셀프 어텐션을 통해서 계산되는가의 여부에 따라 크로스 인코더와 바이 인코더로 구분된다. 크로스 인코더 모델은 높은 성능을 가지고 있지만 효율이 좋지 못한 단점이 있다. 반면, 바이 인코더 모델은 크로스 인코더에 비해 성능은 떨어지지만, 모든 문서의 벡터 표현형을 미리 구해놓을 수 있기 때문에 높은 효율성을 가지고 있다. 바이 인코더 모델은 효율적이기 때문에 실제 검색 엔진에 배포가 가능하다. 이런 이유로 바이 인코더 모델은 검색 업계로부터 더 많은 관심을 받는다. 그러나 앞에서 언급했듯이, 바이 인코더 모델의 성능이 크로스 인코더 모델에 도달하지 못한다는 문제가 있다. 따라서 바이 인코더 모델의 성능을 향상시키는 것은 랭킹 모델을 실제로 이용하려고 하는 영역에서는 매력적인 문제이다. 이 연구에서는 지식 증류와 경량 파인튜닝을 이용하여 바이 인코더 모델을 개선하는 방법을 탐구한다. 우리는 지식 증류를 사용하여 크로스 인코더 모델의 지식을 바이 인코더 모델로 전달하는 방법을 연구한다. 지식 증류를 통해 만들어진 바이 인코더 모델은 크로스 인코더로부터 배운 지식을 이용하기 때문에 성능이 향상된다. 우리는 또한 바이 인코더 모델을 개선하기 위한 경량 파인튜닝 방법을 이용한다. 경량 파인튜닝은 모델 가중치의 일부만 미세하게 학습하는 방법으로, 모델의 정규화 효과가 있는 것으로 알려져 있다. 경량 파인튜닝을 사용하여 BERT기반 바이 인코더 모델의 성능을 개선하기 위한, 두 가지 접근 방식을 이용한다. 첫 번째 접근 방식은 파인튜닝을 경량 파인튜닝으로 대체하는 것이다. 두 번째 접근 방식은 쿼리와 문서를 서로 다르게 처리하는 세미 샴 모델을 이용하는 것이다. 우리는 다양한 실험을 통하여 경량 파인튜닝 방법과 세미 샴 모델이 바이 인코더 모델을 개선하는 데 상당히 도움이 됨을 확인하였다. 마지막으로 지식증류와 경량 파인튜닝 방법을 동시에 사용하는 모델을 제시한다. 두 방법을 모두 사용한 모델이 두 방법을 사용한 각각의 방법보다 성능이 더 좋음을 실험으로 확인하였다. 우리가 제안한 방법이 검색 업계에 도움이 될 것으로 기대한다.Chapter 1. Introduction 1
1.1 Thesis Outline 3
1.2 Related Publications 4
Chapter 2. Background 5
2.1 Information Retrieval 5
2.1.1 Text Ranking using Neural Ranking Models 5
2.2 Ad-hoc Retrieval Problems 8
2.2.1 The Concept of Relevance 9
2.2.2 Test Collections 10
2.2.3 Ranking Metrics 10
2.3 A Brief history of Ad-hoc Retrieval 14
2.3.1 The Era of Exact Match 14
2.3.2 Pre-BERT Neural Ranking Model 17
2.3.3 BERT-based Neural Ranking Models 19
2.4 Research Motivation 21
2.5 Thesis Roadmap 24
Chapter 3. Bi-encoder Neural Ranking Models with Distillation - TRMD 25
3.1 Introduction 25
3.2 Related Works 27
3.2.1 NRMs before Pre-trained Language Models 27
3.2.2 NRMs with BERT 28
3.2.3 Efficient NRMs 29
3.3 Methodology 29
3.3.1 Architecture 30
3.3.2 Learning through Multi-teacher Distillation 31
3.4 Experimental Result 32
3.4.1 Experiment 33
3.4.2 Result and Analysis 35
3.5 Discussion 36
3.6 Conclusion 37
Chapter 4. Bi-encoder Neural Ranking Models with Light weight fine-tuning SS LFT 38
4.1 Introduction 38
4.2 Related Works 41
4.2.1 BERT-based NRMs 41
4.2.2 Lightweight Fine-Tuning (LFT) 41
4.2.3 Semi-Siamese (SS) Models 43
4.3 Methodology 44
4.3.1 Document Re-ranking 44
4.3.2 Lightweight Fine-Tuning (LFT) 46
4.3.3 Semi-Siamese Neural Ranking Model 49
4.4 Experiment 51
4.4.1 Experimental Setup 51
4.4.2 LFT Results for Cross-encoder 53
4.4.3 LFT Results for Bi-encoders 55
4.4.4 Semi-Siamese LFT Results for Bi-encoders 59
4.5 Discussion 60
4.5.1 Cross-encoder vs. Bi-encoder 60
4.5.2 Hybrid: Concurrent Learning vs. Sequential Learning 62
4.6 Conclusion 63
Chapter 5. Bi-encoder Neural Ranking Models with Knowledge Distillation and Lightweight Fine-tuning 64
5.1 Introduction 64
5.2 Related Works 66
5.2.1 Dense Retriever 66
5.2.2 Improving Dense Retriever 67
5.2.3 Lightweight Fine-tuning and semi-Siamese network 68
5.3 Methodology 70
5.3.1 Document Ranking 70
5.3.2 Knowledge Supervision 70
5.3.3 Lightweight fine-tuning 72
5.3.4 Semi-Siamese Lightweight fine-tuning 73
5.3.5 SS LFT with supervision 74
5.3.6 Training Procedure 76
5.4 Experiment 76
5.4.1 Experimental Setup 76
5.4.2 Results of combination method 78
5.5 Discussion 81
5.5.1 Difference between cross-encoder and bi-encoder models 82
5.5.2 How to overcome the shortage of bi-encoder models 83
5.6 Conclusion 85
Chapter 6. Conclusion 86
6.1 Effectiveness and Efficiency 86
6.2 Expansion to Text Ranking 88
6.3 Future Work 88
Bibliography 89
Appendices 100
A Knowledge Distillation Methods 100
B Variants of SS Prefix-tuning 101
C Variants of SS LoRA 103
D Efficiency of Training and Inference 104
D.1 Training Time 104
D.2 Inference Time 104
E Hyper-parameter setting 105
Acknowledgement 107박
Application of Traditional ML and DNN Techniques on Energy Disaggregation with 10Hz AMI Data
학위논문 (석사)-- 서울대학교 대학원 : 융합과학부, 2017. 2. 이원종.Energy disaggregation is the process of separating a households total electricity consumption into energy consumptions of individual appliances. Energy disaggregation is performed by applying a set of algorithms to aggregated electricity data. Energy disaggregation can be helpful for energy feedback, detection of appliance malfunctioning, energy incentive design, and demand-response management.
In this thesis, we apply machine learning algorithms to energy disaggregation problem. Data were measured in 58 Japanese households. In our first study, we formulated energy disaggregation problem into on-off states classification of appliances. To solve the classification problem, we take two main approaches. One is traditional ML approach and the other is deep neural networks approach. In the former approach, we devised the 'edge' concept and extracted 59 features and used traditional ML algorithms such as logistic regression, support vector machine, and random forest. In the latter approach, we applied deep neural networks for automated feature learning. Experiments demonstrate that deep neural networks algorithms perform better than traditional ML approach for weak signature appliances. On the other hand, the traditional ML algorithm showed better performance for the appliances with strong signatures. These results imply that the algorithms should be selected according to the kinds of household appliances.
The second study was an experiment on sensitivity to sampling rate. As the classification was done by extracting the pattern from the signatures, the sampling rate of aggregated data emerges as an important issue. This is because the degree to which signatures are revealed depends on the sampling rate. Our experiments studied how the performance of machine learning algorithms varies as the sampling rate changes. The results are different depending on the type of appliance, but showed that the performance of the algorithm is drastically dropped as the sampling rate is lowered to the sampling rate of once per 10 seconds. Experimental results showed that even at 1Hz, the on-off classification of 90 seconds window can perform well enough, which implies 1Hz is enough to use in the industrial settings.I. Introduction 1
1.1 Overview of Thesis 3
II. Machine Learning Algorithms 5
2.1 Traditional Machine Learning 5
2.1.1 Feature Engineering 6
2.1.2 Algorithms 7
2.2 Deep Learning 12
2.2.1 Feature Learning 12
2.2.2 Algorithms 13
III. Energy Disaggregation 20
3.1 The Challenges of Energy Disaggregation 20
3.2 Previous Works 24
3.3 Public Data Sets 27
3.3.1 REDD 27
3.3.2 BLUED 27
3.3.3 GREEND 28
3.3.4 UK-DALE 28
3.3.5 AMPds 28
3.3.6 ECO 29
IV. Binary Classification for Energy Disaggregation 30
4.1 Data 30
4.2 Methodology 31
4.2.1 Feature Engineering 32
4.2.2 Deep Learning 34
4.3 Results 34
4.4 Discussion 36
V. Sensitivity to Sampling Rate 38
5.1 Data 38
5.2 Methodology 38
5.3 Results 39
5.4 Discussions 40
VI. Conclusion 42
6.1 Summary 42
6.2 Practical Implication 43
6.3 Theoretical Implication 43
6.4 Limitations 44
6.5 Future Works 45
References 46
Appendices 51
A Full Feature List 51
B TV Binary Classification Full Result 57
C Washer Binary Classification Full Result 60
D Cooker Binary Classification Full Result 63
E Result of Random Forest Sampling Rate Experiment 65
F Result of CNN Sampling Rate Experiment 66
초록 67Maste
언어 모델 표현의 Isotropy와 Rank 특성에 기반한 미세 조정 성능 개선에 대한 연구
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능정보융합전공), 2024. 2. Wonjong Rhee.In the field of natural language processing, a strategy to fine-tune pre-trained language models for downstream tasks is a fundamental approach. Among many fine-tuning tasks, learning text embedding (representation) that captures the underlying semantic information of a given text is an essential task. Given the remarkable progress in the linguistic comprehension capabilities of large-scale pre-trained language models (PLMs), there has been a significant surge in the development of text embedding models leveraging these PLMs in recent times. This dissertation focuses on representations of language models such as Bidirectional Encoder Representations from Transformers (BERT) and studies the techniques to improve the performance of text embeddings. We delve into the two text embedding tasks, Dense Retrieval (DR) and sentence embedding.
In the realm of information retrieval, DR models encode queries and documents, thereby generating representations for queries and documents. Using these representations, the relevance between the query and the document is determined. However, representations of PLMs are known to follow an anisotropic distribution, which can be undesirable for relevance estimation. We reveal that representations of popular BERT-based DR models such as ColBERT and RepBERT follow an anisotropic distribution. To cope with the problem, we adopt unsupervised post-processing methods of Normalizing Flow and whitening, which can effectively enhance the isotropy of representations, thereby improving the performance of DR models. Furthermore, with post-processing methods, we can significantly improve the performance of DR models for the out-of-distribution tasks where the distribution of the test dataset differs from that of the training dataset.
The next task we focus on is the sentence embedding task. Sentence embedding models estimate the semantic similarity between two given sentences by measuring the similarity between two sentences representations.
Unsupervised learning of sentence embedding aims to learn representations that capture the underlying semantic information of sentences without the need for human annotation. Among numerous unsupervised models for the sentence embedding task, SimCSE has made a significant progress through self-supervised contrastive learning and has become a foundational baseline for subsequent studies in the field. In pursuit of improving sentence embedding performance through self-supervised learning (SSL), we focus on the representations of SimCSE.
Through an in-depth exploration of SimCSE's training dynamics, we uncover a strong correlation between representation rank and performance. Building upon this insight, we introduce the Rank Reduction (RR) regularizer to the fine-tuning of SimCSE. Our experiments reveal that RR not only boosts the performance of SimCSE in sentence embedding tasks but also contributes to the model's stability against changes in random seeds. This result offers valuable insights into the relationship between representation rank and SSL performance in natural language processing, potentially benefiting a wide range of applications.Chapter 1. Introduction 1
1.1 Dissertation Outline 3
1.2 Related Publications 4
Chapter 2. Background 5
2.1 Language Model 5
2.1.1 Training of Language Models 5
2.1.2 BERT and RoBERTa 6
2.2 Text Embedding 7
2.2.1 History of Text Embedding 7
2.2.2 Dense Retrieval 10
2.2.3 Sentence Embedding 12
2.3 Dissertation Roadmap 14
Chapter 3. Isotropic Representation Can Improve Dense Retrieval 15
3.1 Introduction 15
3.2 Contributions 17
3.3 Related Works 20
3.3.1 Dense Retrieval and Similarity Function 20
3.3.2 Anisotropic Distribution of BERT Representations 20
3.3.3 Enforcing Isotropy for STS Task 22
3.3.4 Robustness of Ranking Models 23
3.4 Backgrounds 24
3.4.1 DR models: ColBERT and RepBERT 24
3.4.2 Metrics 25
3.5 Methodology 26
iii
3.5.1 Enforcing Isotropy 26
3.5.2 Robustness for Out-Of-Distribution Data 31
3.6 Experiments 32
3.6.1 Experimental Settings 32
3.6.2 Experimental Results 34
3.7 Discussion 43
3.7.1 Handling of Outlier Dimensions 43
3.7.2 Normalizing Flow vs. Whitening 43
3.7.3 Token-wise vs. Sequence-wise 44
3.7.4 Robustness and OOD Generalization 44
3.8 Conclusion 47
Chapter 4. Improving Fine-Tuning Performance of Sentence Embed-
ding via Representation Rank Reduction 48
4.1 Introduction 48
4.2 Contributions 50
4.3 Related Works 53
4.3.1 SimCSE training through Contrastive Self-supervised
Learning 53
4.3.2 SimCSE-based Models 54
4.3.3 Rank and Contrastive Learning 54
4.4 Experiments: The Impact of Rank Reduction on SimCSE 55
4.4.1 Rank Reduction Method: Effective Rank Reduction (RR)
Regularization 55
4.4.2 Experimental Setup 56
4.4.3 Experimental Results 57
4.5 Analysis 62
iv
4.5.1 Training Dynamics of SimCSE 62
4.5.2 PromptBERT 66
4.6 Discussion 70
4.6.1 Linguistic Abilities of Representations 70
4.6.2 Acceleration of Training with RR 73
4.7 Conclusion 75
Chapter 5. Conclusion and Limitations 76
Bibliography 77
Appendices 89
A Whitening and Glow 90
B Training Dynamics of SimCSE 93
C Hyperparameters for training SimCSE 95
D Rank and STS Performance of BERT and RoBERTa Models 98
E Probing Tests for Assessing Linguistic Abilities of Representations 100박
심층신경망의 효과적인 저 차원 압축
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(디지털정보융합전공), 2023. 8. Wonjong Rhee.Compression of neural networks has emerged as one of the essential research topics, especially for edge devices that have limited computation power and storage capacity. The most popular compression methods include quantization, pruning of redundant parameters, knowledge distillation from a large network to a small one, and low-rank compression. The low-rank compression methodology has the potential to be a high-performance compression method, but it does not achieve high performance since it does not solve the challenge of determining the optimal rank of all the layers. This thesis explores two methods to solve the challenge and improve compression performance. First, we propose BSR (Beam-search and Stable Rank), a low-rank compression algorithm that embodies an efficient rank-selection method and a unique compression-friendly training method. For the rank selection, BSR employs a modified beam search that can perform a joint optimization of the rank allocations over all the layers in contrast to the previously used heuristic methods. For compression-friendly training, BSR adopts a regularization loss derived from a modified stable rank, which can control the rank while incurring almost no harm in performance. Experiment results confirm that BSR is effective and superior compared to the existing low-rank compression methods. Second, we propose a fully joint learning framework called LeSS to simultaneously determine filters for filter pruning and ranks for low-rank decomposition. We provided a method for rank selection with a training method and confirmed a significant improvement in performance by integrating it with the existing pruning method, which has outstanding performance. LeSS does not depend on iterative or heuristic processes, and it satisfies the desired resource budget constraint. LeSS comprises two learning modules: mask learning for filter pruning and threshold learning for low-rank decomposition. The first module learns masks identifying the importance of the filters, and the second module learns the threshold of the singular values to be removed such that only significant singular values remain. Because both modules are designed to be differentiable, they are easily combined and jointly optimized. LeSS outperforms state-of-the-art methods on a number of benchmarks, demonstrating its effectiveness. Finally, to obtain high performance in transfer learning for fine-grained datasets, we propose mask learning for both rank and filter selection. The mask learning approach could be employed in transfer learning since it is more crucial to determine which singular values are useful rather than rank selection. Our approach to compression for transfer learning yielded either improved or comparable performance with uncompressed results. We anticipate these techniques will be broadly applicable to industrial domains.Chapter 1. Introduction 1
1.1 Thesis Outline 4
1.2 Related Publications 4
Chapter 2. Background 6
2.1 Compression of Deep Neural Networks 6
2.2 Structured Compression of Deep Neural Networks 8
2.2.1 Low-Rank Compression 9
2.2.2 Filter Pruning 15
2.3 Low-rank decomposition in other fields 17
2.4 Thesis Roadmap 19
Chapter 3. An Effective Low-Rank Compression with a Joint Rank Selection Followed by a Compression-Friendly Training 20
3.1 Introduction 20
3.2 Contributions 24
3.3 Related works 25
3.3.1 Beam search 25
3.3.2 Stable rank and rank regularization 26
3.4 The basics of low-rank compression 28
3.4.1 The basic process 28
3.4.2 Compression ratio 28
3.5 Methodology 29
3.5.1 Overall process 29
3.5.2 Modified beam-search (mBS) for rank selection 32
3.5.3 Modified stable rank (mSR) for regularized training 35
3.6 Experiments 36
3.6.1 Experimental setting 36
3.6.2 Experimental results 38
3.6.3 Analysis of BSR 47
3.7 Discussion 59
3.7.1 Combined use with quantization 59
3.7.2 Limitations and future works 59
3.8 Conclusion 60
Chapter 4. Learning to Select a Structured Architecture over Filter Pruning and Low-rank Decomposition 61
4.1 Introduction 61
4.2 Contribution 66
4.3 Related works 67
4.3.1 Hybrid compression methods 67
4.4 Background 68
4.4.1 Selection problem for DNN compression 68
4.4.2 Tensor Matricization 68
4.4.3 CNN decomposition scheme 69
4.5 Learning framework for the selection problem in hybrid compression 70
4.6 Experiments 79
4.6.1 Experimental settings 79
4.7 Analysis and discussion 85
4.7.1 Learning strategy analysis 85
4.7.2 Influence of matricization scheme 88
4.7.3 Data efficiency of LeSS 88
4.7.4 Extension to higher-order SVD 90
4.7.5 Extension to transformer architecture 90
4.7.6 Discussion on the reasons for the improved performance of compressed models compared to the uncompressed baseline model 91
4.8 Conclusion 92
Chapter 5. Conclusion and limitations 93
Bibliography 94
Appendices 108
A The SoTA compression methods 109
B Resource budget definition 109
C Implementation details 110
C.1 Hyper-parameter setting 110
C.2 Tuning details of hyper-parameters 111
D Full comparison results 111박
대조 표현 학습에서 상호 정보의 이해
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(디지털정보융합전공), 2023. 2. Wonjong Rhee.Contrastive learning has played a pivotal role in the recent success of unsupervised representation learning. It has been commonly explained with instance discrimination and a mutual information loss, and some of the fundamental explanations are based on mutual information analysis. An analysis based on mutual information, however, can be misleading. First of all, an exact quantification of mutual information over a real-world dataset is challenging. It has not been solved because we cannot access the true joint distribution function of real-world dataset before. Second, previous studies have equated the limitations of contrastive learning with them of mutual information estimation in the absence of the rigorous investigation for a relationship between them. Third, what information is actually being shared by the two views is overlooked. Without carefully examining what information is actually being shared, the interpretation can be completely misleading. In this work, we develop new methods that enable rigorous analysis of mutual information in contrastive learning. We also evaluate the accuracy of variational MI estimators across various data domains, including images and texts. Using the methods, we investigate three existing beliefs and show that they are incorrect. Based on the investigation results, we address two issues in the discussion section. In particular, we question if contrastive learning is indeed an unsupervised representation learning method because the current framework of contrastive learning relies on validation performance for tuning the augmentation design.Chapter 1. Introduction 1
1.1 Contributions 6
Chapter 2. Background 9
2.1 Contrastive representation learning 9
2.1.1 Previous works to understand contrastive learning 11
2.2 Mutual Information 12
2.3 Variational Mutual Information Estimators 15
2.3.1 Critic function 18
2.3.2 Limitations of the variational MI estimators 19
Chapter 3. Same-class Sampling for Positive Pairing 21
Chapter 4. Understanding the Accuracy of Variational Mutual Information Estimators 27
4.1 Datasets 29
4.1.1 Gaussian dataset 30
4.1.2 Definitions of ds, dr, and Z 30
4.1.3 Details of generating datasets 31
4.2 Experimental setup 33
4.3 Experimental results 34
4.3.1 Critic architecture 34
4.3.2 Critic capacity 38
4.3.3 Choice of the variational MI estimator 39
4.3.4 Number of information sources 39
4.3.5 Representation dimension 40
4.3.6 Nuisance 41
4.3.7 Deep representations 41
4.4 Discussion: How can we make use of MI with practical datasets? 44
4.5 Conclusion 48
Chapter 5. Examining Three Existing Beliefs on Mutual Information in Contrastive Learning 49
5.1 Method 50
5.1.1 Post-training MI estimation 50
5.1.2 CDP dataset 52
5.2 Experimental setups 56
5.2.1 Training 56
5.2.2 Post-training MI estimation 57
5.3 Results 59
5.3.1 A small batch size is a limiting factor for MI estimation but not for contrastive learning. 59
5.3.2 Augmentation-based MI and other metrics are not effective, but MI class is effective. 62
5.3.3 Minimizing task-irrelevant information (InfoMin) is not always necessary. 70
5.4 Discussion 77
5.5 Conclusion 83
Chapter 6. Conclusion 84
6.1 Limitations 86
6.2 Future works 86
Bibliography 88
Appendices 99박
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
