Text Data Mining 🔍
Chengqing Zong,Rui Xia,Jiajun Zhang (auth.)
Springer Singapore : Imprint: Springer, 1st edition 2021, Singapore, 2021
英语 [en] · PDF · 8.5MB · 2021 · 📘 非小说类图书 · 🚀/lgli/lgrs/nexusstc/upload/zlib · Save
描述
Focuses on text data mining from an NLP perspectiveOffers a rich blend of fundamental theories, key techniques and predominant applicationsPresents the latest advances in the field of text data miningThis book discusses various aspects of text data mining. Unlike other books that focus on machine learning or databases, it approaches text data mining from a natural language processing (NLP) perspective.The book offers a detailed introduction to the fundamental theories and methods of text data mining, ranging from pre-processing (for both Chinese and English texts), text representation and feature selection, to text classification and text clustering. It also presents the predominant applications of text data mining, for example, topic modeling, sentiment analysis and opinion mining, topic detection and tracking, information extraction, and automatic text summarization. Bringing all the related concepts and algorithms together, it offers a comprehensive, authoritative and coherent overview. Written by three leading experts, it is valuable both as a textbook and as a reference resource for students, researchers and practitioners interested in text data mining. It can also be used for classes on text data mining or NLP.
备用文件名
nexusstc/Text Data Mining/a3609b59e3f2f264082a374fb12193e9.pdf
备用文件名
lgli/Text Data Mining(2021)[9789811601002][Zong et al].pdf
备用文件名
lgrsnf/Text Data Mining(2021)[9789811601002][Zong et al].pdf
备用文件名
zlib/Computers/Chengqing Zong, Rui Xia, Jiajun Zhang/Text Data Mining_14995452.pdf
备选作者
Zong, Chengqing, Xia, Rui, Zhang, Jiajun
备选作者
Chengqing Zong, Rui Xia, Zhang, Jiajun
备选作者
Chengqing Zong, Rui Xia, Jiajun Zhang
备选作者
Adobe InDesign CC 14.0 (Windows)
备用出版商
Springer Nature Singapore Pte Ltd Fka Springer Science + Business Media Singapore Pte Ltd
备用版本
Springer Nature (Textbooks & Major Reference Works), Singapore, 2021
备用版本
Singapore, Tsinghua University Press, 2021
备用版本
Singapore, Singapore
元数据中的注释
lg3014838
元数据中的注释
producers:
Adobe PDF Library 10.0.1
Adobe PDF Library 10.0.1
元数据中的注释
{"isbns":["9789811600999","9789811601002","9811600996","9811601003"]}
备用描述
Foreword 5
Preface 7
Acknowledgments 9
Contents 10
About the Authors 15
Acronyms 17
1 Introduction 20
1.1 The Basic Concepts 20
1.2 Main Tasks of Text Data Mining 22
1.3 Existing Challenges in Text Data Mining 25
1.4 Overview and Organization of This Book 28
1.5 Further Reading 31
Exercises 32
2 Data Annotation and Preprocessing 33
2.1 Data Acquisition 33
2.2 Data Preprocessing 38
2.3 Data Annotation 40
2.4 Basic Tools of NLP 43
2.4.1 Tokenization and POS Tagging 43
2.4.2 Syntactic Parser 45
2.4.3 N-gram Language Model 47
2.5 Further Reading 48
Exercises 48
3 Text Representation 50
3.1 Vector Space Model 50
3.1.1 Basic Concepts 50
3.1.2 Vector Space Construction 51
3.1.3 Text Length Normalization 53
3.1.4 Feature Engineering 54
3.1.5 Other Text Representation Methods 56
3.2 Distributed Representation of Words 57
3.2.1 Neural Network Language Model 58
3.2.2 C&W Model 62
3.2.3 CBOW and Skip-Gram Model 64
3.2.4 Noise Contrastive Estimation and Negative Sampling 66
3.2.5 Distributed Representation Based on the Hybrid Character-Word Method 68
3.3 Distributed Representation of Phrases 70
3.3.1 Distributed Representation Based on the Bag-of-Words Model 71
3.3.2 Distributed Representation Based on Autoencoder 71
3.4 Distributed Representation of Sentences 75
3.4.1 General Sentence Representation 76
3.4.2 Task-Oriented Sentence Representation 80
3.5 Distributed Representation of Documents 83
3.5.1 General Distributed Representation of Documents 84
3.5.2 Task-Oriented Distributed Representation of Documents 86
3.6 Further Reading 89
Exercises 89
4 Text Representation with Pretraining and Fine-Tuning 91
4.1 ELMo: Embeddings from Language Models 91
4.1.1 Pretraining Bidirectional LSTM Language Models 92
4.1.2 Contextualized ELMo Embeddings for Downstream Tasks 93
4.2 GPT: Generative Pretraining 94
4.2.1 Transformer 94
4.2.2 Pretraining the Transformer Decoder 96
4.2.3 Fine-Tuning the Transformer Decoder 97
4.3 BERT: Bidirectional Encoder Representations from Transformer 98
4.3.1 BERT: Pretraining 99
4.3.2 BERT: Fine-Tuning 102
4.3.3 XLNet: Generalized Autoregressive Pretraining 102
4.3.4 UniLM 105
4.4 Further Reading 106
Exercises 108
5 Text Classification 109
5.1 The Traditional Framework of Text Classification 109
5.2 Feature Selection 111
5.2.1 Mutual Information 112
5.2.2 Information Gain 115
5.2.3 The Chi-Squared Test Method 116
5.2.4 Other Methods 117
5.3 Traditional Machine Learning Algorithms for Text Classification 118
5.3.1 Naïve Bayes 119
5.3.2 Logistic/Softmax and Maximum Entropy 121
5.3.3 Support Vector Machine 123
5.3.4 Ensemble Methods 126
5.4 Deep Learning Methods 127
5.4.1 Multilayer Feed-Forward Neural Network 127
5.4.2 Convolutional Neural Network 129
5.4.3 Recurrent Neural Network 131
5.5 Evaluation of Text Classification 136
5.6 Further Reading 139
Exercises 140
6 Text Clustering 141
6.1 Text Similarity Measures 141
6.1.1 The Similarity Between Documents 141
6.1.2 The Similarity Between Clusters 144
6.2 Text Clustering Algorithms 145
6.2.1 K-Means Clustering 145
6.2.2 Single-Pass Clustering 149
6.2.3 Hierarchical Clustering 152
6.2.4 Density-Based Clustering 154
6.3 Evaluation of Clustering 157
6.3.1 External Criteria 157
6.3.2 Internal Criteria 158
6.4 Further Reading 159
Exercises 160
7 Topic Model 161
7.1 The History of Topic Modeling 161
7.2 Latent Semantic Analysis 162
7.2.1 Singular Value Decomposition of the Term-by-Document Matrix 163
7.2.2 Conceptual Representation and Similarity Computation 164
7.3 Probabilistic Latent Semantic Analysis 166
7.3.1 Model Hypothesis 166
7.3.2 Parameter Learning 167
7.4 Latent Dirichlet Allocation 169
7.4.1 Model Hypothesis 169
7.4.2 Joint Probability 171
7.4.3 Inference in LDA 174
7.4.4 Inference for New Documents 176
7.5 Further Reading 177
Exercises 178
8 Sentiment Analysis and Opinion Mining 179
8.1 History of Sentiment Analysis and Opinion Mining 179
8.2 Categorization of Sentiment Analysis Tasks 180
8.2.1 Categorization According to Task Output 180
8.2.2 According to Analysis Granularity 181
8.3 Methods for Document/Sentence-Level Sentiment Analysis 184
8.3.1 Lexicon- and Rule-Based Methods 185
8.3.2 Traditional Machine Learning Methods 186
8.3.3 Deep Learning Methods 190
8.4 Word-Level Sentiment Analysis and Sentiment Lexicon Construction 194
8.4.1 Knowledgebase-Based Methods 194
8.4.2 Corpus-Based Methods 195
8.4.3 Evaluation of Sentiment Lexicons 198
8.5 Aspect-Level Sentiment Analysis 199
8.5.1 Aspect Term Extraction 199
8.5.2 Aspect-Level Sentiment Classification 202
8.5.3 Generative Modeling of Topics and Sentiments 207
8.6 Special Issues in Sentiment Analysis 209
8.6.1 Sentiment Polarity Shift 209
8.6.2 Domain Adaptation 211
8.7 Further Reading 214
Exercises 215
9 Topic Detection and Tracking 216
9.1 History of Topic Detection and Tracking 216
9.2 Terminology and Task Definition 217
9.2.1 Terminology 217
9.2.2 Task 218
9.3 Story/Topic Representation and Similarity Computation 221
9.4 Topic Detection 224
9.4.1 Online Topic Detection 224
9.4.2 Retrospective Topic Detection 226
9.5 Topic Tracking 227
9.6 Evaluation 228
9.7 Social Media Topic Detection and Tracking 230
9.7.1 Social Media Topic Detection 231
9.7.2 Social Media Topic Tracking 232
9.8 Bursty Topic Detection 232
9.8.1 Burst State Detection 233
9.8.2 Document-Pivot Methods 236
9.8.3 Feature-Pivot Methods 237
9.9 Further Reading 239
Exercises 240
10 Information Extraction 241
10.1 Concepts and History 241
10.2 Named Entity Recognition 243
10.2.1 Rule-based Named Entity Recognition 244
10.2.2 Supervised Named Entity Recognition Method 245
10.2.3 Semisupervised Named Entity Recognition Method 253
10.2.4 Evaluation of Named Entity Recognition Methods 255
10.3 Entity Disambiguation 256
10.3.1 Clustering-Based Entity Disambiguation Method 257
10.3.2 Linking-Based Entity Disambiguation 262
10.3.3 Evaluation of Entity Disambiguation 268
10.4 Relation Extraction 270
10.4.1 Relation Classification Using Discrete Features 272
10.4.2 Relation Classification Using Distributed Features 279
10.4.3 Relation Classification Based on Distant Supervision 282
10.4.4 Evaluation of Relation Classification 283
10.5 Event Extraction 284
10.5.1 Event Description Template 284
10.5.2 Event Extraction Method 286
10.5.3 Evaluation of Event Extraction 295
10.6 Further Reading 295
Exercises 296
11 Automatic Text Summarization 298
11.1 Main Tasks in Text Summarization 298
11.2 Extraction-Based Summarization 300
11.2.1 Sentence Importance Estimation 300
11.2.2 Constraint-Based Summarization Algorithms 311
11.3 Compression-Based Automatic Summarization 312
11.3.1 Sentence Compression Method 313
11.3.2 Automatic Summarization Based on Sentence Compression 318
11.4 Abstractive Automatic Summarization 320
11.4.1 Abstractive Summarization Based on Information Fusion 320
11.4.2 Abstractive Summarization Based on the Encoder-Decoder Framework 326
11.5 Query-Based Automatic Summarization 329
11.5.1 Relevance Calculation Based on the Language Model 330
11.5.2 Relevance Calculation Based on Keyword Co-occurrence 330
11.5.3 Graph-Based Relevance Calculation Method 331
11.6 Crosslingual and Multilingual Automatic Summarization 332
11.6.1 Crosslingual Automatic Summarization 332
11.6.2 Multilingual Automatic Summarization 336
11.7 Summary Quality Evaluation and Evaluation Workshops 338
11.7.1 Summary Quality Evaluation Methods 338
11.7.2 Evaluation Workshops 343
11.8 Further Reading 345
Exercises 346
References 347
Preface 7
Acknowledgments 9
Contents 10
About the Authors 15
Acronyms 17
1 Introduction 20
1.1 The Basic Concepts 20
1.2 Main Tasks of Text Data Mining 22
1.3 Existing Challenges in Text Data Mining 25
1.4 Overview and Organization of This Book 28
1.5 Further Reading 31
Exercises 32
2 Data Annotation and Preprocessing 33
2.1 Data Acquisition 33
2.2 Data Preprocessing 38
2.3 Data Annotation 40
2.4 Basic Tools of NLP 43
2.4.1 Tokenization and POS Tagging 43
2.4.2 Syntactic Parser 45
2.4.3 N-gram Language Model 47
2.5 Further Reading 48
Exercises 48
3 Text Representation 50
3.1 Vector Space Model 50
3.1.1 Basic Concepts 50
3.1.2 Vector Space Construction 51
3.1.3 Text Length Normalization 53
3.1.4 Feature Engineering 54
3.1.5 Other Text Representation Methods 56
3.2 Distributed Representation of Words 57
3.2.1 Neural Network Language Model 58
3.2.2 C&W Model 62
3.2.3 CBOW and Skip-Gram Model 64
3.2.4 Noise Contrastive Estimation and Negative Sampling 66
3.2.5 Distributed Representation Based on the Hybrid Character-Word Method 68
3.3 Distributed Representation of Phrases 70
3.3.1 Distributed Representation Based on the Bag-of-Words Model 71
3.3.2 Distributed Representation Based on Autoencoder 71
3.4 Distributed Representation of Sentences 75
3.4.1 General Sentence Representation 76
3.4.2 Task-Oriented Sentence Representation 80
3.5 Distributed Representation of Documents 83
3.5.1 General Distributed Representation of Documents 84
3.5.2 Task-Oriented Distributed Representation of Documents 86
3.6 Further Reading 89
Exercises 89
4 Text Representation with Pretraining and Fine-Tuning 91
4.1 ELMo: Embeddings from Language Models 91
4.1.1 Pretraining Bidirectional LSTM Language Models 92
4.1.2 Contextualized ELMo Embeddings for Downstream Tasks 93
4.2 GPT: Generative Pretraining 94
4.2.1 Transformer 94
4.2.2 Pretraining the Transformer Decoder 96
4.2.3 Fine-Tuning the Transformer Decoder 97
4.3 BERT: Bidirectional Encoder Representations from Transformer 98
4.3.1 BERT: Pretraining 99
4.3.2 BERT: Fine-Tuning 102
4.3.3 XLNet: Generalized Autoregressive Pretraining 102
4.3.4 UniLM 105
4.4 Further Reading 106
Exercises 108
5 Text Classification 109
5.1 The Traditional Framework of Text Classification 109
5.2 Feature Selection 111
5.2.1 Mutual Information 112
5.2.2 Information Gain 115
5.2.3 The Chi-Squared Test Method 116
5.2.4 Other Methods 117
5.3 Traditional Machine Learning Algorithms for Text Classification 118
5.3.1 Naïve Bayes 119
5.3.2 Logistic/Softmax and Maximum Entropy 121
5.3.3 Support Vector Machine 123
5.3.4 Ensemble Methods 126
5.4 Deep Learning Methods 127
5.4.1 Multilayer Feed-Forward Neural Network 127
5.4.2 Convolutional Neural Network 129
5.4.3 Recurrent Neural Network 131
5.5 Evaluation of Text Classification 136
5.6 Further Reading 139
Exercises 140
6 Text Clustering 141
6.1 Text Similarity Measures 141
6.1.1 The Similarity Between Documents 141
6.1.2 The Similarity Between Clusters 144
6.2 Text Clustering Algorithms 145
6.2.1 K-Means Clustering 145
6.2.2 Single-Pass Clustering 149
6.2.3 Hierarchical Clustering 152
6.2.4 Density-Based Clustering 154
6.3 Evaluation of Clustering 157
6.3.1 External Criteria 157
6.3.2 Internal Criteria 158
6.4 Further Reading 159
Exercises 160
7 Topic Model 161
7.1 The History of Topic Modeling 161
7.2 Latent Semantic Analysis 162
7.2.1 Singular Value Decomposition of the Term-by-Document Matrix 163
7.2.2 Conceptual Representation and Similarity Computation 164
7.3 Probabilistic Latent Semantic Analysis 166
7.3.1 Model Hypothesis 166
7.3.2 Parameter Learning 167
7.4 Latent Dirichlet Allocation 169
7.4.1 Model Hypothesis 169
7.4.2 Joint Probability 171
7.4.3 Inference in LDA 174
7.4.4 Inference for New Documents 176
7.5 Further Reading 177
Exercises 178
8 Sentiment Analysis and Opinion Mining 179
8.1 History of Sentiment Analysis and Opinion Mining 179
8.2 Categorization of Sentiment Analysis Tasks 180
8.2.1 Categorization According to Task Output 180
8.2.2 According to Analysis Granularity 181
8.3 Methods for Document/Sentence-Level Sentiment Analysis 184
8.3.1 Lexicon- and Rule-Based Methods 185
8.3.2 Traditional Machine Learning Methods 186
8.3.3 Deep Learning Methods 190
8.4 Word-Level Sentiment Analysis and Sentiment Lexicon Construction 194
8.4.1 Knowledgebase-Based Methods 194
8.4.2 Corpus-Based Methods 195
8.4.3 Evaluation of Sentiment Lexicons 198
8.5 Aspect-Level Sentiment Analysis 199
8.5.1 Aspect Term Extraction 199
8.5.2 Aspect-Level Sentiment Classification 202
8.5.3 Generative Modeling of Topics and Sentiments 207
8.6 Special Issues in Sentiment Analysis 209
8.6.1 Sentiment Polarity Shift 209
8.6.2 Domain Adaptation 211
8.7 Further Reading 214
Exercises 215
9 Topic Detection and Tracking 216
9.1 History of Topic Detection and Tracking 216
9.2 Terminology and Task Definition 217
9.2.1 Terminology 217
9.2.2 Task 218
9.3 Story/Topic Representation and Similarity Computation 221
9.4 Topic Detection 224
9.4.1 Online Topic Detection 224
9.4.2 Retrospective Topic Detection 226
9.5 Topic Tracking 227
9.6 Evaluation 228
9.7 Social Media Topic Detection and Tracking 230
9.7.1 Social Media Topic Detection 231
9.7.2 Social Media Topic Tracking 232
9.8 Bursty Topic Detection 232
9.8.1 Burst State Detection 233
9.8.2 Document-Pivot Methods 236
9.8.3 Feature-Pivot Methods 237
9.9 Further Reading 239
Exercises 240
10 Information Extraction 241
10.1 Concepts and History 241
10.2 Named Entity Recognition 243
10.2.1 Rule-based Named Entity Recognition 244
10.2.2 Supervised Named Entity Recognition Method 245
10.2.3 Semisupervised Named Entity Recognition Method 253
10.2.4 Evaluation of Named Entity Recognition Methods 255
10.3 Entity Disambiguation 256
10.3.1 Clustering-Based Entity Disambiguation Method 257
10.3.2 Linking-Based Entity Disambiguation 262
10.3.3 Evaluation of Entity Disambiguation 268
10.4 Relation Extraction 270
10.4.1 Relation Classification Using Discrete Features 272
10.4.2 Relation Classification Using Distributed Features 279
10.4.3 Relation Classification Based on Distant Supervision 282
10.4.4 Evaluation of Relation Classification 283
10.5 Event Extraction 284
10.5.1 Event Description Template 284
10.5.2 Event Extraction Method 286
10.5.3 Evaluation of Event Extraction 295
10.6 Further Reading 295
Exercises 296
11 Automatic Text Summarization 298
11.1 Main Tasks in Text Summarization 298
11.2 Extraction-Based Summarization 300
11.2.1 Sentence Importance Estimation 300
11.2.2 Constraint-Based Summarization Algorithms 311
11.3 Compression-Based Automatic Summarization 312
11.3.1 Sentence Compression Method 313
11.3.2 Automatic Summarization Based on Sentence Compression 318
11.4 Abstractive Automatic Summarization 320
11.4.1 Abstractive Summarization Based on Information Fusion 320
11.4.2 Abstractive Summarization Based on the Encoder-Decoder Framework 326
11.5 Query-Based Automatic Summarization 329
11.5.1 Relevance Calculation Based on the Language Model 330
11.5.2 Relevance Calculation Based on Keyword Co-occurrence 330
11.5.3 Graph-Based Relevance Calculation Method 331
11.6 Crosslingual and Multilingual Automatic Summarization 332
11.6.1 Crosslingual Automatic Summarization 332
11.6.2 Multilingual Automatic Summarization 336
11.7 Summary Quality Evaluation and Evaluation Workshops 338
11.7.1 Summary Quality Evaluation Methods 338
11.7.2 Evaluation Workshops 343
11.8 Further Reading 345
Exercises 346
References 347
备用描述
This book discusses various aspects of text data mining. Unlike other books that focus on machine learning or databases, it approaches text data mining from a natural language processing (NLP) perspective. The book offers a detailed introduction to the fundamental theories and methods of text data mining, ranging from pre-processing (for both Chinese and English texts), text representation and feature selection, to text classification and text clustering. It also presents the predominant applications of text data mining, for example, topic modeling, sentiment analysis and opinion mining, topic detection and tracking, information extraction, and automatic text summarization. Bringing all the related concepts and algorithms together, it offers a comprehensive, authoritative and coherent overview. Written by three leading experts, it is valuable both as a textbook and as a reference resource for students, researchers and practitioners interested in text data mining. It can also be used for classes on text data mining or NLP.
备用描述
Chapter 1. Introduction -- Chapter 2. Data Annotation and Preprocessing -- Chapter 3. Text Representation -- Chapter 4. Text Representation with Pretraining and Fine-tuning -- Chapter 5. Text classification -- Chapter 6. Text Clustering -- Chapter 7. Topic Model -- Chapter 8. Sentiment Analysis and Opinion Mining -- Chapter 9. Topic Detection and Tracking -- Chapter 10. Information Extraction -- Chapter 11. Automatic Text Summarization.
开源日期
2021-05-26
🚀 快速下载
成为会员以支持书籍、论文等的长期保存。为了感谢您对我们的支持,您将获得高速下载权益。❤️
🐢 低速下载
由可信的合作方提供。 更多信息请参见常见问题解答。 (可能需要验证浏览器——无限次下载!)
- 低速服务器(合作方提供) #1 (稍快但需要排队)
- 低速服务器(合作方提供) #2 (稍快但需要排队)
- 低速服务器(合作方提供) #3 (稍快但需要排队)
- 低速服务器(合作方提供) #4 (稍快但需要排队)
- 低速服务器(合作方提供) #5 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #6 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #7 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #8 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #9 (无需排队,但可能非常慢)
- 下载后: 在我们的查看器中打开
所有选项下载的文件都相同,应该可以安全使用。即使这样,从互联网下载文件时始终要小心。例如,确保您的设备更新及时。
外部下载
-
对于大文件,我们建议使用下载管理器以防止中断。
推荐的下载管理器:JDownloader -
您将需要一个电子书或 PDF 阅读器来打开文件,具体取决于文件格式。
推荐的电子书阅读器:Anna的档案在线查看器、ReadEra和Calibre -
使用在线工具进行格式转换。
推荐的转换工具:CloudConvert和PrintFriendly -
您可以将 PDF 和 EPUB 文件发送到您的 Kindle 或 Kobo 电子阅读器。
推荐的工具:亚马逊的“发送到 Kindle”和djazz 的“发送到 Kobo/Kindle” -
支持作者和图书馆
✍️ 如果您喜欢这个并且能够负担得起,请考虑购买原版,或直接支持作者。
📚 如果您当地的图书馆有这本书,请考虑在那里免费借阅。
下面的文字仅以英文继续。
总下载量:
“文件的MD5”是根据文件内容计算出的哈希值,并且基于该内容具有相当的唯一性。我们这里索引的所有影子图书馆都主要使用MD5来标识文件。
一个文件可能会出现在多个影子图书馆中。有关我们编译的各种数据集的信息,请参见数据集页面。
有关此文件的详细信息,请查看其JSON 文件。 Live/debug JSON version. Live/debug page.