A method for predicting title of given text

Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: İstanbul Ticaret Üniversitesi, Fen Bilimleri Enstitüsü, Türkiye

Tezin Onay Tarihi: 2021

Tezin Dili: İngilizce

Öğrenci: MOHAMED BARRE OMER

Danışman: KASAPBAŞI MUSTAFA CEM

Özet:

ÖZET Günümüzde, muazzam metin veri kaynakları her yerde kitaplar, haber dergileri, web siteleri ve çok daha fazlası şeklindedir. Metin verilerini keşfetmek ve bunlarla ilgili içgörü kazanmak çok önemlidir. Başlıklar makalenin ve tarihin bir özetini verir, anlamsal ve sözdizimsel olarak tutarlı bir başlık elde etmek oldukça zor bir iştir. Bu çalışmada, doğal bir dil üretme sistemi olan belirli bir metnin (PTT) başlığını tahmin etmek için LSTM (Long Short Term Memory) sinir sistemi adlı bir derin öğrenme sistemi önerilmiştir. Derin sinir ağı mimarisi son zamanlarda popülerlik kazanıyor ve bu, metin oluşturmak için önceki istatistiksel modellerden daha kolay. Bu çalışmada, Kaggle'ın haber özeti adı verilen halka açık haber veri kümesi kullanılmıştır. Verimlilik ve daha az işlem gücü gereksinimleri için 98403 kayıt arasından 500 yüz haber özeti alt kümesi seçilir. Önce durak sözcükleri ön işleme olarak kaldırılır, ardından noktalama işaretleri düzeltilir ve metin küçük harfe dönüştürülür. Daha sonra Porter Stemmer, metindeki kelimelerin köklerini elde etmek için kullanılmıştır. Tokenizasyondan sonra, 16 kelimelik dizilere bölünür. Sayısal değerleri LSTM'ye beslemek için kelime gömme işlemi kullanılır. Önerilen LSTM modeli, ROUGE için Rough Recall Oriented (ROUGE)'dan aldığımız sonuçlara göre ROUGE 1 Ortalama_ Recall: 0,69886, Ortalama_Presicion:0,99924, Ortalama_F1:0,69905 için insan değerlendirmesine göre yüksek kaliteli başlıklar üretti. 2 Ortalama_Hatırlatma:0,69874, Ortalama_Hassas :0,69895, Ortalama_F1:0,69884, ROUGE L Ortalama_Geri Çağırma:0,69829, Ortalama_Hassas:0,69829 Ortalama_F1:0,69829. Anahtar Kelimeler: Deep learning, LSTM, NLP, rouge, text generation. ABSTRACT Nowadays, tremendous text data resources are everywhere in the form of books, news journals, websites, and many more. Exploring and gaining insight into text data is very crucial. The titles give a summary of the article and history, getting a coherent semantically, and syntactically title is quite a challenging task. In this study, a deep learning system namely the LSTM (Long Short-Term Memory) neural system is proposed for predicting the title of a given text (PTT) which is a natural language generation system. Deep neural network architecture recently gains popularity, which is easier than previous statistical models for generating text. In this study publicly open news dataset is used from Kaggle called news summary for headline generation. A 500 hundred news summary subset is chosen out of 98403 records for efficiency and less processing power requirements. Firstly, stop words are removed as preprocessing then punctuations are corrected and text is transformed to lower case. Later Porter Stemmer is used to obtaining stems of the words in the text. After tokenization, it is divided into 16-word-length sequences. In order to feed numerical values to LSTM word embedding is utilized. The proposed LSTM model generated high-quality titles according to human evaluation based on results we get from Recall-Oriented Understanding for Gisting Evaluation (ROUGE) as for ROUGE 1 Average_ Recall: 0,69886, Average Precision :0,69924, Average_F1:0,69905 as for ROUGE 2 Average_Recall:0,69874, Average Precision :0,69895, Average_F1:0,69884, as for ROUGE L Average_Recall:0,69829, Average_Precision:0,69829 Average_F1:0,69829. Keywords: Deep learning, LSTM, NLP, rouge, text generation. CONTENTS CONTENTS. i ABSTRACT . iii ÖZET . iv ACKNOWLEDGEMENTS . v TABLE OF FIGURES . vi LIST OF TABLES . vii LIST OF ABBREVIATION WORDS . viii 1. INTRODUCTION . 1 1. 1. Motivation of the Study . 1 1. 2 Brief introduction to Summarization . 2 1. 3 Summarization Approaches . 2 2. LITERATURE REVIEW . 3 2 . 1 NLP (Natural Language Processing) Applications . 3 2. 1. 1 Information extraction . 3 2. 1. 2 Sentiment analysis . 4 2. 1. 3Opinion summarization . 4 2. 1. 4 Speech recognition . 5 2. 1. 5 Other application of NLP . 5 2. 2 Preparing data . 5 2. 2. 1 Data preprocessing . 5 2. 2. 2 Word embedding. 5 2. 3 Brief Scientific Background Information About Text Generation…………. 8 2. 4 Attention Mechanism…………………………………………………………………………. 9 3. METHODOLOGY. 10 3. 1 Model diagram………………………………………………………………………………. …. 10 3. 2 Long short memory (LSTM) . 10 3. 3 Algorithm Steps…………………………………………………………………………………. 11 3. 3. 1 Data set reading . 11 3. 3. 2 Preprocessing data . 12 3. 3. 3 Check if data cleaned if yes then go next step else back preprocess data . 12 3. 4. 4 Encoder decoder with LSTM………………………………………………………. 12 3. 3. 5 Train model………………………………………………………………………………. 12 3. 3. 6 Summary generation . 12 3. 3. 7 Long Short Term Memory (LSTM) . 13 3. 4 Sequence to Sequence………………………………………………………………………. 13 3. 4. 1 Language model . 13 3. 5 Dataset and Training . 14 3. 6 Using Google Colaboratory………. ……………………………………………………. 15 3. 7 Lunching Google Colab . 15 3. 8 Benefits of Google Colaboratory . 16 4. EVALUATION RESULTS AND DISCUSSIONS . 17 4. 1 Rouge . 17 4. 2 Results and Discussions . 19 5. CONCLUSION AND IMPLICATIONS . 21 REFERENCES . 22 APPENDIX . 24 BIOGRAPHY . 28