Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash


Toprak A., TURAN M.

Data and Information Management, 2024 (Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2024
  • Doi Numarası: 10.1016/j.dim.2024.100088
  • Dergi Adı: Data and Information Management
  • Derginin Tarandığı İndeksler: Scopus
  • Anahtar Kelimeler: Automatic dictionary creation, Big data, Financial dictionary, Natural Language Processing, Text similarity algorithms
  • İstanbul Ticaret Üniversitesi Adresli: Evet

Özet

Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The model gets the benefit of different text similarity algorithms to enhance dictionary coverage and increase its internal similarity. For example, in order to create a financial dictionary, algorithm was started with a general seed word “finance”. Web search was executed with this word, and the top three web pages returned by the web search engine were processed. The words in the contents of these web pages were ranked according to their meaning values using the term frequency-inverse document frequency metric. Then, selected words were initially inserted into three different dictionaries which were controlled by WordNet, Spacy, and Simhash text similarity algorithms separately. All of these words added into these dictionaries were used for further web search again together. This process (search and dictionary update) of the algorithm was repeated for each dictionary separately until each reaches to the upper count of words (250 words have been set). Finally, these three dictionaries are merged to form the final financial dictionary. This financial dictionary was compared with the manually created financial dictionary in terms of quality. Consequently, the internal WordNet similarity rate of the words in the automatic financial dictionary was 29.01%, while it was 23.41% in the manual financial dictionary. For the similarity measure of both dictionaries, when the words were merged in the automatic and manual dictionaries into full texts and evaluated both in terms of Simhash similarity, then 72.30% similarity was obtained. It was seen that although both dictionaries produce almost similar words, the automatic dictionary had stronger internal semantic representation.