Survei Terhadap Pengukuran Kesamaan Teks: Survey of Text Similarity Measurement

Krisna Adiyarta Musodo; Suwasti Broto

doi:10.36080/jk.v1i1.3

Authors

Krisna Adiyarta Musodo Universitas Budi Luhur
Suwasti Broto Universitas Budi Luhur

DOI:

https://doi.org/10.36080/jk.v1i1.3

Keywords:

Text Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity

Abstract

Measuring the similarity between words, sentences, paragraphs and documents is an important research and discussion space in various discussions such as information search, document grouping, word-sense disambiguation, automatic essay scoring, short answer assessment, machine translation and text summarization. This article discus a survey which discussing methods for measuring the similarity of text or strings. This article is structred into three approaches; String-based, corpus-based, and knowledge-based similarity measures and presents the combinations of these similarities measures.

Downloads

References

Chapman, S., SimMetrics: a java & c# .net library of similarity metrics, http://sourceforge.net/projects/simmetrics/, 2006.

Jaro, M. A., Advances in record linkage methodology as applied to the 1985 census of Tampa Florida, Journal of the American Statistical Society, vol. 84, no. 406, pp 414-420, 1989.

Jaro, M. A., Probabilistic linkage of large public health data file, Statistics in Medicine 14 (5-7), 491-8, 1995.

Winkler W. E., String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359, 1990.

Hall, P. A. V. & Dowling, G. R., Approximate string matching, Comput. Surveys, 12:381-402, 1980.

Peterson, J. L., Computer programs for detecting and correcting spelling errors, Comm. Assoc. Comput. Mach., 23:676-687, 1980.

Needleman, B. S. & Wunsch, D. C., A general method applicable to the search for similarities in the amino acid sequence of two proteins", Journal of Molecular Biology vol. 48, no. 3, pp. 443–53, 1970.

Alberto, B. Paolo, R., Eneko A. & Gorka L., Plagiarism Detection across Distant Language Pairs, In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 37–45, 2010.

Smith, F. T. & Waterman, S. M. (1981). Identification of Common Molecular Subsequences, Journal of Molecular Biology, vol. 147, pp. 195–197, 1981.

Eugene FK., Taxicab Geometry, Dover. ISBN 0-486-25202-7, 1987.

Jaccard, P., Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, vol. 37, pp. 547-579, 1901.

Dice, L., Measures of the amount of ecologic association between species. Ecology, vol. 26, no.3, 1945.

Landauer, T.K. & Dumais, S.T., A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge", Psychological Review, 104, 1997.

Lund, K., Burgess, C. & Atchley, R. A., Semantic and associative priming in a high-dimensional semantic space. Cognitive Science Proceedings (LEA), pp. 660-665, 1995.

Lund, K. & Burgess, C., Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers, vol. 28, no. 2, pp. 203-208, 1996.

Matveeva, I., Levow, G., Farahat, A. & Royer, C., Generalized latent semantic analysis for term representation. In Proc. of RANLP, 2005.

Gabrilovich E. & Markovitch, S., Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of the 20th International Joint Conference on Maknaficial Intelligence, pp. 6–12, 2007.

Turney, P., Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (ECML), 2001.

Mmaknan, P., Benno, S. & Maik, A., A Wikipedia-based multilingual retrieval model. Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522-530, 2008.

Islam, A. and Inkpen, D., Semantic text similarity using corpus-based word similarity and string similarity. ACM Transaction Knowledge Discovery. Dat ACM Transactions on Knowledge Discovery from Data 2 (Jul. 2008), pp. 1–25, 2008.

Islam, A. and Inkpen, D., Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038, 2006.

Cilibrasi, R.L. & Vitanyi, P.M.B., The Google Similarity Distance, IEEE Trans. Knowledge and Data Engineering, vol. 19, no. 3, pp. 370-383, 2007.

Peter, K., Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics - NODALIDA '09, Odense, Denmark, 2009.

Lin, D., Extracting Collocations from Text Corpora. In Workshop on Computational Terminology, Montreal, Kanada, pp. 57–63, 1998.

Mihalcea, R., Corley, C. & Strapparava, C., Corpus based and knowledge-based measures of text semantic similarity. In Proceedings of the American Association for Maknaficial Intelligence. (Boston, MA), 2006.

Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D. & Miller, K., WordNet: An online lexical database. Int. J. Lexicograph, vol. 3, no. 4, pp. 235–244, 1990.

Patwardhan, S., Banerjee, S. & Pedersen, T., Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pp. 241–257, 2003.

Resnik, R., Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Maknaficial Intelligence, Montreal, Canada, 1995.

Jiang, J. & Conrath, D., Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics, Taiwan, 1997.

Leacock, C. & Chodorow, M., Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database. The MIT Press, 1998.

Wu, Z.& Palmer, M., Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, 1994.

Banerjee, S. & Pedersen, T., An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pp 136–145, 2002.

Hirst, G. & St-Onge, D., Lexical chains as representations of context for the detection and correction of malapropisms. In C. Fellbaum, editor, WordNet: An electronic lexical database, pp 305–332. MIT Press, 1998.

Patwardhan, V., Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, University of Minnesota, Duluth, 2003.

Li, Y., McLean, D., Bandar, Z., O’Shea, J., & Crockett, K., Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138–1149, 2006.

Nitish, A., Kmaknak, A. & Paul, B., DERI & UPM: Pushing Corpus Based Relatedness to Similarity: Shared Task System Description. First Joint Conference on Lexical and Computational Semantics (SEM), Montreal, Canada, June 7-8, 2012 Association for Computational Linguistics, pp. 643–647, 2012.

Islam, A., & Inkpen, D., Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, vol. 2, no.2, pp. 1–25, 2008.

Davide, B., Ronan, T., Nathalie A., & Josiane, M. (2012), IRIT: Textual Similarity Combining Conceptual with an N-Gram Comparison Method. First Joint Conference on Lexical and ComputationalSemantics (*SEM), Montreal, Canada, June 7-8, 2012 Association for Computational Linguistics, pp. 552–556, 2012.

Survei Terhadap Pengukuran Kesamaan Teks

Survey of Text Similarity Measurement

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

WhatsApp

Certificate

Menu

Tentang Kami

Template

ISSN

CallForReviewer

Pengindex_Jurnal

Tools

Visitors

JurnalCitation

Member

lupapasswd

Information

Language

OFFICE: