This text covers the emerging technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical issues. It seeks to satisfy a need on the part of technology practitioners in the Internet space, faced with having to make difficult decisions as to what research has been done an what the best practices are. It is not intended as a vendor guide (such things are quickly out of date), or as a recipe for building applications (such recipes are very context-dependent). But it does identify the key technologies, the issues involved, and the strengths and weaknesses on evaluation in every chapter, both in terms of methodology (how to evaluate) and what controlled experimentation and industrial experience have to tell us.
“In general, the book is a very good, concise reference book filled with many theoretical principles and practical guidelines. I recommend this book to anyone who wants to build applications related to text retrieval, information extraction and categorization.”
Zhongdong Zhang, Novator Systems Ltd., Toronto, on Linguist List Vol-14-226, 2003
“The authors had the good idea of not making this book a vendor guide but rather an overview of methodologies and technologies available and the evaluation criteria for the techniques described. I do not believe their goal was to publish a detailed overview but an introduction to the various technologies available. In that regard, the book is very successful and I much appreciate it because key concepts are clearly outlined which it makes it easier to follow the authors through the more complex parts of the book. I would recommend it to anyone who is interested in NLP and its applications to the new challenges brought out by the arrival of the information age.”
Patrick Drouin,University of Montreal, in Terminology 10:1, 2004
“In my view, the book is very practical: certainly, since it is pretty comprehensible and does not go into too profound details, it could serve well as a textbook for an introductory course. However, the book is not intended exclusively as an academic text. It is also aimed at software engineers, project managers, and technology executives who want or need to understand the technology at some level. I think that such people may find it useful, and that it may provoke ideas, discussions, and action the field of applied research and development.”
Martin Holub, in The Prague Bulletin of Mathematical Linguistics Vol. 83, 2005
“Some special features of the book include solid coverage of evaluation techniques in every chapter, excellent endnotes, and references to exactly the right stuff. However, the most salient feature of this book is the clear and cogent writing. It reads much like a series of well-written review articles an is actually enjoyable to read while not skimping at all on technical detail.”
K. Bretonnel Cohen, University of Colorado, in Language 80(1), 2004
2024. Contextual word disambiguates of Ge'ez language with homophonic using machine learning. Ampersand 12 ► pp. 100169 ff.
Ibañez, Marilyn Minicucci, Reinado Roberto Rosa & Lamartine Nogueira Frutuoso Guimarães
2024. Sentiment Analysis in Social Medias for Threats Prediction of Natural Extreme Events. In Encyclopedia of Information Science and Technology, Sixth Edition [Advances in Information Quality and Management, ], ► pp. 1 ff.
Nnamoko, Nonso, Themis Karaminis, Jack Procter, Joseph Barrowclough & Ioannis Korkontzelos
2024. Automatic language ability assessment method based on natural language processing. Natural Language Processing Journal 8 ► pp. 100094 ff.
Jbel, Mouad, Imad Hafidi & Abdelmoutalib Metrane
2023. MYC: A Moroccan Corpus for Sentiment Analysis. In Advances in Machine Intelligence and Computer Science Applications [Lecture Notes in Networks and Systems, 656], ► pp. 59 ff.
Miok, Kristian, Padraig Corcoran & Irena Spasić
2023. The Value of Numbers in Clinical Text Classification. Machine Learning and Knowledge Extraction 5:3 ► pp. 746 ff.
Nuccio, Massimiliano & Sofia Mogno
2023. Methodology and Empirical Strategy. In Mapping Digital Skills in Cultural and Creative Industries in Italy [Contributions to Management Science, ], ► pp. 43 ff.
Ibañez, Marilyn Minicucci, Reinaldo Roberto Rosa & Lamartine Nogueira Frutuoso Guimarães
2022. Applying Sentiment Analysis Techniques in Social Media Data About Threat of Armed Conflicts Using Two Times Series Models. In Handbook of Research on Artificial Intelligence Applications in Literary Works and Social Media [Advances in Computational Intelligence and Robotics, ], ► pp. 220 ff.
Ibañez, Marilyn Minicucci, Reinaldo Roberto Rosa & Lamartine Nogueira Frutuoso Guimarães
2022. Threat Emotion Analysis in Social Media. In Handbook of Research on Opinion Mining and Text Analytics on Literary Works and Social Media [Advances in Web Technologies and Engineering, ], ► pp. 293 ff.
2021. Note-Taking Evaluation Using Network Illustrations Based on Term Co-occurrence in a Blended Learning Environment. In Note Taking Activities in E-Learning Environments [Behaviormetrics: Quantitative Approaches to Human Behavior, 11], ► pp. 51 ff.
Funkner, Anastasia A. & Sergey V. Kovalchuk
2020. Time Expressions Identification Without Human-Labeled Corpus for Clinical Text Mining in Russian. In Computational Science – ICCS 2020 [Lecture Notes in Computer Science, 12140], ► pp. 591 ff.
2018. SIAAC: Sentiment Polarity Identification on Arabic Algerian Newspaper Comments. In Applied Computational Intelligence and Mathematical Methods [Advances in Intelligent Systems and Computing, 662], ► pp. 139 ff.
Rivolli, Adriano, Carlos Soares & Andre C.P.L.F. de Carvalho
2018. 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), ► pp. 414 ff.
Rivolli, Adriano, Carlos Soares & André C. P. L. F. de Carvalho
2018. Enhancing multilabel classification for food truck recommendation. Expert Systems 35:4
Banerjee, Binayak, Tania Sarkar, Pratap Chakraborty & Alok Ranjan Pal
2017. 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), ► pp. 768 ff.
Chadha, Sanchit, Antuan Byalik, Eli Tilevich & Alla Rozovskaya
2017. Facilitating the development of cross-platform software via automated code synthesis from web-based programming resources. Computer Languages, Systems & Structures 48 ► pp. 3 ff.
Nabhan, Rabih Joseph
2017. Stylistic Awareness to Analyze and Comprehend Authentic Discourse in Language Classrooms. Open Journal of Modern Linguistics 07:03 ► pp. 185 ff.
Rivolli, Adriano, Larissa C. Parker & Andre C. P. L. F. de Carvalho
2017. Food Truck Recommendation Using Multi-label Classification. In Progress in Artificial Intelligence [Lecture Notes in Computer Science, 10423], ► pp. 585 ff.
Cohen, Kevin Bretonnel, Benjamin Glass, Hansel M. Greiner, Katherine Holland-Bouley, Shannon Standridge, Ravindra Arya, Robert Faist, Diego Morita, Francesco Mangano, Brian Connolly, Tracy Glauser & John Pestian
2016. Methodological Issues in Predicting Pediatric Epilepsy Surgery Candidates through Natural Language Processing and Machine Learning. Biomedical Informatics Insights 8 ► pp. BII.S38308 ff.
Indu, M & K V Kavitha
2016. 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS), ► pp. 1 ff.
Byalik, Antuan, Sanchit Chadha & Eli Tilevich
2015. Proceedings of the 2015 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, ► pp. 99 ff.
Byalik, Antuan, Sanchit Chadha & Eli Tilevich
2016. Native-2-native: automated cross-platform code synthesis from web-based programming resources. ACM SIGPLAN Notices 51:3 ► pp. 99 ff.
Li, Simon, Kamrun Nahar & Benjamin C. M. Fung
2015. Product customization of tablet computers based on the information of online reviews by customers. Journal of Intelligent Manufacturing 26:1 ► pp. 97 ff.
Tahir, Muhammad Atif, Emdad Khan & Adel Al Salem
2015. 2015 2nd World Symposium on Web Applications and Networking (WSWAN), ► pp. 1 ff.
2014. Toward an enriched understanding of factors influencing Filipino behavior during elections through the analysis of Twitter data. Philippine Political Science Journal 35:2 ► pp. 203 ff.
Amolochitis, Emmanouil, Ioannis T. Christou, Zheng-Hua Tan & Ramjee Prasad
2013. A heuristic hierarchical scheme for academic search and retrieval. Information Processing & Management 49:6 ► pp. 1326 ff.
Lesmo, Leonardo, Alessandro Mazzei, Monica Palmirani & Daniele P. Radicioni
2013. TULSI: an NLP system for extracting legal modificatory provisions. Artificial Intelligence and Law 21:2 ► pp. 139 ff.
Perea-Ortega, José M., Arturo Montejo-Ráez, M. Teresa Martín-Valdivia & L. Alfonso Ureña-López
2013. Semantic tagging of video ASR transcripts using the web as a source of knowledge. Computer Standards & Interfaces 35:5 ► pp. 519 ff.
Perea-Ortega, José M., Arturo Montejo-Ráez, M. Teresa Martín-Valdivia & L. Alfonso Ureña-López
2013. Generating web-based corpora for video transcripts categorization. Expert Systems with Applications 40:1 ► pp. 337 ff.
Anchieta, Rafael T., Rogerio F. de Sousa & Raimundo S. Moura
2012. 2012 XXXVIII Conferencia Latinoamericana En Informatica (CLEI), ► pp. 1 ff.
Kimbrough, Steven O., Thomas Y. Lee & Ulku Oktem
2012. On Deriving Indicators from Texts. In Modeling for Decision Support in Network-Based Services [Lecture Notes in Business Information Processing, 42], ► pp. 196 ff.
Leopold, Henrik, Sergey Smirnov & Jan Mendling
2012. On the refactoring of activity labels in business process models. Information Systems 37:5 ► pp. 443 ff.
Pestian, John P., Pawel Matykiewicz, Michelle Linn-Gust, Brett South, Ozlem Uzuner, Jan Wiebe, K. Bretonnel Cohen, John Hurdle & Christopher Brew
2012. Sentiment Analysis of Suicide Notes: A Shared Task. Biomedical Informatics Insights 5s1 ► pp. BII.S9042 ff.
Wnuk, Krzysztof, Martin Höst & Björn Regnell
2012. Replication of an experiment on linguistic tool support for consolidation of requirements from multiple sources. Empirical Software Engineering 17:3 ► pp. 305 ff.
2011. The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ Arabic Texts. In Information Retrieval Technology [Lecture Notes in Computer Science, 7097], ► pp. 538 ff.
HOU, WEN-JUAN & JIA-HAO TSAO
2011. AUTOMATIC ASSESSMENT OF STUDENTS' FREE-TEXT ANSWERS WITH DIFFERENT LEVELS. International Journal on Artificial Intelligence Tools 20:02 ► pp. 327 ff.
KyungTae Kim, Sungahn Ko, Niklas Elmqvist & David S Ebert
2011. 2011 44th Hawaii International Conference on System Sciences, ► pp. 1 ff.
Rattanyu, Kanlaya & Makoto Mizukawa
2011. Emotion Recognition Based on ECG Signals for Service Robots in the Intelligent Space During Daily Life. Journal of Advanced Computational Intelligence and Intelligent Informatics 15:5 ► pp. 582 ff.
Mala, Piotr
2010. ROZWÓJ BADAŃ NAD PRZETWARZANIEM JĘZYKA NATURALNEGO. Zagadnienia Informacji Naukowej - Studia Informacyjne 48:2(96) ► pp. 21 ff.
Spangler, W. S., J. T. Kreulen, Y. Chen, L. Proctor, A. Alba, A. Lelescu & A. Behal
2010. A smarter process for sensing the information space. IBM Journal of Research and Development 54:4 ► pp. 1 ff.
Wu, Qin, Eddie Fuller & Cun-Quan Zhang
2010. Graph Model for Pattern Recognition in Text. In Mining and Analyzing Social Networks [Studies in Computational Intelligence, 288], ► pp. 1 ff.
Chen, Ying, Scott Spangler, Jeffrey Kreulen, Stephen Boyer, Thomas D. Griffin, Alfredo Alba, Amit Behal, Bin He, Linda Kato, Ana Lelescu, Cheryl Kieliszewski, Xian Wu & Li Zhang
2009. 2009 IEEE International Conference on Data Mining Workshops, ► pp. 270 ff.
Hudon, Michèle, Clément Arsenault, Lyne Da Sylva & Dominic Forest
2009. 2. Le traitement du document. In Introduction aux sciences de l'information, ► pp. 53 ff.
Irmak, Utku, Vadim von Brzeski & Reiner Kraft
2009. 2009 IEEE 25th International Conference on Data Engineering, ► pp. 457 ff.
Larson, Martha, Eamonn Newman & Gareth J. F. Jones
2009. Overview of VideoCLEF 2008: Automatic Generation of Topic-Based Feeds for Dual Language Audio-Visual Content. In Evaluating Systems for Multilingual and Multimodal Information Access [Lecture Notes in Computer Science, 5706], ► pp. 906 ff.
MacFarlane, Katrinna & Violeta Holmes
2009. 2009 International Conference on Management and Service Science, ► pp. 1 ff.
Nau, Dana S.
2009. Artificial Intelligence and Automation. In Springer Handbook of Automation, ► pp. 249 ff.
Norouzzadeh, Mohammad S., Ayoub Bagheri & Mohammad H. Saraee
2009. 2009 2nd IEEE International Conference on Computer Science and Information Technology, ► pp. 143 ff.
Alonso, Omar, Premkumar T. Devanbu & Michael Gertz
2008. Proceedings of the 2008 international working conference on Mining software repositories, ► pp. 125 ff.
Gallo, Ignazio & Elisabetta Binaghi
2008. Information Extraction and Classification from Free Text Using a Neural Approach. In Progress in Pattern Recognition, Image Analysis and Applications [Lecture Notes in Computer Science, 4756], ► pp. 921 ff.
Motta, Eduardo, Alexandre Andreatta & Sean Siqueira
2008. Proceedings of the 2008 Euro American Conference on Telematics and Information Systems, ► pp. 1 ff.
Nakayama, Minoru & Yosiyuki Takahasi
2008. Estimation of certainty for responses to multiple-choice questionnaires using eye movements. ACM Transactions on Multimedia Computing, Communications, and Applications 5:2 ► pp. 1 ff.
Spangler, Scott, Larry Proctor & Ying Chen
2008. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, ► pp. 258 ff.
Stasko, John, Carsten Görg & Zhicheng Liu
2008. Jigsaw: Supporting Investigative Analysis through Interactive Visualization. Information Visualization 7:2 ► pp. 118 ff.
Valette, Mathieu & Monique Slodzian
2008. Sémantique des textes et Recherche d'Information. Revue française de linguistique appliquée Vol. XIII:1 ► pp. 119 ff.
Antunes, Bruno, Nuno Seco & Paulo Gomes
2007. Using Ontologies for Software Development Knowledge Reuse. In Progress in Artificial Intelligence [Lecture Notes in Computer Science, 4874], ► pp. 357 ff.
Behal, Amit, Ying Chen, Cheryl Kieliszewski, Ana Lelescu, Bin He, Jie Cui, Jeffrey Kreulen, James Rhodes & W. Scott Spangler
2007. Business Insights Workbench – An Interactive Insights Discovery Solution. In Human Interface and the Management of Information. Interacting in Information Environments [Lecture Notes in Computer Science, 4558], ► pp. 834 ff.
CAPORASO, J. GREGORY, WILLIAM A. BAUMGARTNER, DAVID A. RANDOLPH, K. BRETONNEL COHEN & LAWRENCE HUNTER
2007. RAPID PATTERN DEVELOPMENT FOR CONCEPT RECOGNITION SYSTEMS: APPLICATION TO POINT MUTATIONS. Journal of Bioinformatics and Computational Biology 05:06 ► pp. 1233 ff.
Jankowski, Andrzej & Andrzej Skowron
2007. A Wistech Paradigm for Intelligent Systems. In Transactions on Rough Sets VI [Lecture Notes in Computer Science, 4374], ► pp. 94 ff.
Jo, Taeho & Malrey Lee
2007. 5th ACIS International Conference on Software Engineering Research, Management & Applications (SERA 2007), ► pp. 289 ff.
Netisopakul, Ponrudee & Norapan Siriumpunkul
2007. Educational Service Web Database Prototype. In Advanced Intelligent Computing Theories and Applications. With Aspects of Contemporary Intelligent Computing Techniques [Communications in Computer and Information Science, 2], ► pp. 479 ff.
Rasmussen, Steen, Diana Mangalagiu, Hans Ziock, Johan Bollen & Gordon Keating
2007. 2007 IEEE Symposium on Artificial Life, ► pp. 468 ff.
Soni, Ankit, Nees Jan van Eck & Uzay Kaymak
2007. 2007 IEEE Symposium on Computational Intelligence in Multi-Criteria Decision-Making, ► pp. 205 ff.
Stasko, John, Carsten Gorg, Zhicheng Liu & Kanupriya Singhal
2007. 2007 IEEE Symposium on Visual Analytics Science and Technology, ► pp. 131 ff.
Voloshynovska, Iryna & Nadiya Andreychuk
2007. 2007 9th International Conference - The Experience of Designing and Applications of CAD Systems in Microelectronics, ► pp. 583 ff.
von Brzeski, Vadim, Utku Irmak & Reiner Kraft
2007. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, ► pp. 691 ff.
Xian-Jun Meng, Qing-Cai Chen, Xiao-Long Wang & Xiao-Hong Yang
2007. 2007 IEEE International Conference on Systems, Man and Cybernetics, ► pp. 3075 ff.
Banville, Debra L.
2006. Mining chemical structural information from the drug literature. Drug Discovery Today 11:1-2 ► pp. 35 ff.
Banville, Debra L.
2009. Mining Chemical Structural Information from the Literature. In Pharmaceutical Data Mining, ► pp. 521 ff.
Conrad, Jack G. & Cindy P. Schriber
2006. Managing déjà vu: Collection building for the identification of nonidentical duplicate documents. Journal of the American Society for Information Science and Technology 57:7 ► pp. 921 ff.
Fosdick, Howard
2006. Programming languages for library and textual processing. Bulletin of the American Society for Information Science and Technology 31:6 ► pp. 21 ff.
Hunter, Lawrence & K. Bretonnel Cohen
2006. Biomedical Language Processing: What's Beyond PubMed?. Molecular Cell 21:5 ► pp. 589 ff.
Jackson, P. & F. Schilder
2006. Natural Language Processing: Overview. In Encyclopedia of Language & Linguistics, ► pp. 503 ff.
Mikeal, Adam, Cody Green, Alexey Maslov, Scott Phillips & John Leggett
2006. 2006 Fourth Latin American Web Congress, ► pp. 162 ff.
Morioka, Nobuyuki & Ashesh Mahidadia
2006. Enhancing Information Retrieval Using Problem Specific Knowledge. In Advances in Knowledge Acquisition and Management [Lecture Notes in Computer Science, 4303], ► pp. 244 ff.
Radovanović, Miloš & Mirjana Ivanović
2006. CatS: A Classification-Powered Meta-Search Engine. In Advances in Web Intelligence and Data Mining [Studies in Computational Intelligence, 23], ► pp. 191 ff.
van Diggelen, Jurriaan, Robbert-Jan Beun, Frank Dignum, Rogier M. van Eijk & John-Jules Meyer
2006. Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, ► pp. 899 ff.
Wang, Xiaoting, Peng Zhu, Giovanni Felici & Evangelos Triantaphyllou
2006. Some Future Trends in Data Mining. In Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques [Massive Computing, 6], ► pp. 695 ff.
Zhang, Dell & Wee Sun Lee
2006. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ► pp. 474 ff.
Chaudiron, Stéphane
2005. Terminologie, ingénierie linguistique et gestion de l'information. Langages n° 157:1 ► pp. 25 ff.
Dale, R., Li Lei, H. de Vries, M. Gardiner & M. Tilbrook
2005. 2005 International Conference on Natural Language Processing and Knowledge Engineering, ► pp. 651 ff.
John Davies, Grobelnik, Marko & Dunja Mladenić
2005. Automated knowledge discovery in advanced knowledge management. Journal of Knowledge Management 9:5 ► pp. 132 ff.
Dozier, C. & P. Jackson
2005. Mining Text for Expert Witnesses. IEEE Software 22:3 ► pp. 94 ff.
Natarajan, J., D. Berrar, C. J. Hack & W. Dubitzky
2005. Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications. Critical Reviews in Biotechnology 25:1-2 ► pp. 31 ff.
Natt och Dag, Johan & Vincenzo Gervasi
2005. Managing Large Repositories of Natural Language Requirements. In Engineering and Managing Software Requirements, ► pp. 219 ff.
Paz-Trillo, Christian, Renata Wassermann & Paula P. Braga
2005. An information retrieval application using ontologies. Journal of the Brazilian Computer Society 11:2 ► pp. 17 ff.
Saric, F., J. Snajder, B.D. Basic & H. Eklic
2005. 27th International Conference on Information Technology Interfaces, 2005., ► pp. 214 ff.
Schulze‐Kremer, Steffen & Barry Smith
2005. Ontologies for the life sciences. In Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics,
Wang, Wei, Diep Bich Do & Xuemin Lin
2005. Term Graph Model for Text Classification. In Advanced Data Mining and Applications [Lecture Notes in Computer Science, 3584], ► pp. 19 ff.
Dag, J.N., V. Gervasi, S. Brinkkemper & B. Regnell
2004. Proceedings. 12th IEEE International Requirements Engineering Conference, 2004., ► pp. 265 ff.
Dale, Robert, Rafael Calvo & Marc Tilbrook
2004. Key Element Summarisation: Extracting Information from Company Announcements. In AI 2004: Advances in Artificial Intelligence [Lecture Notes in Computer Science, 3339], ► pp. 438 ff.
Hartley, James, Eric Sotto & Claire Fox
2004. Clarity Across the Disciplines. Science Communication 26:2 ► pp. 188 ff.
Xiangzhu Gao, San Murugesan & B. Lo
2004. IEEE/WIC/ACM International Conference on Web Intelligence (WI'04), ► pp. 192 ff.
Dale, Robert, Cecile Paris & Marc Tilbrook
2003. Information Extraction via Path Merging. In AI 2003: Advances in Artificial Intelligence [Lecture Notes in Computer Science, 2903], ► pp. 150 ff.
Jackson, Peter, Khalid Al-Kofahi, Alex Tyrrell & Arun Vachher
2003. Information extraction from case law and retrieval of prior cases. Artificial Intelligence 150:1-2 ► pp. 239 ff.
Mladenić, Dunja & Marko Grobelnik
2003. Text and Web Mining. In Data Mining and Decision Support, ► pp. 15 ff.
Portscher, Edwin, James Geller & Richard Scherl
2003. Using Internet Glossaries to Determine Interests from Home Pages. In E-Commerce and Web Technologies [Lecture Notes in Computer Science, 2738], ► pp. 248 ff.
This list is based on CrossRef data as of 25 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.