Document Processing Using Machine Learning

معرفی کتاب «Document Processing Using Machine Learning» نوشتهٔ Sk Md Obaidullah; K. C Santosh; Teresa Gonçalves; Nibaran Das; Kaushik Roy، منتشرشده توسط نشر CRC Press در سال 2020. این کتاب در فرمت pdf، زبان انگلیسی ارائه شده است. «Document Processing Using Machine Learning» در دستهٔ بدون دسته‌بندی قرار دارد.

Document Processing Using Machine Learning aims at presenting a handful of resources for students and researchers working in the document image analysis (DIA) domain using machine learning since it covers multiple document processing problems. Starting with an explanation of how Artificial Intelligence (AI) plays an important role in this domain, the book further discusses how different machine learning algorithms can be applied for classification/recognition and clustering problems regardless the type of input data: images or text. In brief, the book offers comprehensive coverage of the most essential topics, including: - The role of AI for document image analysis - Optical character recognition - Machine learning algorithms for document analysis - Extreme learning machines and their applications - Mathematical foundation for Web text document analysis - Social media data analysis - Modalities for document dataset generation This book serves both undergraduate and graduate scholars in Computer Science/Information Technology/Electrical and Computer Engineering. Further, it is a great fit for early career research scientists and industrialists in the domain. Cover Half Title Title Page Copyright Page Table of Contents Preface #8,0,-32767Editors #10,0,-32767Contributors #14,0,-327671: Artificial Intelligence for Document Image Analysis #16,0,-32767 1.1 Introduction #17,0,-32767 1.2 Optical Character Recognition #17,0,-32767 1.2.1 Dealing with Noise #18,0,-32767 1.2.2 Segmentation #21,0,-32767 1.2.3 Applications #21,0,-32767 1.2.3.1 Legal Industry 1.2.3.2 Banking 1.2.3.3 Healthcare 1.2.3.4 CAPTCHA 1.2.3.5 Automatic Number Recognition 1.2.3.6 Handwriting Recognition 1.3 Natural Language Processing #23,0,-32767 1.3.1 Tokenization #23,0,-32767 1.3.2 Stop Word Removal #24,0,-32767 1.3.3 Stemming #24,0,-32767 1.3.4 Part of Speech Tagging #24,0,-32767 1.3.5 Parsing #25,0,-32767 1.3.6 Applications #25,0,-32767 1.3.6.1 Text Summarization 1.3.6.2 Question Answering 1.3.6.3 Text Categorization 1.3.6.4 Sentiment Analysis 1.3.6.5 Word Sense Disambiguation 1.4 Conclusion #26,0,-32767 References #26,0,-327672: An Approach toward Character Recognition of Bangla Handwritten Isolated Characters #30,0,-32767 2.1 Introduction #30,0,-32767 2.2 Proposed Framework #31,0,-32767 2.2.1 Database #32,0,-32767 2.2.2 Feature Extraction #33,0,-32767 2.2.3 Attribute Selection and Classification #33,0,-32767 2.3 Results and Discussion #35,0,-32767 2.3.1 Comparative Study #36,0,-32767 2.4 Conclusion #41,0,-32767 References #41,0,-327673: Artistic Multi-Character Script Identification #44,0,-32767 3.1 Introduction #44,0,-32767 3.2 Literature Review #45,0,-32767 3.3 Data Collection and Preprocessing #46,0,-32767 3.4 Feature Extraction #49,0,-32767 3.4.1 Topology-Based Features #49,0,-32767 3.4.2 Texture Feature #51,0,-32767 3.5 Experiments #53,0,-32767 3.5.1 Estimation Procedure #53,0,-32767 3.5.2 Results and Analysis #53,0,-32767 3.6 Conclusion #54,0,-32767 References #55,0,-327674: A Study on the Extreme Learning Machine and Its Applications #58,0,-32767 4.1 Introduction #58,0,-32767 4.2 Preliminaries #59,0,-32767 4.3 Activation Functions of ELM #60,0,-32767 4.3.1 Sigmoid Function #61,0,-32767 4.3.2 Hardlimit Function (‘Hardlim’) #61,0,-32767 4.3.3 Radial Basis Function (‘Radbas’) #61,0,-32767 4.3.4 Sine Function #61,0,-32767 4.3.5 Triangular Basis Function (‘Tribas’) #62,0,-32767 4.4 Metamorphosis of an ELM #62,0,-32767 4.5 Applications of ELMs #63,0,-32767 4.5.1 ELMs in Document Analysis #63,0,-32767 4.5.2 ELMs in Medicine #64,0,-32767 4.5.3 ELM in Audio Signal Processing #64,0,-32767 4.5.4 ELM in Other Pattern Recognition Problems #64,0,-32767 4.6 Challenges of ELM #64,0,-32767 4.7 Conclusion #65,0,-32767 References #65,0,-327675: A Graph-Based Text Classification Model for Web Text Documents #68,0,-32767 5.1 Introduction #68,0,-32767 5.2 Related Works #69,0,-32767 5.2.1 English #69,0,-32767 5.2.2 Chinese, Japanese and Persian #70,0,-32767 5.2.3 Arabic and Urdu #71,0,-32767 5.2.4 Indian Languages except Bangla #71,0,-32767 5.2.5 Bangla #72,0,-32767 5.3 Proposed Methodology #73,0,-32767 5.3.1 Data Collection #73,0,-32767 5.3.2 Pre-Processing #74,0,-32767 5.3.3 Graph-Based Representation #75,0,-32767 5.3.4 Classifier #75,0,-32767 5.4 Results and Analysis #76,0,-32767 5.4.1 Comparison with Existing Methods #78,0,-32767 5.5 Conclusion #79,0,-32767 Acknowledgment #79,0,-32767 References #79,0,-327676: A Study of Distance Metrics in Document Classification #84,0,-32767 6.1 Introduction #85,0,-32767 6.2 Literature Survey #85,0,-32767 6.2.1 Indo–European #85,0,-32767 6.2.2 Sino–Tibetan #86,0,-32767 6.2.3 Japonic #87,0,-32767 6.2.4 Afro–Asiatic #87,0,-32767 6.2.5 Dravidian #87,0,-32767 6.2.6 Indo–Aryan #87,0,-32767 6.3 Proposed Methodology #88,0,-32767 6.3.1 Data Collection #89,0,-32767 6.3.2 Pre-Processing #89,0,-32767 6.3.3 Feature Extraction and Selection #90,0,-32767 6.3.4 Distance Measurement #91,0,-32767 6.3.4.1 Squared Euclidean Distance 6.3.4.2 Manhattan Distance 6.3.4.3 Mahalanobis Distance 6.3.4.4 Minkowski Distance 6.3.4.5 Chebyshev Distance 6.3.4.6 Canberra Distance 6.4 Results and Discussion #93,0,-32767 6.4.1 Comparison with Existing Methods #96,0,-32767 6.5 Conclusion #96,0,-32767 Acknowledgment #97,0,-32767 References #97,0,-327677: A Study of Proximity of Domains for Text Categorization #100,0,-32767 7.1 Introduction #100,0,-32767 7.2 Existing Work #101,0,-32767 7.3 Proposed Methodology #104,0,-32767 7.3.1 Data Collection #104,0,-32767 7.3.2 Pre-Processing #105,0,-32767 7.3.3 Feature Extraction and Selection #106,0,-32767 7.3.4 Classifiers #107,0,-32767 7.4 Results and Analysis #109,0,-32767 7.5 Conclusion #112,0,-32767 Acknowledgment #112,0,-32767 References #112,0,-327678: Supervised Learning for Aggression Identification and Author Profiling over Twitter Dataset #116,0,-32767 8.1 Introduction #116,0,-32767 8.2 Overview of Aggression Identification #117,0,-32767 8.2.1 Dataset #117,0,-32767 8.2.2 Data Characteristics #119,0,-32767 8.2.3 Data Preprocessing #120,0,-32767 8.2.4 Feature Extraction #121,0,-32767 8.2.5 Experimental Setup #122,0,-32767 8.2.6 System Modeling #122,0,-32767 8.2.7 Results #124,0,-32767 8.3 Overview of Author Profiling #124,0,-32767 8.3.1 Datasets #125,0,-32767 8.3.2 Preprocessing #125,0,-32767 8.3.3 Feature Extraction #126,0,-32767 8.3.4 Experimental Setup #126,0,-32767 8.3.5 Algorithm and Fine-Tuning the Model #127,0,-32767 8.3.6 Results #128,0,-32767 8.4 Conclusion and Future Work #131,0,-32767 References #131,0,-327679: The Effect of Using Features Computed from Generated Offline Images for Online Bangla Handwritten Character Recognition #136,0,-32767 9.1 Introduction #136,0,-32767 9.2 Literature Review #141,0,-32767 9.2.1 Direction Code-Based Feature [50] #141,0,-32767 9.2.2 Area and Local Features #143,0,-32767 9.2.2.1 Area Feature 9.2.2.2 Local Feature 9.2.3 Point-Based Feature [2, 48] #145,0,-32767 9.2.4 Transition Count Feature [61] #146,0,-32767 9.2.5 Topological Feature [61] #146,0,-32767 9.2.5.1 Crossing Point 9.3 Database Preparation and Pre-processing #148,0,-32767 9.3.1 Design of the Data Collection Form #148,0,-32767 9.4 Feature Extraction #149,0,-32767 9.4.1 Directed Hausdorff Distance (DHD)-Based Features #149,0,-32767 9.5 Experimental Results and Analysis #151,0,-32767 9.6 Conclusion #154,0,-32767 References #154,0,-3276710: Handwritten Character Recognition for Palm-Leaf Manuscripts #160,0,-32767 10.1 Introduction #160,0,-32767 10.2 Palm-Leaf Manuscripts #161,0,-32767 10.3 Challenges in OHCR for Palm-Leaf Manuscripts #161,0,-32767 10.4 Document Processing and Recognition for Palm-Leaf Manuscripts #163,0,-32767 10.4.1 Preprocessing #164,0,-32767 10.4.1.1 Binarization 10.4.1.2 Noise Reduction 10.4.1.3 Skew Correction 10.4.2 Segmentation #165,0,-32767 10.4.2.1 Text Line Segmentation 10.4.2.2 Character Segmentation 10.4.3 Recognition #167,0,-32767 10.4.3.1 Segmentation-Based Approach 10.4.3.2 Segmentation-Free Approach This book covers the idea of artificial intelligence for document analysis. It discusses optical character recognition techniques emphasising on Bangla isolated handwritten characters, script identification from character level texts and signature data.

دانلود کتاب Document Processing Using Machine Learning