Learning Spark: Lightning-Fast Big Data Analysis

معرفی کتاب «Learning Spark: Lightning-Fast Big Data Analysis» نوشتهٔ Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia، منتشرشده توسط نشر O'Reilly Media در سال 2015. این کتاب در فرمت pdf، زبان انگلیسی ارائه شده است. «Learning Spark: Lightning-Fast Big Data Analysis» در دستهٔ بدون دسته‌بندی قرار دارد.

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You'll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning. Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell Leverage Spark's powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm Learn how to deploy interactive, batch, and streaming applications Connect to data sources including HDFS, Hive, JSON, and S3 Master advanced topics like data partitioning and shared variables Table of Contents 5 Foreword 11 Preface 13 Audience 13 How This Book Is Organized 14 Supporting Books 14 Conventions Used in This Book 15 Code Examples 15 Safari® Books Online 16 How to Contact Us 17 Acknowledgments 17 Chapter 1. Introduction to Data Analysis with Spark 19 What Is Apache Spark? 19 A Unified Stack 20 Spark Core 21 Spark SQL 21 Spark Streaming 21 MLlib 22 GraphX 22 Cluster Managers 22 Who Uses Spark, and for What? 22 Data Science Tasks 23 Data Processing Applications 24 A Brief History of Spark 24 Spark Versions and Releases 25 Storage Layers for Spark 25 Chapter 2. Downloading Spark and Getting Started 27 Downloading Spark 27 Introduction to Spark’s Python and Scala Shells 29 Introduction to Core Spark Concepts 32 Standalone Applications 35 Initializing a SparkContext 35 Building Standalone Applications 36 Conclusion 39 Chapter 3. Programming with RDDs 41 RDD Basics 41 Creating RDDs 43 RDD Operations 44 Transformations 45 Actions 46 Lazy Evaluation 47 Passing Functions to Spark 48 Python 48 Scala 49 Java 50 Common Transformations and Actions 52 Basic RDDs 52 Converting Between RDD Types 60 Persistence (Caching) 62 Conclusion 64 Chapter 4. Working with Key/Value Pairs 65 Motivation 65 Creating Pair RDDs 66 Transformations on Pair RDDs 67 Aggregations 69 Grouping Data 75 Joins 76 Sorting Data 77 Actions Available on Pair RDDs 78 Data Partitioning (Advanced) 79 Determining an RDD’s Partitioner 82 Operations That Benefit from Partitioning 83 Operations That Affect Partitioning 83 Example: PageRank 84 Custom Partitioners 86 Conclusion 88 Chapter 5. Loading and Saving Your Data 89 Motivation 89 File Formats 90 Text Files 91 JSON 92 Comma-Separated Values and Tab-Separated Values 95 SequenceFiles 98 Object Files 101 Hadoop Input and Output Formats 102 File Compression 105 Filesystems 107 Local/“Regular” FS 107 Amazon S3 108 HDFS 108 Structured Data with Spark SQL 109 Apache Hive 109 JSON 110 Databases 111 Java Database Connectivity 111 Cassandra 112 HBase 114 Elasticsearch 115 Conclusion 116 Chapter 6. Advanced Spark Programming 117 Introduction 117 Accumulators 118 Accumulators and Fault Tolerance 121 Custom Accumulators 121 Broadcast Variables 122 Optimizing Broadcasts 124 Working on a Per-Partition Basis 125 Piping to External Programs 127 Numeric RDD Operations 131 Conclusion 133 Chapter 7. Running on a Cluster 135 Introduction 135 Spark Runtime Architecture 135 The Driver 136 Executors 137 Cluster Manager 137 Launching a Program 138 Summary 138 Deploying Applications with spark-submit 139 Packaging Your Code and Dependencies 141 A Java Spark Application Built with Maven 142 A Scala Spark Application Built with sbt 144 Dependency Conflicts 146 Scheduling Within and Between Spark Applications 146 Cluster Managers 147 Standalone Cluster Manager 147 Hadoop YARN 151 Apache Mesos 152 Amazon EC2 153 Which Cluster Manager to Use? 156 Conclusion 157 Chapter 8. Tuning and Debugging Spark 159 Configuring Spark with SparkConf 159 Components of Execution: Jobs, Tasks, and Stages 163 Finding Information 168 Spark Web UI 168 Driver and Executor Logs 172 Key Performance Considerations 173 Level of Parallelism 173 Serialization Format 174 Memory Management 175 Hardware Provisioning 176 Conclusion 178 Chapter 9. Spark SQL 179 Linking with Spark SQL 180 Using Spark SQL in Applications 182 Initializing Spark SQL 182 Basic Query Example 183 SchemaRDDs 184 Caching 187 Loading and Saving Data 188 Apache Hive 188 Parquet 189 JSON 190 From RDDs 192 JDBC/ODBC Server 193 Working with Beeline 195 Long-Lived Tables and Queries 196 User-Defined Functions 196 Spark SQL UDFs 196 Hive UDFs 197 Spark SQL Performance 198 Performance Tuning Options 198 Conclusion 200 Chapter 10. Spark Streaming 201 A Simple Example 202 Architecture and Abstraction 204 Transformations 207 Stateless Transformations 208 Stateful Transformations 210 Output Operations 215 Input Sources 217 Core Sources 217 Additional Sources 218 Multiple Sources and Cluster Sizing 222 24/7 Operation 223 Checkpointing 223 Driver Fault Tolerance 224 Worker Fault Tolerance 225 Receiver Fault Tolerance 225 Processing Guarantees 226 Streaming UI 226 Performance Considerations 227 Batch and Window Sizes 227 Level of Parallelism 228 Garbage Collection and Memory Usage 228 Conclusion 229 Chapter 11. Machine Learning with MLlib 231 Overview 231 System Requirements 232 Machine Learning Basics 233 Example: Spam Classification 234 Data Types 236 Working with Vectors 237 Algorithms 238 Feature Extraction 239 Statistics 241 Classification and Regression 242 Clustering 247 Collaborative Filtering and Recommendation 248 Dimensionality Reduction 250 Model Evaluation 252 Tips and Performance Considerations 252 Preparing Features 252 Configuring Algorithms 253 Caching RDDs to Reuse 253 Recognizing Sparsity 253 Level of Parallelism 254 Pipeline API 254 Conclusion 255 Index 257 About the Authors 273 Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shellLeverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlibUse one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and StormLearn how to deploy interactive, batch, and streaming applicationsConnect to data sources including HDFS, Hive, JSON, and S3Master advanced topics like data partitioning and shared variables This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. You'll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.-- Source other than Library of Congress

دانلود کتاب Learning Spark: Lightning-Fast Big Data Analysis