Fast Data Processing with Spark 2 - Third Edition

معرفی کتاب «Fast Data Processing with Spark 2 - Third Edition» نوشتهٔ Krishna Sankar; TotalBoox,; TBX، منتشرشده توسط نشر Packt Publishing - ebooks Account در سال 2016. این کتاب در 5 صفحه، فرمت pdf، زبان انگلیسی ارائه شده است. «Fast Data Processing with Spark 2 - Third Edition» در دستهٔ بدون دسته‌بندی قرار دارد.

Learn how to use Spark to process big data at speed and scale for sharper analytics. Put the principles into practice for faster, slicker big data projects. About This Book A quick way to get started with Spark - and reap the rewards From analytics to engineering your big data architecture, we've got it covered Bring your Scala and Java knowledge - and put it to work on new and exciting problemsWho This Book Is For This book is for developers with little to no knowledge of Spark, but with a background in Scala/Java programming. It's recommended that you have experience in dealing and working with big data and a strong interest in data science. What You Will Learn Install and set up Spark in your cluster Prototype distributed applications with Spark's interactive shell Perform data wrangling using the new DataFrame APIs Get to know the different ways to interact with Spark's distributed representation of data (RDDs) Query Spark with a SQL-like query syntax See how Spark works with big data Implement machine learning systems with highly scalable algorithms Use R, the popular statistical language, to work with Spark Apply interesting graph algorithms and graph processing with GraphXIn Detail When people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it's unsurprising that it's becoming popular with data analysts and engineers everywhere. Beginning with the fundamentals, we'll show you how to get set up with Spark with minimum fuss. You'll then get to grips with some simple APIs before investigating machine learning and graph processing - throughout we'll make sure you know exactly how to apply your knowledge. You will also learn how to use the Spark shell, how to load data before finding out how to build and run your own Spark applications. Discover how to manipulate your RDD and get stuck into a range of DataFrame APIs. As if that's not enough, you'll also learn some useful Machine Learning algorithms with the help of Spark MLlib and integrating Spark with R. We'll also make sure you're confident and prepared for graph processing, as you learn more about the GraphX API. Style and approach This book is a basic, step-by-step tutorial that will help you take advantage of all that Spark has to offer Cover 1 Copyright 3 Credits 4 About the Author 5 About the Reviewers 6 www.PacktPub.com 7 Table of Contents 8 Preface 14 Chapter 1: Installing Spark and Setting Up Your Cluster 19 Directory organization and convention 20 Installing the prebuilt distribution 21 Building Spark from source 23 Downloading the source 23 Compiling the source with Maven 24 Compilation switches 26 Testing the installation 26 Spark topology 26 A single machine 28 Running Spark on EC2 29 Downloading EC-scripts 29 Running Spark on EC2 with the scripts 31 Deploying Spark on Elastic MapReduce 37 Deploying Spark with Chef (Opscode) 38 Deploying Spark on Mesos 39 Spark on YARN 39 Spark standalone mode 40 References 44 Summary 45 Chapter 2: Using the Spark Shell 46 The Spark shell 46 Exiting out of the shell 48 Using Spark shell to run the book code 48 Loading a simple text file 49 Interactively loading data from S3 53 Running the Spark shell in Python 56 Summary 57 Chapter 3: Building and Running a Spark Application 58 Building Spark applications 58 Data wrangling with iPython 59 Developing Spark with Eclipse 60 Developing Spark with other IDEs 62 Building your Spark job with Maven 63 Building your Spark job with something else 65 References 65 Summary 66 Chapter 4: Creating a SparkSession Object 67 SparkSession versus SparkContext 67 Building a SparkSession object 69 SparkContext ? metadata 70 Shared Java and Scala APIs 72 Python 73 iPython 74 Reference 75 Summary 76 Chapter 5: Loading and Saving Data in Spark 77 Spark abstractions 77 RDDs 78 Data modalities 79 Data modalities and Datasets/DataFrames/RDDs 79 Loading data into an RDD 80 Saving your data 93 References 93 Summary 94 Chapter 6: Manipulating Your RDD 95 Manipulating your RDD in Scala and Java 95 Scala RDD functions 106 Functions for joining the PairRDD classes 107 Other PairRDD functions 107 Double RDD functions 109 General RDD functions 109 Java RDD functions 112 Spark Java function classes 112 Common Java RDD functions 113 Methods for combining JavaRDDs 115 Functions on JavaPairRDDs 115 Manipulating your RDD in Python 117 Standard RDD functions 119 The PairRDD functions 121 References 123 Summary 123 Chapter 7: Spark 2.0 Concepts 124 Code and Datasets for the rest of the book 125 Code 125 IDE 125 iPython startup and test 125 Datasets 127 Car-mileage 128 Northwind industries sales data 128 Titanic passenger list 128 State of the Union speeches by POTUS 128 Movie lens Dataset 129 The data scientist and Spark features 129 Who is this data scientist DevOps person? 130 The Data Lake architecture 131 Data Hub 131 Reporting Hub 132 Analytics Hub 132 Spark v2.0 and beyond 132 Apache Spark ? evolution 133 Apache Spark ? the full stack 135 The art of a big data store ? Parquet 136 Column projection and data partition 137 Compression 137 Smart data storage and predicate pushdown 137 Support for evolving schema 137 Performance 138 References 138 Summary 139 Chapter 8: Spark SQL 140 The Spark SQL architecture 140 Spark SQL how-to in a nutshell 141 Spark SQL with Spark 2.0 142 Spark SQL programming 143 Datasets/DataFrames 143 SQL access to a simple data table 143 Handling multiple tables with Spark SQL 147 Aftermath 153 References 154 Summary 154 Chapter 9: Foundations of Datasets/DataFrames ? The Proverbial Workhorse for DataScientists 155 Datasets ? a quick introduction 155 Dataset APIs ? an overview 156 org.apache.spark.sql.SparkSession/pyspark.sql.SparkSession 158 org.apache.spark.sql.Dataset/pyspark.sql.DataFrame 158 org.apache.spark.sql.{Column,Row}/pyspark.sql.(Column,Row) 159 org.apache.spark.sql.Column 159 org.apache.spark.sql.Row 160 org.apache.spark.sql.functions/pyspark.sql.functions 160 Dataset interfaces and functions 160 Read/write operations 161 Aggregate functions 163 Statistical functions 165 Scientific functions 170 Data wrangling with Datasets 173 Reading data into the respective Datasets 173 Aggregate and sort 174 Date columns, totals, and aggregations 175 The OrderTotal column 175 Date operations 178 Final aggregations for the answers we want 179 References 182 Summary 182 Chapter 10: Spark with Big Data 183 Parquet ? an efficient and interoperable big data format 183 Saving files in the Parquet format 183 Loading Parquet files 186 Saving processed RDDs in the Parquet format 187 HBase 188 Loading from HBase 188 Saving to HBase 190 Other HBase operations 191 Reference 191 Summary 192 Chapter 11: Machine Learning with Spark ML Pipelines 193 Spark's machine learning algorithm table 193 Spark machine learning APIs ? ML pipelines and MLlib 195 ML pipelines 196 Spark ML examples 198 The API organization 199 Basic statistics 200 Loading data 202 Computing statistics 202 Linear regression 203 Data transformation and feature extraction 203 Data split 204 Predictions using the model 205 Model evaluation 206 Classification 207 Loading data 207 Data transformation and feature extraction 208 Data split 210 The regression model 210 Prediction using the model 211 Model evaluation 212 Clustering 213 Loading data 214 Data transformation and feature extraction 215 Data split 215 Predicting using the model 216 Model evaluation and interpretation 216 Clustering model interpretation 219 Recommendation 220 Loading data 220 Data transformation and feature extraction 223 Data splitting 223 Predicting using the model 224 Model evaluation and interpretation 225 Hyper parameters 226 The final thing 227 References 228 Summary 228 Chapter 12: GraphX 229 Graphs and graph processing ? an introduction 229 Spark GraphX 231 GraphX ? computational model 232 The first example ? graph 234 Building graphs 235 The GraphX API landscape 238 Structural APIs 239 What's wrong with the output? 241 Community, affiliation, and strengths 242 Algorithms 244 Graph parallel computation APIs 246 The aggregateMessages() API 246 The first example ? the oldest follower 247 The second example ? the oldest followee 248 The third example ? the youngest follower/followee 249 The fourth example ? inDegree/outDegree 250 Partition strategy 252 Case study ? AlphaGo tweets analytics 253 Data pipeline 254 GraphX modeling 255 GraphX processing and algorithms 256 References 260 Summary 262 Index 263 About This BookA quick way to get started with Spark – and reap the rewardsFrom analytics to engineering your big data architecture, we've got it coveredBring your Scala and Java knowledge – and put it to work on new and exciting problemsWho This Book Is ForThis book is for developers with little to no knowledge of Spark, but with a background in Scala/Java programming. It's recommended that you have experience in dealing and working with big data and a strong interest in data science.What You Will LearnInstall and set up Spark in your clusterPrototype distributed applications with Spark's interactive shellPerform data wrangling using the new DataFrame APIsGet to know the different ways to interact with Spark's distributed representation of data (RDDs)Query Spark with a SQL-like query syntaxSee how Spark works with big dataUnderstand how data scientists and data engineers can use the Spark frameworkImplement machine learning systems with highly scalable algorithmsUse R, the popular statistical language, to work with SparkApply interesting graph algorithms and graph processing with GraphXIn DetailWhen people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it's unsurprising that it's becoming popular with data analysts and engineers everywhere.Beginning with the fundamentals, we'll show you how to get set up with Spark with minimum fuss. You'll then get to grips with some simple APIs before investigating machine learning and graph processing―throughout we'll make sure you know exactly how to apply your knowledge. For cloud deployments of Spark, we will look at EC2 (both traditional and EC2MR).You will also learn how to use the Spark shell and load data before finding out how to build and run your own Spark applications. Discover how to manipulate your RDD and get stuck into a range of DataFrame APIs. As if that's not enough, you'll also learn some useful machine learning algorithms with the help of Spark MLlib. We'll also make sure you're confident and prepared for graph processing, as you learn more about the GraphX API.

دانلود کتاب Fast Data Processing with Spark 2 - Third Edition