Fast Data Processing with Spark - Second Edition

معرفی کتاب «Fast Data Processing with Spark - Second Edition» نوشتهٔ Krishna Sankar and Holden Karau، منتشرشده توسط نشر Packt Publishing - ebooks Account در سال 2015. این کتاب در 5 صفحه، فرمت pdf، زبان انگلیسی ارائه شده است. «Fast Data Processing with Spark - Second Edition» در دستهٔ بدون دسته‌بندی قرار دارد.

نویسنده: Krishna Sankar and Holden Karau
ناشر: Packt Publishing - ebooks Account
سال انتشار: 2015
تعداد صفحات: 5
فرمت: pdf
زبان: en
شابک: 9781306269971
دسته: بدون دسته‌بندی

Perform real-time analytics using Spark in a fast, distributed, and scalable way In Detail Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets. Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes. What You Will Learn Install and set up Spark on your cluster Prototype distributed applications with Spark's interactive shell Learn different ways to interact with Spark's distributed representation of data (RDDs) Query Spark with a SQL-like query syntax Effectively test your distributed software Recognize how Spark works with big data Implement machine learning systems with highly scalable algorithms Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you Cover 1 Copyright 3 Credits 4 About the Authors 5 About the Reviewers 7 www.PacktPub.com 9 Table of Contents 10 Preface 14 Chapter 1: Installing Spark and Setting up your Cluster 20 Directory organization and convention 21 Installing prebuilt distribution 22 Building Spark from source 23 Download source 24 Compile source with Maven 24 Compilation switches 26 Testing the installation 26 Spark topology 26 Single machine 28 Running Spark on EC2 28 Running Spark on EC2 with the scripts 29 Deploying Spark on Elastic MapReduce 35 Deploying Spark with Chef (opscode) 36 Deploying Spark on Mesos 37 Spark on YARN 38 Spark Standalone mode 38 Summary 43 Chapter 2: Using the Spark Shell 44 Loading a simple text file 45 Using the Spark shell to run Logistic regression 48 Interactively Loading data from S3 51 Running Spark shell in Python 53 Summary 54 Chapter 3: Building and Running a Spark Application 56 Building your Spark project with sbt 56 Building your Spark job with Maven 60 Building your Spark job with something else 63 Summary 63 Chapter 4: Creating a SparkContext 64 Scala 65 Java 65 SparkContext – metadata 66 Shared Java and Scala APIs 68 Python 68 Summary 69 Chapter 5: Loading and Saving Data in Spark 70 RDDs 70 Loading data into an RDD 71 Saving your data 81 Summary 82 Chapter 6: Manipulating your RDD 84 Manipulating your RDD in Scala and Java 84 Scala RDD functions 95 Functions for joining Pair RDDs 95 Other PairRDD functions 96 Double RDD functions 97 General RDD functions 98 Java RDD functions 100 Spark Java function classes 100 Common Java RDD functions 101 Methods for combining JavaRDDs 102 Functions on JavaPairRDDs 103 Manipulating your RDD in Python 104 Standard RDD functions 107 Pair RDD functions 108 Summary 110 Chapter 7: Spark SQL 112 Spark SQL architecture 113 Spark SQL how-to in a nutshell 113 Spark SQL programming 114 SQL access to a simple data table 114 Handling multiple tables with Spark SQL 117 Aftermath 123 Summary 124 Chapter 8: Spark with Big Data 126 Parquet – an efficient and interoperable big data format 126 Saving files to the Parquet format 127 Loading Parquet files 128 Saving processed RDD in the Parquet format 130 Querying Parquet files with Impala 130 HBase 133 Loading from HBase 134 Saving to HBase 135 Other HBase operations 136 Summary 137 Chapter 9: Machine Learning Using Spark MLlib 138 Spark machine learning algorithm table 139 Spark MLlib examples 139 Basic statistics 140 Linear regression 143 Classification 145 Clustering 151 Recommendation 155 Summary 159 Chapter 10: Testing 160 Testing in Java and Scala 160 Making your code testable 160 Testing interactions with SparkContext 163 Testing in Python 167 Summary 169 Chapter 11: Tips and Tricks 170 Where to find Logs 170 Concurrency limitations 170 Memory usage and garbage collection 171 Serialization 172 IDE integration 172 Using Spark with other languages 174 A quick note on security 174 Community developed packages 174 Mailing lists 174 Summary 175 Index 176 q www.it-ebooks.info

In Detail

Starling makes it very easy for an ActionScript developer to create cross-platform, multiplayer games. Starling utilizes GPU to render all the content for excellent performance on a wide range of devices. Multiplayer games have become a very lucrative market, pulling in more and more developers who try to raise the bar for user experience. With the ever-increasing popularity of iOS and Android, the demand for cross-platform games has increased exponentially.

Starling Game Development Essentials takes you step-by-step through the development of a complicated Isometric game. You will learn to create a level editor, AI logic for enemies, and integrate particle effects. Furthermore, you will learn to develop multi-player games that can support multiple players on the same device and would integrate Flox services for efficient user tracking and analytics. Finally, you will understand how to deploy your game to the Web, App Store, and Google Play.

This project-based book starts with the game idea, and an introduction to Game States and Game Loop. You also learn about the working of Isometric projection logic.

You get to explore RenderTexture for dynamically creating game levels and later on easily upgrade to the exceptional QuadBatch for deploying on devices. You will then move on to use Starling Particle extension for explosion effects. Finally, you will develop a simple AI Manager to help the enemy make decisions and use Pathfinder to facilitate grid-based path finding.

Starling Game Development Essentials, with the help of FlagDefense game source code, is an invaluable asset to anyone who wants to create a Starling cross-platform game.

Approach

This is a practical, project-based guide that will help the reader to build Isometric, turn-based games using Starling.

Who this book is for

If you are an ActionScript developer and want to create cross-platform games with Starling, this book is for you. The FlagDefense game covers some complex topics in game development which are beneficial even for those who are already creating games with Starling. Prior knowledge of Starling will help, but is not necessary.

In Detail

Parse using iOS SDK is a new technology, and is the first of its kind in the field of mobile application development. It provides you the cloud where you can keep your data, host your code, and even your website without any hassle. It provides SDK so that you can access your data through your mobile and web applications.

This practical, hands- on guide will help you to instantly get started with Parse iOS. It is packed with step- by- step exercises, which will help you to take advantage of the real power of the Parse iOS cloud backend service, and provides you with an example- based approach to help you build applications using Parse iOS.

Starting with Parse iOS installation, we will move onto integration, and finally, this guide will end with the development of an application using Parse iOS. You will also learn about securing your application data by specifying ACL and Roles to your data objects. We will also learn about configuration in detail, and the implementation of cloud code to make your application lighter on the client side. You can take advantage of iCloud by hosting your website as well.

You will learn everything that you need to know to develop your application using Parse iOS as a backend.

Approach

A practical guide, featuring step-by-step instructions showing you how to use Parse iOS, and handle your data on cloud.

Who this book is for

If you are a developer who wants to build your applications instantly using Parse iOS as a back end application development, this book is ideal for you. This book will help you to understand Parse, featuring examples to help you get familiar with the concepts of Parse iOS.

89 hands-on recipes to help you complete real-world data science projects in R and Python - Learn about the data science pipeline and use it to acquire, clean, analyze, and visualize data - Understand critical concepts in data science in the context of multiple projects - Expand your numerical programming skills through step-by-step code examples and learn more about the robust features of R and Python As increasing amounts of data is generated each year, the need to analyze and operationalize it is more important than ever. Companies that know what to do with their data will have a competitive advantage over companies that don't, and this will drive a higher demand for knowledgeable and competent data professionals. Starting with the basics, this book will cover how to set up your numerical programming environment, introduce you to the data science pipeline (an iterative process by which data science projects are completed), and guide you through several data projects in a step-by-step format. By sequentially working through the steps in each chapter, you will quickly familiarize yourself with the process and learn how to apply it to a variety of situations with examples in the two most popular programming languages for data analysisR and Python. - A practical tutorial with real-world use cases allowing you to develop your own machine learning systems with Spark - Combine various techniques and models into an intelligent machine learning system - Use Spark's powerful tools to load, analyze, clean, and transform your data Apache Spark is a framework for distributed computing that is designed from the ground up to be optimized for low latency tasks and in-memory data storage. It is one of the few frameworks for parallel computing that combines speed, scalability, in-memory processing, and fault tolerance with ease of programming and a flexible, expressive, and powerful API design. This book guides you through the basics of Spark's API used to load and process data and prepare the data to use as input to the various machine learning models. There are detailed examples and real-world use cases for you to explore common machine learning models including recommender systems, classification, regression, clustering, and dimensionality reduction. You will cover advanced topics such as working with large-scale text data, and methods for online machine learning and model evaluation using Spark Streaming.

As increasing amounts of data is generated each year, the need to analyze and operationalize it is more important than ever. Companies that know what to do with their data will have a competitive advantage over companies that don't, and this will drive a higher demand for knowledgeable and competent data professionals.

Starting with the basics, this book will cover how to set up your numerical programming environment, introduce you to the data science pipeline (an iterative process by which data science projects are completed), and guide you through several data projects in a step-by-step format. By sequentially working through the steps in each chapter, you will quickly familiarize yourself with the process and learn how to apply it to a variety of situations with examples in the two most popular programming languages for data analysis—R and Python.

About This Book

Acquire the practical skills required to develop applications in Raspbian
Interact with the Raspbian operating system via its console
Explore the Raspbian GUI and the bundled console applications with this easy-to-follow guide

Who This Book Is For

This book is intended for developers who have worked with the Raspberry Pi and who want to learn how to make the most of the Raspbian operating system and their Raspberry Pi. Whether you are a beginner to the Raspberry Pi or a seasoned expert, this book will make you familiar with the Raspbian operating system and teach you how to get your Raspberry Pi up and running.

Using MLlib for feature normalizationUsing packages for feature extraction; Summary; Chapter 4: Building a Recommendation Engine with Spark; Types of recommendation models; Content-based filtering; Collaborative filtering; Matrix factorization; Extracting the right features from your data; Extracting features from the MovieLens 100k dataset; Training the recommendation model; Training a model on the MovieLens 100k dataset; Training a model using implicit feedback data; Using the recommendation model; User recommendations; Generating movie recommendations from the MovieLens 100k dataset Cover; Copyright; Credits; About the Author; Acknowledgments; About the Reviewers; www.PacktPub.com; Table of Contents; Preface; Chapter 1: Getting Up and Running with Spark; Installing and setting up Spark locally; Spark clusters; The Spark programming model; SparkContext and SparkConf; The Spark shell; Resilient Distributed Datasets; Creating RDDs; Spark operations; Caching RDDs; Broadcast variables and accumulators; The first step to a Spark program in Scala; The first step to a Spark program in Java; The first step to a Spark program in Python; Getting Spark running on Amazon EC2 Launching an EC2 Spark clusterSummary; Chapter 2: Designing a Machine Learning System; Introducing MovieStream; Business use cases for a machine learning system; Personalization; Targeted marketing and customer segmentation; Predictive modeling and analytics; Types of machine learning models; The components of a data-driven machine learning system; Data ingestion and storage; Data cleansing and transformation; Model training and testing loop; Model deployment and integration; Model monitoring and feedback; Batch versus real time; An architecture for a machine learning system Practical exerciseSummary; Chapter 3: Obtaining, Processing, and Preparing Data with Spark; Accessing publicly available datasets; The MovieLens 100k dataset; Exploring and visualizing your data; Exploring the user dataset; Exploring the movie dataset; Exploring the rating dataset; Processing and transforming your data; Filling in bad or missing data; Extracting useful features from your data; Numerical features; Categorical features; Derived features; Transforming timestamps into categorical features; Text features; Simple text feature extraction; Normalizing features Item recommendationsGenerating similar movies for the MovieLens 100K dataset; Evaluating the performance of recommendation models; Mean Squared Error; Mean average precision at K; Using MLlib's built-in evaluation functions; RMSE and MSE; MAP; Summary; Chapter 5: Building a Classification Model with Spark; Types of classification models; Linear models; Logistic regression; Linear support vector machines; The naïve Bayes model; Decision trees; Extracting the right features from your data; Extracting features from the Kaggle/StumbleUpon evergreen classification dataset If you are an aspiring data scientist who wants to learn data science and numerical programming concepts through hands-on, real-world project examples, this is the book for you. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. Since the book is formatted to walk you through the projects with examples and explanations along the way, no prior programming experience is required. Fast Data Processing with Spark - Second Edition is for software developers who want to learn how to write distributed programs with Spark. It will help developers who have had problems that were too big to be dealt with on a single computer. No previous experience with distributed programming is necessary. This book assumes knowledge of either Java, Scala, or Python. If you are a Scala, Java, or Python developer with an interest in machine learning and data analysis and are eager to learn how to apply common machine learning techniques at scale using the Spark framework, this is the book for you. While it may be useful to have a basic understanding of Spark, no previous experience is required.

دانلود کتاب Fast Data Processing with Spark - Second Edition