Snowflake Data Engineering
معرفی کتاب «Snowflake Data Engineering» نوشتهٔ Maja Ferle، منتشرشده توسط نشر Manning Publications Co. LLC در سال 2025. این کتاب در فرمت pdf، زبان انگلیسی ارائه شده است. «Snowflake Data Engineering» در دستهٔ بدون دستهبندی قرار دارد.
A practical introduction to data engineering on the powerful Snowflake cloud data platform. Data engineers create the pipelines that ingest raw data, transform it, and funnel it to the analysts and professionals who need it. The Snowflake cloud data platform provides a suite of productivity-focused tools and features that simplify building and maintaining data pipelines. In Snowflake Data Engineering, Snowflake Data Superhero Maja Ferle shows you how to get started. In Snowflake Data Engineering you will learn how to: • Ingest data into Snowflake from both cloud and local file systems • Transform data using functions, stored procedures, and SQL • Orchestrate data pipelines with streams and tasks, and monitor their execution • Use Snowpark to run Python code in your pipelines • Deploy Snowflake objects and code using continuous integration principles • Optimize performance and costs when ingesting data into Snowflake Snowflake Data Engineering reveals how Snowflake makes it easy to work with unstructured data, set up continuous ingestion with Snowpipe, and keep your data safe and secure with best-in-class data governance features. Along the way, you’ll practice the most important data engineering tasks as you work through relevant hands-on examples. Throughout, author Maja Ferle shares design tips drawn from her years of experience to ensure your pipeline follows the best practices of software engineering, security, and data governance. Foreword by Joe Reis. About the technology Pipelines that ingest and transform raw data are the lifeblood of business analytics, and data engineers rely on Snowflake to help them deliver those pipelines efficiently. Snowflake is a full-service cloud-based platform that handles everything from near-infinite storage, fast elastic compute services, inbuilt AI/ML capabilities like vector search, text-to-SQL, code generation, and more. This book gives you what you need to create effective data pipelines on the Snowflake platform. About the book Snowflake Data Engineering guides you skill-by-skill through accomplishing on-the-job data engineering tasks using Snowflake. You’ll start by building your first simple pipeline and then expand it by adding increasingly powerful features, including data governance and security, adding CI/CD into your pipelines, and even augmenting data with generative AI. You’ll be amazed how far you can go in just a few short chapters! What's inside • Ingest data from the cloud, APIs, or Snowflake Marketplace • Orchestrate data pipelines with streams and tasks • Optimize performance and cost About the reader For software developers and data analysts. Readers should know the basics of SQL and the Cloud. About the author Maja Ferle is a Snowflake Subject Matter Expert and a Snowflake Data Superhero who holds the SnowPro Advanced Data Engineer and the SnowPro Advanced Data Analyst certifications. Snowflake Data Engineering brief contents contents foreword preface acknowledgments about this book Who should read this book How this book is organized: A road map About the code liveBook discussion forum Other online resources about the author about the cover illustration Part 1 Introducing data engineering with Snowflake 1 Data engineering with Snowflake 1.1 Snowflake for data engineering 1.1.1 Snowflake architecture 1.1.2 Snowflake features for data engineering 1.2 Responsibilities of a Snowflake data engineer 1.2.1 Extracting data from source systems 1.2.2 Performing data transformations 1.2.3 Presenting data to downstream consumers 1.2.4 Applying underlying components 1.3 Building data pipelines 1.4 Data engineering with Snowflake applications Summary 2 Creating your first data pipeline 2.1 Setting up your Snowflake account 2.2 Staging a CSV file 2.3 Loading data from a staged file into a target table 2.3.1 Loading data from a staged file into a staging table 2.3.2 Merging data from the staging table into the target table 2.4 Transforming data with SQL commands 2.5 Automating the process with tasks Summary Part 2Ingesting, transforming, and storing data 3 Best practices for data staging 3.1 Creating external stages 3.1.1 Configuring a storage integration 3.1.2 Creating an external stage using a storage integration 3.1.3 Creating an external stage using credentials 3.1.4 Loading data from staged files into a staging table 3.1.5 Avoiding duplication when loading data from staged files 3.1.6 Using a named file format 3.2 Viewing stage metadata with directory tables 3.3 Preparing data files for efficient ingestion 3.3.1 File sizing recommendations 3.3.2 Organizing data by path 3.4 Building pipelines with external tables 3.4.1 Querying data in external stages with external tables 3.4.2 Using materialized views to improve query performance Summary 4 Transforming data 4.1 Ingesting semistructured data from cloud storage 4.1.1 Creating a storage integration 4.1.2 Creating an external stage 4.1.3 Examining the JSON structure 4.1.4 Ingesting JSON data into a VARIANT data type 4.2 Flattening semistructured data into relational tables 4.3 Encapsulating transformations with stored procedures 4.3.1 Creating a basic stored procedure 4.3.2 Including a return value in a stored procedure 4.3.3 Implementing exception handling in stored procedures 4.4 Adding logging to stored procedures 4.5 Building robust data pipelines Summary 5 Continuous data ingestion 5.1 Comparing bulk and continuous data ingestion 5.2 Preparing files in cloud storage 5.2.1 Creating a storage integration 5.2.2 Creating an external stage 5.3 Configuring Snowpipe with cloud messaging 5.3.1 Configuring event grid messages for blob storage events 5.3.2 Creating a notification integration 5.3.3 Creating a pipe object 5.3.4 Ingesting data continuously 5.3.5 Flattening the JSON structure to relational format 5.4 Transforming data with dynamic tables Summary 6 Executing code natively with Snowpark 6.1 Introducing Snowpark 6.2 Creating a Snowpark procedure in a worksheet 6.3 Using the SQL API from a local development environment 6.3.1 Installing and configuring the local development environment 6.3.2 Creating a Snowflake session 6.3.3 Providing credentials in a configuration file 6.3.4 Querying data and executing SQL commands 6.4 Generating a date dimension in Snowpark Python 6.5 Working with data frames 6.6 Ingesting data from a CSV file into a Snowflake table 6.7 Transforming data with data frames Summary 7 Augmenting data with outputs from large language models 7.1 Configuring external network access 7.2 Calling an API endpoint from a Snowpark function 7.2.1 Constructing the UDF that retrieves customer reviews 7.2.2 Interpreting the results from the UDF 7.2.3 Storing the customer reviews in a table 7.3 Deriving customer review sentiments 7.4 Interpreting order emails using LLMs to save time 7.4.1 Creating a stored procedure that interprets customer emails 7.4.2 Constructing the prompt 7.4.3 Saving the CSV result to a table 7.4.4 Evaluating the output Summary 8 Optimizing query performance 8.1 Getting data from the Snowflake Marketplace 8.2 Performing analysis of geographical data 8.2.1 Snowflake’s geography functions 8.2.2 Copying data from the shared database 8.2.3 Viewing query execution parameters using the query profile 8.3 Understanding Snowflake micro-partitions 8.3.1 A conceptual example of micro-partitions 8.3.2 Micro-partition pruning 8.4 Optimizing storage with clustering 8.4.1 Viewing clustering information 8.4.2 Adding clustering keys to a table 8.4.3 Monitoring the clustering process 8.4.4 Viewing improved query execution after clustering 8.5 Improving query performance with search optimization 8.5.1 Adding search optimization to a table 8.5.2 Reviewing query performance after adding search optimization 8.6 General tips for improving query performance 8.6.1 Writing efficient SQL queries 8.6.2 Identifying queries that are candidates for optimization Summary 9 Controlling costs 9.1 Understanding Snowflake costs 9.1.1 Total Snowflake cost 9.1.2 Compute resources cost 9.1.3 Virtual warehouse credits 9.2 Sizing virtual warehouses 9.2.1 Using persisted query results 9.2.2 Comparing query statistics between differently sized warehouses 9.2.3 Optimizing query performance to reduce spilling 9.3 Optimizing performance with data caching 9.3.1 Illustrating the metadata cache 9.3.2 Utilizing the warehouse cache efficiently 9.4 Reducing query queuing 9.4.1 Examining queuing 9.4.2 Limiting concurrently running queries 9.5 Monitoring compute consumption Summary 10 Data governance and access control 10.1 Role-based access control 10.1.1 System-defined roles 10.1.2 Custom roles 10.1.3 Designing RBAC 10.2 Securing data with row access policies 10.3 Protecting sensitive data with masking policies Summary Part 3 Building data pipelines 11 Designing data pipelines 11.1 Designing data pipelines 11.1.1 Extracting data 11.1.2 Comparing data pipeline patterns 11.1.3 Choosing data transformation layers 11.1.4 Organizing data warehouse layers 11.1.5 Creating schemas with access control 11.2 Building a sample data pipeline 11.2.1 Implementing the extraction layer 11.2.2 Implementing the staging layer 11.2.3 Implementing the data warehouse layer 11.2.4 Implementing the reporting layer Summary 12 Ingesting data incrementally 12.1 Comparing data ingestion approaches 12.1.1 Full ingestion 12.1.2 Incremental ingestion 12.2 Preserving history with slowly changing dimensions 12.2.1 SCD type 2 12.2.2 Append-only strategy 12.2.3 Designing idempotent data pipelines 12.3 Detecting changes with Snowflake streams 12.3.1 Ingesting files from cloud storage incrementally 12.3.2 Preserving history when ingesting data incrementally 12.4 Maintaining data with dynamic tables 12.4.1 Deciding when to use dynamic tables 12.4.2 Querying historical data Summary 13 Orchestrating data pipelines 13.1 Orchestrating with Snowflake tasks 13.1.1 Creating a schema to store the orchestration objects 13.1.2 Designing the orchestration tasks 13.1.3 Creating tasks with dependencies 13.2 Sending email notifications 13.3 Orchestrating with task graphs 13.3.1 Designing the task graph 13.3.2 Creating the root task 13.3.3 Creating the finalizer task 13.3.4 Viewing the task graph 13.4 Monitoring data pipeline execution 13.4.1 Adding logging functionality to tasks 13.4.2 Summarizing logging information in an email notification 13.5 Troubleshooting data pipeline failures Summary 14 Testing for data integrity and completeness 14.1 Data testing methods 14.1.1 Performing data testing as steps in the pipeline 14.1.2 Performing data testing independently of the pipeline 14.2 Incorporating data testing steps in the pipeline 14.2.1 Constructing the partner data quality task 14.2.2 Constructing the product data quality task 14.2.3 Executing the pipeline with the data testing tasks 14.3 Applying the Snowflake data metric functions 14.3.1 System-defined data metric functions 14.3.2 User-defined data metric functions 14.3.3 Viewing data metric function details 14.4 Alerting users when data metrics exceed thresholds 14.5 Detecting data volume anomalies 14.5.1 Generating random data 14.5.2 Displaying data as a line chart in Snowsight 14.5.3 Working with the anomaly detection model Summary 15 Data pipeline continuous integration 15.1 Separating the data engineering environments 15.2 Database change management 15.2.1 Comparing the imperative and the declarative approach to DCM 15.2.2 Organizing the code in the repository 15.3 Configuring Snowflake to use Git 15.3.1 Creating a Git repository stage 15.3.2 Executing commands from a Git repository stage 15.4 Using the Snowflake CLI command line interface 15.4.1 Installing and configuring Snowflake CLI 15.4.2 Executing scripts with Snowflake CLI 15.4.3 Continuous integration with Snowflake CLI 15.5 Connecting to Snowflake securely 15.5.1 Configuring key-pair authentication 15.6 Applying what we learned in real-world scenarios Summary appendix A—Configuring your Snowflake environment A.1 Signing up for a Snowflake free trial account A.2 Installing and configuring Snowflake CLI appendix B—Snowflake objects used in the examples B.1 Ingesting and transforming data from CSV files (chapter 2) B.2 Ingesting data from a cloud storage provider (chapter 3) B.3 Ingesting and flattening semi-structured data (chapter 4) B.4 Continuous data ingestion and dynamic tables (chapter 5) B.5 Executing code natively with Snowpark (chapter 6) B.6 Calling API endpoints and LLM functions (chapter 7) B.7 Optimizing query performance and controlling cost (chapters 8 and 9) B.8 Data governance and access control (chapter 10) B.9 Designing data pipelines (chapter 11) B.10 Ingesting data incrementally (chapter 12) B.11 Orchestrating data pipelines (chapter 13) B.12 Testing for data integrity and completeness (chapter 14) B.13 Data pipeline continuous integration (chapter 15) index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A practical introduction to data engineering on the powerful Snowflake cloud data platform. Data engineers create the pipelines that ingest raw data, transform it, and funnel it to the analysts and professionals who need it. The Snowflake cloud data platform provides a suite of productivity-focused tools and features that simplify building and maintaining data pipelines. In Snowflake Data Engineering, Snowflake Data Superhero Maja Ferle shows you how to get started. In Snowflake Data Engineering you will learn how to: • Ingest data into Snowflake from both cloud and local file systems • Transform data using functions, stored procedures, and SQL • Orchestrate data pipelines with streams and tasks, and monitor their execution • Use Snowpark to run Python, Java, and Scala code in your pipelines • Deploy Snowflake objects and code using continuous integration principles • Optimize performance and costs when ingesting data into Snowflake With this practical guide you’ll build the skills you need to create effective data pipelines on the Snowflake platform. You’ll see how Snowflake makes it easy to work with unstructured data, set up continuous ingestion with Snowpipe, and keep your data safe and secure with best-in-class data governance features. Along the way, you’ll practice the most important data engineering tasks as you work through relevant hands-on examples. Purchase of the print book includes a free eBook in PDF and ePub formats from Manning Publications. About the book Snowflake Data Engineering teaches data engineering skills using the day-to-day tasks you’ll face on the job. You’ll start working hands-on right from chapter two by building your very first simple pipeline on the Snowflake platform. Then, you’ll improve your pipeline with increasingly complex elements, including performance optimization and augmenting your data with generative AI. Throughout, author Maja Ferle shares design tips drawn from her years of experience to ensure your pipeline follows t
دانلود کتاب Snowflake Data Engineering