Minimalist Data Wrangling with Python

معرفی کتاب «Minimalist Data Wrangling with Python» نوشتهٔ Marek Gagolewski، منتشرشده توسط نشر Marek Gagolewski. این کتاب در فرمت pdf، زبان انگلیسی ارائه شده است. «Minimalist Data Wrangling with Python» در دستهٔ بدون دسته‌بندی قرار دارد.

Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science , providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results. This textbook is a non-profit project . Its online and PDF versions are freely available at https://datawranglingpy.gagolewski.com/. This is version 1.0.3 of the book (last updated 2023-02-06). Preface The art of data wrangling Aims, scope, and design philosophy We need maths We need some computing environment We need data and domain knowledge Structure The Rules About the author Acknowledgements You can make this book better I Introducing Python Getting started with Python Installing Python Working with Jupyter notebooks Launching JupyterLab First notebook More cells Edit vs command mode Markdown cells The best note-taking app Initialising each session and getting example data Exercises Scalar types and control structures in Python Scalar types Logical values Numeric values Arithmetic operators Creating named variables Character strings F-strings (formatted string literals) Calling built-in functions Positional and keyword arguments Modules and packages Slots and methods Controlling program flow Relational and logical operators The if statement The while loop Defining functions Exercises Sequential and other types in Python Sequential types Lists Tuples Ranges Strings (again) Working with sequences Extracting elements Slicing Modifying elements of mutable sequences Searching for specific elements Arithmetic operators Dictionaries Iterable types The for loop Tuple assignment Argument unpacking (*) Variadic arguments: *args and **kwargs (*) Object references and copying (*) Copying references Pass by assignment Object copies Modify in place or return a modified copy? Further reading Exercises II Unidimensional data Unidimensional numeric data and their empirical distribution Creating vectors in numpy Enumerating elements Arithmetic progressions Repeating values numpy.r_ (*) Generating pseudorandom variates Loading data from files Some mathematical notation Inspecting the data distribution with histograms heights: A bell-shaped distribution income: A right-skewed distribution How many bins? peds: A bimodal distribution (already binned) matura: A bell-shaped distribution (almost) marathon (truncated – fastest runners): A left-skewed distribution Log-scale and heavy-tailed distributions Cumulative probabilities and the empirical cumulative distribution function Exercises Processing unidimensional data Aggregating numeric data Measures of location Arithmetic mean and median Sensitive to outliers vs robust Sample quantiles Measures of dispersion Standard deviation (and variance) Interquartile range Measures of shape Box (and whisker) plots Other aggregation methods (*) Vectorised mathematical functions Logarithms and exponential functions Trigonometric functions Arithmetic operators Vector-scalar case Application: Feature scaling Standardisation and z-scores Min-max scaling and clipping Normalisation (l2; dividing by magnitude) Normalisation (l1; dividing by sum) Vector-vector case Indexing vectors Integer indexing Logical indexing Slicing Other operations Cumulative sums and iterated differences Sorting Dealing with tied observations Determining the ordering permutation and ranking Searching for certain indexes (argmin, argmax) Dealing with round-off and measurement errors Vectorising scalar operations with list comprehensions Exercises Continuous probability distributions Normal distribution Estimating parameters Data models are useful Assessing goodness-of-fit Comparing cumulative distribution functions Comparing quantiles Kolmogorov–Smirnov test (*) Other noteworthy distributions Log-normal distribution Pareto distribution Uniform distribution Distribution mixtures (*) Generating pseudorandom numbers Uniform distribution Not exactly random Sampling from other distributions Natural variability Adding jitter (white noise) Independence assumption Further reading Exercises III Multidimensional data From uni- to multidimensional numeric data Creating matrices Reading CSV files Enumerating elements Repeating arrays Stacking arrays Other functions Reshaping matrices Mathematical notation Transpose Row and column vectors Identity and other diagonal matrices Visualising multidimensional data 2D Data 3D data and beyond Scatter plot matrix (pairs plot) Exercises Processing multidimensional data Extending vectorised operations to matrices Vectorised mathematical functions Componentwise aggregation Arithmetic, logical, and relational operations Matrix vs scalar Matrix vs matrix Matrix vs any vector Row vector vs column vector (*) Other row and column transforms (*) Indexing matrices Slice-based indexing Scalar-based indexing Mixed logical/integer vector and scalar/slice indexers Two vectors as indexers (*) Views of existing arrays (*) Adding and modifying rows and columns Matrix multiplication, dot products, and Euclidean norm (*) Pairwise distances and related methods (*) Euclidean metric (*) Centroids (*) Multidimensional dispersion and other aggregates (**) Fixed-radius and k-nearest neighbour search (**) Spatial search with K-d trees (**) Exercises Exploring relationships between variables Measuring correlation Pearson linear correlation coefficient Perfect linear correlation Strong linear correlation No linear correlation does not imply independence False linear correlations Correlation is not causation Correlation heat map Linear correlation coefficients on transformed data Spearman rank correlation coefficient Regression tasks (*) K-nearest neighbour regression (*) From data to (linear) models (*) Least squares method (*) Analysis of residuals (*) Multiple regression (*) Variable transformation and linearisable models (**) Descriptive vs predictive power (**) Fitting regression models with scikit-learn (*) Ill-conditioned model matrices (**) Finding interesting combinations of variables (*) Dot products, angles, collinearity, and orthogonality (*) Geometric transformations of points (*) Matrix inverse (*) Singular value decomposition (*) Dimensionality reduction with SVD (*) Principal component analysis (*) Further reading Exercises IV Heterogeneous data Introducing data frames Creating data frames Data frames are matrix-like Series Index Aggregating data frames Transforming data frames Indexing Series objects Do not use [...] directly (in the current version of pandas) loc[...] iloc[...] Logical indexing Indexing data frames loc[...] and iloc[...] Adding rows and columns Modifying items Pseudorandom sampling and splitting Hierarchical indexes (*) Further operations on data frames Sorting Stacking and unstacking (long/tall and wide forms) Joining (merging) Set-theoretic operations and removing duplicates ...and (too) many more Exercises Handling categorical data Representing and generating categorical data Encoding and decoding factors Binary data as logical and probability vectors One-hot encoding (*) Binning numeric data (revisited) Generating pseudorandom labels Frequency distributions Counting Two-way contingency tables: Factor combinations Combinations of even more factors Visualising factors Bar plots Political marketing and statistics . Pareto charts (*) Heat maps Aggregating and comparing factors Mode Binary data as logical vectors Pearson chi-squared test (*) Two-sample Pearson chi-squared test (*) Measuring association (*) Binned numeric data Ordinal data (*) Exercises Processing data in groups Basic methods Aggregating data in groups Transforming data in groups Manual splitting into subgroups (*) Plotting data in groups Series of box plots Series of bar plots Semitransparent histograms Scatter plots with group information Grid (trellis) plots Kolmogorov–Smirnov test for comparing ECDFs (*) Comparing quantiles Classification tasks (*) K-nearest neighbour classification (*) Assessing prediction quality (*) Splitting into training and test sets (*) Validating many models (parameter selection) (**) Clustering tasks (*) K-means method (*) Solving k-means is hard (*) Lloyd algorithm (*) Local minima (*) Random restarts (*) Further reading Exercises Accessing databases Example database Exporting data to a database Exercises on SQL vs pandas Filtering Ordering Removing duplicates Grouping and aggregating Joining Solutions to exercises Closing the database connection Common data serialisation formats for the Web Working with many files File paths File search Exception handling File connections (*) Further reading Exercises V Other data types Text data Basic string operations Unicode as the universal encoding Normalising strings Substring searching and replacing Locale-aware services in ICU (*) String operations in pandas String operations in numpy (*) Working with string lists Formatted outputs for reproducible report generation Formatting strings str and repr Aligning strings Direct Markdown output in Jupyter Manual Markdown file output (*) Regular expressions (*) Regex matching with re (*) Regex matching with pandas (*) Matching individual characters (*) Matching anything (almost) (*) Defining character sets (*) Complementing sets (*) Defining code point ranges (*) Using predefined character sets (*) Alternating and grouping subexpressions (*) Alternation operator (*) Grouping subexpressions (*) Non-grouping parentheses (*) Quantifiers (*) Capture groups and references thereto (**) Extracting capture group matches (**) Replacing with capture group matches (**) Back-referencing (**) Anchoring (*) Matching at the beginning or end of a string (*) Matching at word boundaries (*) Looking behind and ahead (**) Exercises Missing, censored, and questionable data Missing data Representing and detecting missing values Computing with missing values Missing at random or not? Discarding missing values Mean imputation Imputation by classification and regression (*) Censored and interval data (*) Incorrect data Outliers The 3/2 IQR rule for normally-distributed data Unidimensional density estimation (*) Multidimensional density estimation (*) Exercises Time series Temporal ordering and line charts Working with date-times and time-deltas Representation: The UNIX epoch Time differences Date-times in data frames Basic operations Iterated differences and cumulative sums revisited Smoothing with moving averages Detecting trends and seasonal patterns Imputing missing values Plotting multidimensional time series Candlestick plots (*) Further reading Exercises Changelog References Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results. This textbook is a non-profit project. Its online and PDF versions are freely available at Make make sure to check them out before making an order. Note that 0% (zero) profit goes to the author and that the print price is set by Amazon (the author requested that it should be as small as possible - probably the reason why it is not available on the US marketplace - check out the other ones...). This is version 1.0.3 of the book (last updated 2023-02-06).

دانلود کتاب Minimalist Data Wrangling with Python