Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Z**U
Insightful exploration of modern data engineering concepts and practices with Apache Spark
"Data Engineering with Scala and Spark" is a fantastic survey of the key concepts and practices in modern data engineering with Apache Spark and data lake architectures. I'm a data professional in the software industry and have been working with Apache Spark for close to a decade now, which is even prior to cloud data lakes and platforms like Databricks becoming mainstream. This book does a great job of establishing the foundational concepts with Scala and Spark in its first few chapters, which gives the reader the necessary tools to experiment and extend their knowledge. The progression of the book is easy to follow, which goes toward advanced transformations, data quality, and finally to best practice data engineering patterns. I very much respect its coverage of Spark with the Scala language, as it continues to be the native programming language of Spark itself, and one that has the deepest level of integration and best performance characteristics when it comes to data engineering.One concept I really appreciate from the author in this book is its coverage, albeit somewhat brief, of Test Driven Development and CI/CD. The data engineering industry, in my opinion, has yet to fully adopt and institute the degree of rigor and engineering disciplines that are now pervasive with general software engineering in both backend and frontend settings. As a result, data pipelines of any real complexity for large organizations eventually become very brittle, difficult to manage, and costly to operate. This book plants a great seed in the mind of its readers that these concepts around unit and integration testing via CI/CD with data pipelines are best practices for data engineering and a necessary knowledge area for data engineers in our current environment. I would loved to have seen some concrete samples of full integration tests that tests the logic of Spark transformations, which is an essential practice that typical Spark engineers lack familiarity with.In the concluding parts of the book, the author covers areas on orchestration, performance tuning, and end-to-end pipelines for both batch and streaming modalities. These are deep and advanced concepts, and there certainly can be full books written on each of these topics just by themselves. I like the broad coverage of several orchestration frameworks, giving the users an unbiased perspective on how tools like Airflow, Databricks Workflows, and ADF can be used with Spark. I also support the judicious coverage of some of the key concepts in Spark performance tuning, including data skew, partitioning, and right-sizing compute, which are generally the most important concepts to understand when tuning pipelines.Overall, I recommend this book for readers seeking to gain a deeper level of understanding of what data engineering is about and how to best achieve that with Apache Spark, in addition to the current set of companion platforms and tooling in the data engineering ecosystem. The reader should expect to be able to construct and support cloud-based or local data pipelines from various source modalities with Apache Spark in an end-to-end fashion, which I think makes this book a worthwhile journey.
L**I
Comprehensive guide for a beginner
"Data Engineering with Scala and Spark" offers a comprehensive guide to navigating the complexities of Apache Spark and modern data engineering practices. From fundamental concepts to advanced optimization techniques, each chapter provides clear explanations and practical insights for building efficient data pipelines. With a focus on real-world applications and best practices, this book is essential reading for data engineers and professionals seeking to harness the full potential of Apache Spark in their projects.
F**O
Too basic
This is a book for a newbie. If you have experience you won’t learn much from it.
O**S
A Journey to Enhance Data Engineering Skills
In "Data Engineering with Scala and Spark," you'll embark on a journey to enhance your data engineering skills using Scala and functional programming techniques. The book focuses on creating continuous and scheduled pipelines for data ingestion, transformation, and aggregation.Key Features:Use Scala to transform data reliably.Learn to build streaming and batch-processing pipelines with clear explanations.Implement CI/CD best practices and test-driven development (TDD).The book covers essential topics like setting up development environments, working with Spark APIs (DataFrame, Dataset, and Spark SQL), data profiling, quality assurance, and pipeline orchestration. It also includes insights into performance tuning and best practices for building robust data pipelines.
H**N
Good book
A good resource for who looking to master Scala, Spark, and cloud computing for data engineering. The book covers essential concepts and best practices, it guides readers through setting up environments, developing pipelines, and applying test-driven development and CI/CD and also advanced topics like data transformation, quality checks, and performance tuning with practical examples. Overall, it's a highly valuable resource for anyone aspiring to excel in data engineering.
G**S
Good for beginners
This book gives the basics. If you have some experience with data engineering, you will probably learn nothing new.
ترست بايلوت
منذ أسبوع
منذ شهر