Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale
Read it now on the O’Reilly learning platform with a 10-day free trial.
O’Reilly members get unlimited access to books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.
Book description
The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students
Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.
The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.
Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).
This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.
Show and hide more
Table of contents Product information
Table of contents
- About This E-Book
- Title Page
- Copyright Page
- Contents
- Foreword
- Preface
- Focus of the Book
- Who Should Read This Book
- How to Use This Book
- Book Conventions
- Accompanying Code
- 1. Introduction to Data Science
- What Is Data Science?
- Example: Search Advertising
- A Bit of Data Science History
- Statistics and Machine Learning
- Innovation from Internet Giants
- Data Science in the Modern Enterprise
- The Data Engineer
- The Applied Scientist
- Transitioning to a Data Scientist Role
- Soft Skills of a Data Scientist
- Ask the Right Question
- Data Acquisition
- Data Cleaning: Taking Care of Data Quality
- Explore the Data and Design Model Features
- Building and Tuning the Model
- Deploy to Production
- Big Data—A Driver of Change
- Volume: More Data Is Now Available
- Variety: More Data Types
- Velocity: Fast Data Ingest
- Product Recommendation
- Customer Churn Analysis
- Customer Segmentation
- Sales Leads Prioritization
- Sentiment Analysis
- Fraud Detection
- Predictive Maintenance
- Market Basket Analysis
- Predictive Medical Diagnosis
- Predicting Patient Re-admission
- Detecting Anomalous Record Access
- Insurance Risk Analysis
- Predicting Oil and Gas Well Production Levels
- What Is Hadoop?
- Distributed File System
- Resource Manager and Scheduler
- Distributed Data Processing Frameworks
- Apache Sqoop
- Apache Flume
- Apache Hive
- Apache Pig
- Apache Spark
- R
- Python
- Java Machine Learning Packages
- Cost Effective Storage
- Schema on Read
- Unstructured and Semi-Structured Data
- Multi-Language Tooling
- Robust Scheduling and Resource Management
- Levels of Distributed Systems Abstractions
- Scalable Creation of Models
- Scalable Application of Models
- 4. Getting Data into Hadoop
- Hadoop as a Data Lake
- The Hadoop Distributed File System (HDFS)
- Direct File Transfer to Hadoop HDFS
- Importing Data from Files into Hive Tables
- Import CSV Files into Hive Tables
- Import CSV Files into HIVE Using Spark
- Import a JSON File into HIVE Using Spark
- Data Import and Export with Sqoop
- Apache Sqoop Version Changes
- Using Sqoop V2: A Basic Example
- Using Flume: A Web Log Example Overview
- Why Hadoop for Data Munging?
- Data Quality
- What Is Data Quality?
- Dealing with Data Quality Issues
- Using Hadoop for Data Quality
- Choosing the “Right” Features
- Sampling: Choosing Instances
- Generating Features
- Text Features
- Time-Series Features
- Features from Complex Data Types
- Feature Manipulation
- Dimensionality Reduction
- Why Visualize Data?
- Motivating Example: Visualizing Network Throughput
- Visualizing the Breakthrough That Never Happened
- Comparison Charts
- Composition Charts
- Distribution Charts
- Relationship Charts
- R
- Python: Matplotlib, Seaborn, and Others
- SAS
- Matlab
- Julia
- Other Visualization Tools
- 7. Machine Learning with Hadoop
- Overview of Machine Learning
- Terminology
- Task Types in Machine Learning
- Big Data and Machine Learning
- Tools for Machine Learning
- The Future of Machine Learning and Artificial Intelligence
- Summary
- Overview of Predictive Modeling
- Classification Versus Regression
- Evaluating Predictive Models
- Evaluating Classifiers
- Evaluating Regression Models
- Cross Validation
- Model Training
- Batch Prediction
- Real-Time Prediction
- Tweets Dataset
- Data Preparation
- Feature Generation
- Building a Classifier
- Overview of Clustering
- Uses of Clustering
- Designing a Similarity Measure
- Distance Functions
- Similarity Functions
- k-means Clustering
- Latent Dirichlet Allocation
- Data Ingestion
- Feature Generation
- Running Latent Dirichlet Allocation
- Overview
- Uses of Anomaly Detection
- Types of Anomalies in Data
- Approaches to Anomaly Detection
- Rules-based Methods
- Supervised Learning Methods
- Unsupervised Learning Methods
- Semi-Supervised Learning Methods
- Data Ingestion
- Building a Classifier
- Evaluating Performance
- Natural Language Processing
- Historical Approaches
- NLP Use Cases
- Text Segmentation
- Part-of-Speech Tagging
- Named Entity Recognition
- Sentiment Analysis
- Topic Modeling
- Small-Model NLP
- Big-Model NLP
- Bag-of-Words
- Word2vec
- Stanford CoreNLP
- Using Spark for Sentiment Analysis
- Automated Data Discovery
- Deep Learning
- Summary
- Quick Command Dereference
- General User HDFS Commands
- List Files in HDFS
- Make a Directory in HDFS
- Copy Files to HDFS
- Copy Files from HDFS
- Copy Files within HDFS
- Delete a File within HDFS
- Delete a Directory in HDFS
- Get an HDFS Status Report (Administrators)
- Perform an FSCK on HDFS (Administrators)
- General Hadoop/Spark Information
- Hadoop/Spark Installation Recipes
- HDFS
- MapReduce
- Spark
- Essential Tools
- Machine Learning
Show and hide more
Product information
- Title: Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale
- Author(s): Ofer Mendelevitch, Casey Stella, Douglas Eadline
- Release date: December 2016
- Publisher(s): Addison-Wesley Professional
- ISBN: 9780134029733
You might also like
Check it out now on O’Reilly
Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.