Big Data Specialization -- Introduction to Big Data

2020-06-04

Online Course

Word count: 1.5k | Reading time≈ 9 min

Course 1 Introduction to Big Data

Week 1

In-Situ analytical processing: Bringing the computation to where data is located.
SCADA(Supervisory Control and Data Acquisition)
　　It is a type of industrial control system for remote monitoring and control of industrial processes that exists in the physical world, potentially including multiple sites, many types of sensors.In addition to monitoring and control, SCADA system can be used to define actions for reduced waste and improved efficiency in industrial processes,public or private infrastructure processes,facility processes.

Week 2

1. The Characteristics of Big Data: The Six Fundamentals of Big Data

Volume: It’s the dimension of big data related to its size and
its exponential growth.
Variety: refers to the ever-increasing different forms that data can come in such as text, images, voice, and geospatial data.
Velocity: refers to the increasing speed at which big data is created and the increasing speed at which the data needs to be stored and analyzed.
- Batch Processing （Slow）: Collect Data, Clean Data, Feed-In Chunks, Wait, Act.
- Real-Time Processing ( Fast): Instantly Capture Streaming Data, Feed real-time to machines, Process Real-Time, Act.
Veracity: refers to the quality of big data. It sometimes gets referred to as validity or volatility referring to the lifetime of the data.
Valence (connectedness): The more connected data is, the higher it’s valences
Value: The ultimate goal of big data is to get value out to the data.

2. Defining the Questions - Building a big data strategy:

As a summary, when building a big data strategy,it is important to

Integrate big data analytics with business objectives
Communicate goals and provide organizational buy-in (Commitment, Sponsorship, Communication) for analytics projects
Build teams with diverse talents, and establish a teamwork mindset.
Remove barriers to data access and integration
Finally, these activities need to be iterated to respond to new business goals and technological advances

Five P’s of Data Science

Hadoop Ecosystem

Purpose: The purpose refers to the challenge or set of challenges defined by your big data strategy. The purpose can be related to a scientific analysis with a hypothesis or a business metric that needs to be analyzed based often on Big Data.
People: The data scientists are often seen as people who possess skills on a variety of topics including science or business domain knowledge; analysis using statistics, machine learning and mathematical knowledge; data management, programming, and computing. In practice, this is generally a group of researchers comprised of people with complementary skills.
Process: The process of data science includes techniques for statistics, machine learning, programming, computing, and data management. A process is conceptual in the beginning and defines the course set of steps and how everyone can contribute to it. Note that similar reusable processes can apply to many applications with different purposes when employed within different workflows. Data science workflows combine such steps in executable graphs. We believe that process-oriented thinking is a transformative way of conducting data science to connect people and techniques to applications. Execution of such a data science process requires access to many datasets, Big and small, bringing new opportunities and challenges to Data Science. There are many Data Science steps or tasks, such as Data Collection, Data Cleaning, Data Processing/Analysis, Result Visualization, resulting in a Data Science Workflow. Data Science Processes may need user interaction and other manual operations, or be fully automated. Challenges for the data science process include 1) how to easily integrate all needed tasks to build such a process; 2) how to find the best computing resources and efficiently schedule process executions to the resources based on process definition, parameter settings, and user preferences.
Platforms: Based on the needs of an application-driven purpose and the amount of data and computing required to perform this application, different computing and data platforms can be used as a part of the data science process. This scalability should be made part of any data science solution architecture.
Programmability: Capturing a scalable data science process requires aid from programming languages, e.g., R, and patterns, e.g., MapReduce. Tools that provide access to such programming techniques are key to making the data science process programmable on a variety of platforms.

3. The Process of Data Analysis - Five steps activities of the data science process:

Acquire: includes anything that makes us retrieve data including; finding, accessing, acquiring, and moving data. It includes identification of and authenticated access to all related data. And transportation of data from sources to distributed files systems
Prepare:
- Explorer: The first step in data preparation involves looking at the data to understand its nature, what it means, its quality, and format.
- Pre-processing of data: Pre-processing includes cleaning data, sub-setting or filtering data, creating data, which programs can read and understand, such as modeling raw data into a more defined data model,or packaging it using a specific data format.
Analyze: Select the analytical technique, build models.
Report: Communicating results includes evaluation of analytical results. Presenting them in a visual way, creating reports that include an assessment of results with respect to success criteria.
Act: Reporting insights from analysis and determining actions from insights based on the purpose.

Week 3

1. Basic Scalable Concepts - Scalable Computing over the Internet:

Parallel computer: a parallel computer is a very large number of single computing nodes with specialized capabilities connected to other networks.
Commodity cluster: affordable parallel computers with an average number of computing nodes.They are not as powerful as traditional parallel computers and are often built out of less specialized nodes. These types of systems have a higher potential for partial failures. It is this type of distributed computing that pushed for a change towards cost-effective reliable and Fault-tolerant systems for management and analysis of big data

2. The Requirements For Big Data Programming Model:

Support Big Data Operation:
- Split volumes of data
- Access data fast
- Distribute computations to nodes
Handle Fault tolerance
- Replicate data partition
- Recover files when needs
Enable Adding More Racks
- Adding new resources to more or faster data without losing performance (scaling out).
- Optimized for specific data types

3. Getting Started With Hadoop

Hadoop Ecosystem

HDFS: The Hadoop Distributed File System, a storage system for big data.It serves as the foundation for most tools in the Hadoop ecosystem.

It provides two capabilities that are essential for managing big data.

Scalability to large data sets.
Reliability to cope with hardware failures. By default, HDFS maintains three copies of every block

Two key components of HDFS:

NameNode for Metadata:
- One namenode per cluster
- Coordinator of HDFS cluster
- Records the name, location in the directory hierarchy and other metadata
- Decides which data nodes to store the contents of the file and remembers this mapping
DataNode for block storage
- Datanode Runs on each node on the cluster and is responsible for storing the file blocks
- listens to commands from the name node for block creation, deletion, and replication.
- Replication provides two key capabilities:
  - Fault tolerance, Data locality

YARN: The Resource Manager for Hadoop
YARN interacts with applications and schedules resources for their use.

Essential Gear in YARN engine:

Resource manager: controls all the resources, and decides who gets what
Application Manager: It negotiates resource from the Resource Manager and it talks to Node Manager to get its tasks completed
Node Manager: operates at the machine level and is in charge of a single machine
Container (Machine): abstract Notions that signifies a resource that is a collection of CPU memory disk network and other resources within the Compute node

Mapreduce: 　　

　　It is a big data programming model that supports all
the requirements of big data modeling we mentioned. It can model processing large data, split complications into different parallel tasks and make efficient use of large commodity clusters and distributed file systems.

Map: Applys operation to all elements and generates key-values pairs.
Reduce: Summarize operation on elements and construct one output file.

MapReduce is bad for:

Frequently changing data (slow)
Dependent tasks
Interactive analysis

When to reconsider Hadoop:

Key feature that makes problem Hadoop friendly:
- Future anticipated data growth
- Long term availability of data
- Many platforms over single datastore
- High Volume, High variety
Be careful when:
- Small dataset
- Task Level Parallelism ( Furuethrer analysis for which tool to use in Hadoop ecosystem)
- Advanced algorithms (Not all algorithms are scalable in Hadoop, or reducible to one of the programming models supported by YARN
- Random Data Access ( you may have to read an entire file just to pick one data entry)

4. Cloud Computing( On-demand computing: it enables us to compute any time any anywhere)

Service Model:

IaaS: infrastructure as a service, can be defined as a bare minimum rental service. This is like renting a truck from a company that you can assume has hardware and you do the packing of your furniture, and drive to your new house.
PaaS: platform as a service,
is the model where a user is provided with an entire computing platform.
This could include the operating system and programming languages that you need
SaaS: the software as a service model, is the model
in which the cloud service provider takes the responsibilities for the hardware and software environments such as the operating system and the application software

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.