Claude's Profile

Blog

Apache Spark 을 이용한 빅데이터 분석 (1)

Claude's Profile

Blog

Apache Spark 을 이용한 빅데이터 분석 (1)

Blog

Github

Apache Spark 을 이용한 빅데이터 분석 (1)

Apache Spark 이해하기

Apache Spark 란?

Apache Spark 특징

RDD ( Resilient Distributed Dataset )

맛보기

Apache Spark 이해하기

Apache Spark 란?

•

대용량 데이터 프로세싱을 위한 빠르고 범용적인 인메모리 기반 클러스터 컴퓨팅 엔진

•

분산 메모리 기반의 빠른 분산 병렬 처리

•

Unified Engine

◦

다양한 처리 타입의 데이터 처리를 하나의 엔진에서 처리가 가능하다

▪

Streaming, SQL, ML, Graph, Batch

◦

Batch, SQL, Streaming, ML 과 같은 다양한 작업 타입을 지원하는 범용 엔진으로 Apache Hadoop 과 호환

•

High-level API

◦

Scala, Java, Python, R 기반 High-level APIs 제공

•

Integrate Broadly

◦

다양한 스토리지 시스템과 연계가 가능함

•

Cluster Resource Manager 

◦

여러개의 애플리케이션이 실행될 때 컴퓨팅 리소스 (CPU, Memory) 에 대한 경합이 발생할 수 있으므로 이를 제어하기 위해 중간에서 리소스를 할당해주는 역할

◦

지원하는 매니저 목록

▪

Standalnoe Scheduler

•

Spark 내장

•

Spark 만 동작할 수 있음 ( 성능은 빠름 )

▪

YARN ( Hadoop )

▪

Apache Mesos

▪

Kubernetes

•

다양한 소스에 있는 데이터를 Spark 로 가져온 후 RDD, SQL 등으로 처리가 가능함

Apache Spark 특징

•

In-Memory 컴퓨팅 ( Disk 기반도 사용 가능 )

◦

메모리 리소스를 사용하고 있기 때문에 실제로 캐시가 필요한지 안필요한지는 애플리케이션을 실행시켜보면서 확인이 필요함

◦

인메모리 컴퓨팅 처리 엔진: Presto, Flink 등

◦

보통 Spark 가 처리하는 데이터는 Immutable 하다

•

RDD ( Resilient Distributed Dataset ) 데이터 모델

◦

다양한 API 를 제공 ( filter, parse, aggrgate 등 )

•

다양한 개발 언어 지원 ( Scala, Java, Python, R, SQL )

•

Rich API 제공 ( 80여개 이상 지원 )

•

General execution graphs ⇒ DAG ( Directed Acyclic Grpah, 비순환 그래프 ) ⇒ Multiple stage of map & reduce

•

Hadoop 과의 유연한 연계

•

빠른 데이터 처리 ( In-Memory Cached RDD )

•

대화형 질의를 위한 Interactive Shell ( Scala, Python, R Interpreter )

•

실시간 Streaming Processing

•

하나의 애플리케이션에서 Batch, SQL, Streaming, ML 등의 다양한 작업을 하나의 워크플로우로 결합

•

Both fast to write and fast to run

RDD ( Resilient Distributed Dataset )

•

여러 분산 노드에 걸쳐서 저장되는 Immutable 한 데이터 집합

•

Dataset: 메모리나 디스크에 분산 저장된 변경 불가능한 데이터 객체들의 모음

•

Distributed: RDD 에 있는 데이터는 클러스터에 자동 분배 및 병렬 연산 수행

•

Resilient: 한 노드가 실패하더라도 다른 노드가 작업을 이어서 처리 ( RDD Lineage, Automatically rebuilt on failure )

•

Immutable: RDD 는 기존에 있던걸 수정할 수 없고, 변형된 새로운 RDD 가 생성됨

•

Lazy Evaluation: All Transformations ( Action이 실행될 때 까지 기다리다가 실행되면 한번에 작업을 실행 )

•

Controllable Persistence: Memory 나 Disk 에 캐시를 할 수 있다 ( 반복적인 연산할 때 유리하다 )

•

RDD 는 2개의 Operation 을 지원

◦

Transform

▪

기본의 RDD 데이터를 변경하여 새로운 RDD 를 만듬 ( e.g: filter, map )

◦

Action

▪

RDD 값을 기반으로 계산해서 결과를 생성함 ( e.g: count )

•

RDD 데이터 로딩은 Lazy Loading 컨셉을 가지고 있는데, 파일을 로딩하더라도 실제로 로딩이 되지 않고, 액션을 호출할 때 파일을 올린다.

맛보기

$ wget https://www.apache.org/dyn/closer.lua/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
$ tar -zxvf spark-3.1.1-bin-hadoop3.2.tgz
$ cd ./spark-3.1.1-bin-hadoop3.2/bin
$ ./pyspark
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
21/03/27 01:29:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.8.5 (default, Jan 27 2021 15:41:15)
Spark context Web UI available at http://spark-master-01:4040
Spark context available as 'sc' (master = local[*], app id = local-1616808556874).
SparkSession available as 'spark'.
>>>
Shell
복사

Apache Spark 을 이용한 빅데이터 분석 (1)

Apache Spark 이해하기

목차

Apache Spark 란?

Apache Spark 특징

RDD ( Resilient Distributed Dataset )

맛보기