Performance Optimization Strategies for Fully Utilizing Apache Spark


KIPS Transactions on Computer and Communication Systems, Vol. 7, No. 1, pp. 9-18, Jan. 2018
10.3745/KTCCS.2018.7.1.9,   PDF Download:
Keywords: Apache Spark, Performance Optimization, System Tuning
Abstract

Enhancing performance of big data analytics in distributed environment has been issued because most of the big data related applications such as machine learning techniques and streaming services generally utilize distributed computing frameworks. Thus, optimizing performance of those applications at Spark has been actively researched. Since optimizing performance of the applications at distributed environment is challenging because it not only needs optimizing the applications themselves but also requires tuning of the distributed system configuration parameters. Although prior researches made a huge effort to improve execution performance, most of them only focused on one of three performance optimization aspect: application design, system tuning, hardware utilization. Thus, they couldn’t handle an orchestration of those aspects. In this paper, we deeply analyze and model the application processing procedure of the Spark. Through the analyzed results, we propose performance optimization schemes for each step of the procedure: inner stage and outer stage. We also propose appropriate partitioning mechanism by analyzing relationship between partitioning parallelism and performance of the applications. We applied those three performance optimization schemes to WordCount, Pagerank, and Kmeans which are basic big data analytics and found nearly 50% performance improvement when all of those schemes are applied.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
R. Myung, H. Yu, S. Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS Transactions on Computer and Communication Systems, vol. 7, no. 1, pp. 9-18, 2018. DOI: 10.3745/KTCCS.2018.7.1.9.

[ACM Style]
Rohyoung Myung, Heonchang Yu, and Sukyong Choi. 2018. Performance Optimization Strategies for Fully Utilizing Apache Spark. KIPS Transactions on Computer and Communication Systems, 7, 1, (2018), 9-18. DOI: 10.3745/KTCCS.2018.7.1.9.