Skip to main content

Building Clusters Journey

This doc is about how we build "BigData" related clusters such as Hadoop, Spark, Kafka etc. Specifically we'll look into following clusters:

  • Hadoop 3 cluster
  • Spark cluster:
    • Spark on Standalone
    • Spark on Hadoop/YARN
    • Spark on Kubernetes
  • Hive on Hadoop/YARN
  • Presto
  • Kafka

Note that we're using Apache licensed software for these projects except for Presto.

Contents (& Progress)#

  1. Hadoop 3
  2. Spark cluster 2. Spark on Hadoop/YARN (plus Spark on Standalone) 3. Spark on Kubernetes (Not yet)
  3. Hive on Hadoop/YARN
  4. Presto (Not yet)
  5. Kafka cluster (Not yet)


Here's the detail which you need to prepare when you build a cluster:

  • Amazon EC2 instances
    • us-east-1
    • AmazonLinux2 (AMI ID: ami-0947d2ba12ee1ff75)
    • Instance type: m5.xlarge
  • clush command - version: clush 1.8.3