All the data you need.
How to get Hadoop and Spark up and running on AWS
( go to the article → )
Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.Installing Spark from scratch and getting it to run in a distributed mode on a cloud computing system can be a hurdle for many new data engineers. Through this blog post, I hope to make it easier or at least provide guidance on reducing the time you must spend on the process.In the data engineering program at Insight Data Science, some Fellows choose to use Pegasus as a tool to quickly stand-up instances on Amazon Web Services and install the necessary distributed computing technologies. Pegasus was written several years ago by an Insight program director and relies on shell scripts that aren’t as fault-tolerant or produce meaningful messages as we’d like. If we were to update Pegasus today, we’d use tools, such as Terraform and configuration managers, such as Puppet and Chef, to provision and install technologies rather than the shell scripts.Its disadvantages aside, Pegasus can be quite a timesaver in reducing the keystrokes you’ll need to stand up a cluster of machines on AWS from the comfort of your own laptop.That said, we do find that Fellows who rely on Pegasus without really understanding what it does under the covers are at a loss on what to do when they encounter problems. I hope to cover some of what Pegasus (it’s open-source and available on Github) does within its shell scripts so that if you encounter an error, you’ll have a better idea of how to troubleshoot and overcome what’s keeping your installation from succeeding. At the very least, by learning more about how Pegasus installs Hadoop and Spark, you may get a better appreciation of how Hadoop and Spark are configured, so that if you want to tweak some of the parameters, you’ll know where to look.Be aware that using Pegasus to spin up instances and install Hadoop and Spark will incur AWS charges so you’re going to want to keep an eye out on your expenses.If you are unfamiliar with Pegasus, start with reading the Github Readme, which contains detailed instructions on how to get started and use Pegasus, including its many features.Quick start guideThere are a lot of topics to cover, and it may be best to start with the keystrokes needed to stand-up a cluster of four AWS instances running Hadoop and Spark using Pegasus.Clone the Pegasus repository and set the necessary environment variables detailed in the ‘Manual’ installation of Pegasus ReadmeGather information about your AWS VPC, subnet, and security group. The easiest way to get that information is to open your browser to the AWS console and verify the results using Pegasus commands documented here. (Tip: If you are unable to see the VPC, subnet, and security groups that you see on your AWS console on your terminal, it may be very likely that your Pegasus setup is using a different region from your AWS console. On the top right corner of your AWS console, check what region you are using (e.g., US East (N. Virginia) or US West (Oregon)) and then make sure it matches the region you see on the terminal you are running Pegasus when you type in peg config)Create a Pegasus configuration yaml file as explained here. Pegasus provides example yaml files, and to get started, you can copy this one.Now provision instances and install technologies using the following commands, which assume you’ve named the yaml file for configuring the master and worker instances ‘mymaster.yml’ and ‘myworkers.yml,’ respectively. The below instructions assume you’ve specified a ‘tag_name’ in the both yaml files to be spark_cluster:peg up mymaster.ymlpeg up myworkers.ymlpeg fetch spark_clusterpeg install spark_cluster sshpeg install spark_cluster awspeg install spark_cluster hadooppeg service spark_cluster hadoop startpeg install spark_cluster sparkpeg service spark_cluster spark startThe above set of commands are very few considering the amount of work happening under the hood. Some Fellows are tempted to put all of these commands into a shell script that they’ll execute in a single keystroke. I’d advise against doing that in the beginning because it’d be easy to assume all of the steps worked, but as I’ve mentioned, Pegasus doesn’t have the most robust error handling and notification system yet, and you can miss a very important message embedded in the output of one of these commands.Now let’s iterate back through the steps listed in this ‘Quick Start Guide,’ to provide more detail, especially on the steps that have tripped up some people in the past. I’ll also go over what Pegasus is doing behind these commands so that if you want to tweak your configurations, you’ll know where to look.Preparing for the installationOne of the first things you’ll want to do before using Pegasus is to set up your configuration files with the correct AWS VPC, subnet, and security group for your account and project need. If you don’t know what VPCs, subnets or security groups are, especially in the AWS environment, I’d highly review those materials before proceeding.There are a number of example configuration files that come with Pegasus in this directory. All of the fields in the configuration files are described in the Readme. Below are two that have sometimes tripped up new users.Security group settingsUse best practices on setting up your security group so that it’s not totally open to the public but permissive enough to run Spark on the cluster of machines you’ll provision on AWS. For this example, here’s how I set the inbound traffic for my security group to limit TCP traffic only to instances that operate under the same security group as well as traffic from my laptop. This is not a good practice for production but for the purposes of this example, it will suffice.Tip: In the above instance, the first row sets ‘All TCP’ permissions to my laptop’s IP address but if I physically move locations and latch on to a different wifi, that IP address might change. So if you are finding a situation that worked in one location but stopped in another one, check that your security group is set properly.AWS KeypairYou’ll also want to download a key pair (.pem file) that will be used to access the instances you create on AWS. Pegasus assumes the keypair file is in a .ssh subdirectory under your home directory (e.g., ~/.ssh/your-name.pem). When listing the name of the keypair file name in your yaml file, be sure to leave off the suffix (.pem) (e.g., if the file name is your-name.pem then set key_name: your-name)Peg upOnce you’ve correctly set up your .yml files, you’re ready to connect to AWS and provision your instances using the Pegasus command ‘peg up <name of your yml file>’ Below are examples of what happens when you type these commands in on your laptop terminal assuming you’ve provided the configuration files, mymaster.yml, and myworkers.yml, and inside those files listed the tag_name for your machines as spark_cluster. The first file, mymaster.yml, holds the configuration details to spin up one AWS instance that will act as a ‘master’ machine and the second file, myworkers.yml, configures the AWS ‘worker’ instances. In this case, in myworkers.yml file, for the num_instances field, I specified 3, which is why Pegasus spun up three worker machines on AWS.Results of ‘peg up mymaster.yml’Results of ‘peg up myworkers.yml’Tip: If you get the message, “A client error (AddressLimitExceeded) occurred when calling the AllocateAddress operation: The maximum number of addresses has been reached.”, you will need to open a support case on AWS asking that your limit be increased. AWS should respond fairly quickly and when your case is resolved, re-run ‘peg up’Once you’ve successfully passed this stage, if you navigate your web browser to the AWS console, you should see something like the image below. Notice that under the Name column is the ‘tag_name’ — in my case, spark_cluster, that you specified in your mymaster.yml and myworkers.yml files. In order for them to be associated together, you must list the same ‘tag_name’ in both files. Also, notice the Role column, Pegasus has indicated which machine will be designated the ‘master’ and which ones are designated ‘worker.’ (You may need to tweak the display settings on your AWS console to see the ‘Role’ column)AWS console after ‘peg up’ master and workersNow that the instances have been properly set up on AWS, it’s time to record all of this information temporarily on your laptop so that when Pegasus starts installing technologies in the next step, it’ll know how to reach the master and worker nodes without querying AWS every time. Do that by running ‘peg fetch <tag_name>’ and you should see something like the following in your laptop terminal:Results of ‘peg fetch spark_cluster’(Remember to replace <tag_name> with whatever name you chose — in this example it’s spark_cluster). You only do this once immediately after you’ve ‘peg up’ your master and worker nodes. If you are standing up multiple, different clusters, give them different tag_names, and you’ll be able to install different technologies on different clusters (e.g., one can be a Spark cluster and a second can be a Kafka cluster).Installing prerequisite utilitiesPrior to installing Spark or other distributed computing frameworks, you’ll need to install some utilities that are necessary for Pegasus.One of the most important steps is enabling password-less ssh login. This will allow you to login into different machines without exposing and passing around passwords. You can set up password-less ssh login manually by following various nifty guides on the Internet, such as this one, or you can let Pegasus take care of it for you. It’s easier if Pegasus does — just fire off this command (assuming you’ve used a ‘tag_name’ of spark_cluster)peg install spark-cluster ssh’When you do that, you’re basically invoking a Pegasus shell script that types in the same commands that you might manually do if you were to enable password-less ssh login without Pegasus.Results of ‘peg install spark_cluster ssh’If your terminal just hangs on this step, or returns a ‘port 22: Operation timed out’ error, then it is almost always due to a security group problem. See the section above on security groups.After you make sure you have passwordless ssh login enabled, you’ll want to set up your AWS secret key and AWS access key id as environment variables on your master and all of your worker instances so that you can programmatically access AWS resources. That’s accomplished with the following Pegasus command assuming the ‘tag_name’ is spark_cluster:peg install spark_cluster awsUnder the covers, Pegasus is invoking this shell script.Installing Hadoop using PegasusIf you’ve made it this far that means you’re ready to start installing your first distributed computing framework, Hadoop. While installing Hadoop is not necessary to run Spark, you’ll find that having access to the Hadoop Distributed File System will be beneficial, especially if your Spark job will be accessing files on HDFS.To install Hadoop using Pegasus, if my ‘tag_name’ is spark_cluster, I’d execute on my laptop’s terminal:peg install spark_cluster hadoopIdentifying what version to install using PegasusThe first thing Pegasus will do (as found in this download_tech shell script) is figure out what version of Hadoop you are looking for and then identify where it can find the binary (pre-compiled) distribution. As of this writing, the Hadoop version used by Pegasus is 2.7.6. In the past, Pegasus pulled the compiled binaries directly from an Apache mirror website but that led to instability because some versions would mysteriously disappear from the Apache site without warning. Sometimes, it might re-appear in a few hours or there’d be an alternative version to choose but it’d require manual intervention. To improve Pegasus’ reliability, Insight downloaded several Apache binaries to an S3 bucket and made them available. The downside is that the S3 bucket won’t always have the latest version.So if you are looking to install a newer Hadoop version, you can do that by navigating to your laptop’s Pegasus directory and changing these lines in the download_tech shell script:Find the variable, HADOOP_VER, and change it to your desired version numberLocate the HADOOP_URL variable and point it to where you want to pull the desired version’s binary. Of course you must make sure the binary exists — either move it to your own personal S3 bucket and reference that location — or point it the Apache site knowing the instability I pointed out above and then just dealing with it. In the later case, you’d update the variable to HADOOP_URL=$HADOOP_VER/hadoop-$HADOOP_VER.tar.gz)If you’ve already installed Hadoop onto your cluster and you want to overwrite it with this newer version, execute ‘peg uninstall <tag_name> hadoop” to remove Hadoop from your cluster (which will also just remove the Hadoop directory from each instance) first. That way, you can be sure that when you re-try ‘peg install <tag_name> hadoop’ to install the newer Hadoop version there aren’t remnants of a previous version lurking in some directory.The Hadoop installation can produce a lot of output — some of which may seem like it failed. But if the end of your Hadoop installation looks like the following, then it’s likely the Hadoop installation succeeded:Result of ‘peg install spark_cluster hadoop’How Pegasus installs Hadoop under the coversFor a single command, ‘peg install <tag_name> hadoop’ there’s a lot that Pegasus is executing behind the scenes.First, Pegasus executes its script on the master and worker instances you’ve provisioned in the previous steps. That script:sets the HADOOP_CLASSPATH to the directory where Pegasus unzipped and installed the Hadoop distributionupdates the $HADOOP_HOME/etc/hadoop/core-site.xml file to identify the namenode (or master) running the default file systemspecifies the s3a protocol to access S3 buckets and passes in the AWS secret key and access key idsets the configuration for the resource manager, node manager and MapReduce applications, including extending the Hadoop classpath to enable commands that access S3 bucketsAs part of starting Hadoop, Pegasus also will make sure all of the cluster’s DNS names are properly listed in the /etc/hosts file on the master and worker instances. DNS names will be resolved to the instance’s private IP address, which Hadoop will use to reference the instance.Pegasus also will make sure that it sets where the namenode stores its metadata by initializing the Hadoop configuration file, hdfs-site.html, and identifying the worker instances in the cluster. Next, Pegasus will identify where the data node will store its metadata. Finally, Pegasus will format HDFS.If you’d rather manually install Hadoop instead of using Pegasus, go to the documentation to install the latest stable version. Going through that documentation also will reveal additional configurations you can set outside of the default ones Pegasus initializes.Starting Hadoop using Pegasus and what that meansWhile there are many things Pegasus does to install Hadoop, before you actually can start using the system, you must explicitly tell Pegasus to start Hadoop.To start Hadoop on your cluster, which continuing on the example we’ve used with the ‘tag_name’ of spark_cluster, you’d issue the Pegasus command:peg service spark_cluster hadoop startWhen you do that, Pegasus will execute this script, which in turn runs several Hadoop scripts, and, and kicks off the MapReduce JobHistory server. The two later Hadoop scripts also are covered in the Apache installation documentation so if you were going to do this manually rather than with Pegasus.Result of ‘peg service spark_cluster hadoop start’A quick way to know if you’ve successfully started Hadoop is to go to the website that the Hadoop cluster makes available on its Namenode or master instance. I like going to the job tracker one found on port 8088 and then clicking on Nodes to see that my worker instances are showing up as seen below.Result of http://<master public DNS>:8088/cluster/nodesInstalling Spark using PegasusIf your Hadoop installation succeeded, you’re ready to install Spark. To install Hadoop using Pegasus, if my ‘tag_name’ is spark_cluster, I’d execute on my laptop’s terminal:peg install spark_cluster sparkResults of ‘peg install spark_cluster spark’Above, when we described installing Hadoop, we mentioned how the first thing Pegasus will do is identify what version to look for and install. The same thing applies to Spark, except that in this case, the Spark binary also will depend on what version of Hadoop you’ve installed.As of this writing, Pegasus was pulling the binaries for Apache Spark version 2.4.0 built for Apache Hadoop version 2.7 or later. If you’re looking for a later version, find it here and then download it to your S3 bucket or you can point it to this Apache site with the same caveats listed above about how the Apache site sometimes removes Spark versions without prior warning.How Pegasus installs Spark under the coversMost of what Pegasus does when it installs Spark is around setting up default configuration values on each of the machines that’ll make up the Spark cluster.Spark is highly configurable with a host of options that can be tweaked to customize and optimize its deployment and operations. Pegasus tries to optimize some of the configurations, but certainly not all. Read up on the plethora of Spark configurations here.For instance, Pegasus will copy a couple of template configuration files provided by Spark and sets certain values, such as the Java home directory and also the number of Spark worker cores.Keep in mind Spark worker cores are different than machine cores. Spark worker cores can be thought of as the number of Spark tasks (or process threads) that can be spawned by a Spark executor on that worker machine. By default, it’ll be set to one per machine core but that won’t get you too much throughput — so Pegasus will use the Linux utility, nproc, to identify how many processing units are on a machine and then multiple that by an “oversubscription factor” of 3. That means depending on how many machine cores there are on an instance, Pegasus will multiple that by 3 to set the number of Spark worker cores. If you are using m4.large instances, which typically have two machine cores, that’ll mean that your Spark worker cores will be set to 6.For a high-level overview of Spark’s cluster architecture, including what the executor does, read this documentation.Pegasus also will modify Spark’s spark-defaults.conf file to extend its classpath to include libraries that will allow you to connect to S3 resources as well as configure worker instances so that they’ll know who the other workers are.All of the configuration files are stored in the ‘/usr/local/spark/conf’ directory on the mater and worker instance with the list of workers stored in the ‘/usr/local/spark/conf/slaves’ file.Starting Spark using Pegasus and what that meansJust like the other distributed computing technologies when it just installed the necessary binaries and libraries but didn’t actually kick off the service, Pegasus needs to explicitly start Spark through this command:peg service spark_cluster spark startpeg service spark_cluster spark startThis command simply executes the Spark script, which launches a Spark standalone cluster, including the Spark master and workers. It’ll also start a Jupyter notebook.You can then navigate your browser to the Spark Cluster user interface that can be found at the master’s public DNS and port 8080 (Pegasus will output that information once it’s started Spark) and you should see your standalone Spark cluster running with associated workers.Result of http://<master public DNS>:8080 on your web browserIf you decide you want to change any configuration values once the Spark cluster has been started, you can either try to dynamically change those configurations programmatically or on the command line, if possible (e.g., classpath), or if you decide you want to edit Spark files, such as the number of worker cores in, be sure and make the changes, and then before those new changes take effect, execute:peg service <tag_name> spark stoppeg service <tag_name> spark startReplace <tag_name> for the name you chose to reference your cluster, or in the examples I’ve given spark_clusterRunning a Spark jobNow that both Hadoop and Spark have been installed and you’ve started your clusters, you’re ready to run a Spark job. Let’s submit a word count example.First, copy and paste the following text into a file on your master machine that you can name alice.txtAlice sits drowsily by a riverbank, bored by the book her older sister reads to her.Out of nowhere, a White Rabbit runs past her, fretting that he will be late.The Rabbit pulls a watch out of his waistcoat pocketand runs across the field and down a hole.Now you’re going to want to copy the alice.txt file onto HDFS so that all of the instances in your cluster will be able to access that file. On your master node, type the following commands to create a new HDFS directory and into which you’ll copy over alice.txt.$ hdfs dfs -mkdir /user$ hdfs dfs -copyFromLocal alice.txt /user/alice.txtSecond, you can take this word count Python Spark code example and copy it into a file on your master machine, which you can name wordcount.pyNow you’re ready to fire off your first spark-submit job. Before you do, you should know about one important option that you can pass to spark-submit, and that is the --master option, which triggers the Spark standalone mode and will allow for the processing of your code to be distributed. If you don’t pass in that option along with spark://<master DNS>:7077 where ‘<master DNS>’ is replaced by the public DNS of your master ec2 instance, your code will run solely on your master machine and the word count won’t be truly distributed. And if you don’t want to run the computation in a distributed manner, then you should re-think whether you need Spark in the first place.Finally, because you’ve loaded the input text file onto HDFS, you’ll also want to pass to Spark the HDFS location of that file along with the DNS of your master instance and port 9000.In other words, you’d type this on your master node (be sure and replace ‘<master DNS>’ with your instance’s public DNS):$ spark-submit --master spark://<master DNS>:7077 hdfs://<master DNS>:9000/user/alice.txtWhile your spark-submit job is executing, be sure and go to your web browser and check out the progress of your job by typing in this address: http://<master public DNS>:8080 and this address: http://<master public DNS>:4040 (replace ‘<master DNS>’ with your instance’s actual public DNS).Web browser result of http://<master public DNS>:4040You’ll find that the second address is only active while the Spark job is running and will return a “This site can’t be reached” once it concludes.The terminal in which you fired off the Spark job should be full of informational messages but in between those messages will be the output of the wordcount.Wordcount resultsCongratulations on reaching this far — now go and write some more Spark jobs!Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.How to get Hadoop and Spark up and running on AWS was originally published in Insight Fellows Program on Medium, where people are continuing the conversation by highlighting and responding to this story.
Back All Articles
advert template