Apache Storm on Verizon 5G Edge
Building your first Tweet processing application at the network edge
Kishor Patil, PMC Chair Apache Storm and Distinguished Software Systems Engineer, Verizon Media
Robbie Belson, Developer Relations, Verizon
An intro to stream processing at the edge
Verizon 5G Edge with AWS Wavelength present boundless opportunities for distributed applications with ultralow latency at scale. However, for distributed real-time computation, Apache Storm stands out as an easy-to-use approach to reliably processing data streams in near real time.
Take a distributed advertising application, for example. Wouldn’t it be valuable to syndicate social media insights in real time, to provide users with personalized ads far faster than is generally done today?
So, in this tutorial, we’ll explore how to launch an Apache Storm cluster in a 5G Edge Wavelength Zone to calculate the top 10 Twitter hashtags at any point in time.
Part 1: Verizon 5G Edge infrastructure
To get started, launch your first EC2 instance in a Wavelength Zone. After the instance is up, modify the instance security group to allow ports HTTP (80), Custom TCP (8080) and DRPC (3774).
Part 2: Configure Apache Storm
First, install the relevant dependencies, including Java and a Git client, to download the Twitter word count repository.
sudo yum install -y java-1.8.0-openjdk-devel git
Next, navigate into the storm directory and install maven
mkdir storm && cd storm
sudo tar -xvf apache-maven-3.6.3-bin.tar -C /usr/lib/
Next, export your PATH path variables to point to the binaries associated with Maven. Open up ~/.bash_profile and add the following:
Then, pull down the Twitter word count repository. Note that the -b flag pulls down the twitter-poc branch of the repository.
git clone -b twitter-poc https://github.com/kishorvpatil/storm.git
To confirm your PATH variables were adjusted correctly, source the bash profile and check the path variable.
Next, from the storm directory, build the storm repo from source code, using Maven.
mvn clean install -DskipTests -Dgpg.skip=true;
After navigating to the storm binary files, build the package (note: we’re excluding unit tests with -DskipTests flag).
mvn package -DskipTests -Dgpg.skip=true;
Next, untar the package and start the storm service. To learn more about the storm architecture individual components, visit the docs.
tar -xvf apache-storm-2.3.0-SNAPSHOT.tar.gz
nohup bin/storm dev-zookeeper &
nohup bin/storm nimbus &
nohup bin/storm ui &
nohup bin/storm supervisor &
nohup bin/storm logviewer &
nohup bin/storm drpc &
Then, add the address of your drpc server to storm.yaml. Search for Locations of drpc servers in the conf/storm.yaml file and add the following:
- “Localhost” #Add to this line
To verify outbound connectivity, navigate to the Storm UI on a web browser.
Next, submit the topology to the storm cluster.
bin/storm jar ../../../../../examples/storm-starter/target/storm-starter-2.3.0-SNAPSHOT.jar org.apache.storm.starter.TopNTweetTopology — consumerKey “your_consumer_key” — consumerSecret “your_consumer_secret” — accessToken “your_access_token” — accessTokenSecret “your_secret_token”
To verify successful submission of the topology, go to the topology to take a look at the top-five tweets.
If executed properly, the top five tweets should display in your console (or web browser). Note that you can change the top N tweets from 5 to any number less than 50.
Rapid recap and Apache Storm deep dive
So, what exactly happened? Take a look at the visualization feature of your topology to break it down further.
Essentially, Apache Storm reads a raw stream of near real-time data — in this case from Twitter — and passes it through a sequence of small processing units in order to output meaningful information at the very end. Think of the Tweet-spout as the source of the stream, and a bolt as a logical processing unit. In other words, the parse-tweet bolt ingests the tuple to capture the hashtags present on a given tweet, count-bolt counts the individual hashtags associated with the Tweet, and intermediate-ranker consolidates the individual hashtag counts, which are then passed to the total-ranker bolt, which actually ranks those most popular at a given point in time.
Let’s provide an example of a Tweet coming in: “I love animals! #dog #cat”
parse-tweet-bolt: Extracts #dog and #cat
count-bolt: Sends tuple (#dog,1) and (#cat,1)
intermediate-ranker: Let’s say the bolt has already seen five dog hashtags; now it would maintain two tuples: (#dog, 6) and (#cat, 2) and emit the tuples to the total-ranker.
total-ranker: Do 6 instances of #dog meet the top 10? If so, emit (#hashtag, <value>) as part of the complete list.
Note: For a given hashtag (e.g., #dog) to propagate to the final bolt, the intermediate-ranker bolt must have counted a volume of hashtags that exceeds the top 10 at any given point in time. Thus, not all data (i.e., tuples) will propagate from the spout to the final bolt.
Additionally, given the distributed nature of the topology, a single bolt can have multiple instances. For example, you could have multiple count-bolt instances and multiple spouts to handle large ingress data volumes, but only one total-ranker bolt.
As a result, these spouts and bolts are connected together and form a topology whereby you can connect a drpc component to provide external access to query the final state from report-bolt.
Congratulations! You’ve now completed your first Apache Storm application on 5G Edge.