Unlocking the Potential of Auto-Stopping MediaLive Channels

7 min readMar 30, 2024

Introduction:

With our commitment to excellence in education and our dedication to empowering the youth of India, PhysicsWallah offers a plethora of courses designed to meet the diverse needs of our students. Our mission is to provide quality education that enhances the skills and knowledge of the next generation.
To achieve this goal, we conduct numerous live classes daily across various academic batches. Leveraging AWS Elemental Services for live streaming, we ensure seamless delivery of our educational content, thereby facilitating access to high-quality instruction for students nationwide. However, the necessity to cater to the educational needs of a large and diverse student base requires us to operate multiple live classes simultaneously. However, we have encountered a recurring challenge wherein, post the conclusion of a live class, the MediaLive channel continues running due to operational issues. This oversight has led to substantial unwanted costs without corresponding productivity.

High Level Live Streaming Architecture:

The depicted diagram provides an overview of our high-level live streaming architecture. In this setup, OBS is employed to transmit live classes from the Studio to AWS Elemental MediaLive. MediaLive plays a crucial role in both encoding and transcoding, facilitating adaptive bitrate streaming by generating multiple renditions of the same video stream with varying bitrates and resolutions. The processed stream then proceeds to MediaPackage, where HLS packaging is applied, also harvesting the class for live-to-video-on-demand (VOD) conversion. Given our extensive scale of operation, a content distribution network (CDN) is essential to ensure wide accessibility for students, making use of the CloudFront to deliver the live streaming content effectively.

Problem Statement:

The issue at hand arises from the persistence of active MediaLive channels after the conclusion of a live class. As we conduct multiple live classes throughout the day to serve different batches, the operational challenge emerges from the substantial number of approximately 300 daily Medialive channels used for these sessions. The sheer volume of classes poses difficulties in manually shutting down the channels by the operations team, resulting in a failure rate of around 10% where the team is unable to stop the channel after the conclusion of a class.

This operational challenge results in an unintended and substantial financial burden for us. The prolonged running of these channels post-live sessions contributes to an unnecessary increase in our AWS billing, without corresponding educational or operational benefits.

Possible options for Addressing the Issue:

Improve Operation: We considered enhancing the efficiency of the operations team by having them manually stop each channel after use. However, this plan was deemed impractical due to the inherent limitations of manual intervention.
Custom Script: Contemplating the creation of a custom script was also evaluated. However, this approach seemed less suitable for our dynamic scenario. With classes being rescheduled and operational challenges leading to unpredictable scenarios, relying on a fixed script pattern was considered insufficient to handle these edge cases effectively.

Our Chosen Approach (version 1):

After considering multiple approaches, we settled on a solution that leverages AWS CloudWatch, AWS MediaLive, AWS SNS, and AWS Lambda. This solution entails implementing a mechanism wherein, if a MediaLive channel remains active without receiving any input for more than 30 minutes, it will be automatically stopped. The necessary code for this automated intervention is developed and executed within AWS Lambda.

Architecture:

Service Overview:

Elemental Medialive — This service helps in real-time video encoding, packaging, and delivery in the cloud, facilitating live streaming workflows.
CloudWatch — It is a monitoring service providing insights into metrics, logs, and resource utilization, enabling proactive management of AWS resources.
SNS (Simple Notification Service) — AWS messaging service for sending notifications and messages to a distributed set of recipients.
Lambda — A serverless compute service for running code without provisioning or managing servers.

Our architecture employs a strategy wherein a MediaLive channel is equipped with a metric known as “Active Alerts.” MediaLive generates alerts for various conditions, such as the absence of video or sound or when there is no input. We utilize this Active Alerts metric in conjunction with CloudWatch to establish a custom CloudWatch alarm.

The configured alarm operates in such a way that if the Active Alerts persist at a level exceeding the set threshold for more than 30 minutes, the alarm transitions to an “in-alarm” state. This state triggers an action, prompting the invocation of an AWS Simple Notification Service (SNS). Subsequently, a Lambda function acts as a subscriber to this SNS.

Upon receiving an event from the SNS, the Lambda function extracts the relevant MediaLive channel ID and initiates the process to stop the channel, effectively implementing an automated response to running the channel without input.

Challenges Faced in Version 1:

While our production setup was live and operational, we encountered a specific edge case.
Let’s take a look at the edge case scenario :
We initiate a medialive channel at 11 am

We had a 30-minute threshold for the channel, meaning that if it ran continuously without receiving any input for 30 minutes, it would be automatically stopped. To illustrate, let’s examine the timeline: the channel commenced at 11 am and operated without any input stream for the initial 15 minutes. Consequently, this generated 15 data points for the alarm. Subsequently, the channel stopped, leaving no data points for the alarm. Thus, up until 11:17, the alarm had only accumulated 15 data points where the active alert was greater than zero.

Following this, the channel resumed operation but encountered another period of 13 minutes without any input stream. During this time, the alarm received 13 new data points where the active alert was greater than zero. However, by 11:30, the alarm had only received a total of 28 data points, prompting it to monitor the subsequent few minutes closely.

At 11:30, the channel once again ran without receiving any input for two minutes, providing the alarm with two additional data points. This completed the required 30 data points, triggering the channel to stop at 11:32.

The below graph clearly shows that there are no active alerts when the channel is stopped around 19:57 UTC. And this missing data leads to this unexpected behavior of CloudWatch.

Our Response (Version 2):

Upon deeper investigation, we identified an internal factor referred to as the “evaluation period.” In cases where there was missing data, which could possibly be due to stopping the channel in between for 1–2 minutes during the 30-minute timeframe specified for the CloudWatch alarm, the system would consider the next 3–4 data points.

To address the issue stemming from missing data, we implemented a solution using a CloudWatch feature called the “fill function” This feature ensures that if there is no information from MediaLive, it treats the number of active alerts as 0. By adopting this approach, we effectively handled our edge case, ensuring that CloudWatch always has a numeric value for active alerts for any given channel.

The below graph illustrates that in the absence of active alerts, the fill function is applied to substitute zero values for any missing active alert data points.

Conclusion:

In our effort to make AWS Elemental MediaLive work better, our clever setup not only made things smoother but also saved us money. We run, on average, 100 channels for 4 hours each day, totaling 400 hours, accounting for a 5% miss rate results in an additional 100 hours of runtime. Therefore, by factoring in this 5% margin, we effectively save approximately 25% of our total cost. We have eliminated the need for manual intervention by the Operations Team, thereby eliminating the possibility of mistakes and errors.

Even when we faced a surprise hiccup, we found a fix with CloudWatch’s “fill function,” making sure our success stayed on track. It’s not just tech talk — it’s about our team staying strong. As we keep improving, we’re not just cutting costs; we’re showing how to stream smarter on AWS without breaking the bank. Our journey isn’t done; we’re committed to keeping things simple, smart, and budget-friendly in the ever-changing world of AWS live streaming.

Written by — Tejas Gupta, DevOps Team
Tejas Gupta is working as Devops Engineer at PhysicsWallah and also an AWS Community Builder. He is actively contributing towards tech community. He has knowledge of multiple horizon with an expertise in Streaming, Infrastructure and Security. He has been handling Media and Infra of PW since a year.