Enhancing Load Testing Performance: From Kubernetes to EC2
by Abhinav Kumar, DevOps Team
At Physics Wallah, our commitment to delivering seamless educational experiences requires robust and scalable infrastructure. In Part 1, we shared our journey of revamping our load testing approach by leveraging Kubernetes. Today, we’re excited to share how we’ve further evolved this strategy to overcome new challenges and achieve even greater efficiency.
Why We Changed Our Approach
While Kubernetes provided significant advantages, such as automated scaling and better resource control, we encountered several issues:
- Network Congestion: Operating within a private cluster meant that all requests went through a NAT, causing network congestion and instability.
- Port Exhaustion: Multiple JMeter pods scheduling on a single node led to node network choking and port exhaustion.
- Increased Costs: NAT charges escalated, and the instability in Kubernetes as pods scheduled mid-load test was problematic.
- Load Balancer Scaling Issues: A significant load coming from a single IP (NAT IP address) seemed to be treated differently by the Application Load Balancer (ALB) compared to load distributed via multiple IPs. The latter scenario allowed the ALB to scale more effectively, as observed during our tests.
- Inaccurate Latency Reports: JMeter reports often showed high API latency due to increased ‘Connect Time’ time taken to connect to server, which made all the APIs in the report appear latent.
- Distributed Mode Limitations: Running JMeter in distributed mode in Kubernetes works well with 20–30 machines, and maybe up to 40–50 machines with luck. This setup, limited by the inefficiencies of the RMI protocol used by JMeter, capped our ability to simulate more than 50–100k concurrent users.
The Shift to EC2
To address these challenges, we transitioned our load testing setup from Kubernetes to EC2 machines. Here’s a breakdown of the benefits this shift brought:
- Improved Network Efficiency: By eliminating the NAT and reducing hops, our network congestion issues were resolved. Requests now come from multiple IPs, closely mimicking actual load scenarios and allowing the ALB to scale effectively.
- Dedicated Resources: Running one JMeter instance per EC2 node eliminated network congestion issues. We’ve tuned our EC2 instances for better TCP performance, providing us with an edge we didn’t have in Kubernetes.
- Simplified Load Distribution: Instead of running JMeter in distributed mode, we now execute parallel JMeter tests and consolidate the results at the end. This approach simplifies our load testing process and enhances reliability.
Tuning for Optimal Performance
To maximize the efficiency of our new setup, we’ve implemented several performance tuning measures on our EC2 instances:
Increased Local Port Range:
echo 1024 65000 > /proc/sys/net/ipv4/ip_local_port_range
This command increases the range of available local ports, allowing more connections to be handled simultaneously.
Fast Recycling TIME_WAIT Sockets:
sudo sysctl -w net.ipv4.tcp_tw_recycle=1
Enabling fast recycling of TIME_WAIT sockets helps in quickly reclaiming socket resources.
Reuse of Sockets:
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
Enabling reuse of sockets ensures efficient utilization of available socket connections.
Addressing Latency Discrepancies
One critical issue we faced in our Kubernetes setup was the discrepancy between JMeter-reported API latency and what was observed via our APM tools. JMeter’s high ‘Connect Time’ skewed the latency results, giving a misleading picture of API performance. Due to port exhaustion and network congestion, the ‘Connect Time’ in JMeter reports was significantly higher.
By moving to EC2, where we have more control over network configurations and resource allocation, we’ve mitigated these discrepancies. Running one JMeter instance per node and fine-tuning the network configuration has reduced the ‘Connect Time’ to under 3 milliseconds. This alignment means our JMeter reports now accurately reflect the data shown in our APM tools, providing a true picture of our API performance.
Implementation Details
- Parallel JMeter Tests: Each EC2 instance runs a standalone JMeter test. The JMeter test files are copied to all EC2 instances, and the tests are executed individually. Reports are then collected from these machines and combined to create a final JMeter report. This ensures even distribution of load and efficient use of resources.
- Enhanced Performance Tuning: We’ve fine-tuned our EC2 instances for optimal TCP performance, reducing latency and increasing throughput.
- Scalability: By running individual tests on EC2, the load we can put on the systems is limitless, ensuring we can loadtest 5x or 10x of our peak load.
Conclusion
Our transition from Kubernetes to EC2 has proven to be a game-changer. By addressing the limitations of our previous setup, we’ve achieved a more stable, efficient, and scalable load testing environment. This evolution underscores our dedication to continuously improving our infrastructure to support our mission of providing quality education to all.
As we move forward, we will continue to explore and implement innovative solutions to ensure our systems remain robust and capable of handling the growing demands of our platform.
Stay tuned for more insights as we continue to refine our processes and share our learnings with the community.
#EdTech #LoadTesting #Scalability #AWS #EC2 #PerformanceOptimization #PhysicsWallah
Feel free to reach out if you have any questions or would like to learn more about our load testing journey!