Software

Scale Testing with Locust: How We're Going to Crush Our Own API (Before Users Do)

Umang Kedia

19 Dec 2025 — 5 min read

Modern cloud systems are dangerously good at hiding their limits.

Auto-scaling dashboards stay green. Serverless promises "infinite scale." Latency looks fine in staging.

And then one day, a feature gets shared. A reel goes viral. A push notification lands at the wrong time.

That's usually when teams learn — painfully — where their system actually breaks.

At Hoomanely, we don't want that moment to come from users. So we're going to manufacture it ourselves.

How we are going to build a realistic scale-testing setup using Locust, what exactly we plan to test, which metrics matter, and how the results will guide infrastructure limits, product decisions, and even LLM cost controls.

Why We're Doing This Before We "Need" It

Our application stack looks modern and robust on paper:

We have mobile clients built in Flutter. Our APIs run on AWS Lambda behind API Gateway. Data is stored in DynamoDB. Conversational intelligence is powered by Bedrock.

Each individual component scales well on its own. The risk lives in how they interact under simultaneous load.

The goal is to have hard data on system limits before user growth accelerates, not after.

We have one core question:

What actually happens when 100 real users talk to our system at the same time?

Not synthetic traffic. Not benchmark requests. Real flows, real delays, real cost.

Instead of guessing, we are going to answer this question with data.

What We Mean by "Realistic" Load Testing

Many load tests fail before they begin because they test the wrong thing.

They:

Fire the same endpoint in a tight loop
Ignore user think time
Skip authentication
Measure only requests per second

That is not how users behave.

We are going to simulate complete user sessions, not isolated API calls.

Each simulated user will behave like an actual person using the app.

The User Flow We're Going to Simulate

Every virtual user in our test will follow the same journey:

First, they authenticate. Then they start a new chat session. They pause for a few seconds, just like a human thinking. Finally, they ask a question that triggers a full conversational response.

That single question is deceptively expensive.

Behind the scenes, it can involve:

Writing conversation metadata
Storing chat history
Invoking an LLM
Recording analytics events
Updating user context

This is the exact moment where latency, throughput, and cost collide.

That is why we are designing the test around this flow.

Why We Chose Locust for This

We are going to use Locust because it lets us think in terms of users, not requests.

Locust allows us to:

Create stateful virtual users
Maintain authentication across requests
Introduce natural waiting periods
Increase concurrency gradually
Observe percentile-based latency

Most importantly, it helps us answer the question we actually care about:

What does the slowest 5% of users experience as concurrency increases?

How We're Going to Structure the Test

Instead, we will ramp up deliberately.

We will begin with a small number of concurrent users and slowly increase the load in controlled steps. At each step, we will hold the traffic steady long enough to let the system stabilize.

→ Our ramp plan: Start at 10 concurrent users, add 10 every 2 minutes, hold at each level for 5 minutes, and stop when either error rates exceed 1% or p95 latency crosses 5 seconds.

At every level, we will record:

Median conversation latency
95th percentile latency
Error rates
Throttling signals
Timeout frequency

→ We'll be watching these metrics in real time through CloudWatch, Datadog, and Locust's built-in dashboard. The combination gives us infrastructure-level signals (Lambda throttles, DynamoDB rejections) alongside user-facing latency.

The goal is not to pass the test. The goal is to find the edge of failure.

The One Metric That Matters Most

We are intentionally ignoring flashy metrics like raw throughput.

The primary metric we care about is conversation latency.

Specifically: The time between a user submitting a question and receiving the full response.

This metric matters because:

It directly maps to user experience
It captures downstream dependencies
It degrades before outright failures occur

In most systems, errors appear late. Latency spikes appear first.

We want to catch the system in that uncomfortable middle ground.

What We Expect to Break First

We are not going into this blind.

Based on our architecture, we expect pressure to build in three main areas.

DynamoDB Write Capacity

Each conversational turn generates multiple writes.

These writes don't just go to one table. They touch:

Conversation state
Message history
User context
Analytics streams

Under concurrent load, write capacity units can be exhausted quickly.

When that happens, DynamoDB doesn't fail loudly. It throttles.

Retries pile up. Latency climbs. User experience quietly degrades.

This test will help us understand:

How quickly we hit write limits
Whether adaptive capacity can keep up
How retry behavior amplifies delays

Lambda Concurrency Limits

Every chat request triggers orchestration logic.

Even if each Lambda execution is fast, concurrency multiplies rapidly when many users arrive together.

We want to observe:

When cold starts become visible
When concurrency limits are reached
How throttling manifests at the API layer

Lambda failures are rarely catastrophic at first. They are slow, subtle, and confusing — exactly what we want to surface.

Bedrock Inference Pressure

LLMs introduce a new kind of scaling problem.

They don't just cost compute. They cost tokens.

Every additional user increases:

Inference time
Token usage
Billing exposure

This test is not just about performance.

It is also about financial sustainability.

Why Cost Is Part of the Load Test

Traditional load testing stops at stability.

We are going further.

We are going to estimate what happens if this traffic were real.

If 100 users are chatting concurrently, and each conversation generates multiple LLM responses, the token count adds up fast.

We will use this data to answer questions like:

How expensive is one active user per minute?
What happens if engagement doubles?
What does "going viral" actually cost?

This information directly informs:

Rate limiting policies
Token budgets
Response verbosity
Feature gating

In an LLM-powered system, cost is a first-class metric.

What We'll Do With the Results

This test is not an academic exercise.

Based on what we learn, we expect to make concrete changes.

On the infrastructure side, this may include:

Adjusting DynamoDB capacity strategies
Introducing write batching or queues
Reserving Lambda concurrency
Splitting synchronous and asynchronous paths

On the product side, this may include:

Limiting rapid consecutive messages
Caching repeated questions
Shortening default responses
Introducing daily or per-session limits

Load testing informs product design just as much as architecture.

Why We're Doing This Before Growth

Most teams run load tests when something already hurts.

We are doing this because we want:

Predictability
Control
Confidence

We want to know where the system bends before it breaks.

We want to design safeguards intentionally, not reactively.

And most importantly, we want users to experience reliability — even when demand spikes unexpectedly.

Key Takeaways

We are going to load-test real user flows, not isolated API calls.
Conversation latency is the metric that best reflects user experience.
Serverless systems have limits — they just show up under concurrency.
Latency spikes before errors, making it an early warning signal.
LLM features must be tested for cost impact, not just stability.
Breaking the system intentionally is safer than learning in production.