API rate limiting is the practice of restricting how many requests a client can send to an API in a given time window. It is usually implemented at an endpoint level or per API key to prevent abuse and protect system resources. A common example is allowing 1000 requests per hour and returning a specific error once that limit is reached. Rate limiting rules are often enforced by an [API gateway] or by logic inside an application server sitting behind a load balancer. The system tracks usage counts in fast storage such as a cache or key-value store.
Why it matters
Without API rate limiting, a buggy script or hostile actor could overwhelm your web application or backend services. Limits help preserve capacity for well behaved clients and make costs more predictable, especially when you pay for underlying cloud computing resources. Rate limiting also creates a clear contract between API providers and consumers, which reduces surprises when integrating systems.
How it Works
Typical algorithms include fixed window, sliding window, and token bucket, each with different tradeoffs in fairness and implementation complexity. When a call exceeds the allowed threshold, the API usually returns a 429 Too Many Requests status code with guidance about retry timing. You can learn how providers present these rules in the lesson Rate Limiting.