They needed someone to fix speed, stop crashes, and make system strong for future growth — without spending too much money.
My Role I joined as backend dev. I studied their current setup, found weak points, and made big improvements step by step. Here is exactly what I did and why.
Tech Stack they were using.
- Backend: Python (FastAPI/Django) + some Node.js microservices
- Frontend: React (web) + Flutter (mobile apps)
- Database: PostgreSQL
- Other: Redis(later i implemented this), Celery/RQ, Docker, AWS/GCP cloud
What I Did – Step by Step (with simple explanation)
- Made Application Stateless (Very Important) Old problem: Each server was saving user session in its own memory. If that server died, user got logged out. Also hard to add more servers. Solution:
- Removed session from server memory
- Used Redis to store sessions (very fast)
- Switched to JWT tokens for mobile app login (no session needed) Result: Now we can start/stop any number of servers anytime. Traffic comes — auto add more servers — no problem.
- Put All Static Files on CDN Old way: Images, CSS, JS, videos were coming from same server. Very heavy load. Solution:
- Moved everything static (photos, videos, app icons, JS bundles) to CDN (Cloudflare or AWS CloudFront)
- CDN is present all over India and world — files load in 50–100 ms instead of 1–2 seconds Result: Main server load reduced by 40–50%. Pages open much faster.
- Added Smart Caching with Redis Many things don’t change every second (like user profile, product list, settings). Solution:
- Saved this data in Redis (super fast memory store)
- Set time limit (TTL) — example: user profile cache for 30 minutes, product list for 30 minutes, but price is live.
- When user updates profile, we clear old cache Result: 80–90% of read requests now answered from cache, not database. Database load dropped a lot.
- Fixed Database Problems (PostgreSQL) Old issues: What I did:
- Too many connections from app → database hanged
- All reads and writes going to same DB → slow
- Some queries very slow because no proper index
- Added PgBouncer → it manages connections. Now only few real connections to DB, rest wait in pool.
- Added 1 read replica → all read queries (like show user data, feed, search) go to replica. Writes still go to main DB.
- Found slow queries using logs → added missing indexes → removed SELECT * → fixed N+1 problem Result: Database became 3–5x faster. No more connection errors.
- Now the database is on another VPC with load balancers
- Moved Heavy Work to Background Old problem: When user uploads photo → server processes it (resize, thumbnail) → user waits 10–20 seconds. Same for sending email, notifications, reports.Solution:
- Used Celery (with Redis or RabbitMQ as queue)
- When user uploads → API says "success" in 1 second → processing happens in background worker Result: API always fast. User happy. Even if 1000 uploads at same time — no slow down.
- Added Rate Limiting & Protection Some bad users or bots sending 1000 requests per second → server dies.Solution:
- Added rate limit in API gateway / NGINX
- Example: max 100 requests per minute per IP / per user
- Also added basic WAF (Web Application Firewall) rules Result: Stopped attack traffic. Real users never affected.
- Added Monitoring So We Can See Problems Fast Solution:
- Used Prometheus + Grafana dashboard
- Watching: CPU, memory, request per second, error rate, database connections, cache hit %, slow queries
- Set alerts on WhatsApp/Slack/email if p95(we are removing this anytime soon it's costing too much and complicated) latency > 500ms or errors > 1% Result: We know problem before users complain.
- Enabled Autoscaling Solution:
- Put all services in Docker containers
- Used cloud autoscaling (AWS ECS / GCP Cloud Run / Kubernetes)
- Rule: If CPU > 70% or latency high → add more servers automatically Result: During 10x spike (like christmas sale), system auto handles without manual work. (working to add more checks to this rule to save more cost)
What I Delivered in First 15–20 Days (Quick Wins)
- Static files → CDN
- Redis cache for hot data
- PgBouncer + 1 read replica
- Background jobs for email & image processing
- Basic rate limiting
- Simple monitoring dashboard + alerts
After these quick changes → speed improved 4–5x, no crashes in spikes.
Final Result
- Handled 25,000 daily visits + 10x sudden spikes easily
- Page load time reduced from 4–6 seconds to under 1–2 seconds
- Zero downtime during big traffic days
- Database stable, no more "too many connections" error
- Client very happy — they could run bigger marketing without fear
- Cost increase was small because we used autoscaling (pay only when traffic high)
This project taught me real scaling is not about big tools, we are still migrating the VPC+Redis and working on techniques to reduce server cost at high load.