Lazada API Exception Monitoring & Alerts: How to Build a Healthy Operations System

1. Introduction

In the e-commerce world, the Lazada API plays a vital role in product listings, orders, inventory synchronization, and more. As business traffic increases, any failure or latency in API calls can directly affect user experience and revenue. This article walks you through how to build a robust and scalable operations and monitoring system using Prometheus + Alertmanager, ELK/EFK logging stacks, auto-recovery strategies, and distributed observability tools.

2. Overview of Monitoring & Alerting

2.1 What is Monitoring and Alerting?

  • Monitoring: Continuously collecting system metrics, logs, and traces to understand service health.

  • Alerting: Automatically notifying engineers when key indicators hit abnormal thresholds or failures occur.

2.2 Three Pillars of Operational Health

  1. Performance: Response time, throughput.

  2. Availability: API success rate, system uptime.

  3. Stability: Retry success rates, system load trends.

2.3 Design Principles

  • Observability: Internal system states should be externally measurable.

  • Scalability: Monitoring components must scale with traffic and system growth.

  • Reliability: The monitoring system must be highly available with no single point of failure.

3. Core Monitoring Metric Design

Category

Key Metrics

Purpose

Availability

HTTP status codes (2xx/4xx/5xx)

Quickly identify error types

Performance

Latency P50 / P95 / P99

Multi-dimensional response analysis

Throughput

QPS (Queries per Second)

Determine system load

Resource Usage

CPU, Memory, Disk I/O, Network

Pinpoint bottlenecks

Business Metrics

Order success rate, inventory sync rate, retries

SLA-related business health monitoring

4. Prometheus + Alertmanager in Practice

4.1 System Architecture

  • Prometheus Server: Central collector and processor of metrics.

  • Exporters: Metric providers like node_exporter, cadvisor, or custom clients.

  • Pushgateway: For short-lived job metrics.

  • Alertmanager: Groups, routes, and sends alert notifications.

4.2 Metric Collection & Exposure

(1) Installing node_exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

tar xzf node_exporter-1.5.0.linux-amd64.tar.gz

cd node_exporter-1.5.0.linux-amd64

./node_exporter --web.listen-address=":9100" &

(2) Custom API Metric Example (Python)

from prometheus_client import Counter, Histogram, start_http_server

import time, requests

REQUEST_COUNT = Counter(

'lazada_api_requests_total',

'Total number of Lazada API requests',

['endpoint', 'http_status']

)

REQUEST_LATENCY = Histogram(

'lazada_api_request_latency_seconds',

'Latency of Lazada API requests',

['endpoint']

)

start_http_server(8000)

def call_api(endpoint, params=None):

start = time.time()

resp = requests.get(endpoint, params=params, timeout=10)

latency = time.time() - start

REQUEST_LATENCY.labels(endpoint=endpoint).observe(latency)

REQUEST_COUNT.labels(endpoint=endpoint, http_status=resp.status_code).inc()

return resp.json()

4.3 Alertmanager Configuration

alertmanager.yml:

global:

resolve_timeout: 5m

route:

receiver: 'team-slack'

group_wait: 30s

group_interval: 5m

repeat_interval: 3h

receivers:

- name: 'team-slack'

slack_configs:

- channel: '#ops-alerts'

send_resolved: true

inhibit_rules:

- source_match:

severity: 'critical'

target_match:

severity: 'warning'

equal:

- alertname

Alert rules lazada-alerts.yml:

groups:

- name: lazada-api

rules:

- alert: HighErrorRate

expr: sum(rate(lazada_api_requests_total{http_status=~"5.."}[5m]))

/ sum(rate(lazada_api_requests_total[5m])) > 0.05

for: 2m

labels:

severity: critical

annotations:

summary: "{{ $labels.instance }} High 5xx error rate"

description: "More than 5% 5xx errors in the last 5 minutes"

- alert: LatencySpike

expr: histogram_quantile(0.95, sum(rate(lazada_api_request_latency_seconds_bucket[5m])) by (le)) > 1

for: 3m

labels:

severity: warning

annotations:

summary: "{{ $labels.instance }} High P95 latency"

description: "P95 latency exceeds 1s in the last 5 minutes"

5. Log Aggregation: ELK vs EFK

5.1 Architecture Comparison

Feature

ELK (Logstash)

EFK (Fluentd)

Performance

Higher resource usage

Lightweight, high concurrency

Plugins

Rich, slower boot time

Flexible, dynamic loading

Complexity

Complex Grok parsing

Easier DSL configuration

5.2 Log Format & Fluentd Setup

Structured log sample:

{

"timestamp": "2025-04-18T10:20:30Z",

"level": "ERROR",

"service": "lazada-client",

"endpoint": "/orders/create",

"status": 500,

"message": "Internal Server Error"

}

Fluentd config (fluent.conf):

<source>

@type tail

path /var/log/lazada-client/*.log

pos_file /var/log/fluentd-lazada-client.pos

tag lazada.client

<parse>

@type json

</parse>

</source>

<match lazada.client>

@type elasticsearch

host es-host

port 9200

logstash_format true

index_name lazada-client-%Y.%m.%d

</match>

5.3 Kibana Visualizations

  • Create index pattern: lazada-client-*

  • Common queries:

    • service: "lazada-client" AND status:500

    • endpoint.keyword: "/orders/create" AND response_time:[1 TO *]

6. Auto-Recovery Strategies

6.1 Common Issues and Fixes

Scenario

Recovery Action

Network jitter

Exponential backoff retry

API throttling

Retry with delay or rotate backup API accounts

Node crash

Auto-restart or scale with Kubernetes

6.2 Kubernetes Health Check

livenessProbe:

httpGet:

path: /healthz

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

readinessProbe:

httpGet:

path: /readyz

port: 8080

initialDelaySeconds: 5

periodSeconds: 5

6.3 Auto-Restart Script Example

# auto_recover.py

import subprocess, time

def check_and_restart():

status = subprocess.run(["kubectl", "get", "pods", "-l app=lazada-client", "-o","json"], capture_output=True)

# Parse JSON output and restart failing pods

if __name__ == '__main__':

while True:

check_and_restart()

time.sleep(60)

7. End-to-End Observability

7.1 Distributed Tracing

  • Use OpenTelemetry SDK for tracing.

  • Visualize spans with Jaeger or Tempo.

7.2 Canary Releases

  • Use Prometheus metrics to gate deployments.

  • Automatically roll back on elevated latency or error rates.

7.3 Scheduled Reporting

  • Use Grafana alert reports to email or Slack team health summaries daily.

8. Performance Optimization

  1. Prefer horizontal scaling over vertical.

  2. Add Redis caching for high-frequency endpoints.

  3. Introduce Resilience4j for rate limiting, fallback, and circuit breaking.

9. Real-World Case Study

A large e-commerce platform adopted this system and reduced 5xx errors from 8% to 1.2%. P95 latency dropped from 1.3s to 0.6s, and alert response times improved by 40%.

10. Summary & Best Practices