Building an API Monitoring Dashboard with Grafana and Elasticsearch

A practical guide to building production API monitoring — from Elasticsearch data source setup and Lucene queries to Grafana dashboard panels, alerting rules, and operational best practices.

zhuermu · · 18 min
GrafanaElasticsearchMonitoringObservabilityAlertingDevOps

中文版 / Chinese Version: 本文最初发表于 CSDN。阅读中文原文 →

Your API is running in production. Users are hitting endpoints, services are calling each other, and everything looks fine — until it isn’t. A payment endpoint starts returning 500s. Latency on the search route creeps from 200ms to 3 seconds. A downstream dependency goes down and your error rate spikes from 0.1% to 12%. Without monitoring, you find out about these problems from angry users. With the right dashboard, you find out in minutes and often before users even notice.

This article walks through building a complete API monitoring dashboard using Grafana 11.x with Elasticsearch as the data source. We’ll cover the architecture, data source configuration, six essential dashboard panels with real query syntax, alerting with Grafana’s Unified Alerting system, and production tips that make the difference between a dashboard people glance at and one that actually prevents outages.


1. Why API Monitoring Matters

Before diving into the tooling, it’s worth grounding the discussion in the framework that modern operations teams use to think about reliability: SLOs, SLIs, and error budgets.

Service Level Indicators (SLIs) are the metrics you actually measure — request latency, error rate, throughput. Service Level Objectives (SLOs) are the targets you set for those metrics — “99.5% of requests complete successfully” or “P95 latency stays below 300ms.” The error budget is the gap between perfection and your SLO: if your SLO is 99.5% success rate, you have a 0.5% error budget per measurement window.

The dashboard we’re building in this article directly measures three of the four golden signals defined in the Google SRE book:

  1. Latency — how long requests take (P50, P95, P99)
  2. Traffic — how many requests per minute are flowing through the system
  3. Errors — what percentage of requests are failing

The fourth signal, saturation, typically comes from infrastructure metrics (CPU, memory, disk) rather than access logs. Together, these four signals give you a comprehensive picture of API health.

The practical value is immediate: when an alert fires at 3 AM because the error rate on /api/v1/payments exceeded 2%, the on-call engineer opens the dashboard, sees that the spike started exactly when a new deployment rolled out, correlates with the latency increase on the same endpoint, and has enough context to either roll back or investigate the specific failure mode — all within minutes.


2. Architecture

The monitoring pipeline is straightforward: API traffic generates access logs, those logs flow into Elasticsearch, and Grafana queries Elasticsearch to power dashboards and alerts.

Monitoring Architecture

Here is what each component does:

API Gateway (nginx, AWS ALB, Kong, etc.) serves as the entry point for all API traffic. The gateway writes structured access logs for every request, including the HTTP method, path, status code, response time, client IP, and request size. The key decision here is log format — use JSON if possible, as it eliminates the need for complex parsing rules downstream.

A typical nginx JSON log configuration looks like this:

log_format json_combined escape=json
  '{'
    '"timestamp":"$time_iso8601",'
    '"method":"$request_method",'
    '"path":"$uri",'
    '"status":$status,'
    '"response_time":$request_time,'
    '"bytes_sent":$bytes_sent,'
    '"remote_addr":"$remote_addr",'
    '"user_agent":"$http_user_agent",'
    '"upstream_response_time":"$upstream_response_time",'
    '"request_id":"$request_id"'
  '}';

access_log /var/log/nginx/api_access.log json_combined;

Log Shipper (Filebeat or Fluentd) reads the log files, parses them, and sends them to Elasticsearch. Filebeat is the lighter-weight option and integrates natively with the Elastic Stack. Fluentd is more flexible if you need to route logs to multiple destinations.

A minimal Filebeat configuration for shipping nginx JSON logs:

filebeat.inputs:
  - type: filestream
    id: api-access-logs
    paths:
      - /var/log/nginx/api_access.log
    parsers:
      - ndjson:
          target: ""
          add_error_key: true

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  index: "api-access-%{+yyyy.MM.dd}"
  username: "${ES_USERNAME}"
  password: "${ES_PASSWORD}"

setup.ilm.enabled: true
setup.ilm.rollover_alias: "api-access"
setup.ilm.pattern: "{now/d}-000001"

Elasticsearch stores the access logs and provides the aggregation engine that Grafana queries. Each access log entry becomes a document in a time-series index (e.g., api-access-2023.01.08). Elasticsearch’s aggregation framework — terms, date histogram, percentiles, filters — is what makes it possible to compute metrics like “P99 latency grouped by endpoint for the last hour” in real time.

Grafana queries Elasticsearch to render dashboard panels and evaluate alert rules. Grafana’s Elasticsearch data source plugin speaks the Elasticsearch query DSL natively, so you configure queries using Lucene syntax and aggregation builders in the panel editor.


3. Setting Up Grafana with Elasticsearch

3.1 Adding the Elasticsearch Data Source

In Grafana 11.x, navigate to Connections > Data sources > Add data source and select Elasticsearch. The key configuration fields:

FieldValueNotes
URLhttps://elasticsearch:9200Use HTTPS in production
AuthenticationBasic Auth or API KeyNever leave authentication disabled
Index nameapi-access-*Wildcard matches daily indices
Time fieldtimestampMust match your log’s timestamp field
Max concurrent shard requests5Prevents overloading ES with dashboard queries
Min time interval1mMatches your log granularity

If you’re using Amazon OpenSearch Service, the setup is identical — Grafana’s Elasticsearch plugin is compatible with OpenSearch. Just point the URL to your OpenSearch domain endpoint and use IAM-based authentication via the Sigv4 auth option.

3.2 Index Pattern and Mapping Considerations

For the dashboard queries to work correctly, your Elasticsearch index mapping needs a few things:

  • The timestamp field must be of type date (not text or keyword).
  • The status field should be integer or keyword. If it’s text, aggregations won’t work.
  • The response_time field must be float or double for percentile calculations.
  • The path field should be keyword (not text) so that terms aggregations return exact paths rather than tokenized fragments.

You can verify the mapping with:

curl -s "https://elasticsearch:9200/api-access-*/_mapping" | jq '.[] .mappings.properties | {timestamp, status, response_time, path}'

If fields are incorrectly mapped as text, create an index template that enforces the correct types:

PUT _index_template/api-access
{
  "index_patterns": ["api-access-*"],
  "template": {
    "mappings": {
      "properties": {
        "timestamp": { "type": "date" },
        "method": { "type": "keyword" },
        "path": { "type": "keyword" },
        "status": { "type": "integer" },
        "response_time": { "type": "float" },
        "bytes_sent": { "type": "long" },
        "remote_addr": { "type": "ip" },
        "user_agent": { "type": "text" },
        "request_id": { "type": "keyword" }
      }
    }
  }
}

4. Building the Dashboard Panels

Dashboard Layout

The dashboard is organized in two rows of three panels each. The top row shows high-level health metrics (rate, success rate, latency), and the bottom row shows detailed breakdowns (status distribution, slowest endpoints, errors by route).

4.1 Request Rate (requests/min)

This panel shows how many requests per minute are flowing through the API, broken down by status code class.

Panel type: Time series

Elasticsearch query configuration:

  • Metric: Count
  • Group by: Date Histogram on timestamp, interval 1m
  • Group by (second): Terms on status (to split by status code)

In the Grafana panel editor, the query looks like this:

Query: *
Metric: Count
Group by: Date Histogram (timestamp) — Interval: 1m
         Terms (status) — Order: Top, Size: 10

For a cleaner view that groups by status code class (2xx, 3xx, 4xx, 5xx) rather than individual codes, use a Lucene query approach with multiple queries:

  • Query A: status:[200 TO 299] — label: “2xx”
  • Query B: status:[300 TO 399] — label: “3xx”
  • Query C: status:[400 TO 499] — label: “4xx”
  • Query D: status:[500 TO 599] — label: “5xx”

Each query uses Count as the metric and a Date Histogram on timestamp with a 1m interval.

4.2 Success Rate (%)

The success rate is the percentage of requests that return a 2xx status code. This is your primary SLI for availability.

Panel type: Stat (for the big number) or Time series (for trend)

Computing a ratio in Grafana with Elasticsearch requires a bit of setup. There are two approaches:

Approach 1: Using Bucket Script (recommended for Grafana 11.x)

Configure a single Elasticsearch query with:

Metric A: Count (all requests)
Metric B: Count with Lucene query filter: status:[200 TO 299]
Pipeline: Bucket Script
  Expression: params.B / params.A * 100
Group by: Date Histogram (timestamp) — Interval: 1m

The Bucket Script pipeline aggregation lets you compute (successful_requests / total_requests) * 100 at each time bucket.

Approach 2: Using Grafana Transformations

Create two queries — one for total requests (Count, no filter) and one for successful requests (Count, filter status:[200 TO 299]). Then add a Transform > Calculate field step with the operation Query B / Query A * 100.

For the Stat panel showing the big number, set the Value options > Calculation to “Last” or “Mean” depending on whether you want the current or average success rate. Set thresholds at your SLO boundaries:

{
  "thresholds": {
    "steps": [
      { "color": "red", "value": null },
      { "color": "orange", "value": 99 },
      { "color": "green", "value": 99.5 }
    ]
  }
}

4.3 P50/P95/P99 Latency

Latency percentiles tell you how fast requests are for the median user (P50), the tail (P95), and the extreme tail (P99). The P99 is often where problems hide — the endpoint might feel fast for most users but be painfully slow for 1%.

Panel type: Time series

Elasticsearch query configuration:

Query: *
Metric: Percentiles on field "response_time"
  Percentile values: 50, 95, 99
Group by: Date Histogram (timestamp) — Interval: 1m

Elasticsearch’s percentiles aggregation uses the t-digest algorithm, which is approximate but memory-efficient — suitable for high-cardinality time-series data.

To add a visual SLO reference line (e.g., “P95 must stay below 300ms”), use Dashboard > Annotations > Add annotation query or simply add a threshold line in the panel’s Thresholds configuration set to 300 with a red color.

4.4 HTTP Status Code Distribution

A pie chart showing the overall distribution of HTTP status codes helps you spot unusual patterns at a glance — for example, a sudden spike in 429 (rate limiting) or 401 (auth failures).

Panel type: Pie chart

Elasticsearch query configuration:

Query: *
Metric: Count
Group by: Terms on "status" — Order: Top, Size: 20

Override colors for semantic meaning:

Status Code RangeColor
200-299Green
301, 302, 304Blue
400-499Orange
500-599Red

4.5 Top Slowest Endpoints

This panel answers the question: “Which endpoints are the slowest right now?” It’s invaluable for identifying performance regressions after deployments.

Panel type: Table

Elasticsearch query configuration:

Query: *
Metric: Average on field "response_time"
Group by: Terms on "path" — Order by: Average response_time (desc), Size: 15

To make the table more useful, add additional metrics:

Metric A: Average on "response_time" (alias: "Avg Latency (ms)")
Metric B: Percentiles on "response_time", value: 99 (alias: "P99 (ms)")
Metric C: Count (alias: "Request Count")
Group by: Terms on "path" — Order by: Metric A desc, Size: 15

This gives you the average latency, P99 latency, and request volume for each endpoint — enough context to distinguish between “this endpoint is slow because of one outlier” and “this endpoint is consistently slow.”

4.6 Error Rate by Endpoint

This panel shows which specific endpoints have the highest error rates, helping you prioritize investigation.

Panel type: Bar gauge (horizontal)

Elasticsearch query configuration:

This requires a two-step aggregation. First, compute the error count and total count per endpoint, then calculate the ratio:

Query A: status:[400 TO 599]
  Metric: Count
  Group by: Terms on "path" — Order: Top, Size: 10

Query B: *
  Metric: Count
  Group by: Terms on "path" — Order: Top, Size: 10

Then apply a Transform > Join by field on path, followed by Transform > Add field from calculation with the formula Query A / Query B * 100. Alternatively, use the Bucket Script approach within a single query:

Metric A: Count (all requests per path)
Metric B: Count with inline filter: status:[400 TO 599]
Pipeline: Bucket Script — Expression: params.B / params.A * 100
Group by: Terms on "path" — Order by: Bucket Script desc, Size: 10

5. Grafana Alerting

Grafana 9+ replaced the legacy per-panel alerting system with Unified Alerting, which is a standalone alerting engine that supports multi-dimensional evaluation, a dedicated rule editor, and a complete notification pipeline. If you’re still on the old system, migrating is worth it — Unified Alerting is significantly more capable.

5.1 Alert Rules

Alert rules in Grafana evaluate a query expression on a schedule and fire when conditions are met. Here’s an example alert rule for high error rate:

# Alert: API Error Rate > 2%
apiVersion: 1
groups:
  - orgId: 1
    name: api-monitoring
    folder: API Alerts
    interval: 1m
    rules:
      - uid: api-error-rate-high
        title: "API Error Rate Exceeds 2%"
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300  # last 5 minutes
              to: 0
            datasourceUid: elasticsearch-ds
            model:
              query: "*"
              metrics:
                - type: count
                  id: "1"
              bucketAggs:
                - type: date_histogram
                  field: timestamp
                  id: "2"
                  settings:
                    interval: 1m
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: elasticsearch-ds
            model:
              query: "status:[400 TO 599]"
              metrics:
                - type: count
                  id: "1"
              bucketAggs:
                - type: date_histogram
                  field: timestamp
                  id: "2"
                  settings:
                    interval: 1m
          - refId: C
            datasourceUid: __expr__
            model:
              type: math
              expression: "$B / $A * 100"
              conditions:
                - evaluator:
                    type: gt
                    params: [2]
        for: 5m  # must breach for 5 consecutive minutes
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API error rate is {{ $value }}%, exceeding 2% threshold"
          dashboard_url: "https://grafana.example.com/d/api-monitoring"

For latency alerts, a similar pattern works. Monitor when P99 latency exceeds your SLO:

      - uid: api-p99-latency-high
        title: "API P99 Latency Exceeds 500ms"
        condition: B
        data:
          - refId: A
            datasourceUid: elasticsearch-ds
            model:
              query: "*"
              metrics:
                - type: percentiles
                  field: response_time
                  id: "1"
                  settings:
                    percents: ["99"]
              bucketAggs:
                - type: date_histogram
                  field: timestamp
                  id: "2"
                  settings:
                    interval: 1m
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator:
                    type: gt
                    params: [500]
        for: 3m
        labels:
          severity: warning
          team: platform

5.2 Contact Points

Contact points define where alerts are sent. Grafana supports Slack, PagerDuty, email, Microsoft Teams, Opsgenie, webhook, and many more.

A Slack contact point configuration:

apiVersion: 1
contactPoints:
  - orgId: 1
    name: platform-team-slack
    receivers:
      - uid: slack-platform
        type: slack
        settings:
          recipient: "#platform-alerts"
          token: "${SLACK_BOT_TOKEN}"
          title: |
            {{ `{{ .Status | toUpper }}` }} {{ `{{ .CommonLabels.alertname }}` }}
          text: |
            {{ `{{ range .Alerts }}` }}
            *{{ `{{ .Labels.alertname }}` }}*
            {{ `{{ .Annotations.summary }}` }}
            Dashboard: {{ `{{ .Annotations.dashboard_url }}` }}
            {{ `{{ end }}` }}

For PagerDuty integration (critical alerts that need to page an on-call engineer):

      - uid: pagerduty-platform
        type: pagerduty
        settings:
          integrationKey: "${PAGERDUTY_INTEGRATION_KEY}"
          severity: "{{ `{{ .CommonLabels.severity }}` }}"
          class: "api-monitoring"

5.3 Notification Policies and Routing

Notification policies route alerts to the right contact points based on labels. This is where you implement the logic “critical alerts go to PagerDuty, warnings go to Slack”:

apiVersion: 1
policies:
  - orgId: 1
    receiver: platform-team-slack  # default receiver
    group_by: ["alertname", "team"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-platform
        matchers:
          - severity = critical
        continue: true  # also send to the default Slack receiver
      - receiver: platform-team-slack
        matchers:
          - severity = warning

5.4 Silences and Mute Timings

During planned maintenance windows, you don’t want alerts firing. Grafana provides two mechanisms:

Mute timings define recurring windows (e.g., “every Saturday 2AM-6AM UTC”) during which alerts are suppressed:

apiVersion: 1
muteTimes:
  - orgId: 1
    name: weekly-maintenance
    time_intervals:
      - times:
          - start_time: "02:00"
            end_time: "06:00"
        weekdays: ["saturday"]

Silences are one-off suppressions created via the Grafana UI or API for specific maintenance events. They match alerts by label and expire after a specified duration.


6. Dashboard as Code

Manually configuring dashboards through the Grafana UI works for prototyping, but production dashboards should be version-controlled and deployed through code. Grafana supports three approaches:

1. JSON Provisioning — Place dashboard JSON files in Grafana’s provisioning directory (/etc/grafana/provisioning/dashboards/). Grafana loads them on startup. This is the simplest approach and works well with Git-based workflows.

# /etc/grafana/provisioning/dashboards/api-monitoring.yaml
apiVersion: 1
providers:
  - name: API Monitoring
    folder: Production
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

2. Terraform Provider — The Grafana Terraform provider lets you manage dashboards, data sources, alert rules, and contact points as Terraform resources. This integrates naturally with infrastructure-as-code workflows:

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("dashboards/api-monitoring.json")
  folder      = grafana_folder.production.id
}

resource "grafana_rule_group" "api_alerts" {
  org_id           = 1
  name             = "api-monitoring"
  folder_uid       = grafana_folder.production.uid
  interval_seconds = 60

  rule {
    name      = "API Error Rate Exceeds 2%"
    condition = "C"
    for       = "5m"
    # ... data and expressions
  }
}

3. Grafonnet (Jsonnet library) — For teams that manage many dashboards with shared patterns, Grafonnet provides a programmatic way to generate dashboard JSON from reusable templates. This avoids the copy-paste drift that plagues hand-edited JSON.


7. Production Tips

7.1 Dashboard Variables for Filtering

Hard-coding values in queries makes dashboards rigid. Use Grafana template variables to let users filter by service, environment, or endpoint dynamically.

Define a variable for environment:

Name: environment
Type: Query
Data source: Elasticsearch
Query: {"find": "terms", "field": "environment.keyword", "size": 20}

Define a variable for API path:

Name: path
Type: Query
Data source: Elasticsearch
Query: {"find": "terms", "field": "path.keyword", "size": 100}

Then reference them in your panel queries with Lucene syntax:

environment:$environment AND path:$path

This single dashboard now serves every service and every environment. A dropdown at the top lets anyone filter to exactly what they need.

7.2 Annotations from CI/CD Deployments

When something breaks, the first question is always “did anything change?” Annotations overlay deployment events directly on your time-series graphs, making it immediately visible when a deploy correlates with a metric change.

Push annotations from your CI/CD pipeline:

curl -X POST "https://grafana.example.com/api/annotations" \
  -H "Authorization: Bearer ${GRAFANA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "dashboardUID": "api-monitoring",
    "time": '"$(date +%s)000"',
    "tags": ["deploy", "api-service"],
    "text": "Deployed api-service v2.4.1 (commit: abc123)"
  }'

These annotations appear as vertical lines on time-series panels, giving you instant visual correlation between deployments and metric changes.

7.3 Index Lifecycle Management

Access logs can generate enormous volumes of data. Configure Elasticsearch’s Index Lifecycle Management (ILM) to automatically roll over, shrink, and delete old indices:

PUT _ilm/policy/api-access-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

This keeps your hot data fast and queryable while automatically cleaning up old data.

7.4 Dashboard Organization

As your monitoring grows, organization matters. A practical structure:

Production/
  API Monitoring/
    Overview (the dashboard we built)
    Per-Service Deep Dive
    SLO Tracking
  Infrastructure/
    ECS/EC2 Metrics
    RDS Performance
    ElastiCache Metrics

Use dashboard links to connect related dashboards. The overview dashboard should link to the per-service deep dive, pre-filtered to the relevant service. This creates a natural drill-down flow: alert fires, on-call engineer opens overview, spots the affected service, clicks through to the deep dive for root cause.

7.5 Avoiding Common Pitfalls

A few things we learned the hard way:

Plugin installation on managed Grafana. If you’re running a managed Grafana instance (e.g., Amazon Managed Grafana, Azure Managed Grafana, or a cloud provider’s hosted offering), plugin installation may be restricted. Some unsigned plugins require explicit allowlisting in grafana.ini:

[plugins]
allow_loading_unsigned_plugins = goshposh-metaqueries-datasource

However, with modern Grafana’s built-in Bucket Script and transformation capabilities, the need for third-party MetaQuery plugins has largely disappeared. The Bucket Script pipeline aggregation (shown in Section 4.2) handles ratio calculations natively.

Data source connectivity. When configuring the Elasticsearch data source, ensure the hostname or IP you use is the actual address reachable from the Grafana server. In containerized environments, the “internal IP” shown in a management console may be a load balancer VIP rather than the actual container IP. Always verify connectivity from the Grafana container itself:

# From inside the Grafana container
curl -v "https://elasticsearch-host:9200/_cluster/health"

Query performance. Dashboard panels that query high-cardinality fields (like user_agent or remote_addr) with large terms aggregations can be slow. Keep the size parameter in terms aggregations reasonable (10-20 for dashboard panels), and use the Min time interval setting in the data source configuration to prevent excessively granular queries.


Conclusion

The dashboard we’ve built gives you comprehensive visibility into API health: request volume, success rate, latency percentiles, status code distribution, slowest endpoints, and error rates by route. Combined with Grafana’s Unified Alerting, you get proactive notification when SLOs are at risk, routed to the right people through the right channels.

The real value, though, isn’t in any single panel or alert. It’s in the combination: when an alert fires, the dashboard provides immediate context for investigation. When a deployment goes wrong, the annotations show exactly when it happened. When someone asks “is the API healthy?”, you can point to a single URL instead of running ad-hoc queries.

If you’re starting from scratch, begin with the three most important panels — success rate, P99 latency, and error rate by endpoint — and one alert for each. You can always add more panels later, but those three will catch the vast majority of production issues. Get the alerting right first, make the dashboard useful second, and add polish last.