Monitor bot health

Learn about bot health states, how to monitor bot health with the API, and learn about bot health notifications

Khoros monitors the health of a registered bot based on the response to events delivered by the Framework -- the bot should return a 202 (or 200). We use a rolling 2-minute window to determine the health state of the registration:

  • HEALTHY - 100% success (or 0 attempts) in the last 2 minutes
  • DEGRADED - mixed successes and failures, or less than 10 attempts, in the last 2 minutes
  • DOWN - 100% failures with at least 10 attempts in the last 2 minutes

When the Framework determines that a bot is DOWN, the Framework does not attempt to deliver every message. Every 2 minutes, the Framework will restart event delivery and, per the above rules, transition into one of the above states.

How you’ll learn about bot health changes

  • Email notification - if you have added email contact information to your bot’s registration, we send an email when:
    • the bot health becomes DEGRADED
    • the bot health becomes DOWN
  • Hand-off tags - we automatically apply a Bot Handoff child-tag named "Automatic Bot Handoff" which you can see in Khoros Care and monitor for in Analytics. We apply this tag when we automatically hand-off a conversation if:
    • the Framework can't deliver an event of type=message for a bot-owned author cannot be delivered
    • the bot health is DOWN
  • API - Use /health/appId/appId to check the current health of a bot.

Bot health walkthrough

Let's walk through a quick example of how you might use the bot health feature. This example assumes a frequent, steady rate of events is because the evaluation of health occurs when an event is processed (so the timing will not be exactly 2 minutes) and as noted above, the rules are slightly different if there are very few events.

Suppose you turn your bot off to deploy a new version without putting the bot in maintenance mode first. Here is what will happen. (For simplicity, assume all events are of type message and their messages are incoming at a steady, frequent rate).

  1. The first failed event delivery results in an email being sent. If the bot is the owner, the Automatic Bot Handoff tag is applied and control will be assigned to AGENT.
  • For the next 2 minutes, if the bot is the owner, the Automatic Bot Handoff tag is applied and control is assigned to AGENT.
  1. At about the 2-minute mark, the bot is marked as DOWN, and another email notification is sent.
  2. For the next 2 minutes, if the bot is the owner, the Automatic Bot Handoff tag is applied and control is assigned to AGENT.
  3. After 2 minutes, the Framework begins trying to deliver events.
    • If the bot is still down, the event can't be delivered, Another email will be sent.
    • If the bot is now available, events can be delivered. The bot is now in a degraded state because there were still events that couldn't be delivered in the previous 2 minutes A health object is included in the events being delivered.
  4. After 2 minutes, because all delivery attempts were successful in the last 2 minutes, the bot is now in a healthy state and an email notification is sent.

Get the health status with the API

Make a GET request to /bots/v3/health/appId/{appId}, passing the appId defined in the bot registration to retrieve the health status of the bot and the count of failed and successful deliveries during the last 2-minute period.

Your request will look like this:

curl GET \
  "https://api.app.lithium.com/bots/v3/health/appId/mybot" \
  -H "Authorization: Bearer [TOKEN]"

Your response will look something like this:

{  
  "data":[  
    {  
      "status":"DEGRADED",
      "failureCount":2,
      "successCount":23
    }
  ]
}

Bot Health and Debugging a Bot

If you are developing your bot and stepping through it in debug mode before you respond to the event delivery, you will affect the health of your bot - and you may inadvertently put it in a bad state.

You will notice that we retry delivery of an event a second time if the first has not been acknowledged. This retry is simply to account for network inconsistencies.


Did this page help you?