Don't let failing APIs get you down. (Zing!) If your software depends on various services, service failures can compromise part of your application or bring your application down entirely.

This simple problem gave rise to a powerful software pattern called a circuit breaker. It allows your software to stay up gracefully, even if it depends on something that is experiencing a failure.

Let’s implement a circuit breaker in our Elixir API, with the Erlang library called Fuse. We will use a circuit breaker to backup our calls to the fictional SocialMedia API. This is not going to be an install guide for these tools, but will help you reason about and implement this kind of solution, which will augment the install instructions you can find in the Fuse documentation.

Starting with implementation

Let’s focus on the reasoning behind this pattern and perform the implementation first and the configuration last.

Suppose we start with this basic module that handles our social media integration.

defmodule MyApi.SocialMediaAPI do  
  @moduledoc """
  This module contains the functions
  that interface to SocialMediaAPI.
  """

  @doc """
  Returns a list of maps from SocialMediaAPI
  based on the given options.
  """
  @spec get_entries(list) :: list
  def get_entries(options) do
    options
    |> fetch_from_web
    |> Poison.decode!
  end

  @spec fetch_from_web(list) :: binary
  defp fetch_from_web(options) do
    case options |> MyApi.SocialMediaAPI.Fetch.user_entries do
      {:ok, entries} -> entries
      :error -> "[]"
    end
  end
end  

Our module provides for get_entries/1 which, given some options, wraps the API request and decodes the response to an Elixir list with Poison.

This function relies on fetch_from_web/1, which is responsible for executing the API call and handling the outcome. It’s at this point where we see the possibility of getting entries or else an error. To code defensively, we return "[]" in the event of an error, which lets the code avoid raising an exception as it fulfills its contract.

Without implementing the circuit breaker, this seems like a normal strategy for handling a failure of the SocialMediaAPI. However, with no content in the ”[]”, the can is kicked down the road. We now have to find a way to control flow in the case of this empty collection. If we plan to display some social media entries in the UI that consumes our Elixir API, we now have to bring the failure all the way to the surface with a message like, “No entries found at this time.” The failure spans the full stack. Or, even worse, we may not anticipate the failure and return the dreaded 500 Internal Server Error.

So, let’s move one step closer to solving our problem and introduce caching to serve the last matching call, rather than an empty list. This way, we have a graceful recovery from the error; not merely fulfill our contract with an empty collection. We do this by updating fetch_from_web/1 and introducing some stubbed out caching functions.

@spec fetch_from_web(list) :: list
  defp fetch_from_web(options) do
    case options |> MyApi.SocialMediaAPI.Fetch.user_timeline do
      {:ok, entries} -> set_cache(entries)
      :error -> get_cache(options)
    end
  end

  @spec get_cache(list) :: list
  defp get_cache(options), do: MyApi.SocialMediaAPI.Cache.get(options)

  @spec set_cache(list) :: list
  defp set_cache(entries), do: MyApi.SocialMediaAPI.Cache.set(entries)

We have introduced caching in our case statement that will either return entries that are already cached with get_cache/1 in the failure state, or on the happy path, create a new cache record and return the entries with set_cache/1. The implementation details underlying the cache depend on your domain. But we have improved upon ”[]”.

However, if SocialMediaAPI is down for one hour and we have 30 requests to our Elixir API, we are making 30 calls to the SocialMediaAPI. And all 30 will return :error. Why beat a dead horse? We should stop executing code that we know will fail.

More realistically, the service is down for an unknown amount of time and the wasted execution of code is costing performance at best and creating unknown problems at worst. The primary motivation for using a circuit breaker is when your software has a strong reason to avoid running that doomed code, when the service is down.

The circuit breaker will help us exit as early as possible from that cycle of requests when SocialMediaAPI’s failures meet a threshold. In other words, when the SocialMediaAPI has failed beyond our threshold, the circuit breaks. And then, we stop sending any requests to the SocialMediaAPI. Instead, we go straight to fetching from the cache. With Fuse, here is how we do that:

Putting the circuit breaker in the module

Below is how we need to change the module, get_entries/1 and fetch_from_web/1. The configuration will come next.

  @fuse Application.get_env(:my_api, :social_media_api) # Match the fuse to a module attribute

  @doc """
  Returns a list of maps from SocialMediaAPI
  based on the given options.
  """
  @spec get_entries(list) :: list
  def get_entries(options) do
    case :fuse.ask(@fuse, :sync) do # Step 1: Check conn. status before call by passing in the module attribute
      :ok -> fetch_from_web(options)
      :blown -> get_cache(options))
    end |> Poison.decode!
  end

  @spec fetch_from_web(list) :: list
  defp fetch_from_web(options) do
    case options |> MyApi.SocialMediaAPI.Fetch.user_timeline do
      {:ok, entries} -> set_cache(entries)
      :error ->
        :fuse.melt(@fuse) # Step 2: Signals to quit depending on conn. in Step 1
        get_cache(options)
    end
  end

Using the module attribute @fuse, we name the identifier of the circuit breaker, which gives us a way to reference its state. And in Step 1 in the comment above, we will check that state before entering the logic that executes the call to SocialMediaAPI. On the happy path, nothing is new. However, when the SocialMediaAPI returns an error in Step 2, we :fuse.melt(@fuse) to trigger the set of behaviors that stop reliance on the endpoint. Until :fuse.ask(@fuse, :sync) is :ok again, :blown sends us directly to our cache.

This shortcut to the cache is essentially what makes the circuit breaker pattern so valuable here. Before continuing with configuration options and setup, it's worth mentioning that this is just one domain for using this pattern. With an interconnected microservices architecture, an asynchronous worker queue, and in scenarios where throughput is at scale, caching data may not be the right solution. Whatever the actual solution, this pattern saves unnecessary runtime execution of code, and it prevents the sending of requests to a down service.

Configuration

This configuration of Fuse will illuminate some of the fine tuning that makes a circuit breaker intelligent. While there are several options, here are the ones that pertain to this domain, in start/2 of lib/my_api.ex:

fuse_options = {  
  {:standard, 2, 10_000},  # Allow 2 failures within 10 seconds, then fuse is blown
  {:reset, 120_000}        # Retry :blown fuses after 120 seconds
}

fuses = [  
  :social_media_api # Match this atom when setting module attribute @fuse
]

# Perform a list comprehension to add each fuse to the application
for fuse <- Enum.map(fuses, &(Application.get_env(:my_api, &1))) do  
  :fuse.install(fuse, fuse_options)
end  

In the tuple fuse_options, we define the threshold for failure of a service and the time until we start trying the circuit again. In this case, if the rate of failure exceeds one failure in a ten second period, the fuse is blown. Once it is blown, we can start avoiding unnecessary code execution. Then, we set the :reset option, which will allow the application to retry the circuit after 2 minutes, to see if the service is back up.

Conclusion

I hope this simple introduction to the circuit breaker pattern with Fuse has been useful! Please feel free to suggest improvements or ask questions in the comments below.