Skip to content

Error Handling

We all know that bad things can (will) happen. Error handling is not about avoiding errors, it is about making conscious decisions on how the system should react to them. The error handling implemented in the Suite is based upon and leverages the error handling mechanisms provided by MassTransit

Idempotence

Info

Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. Wikipedia

Before reviewing how errors are handled and published, we should first discuss when an error should be produced, and when it shouldn't.

A common scenario when processing messages is to receive a request to do an operation that was already done. This can happen due infrastructure errors (problems with message delivery, etc.) or by a bug on the requesting side.

For example let's say you have a Saga that listens for a message to complete a Work Order (let's call it CompleteWorkOrder), and when it succeeds it publishes a message (let's call it WorkOrderCompleted). Now imagine you receive a message to Complete a Work Order, the message is valid, and the Work Order can be completed, so the code to do so runs and modifies the entity accordantly, and then the database transaction is successfully committed, but right after that, there is a critical system failure (Network disconnection, power off, etc.). This means that the CompleteWorkOrder was not removed from the queue (was not marked as processed) and the WorkOrderCompleted message was not published.

Info

Consumers/Sagas need to synchronize two systems: the DB and the broker, which is not an atomic operation. Hence, changes of one getting done without the other are high.

In the scenario above, since the message was left on the queue, when the system is back online, it will try to process it again because the broker has no way to tell that the message was already processed. In this cases it might be tempting to produce an error, for example to throw some kind of InvalidOperationException or CannotCompleteWorkOrderException, or WorkOrderTransitionException, etc. you get the point. This is is generally not a good idea. Instead we should adhere to the Idempotence principle, producing the same output for a given message. This generally means publishing the same messages we would have published if it were the first time the message arrived.

Keeping with the example scenario above, when processing the CompleteWorkOrder a second time (or a third, or a fourth, etc.), we would publish the WorkOrderCompleted message again. There's no need to actually "complete the Work Order" in the DB, since that's done already cause we are in the "Completed" state. We just need to make sure the public API reflects what you would have done in the normal consumption of this message.

You might be wondering, what happen if there is a service listening for the WorkOrderCompleted message? wouldn't publishing the same message several times cause problems? The answer is no, as long as said service is also adhering to the Idempotence principle.

We also have to consider that, the producer of the original message is probably expecting an outcome. Hence, we need to perform, public API speaking, exactly the same as what we would have done the first time we processed the message.

First line of defence: retry

A lot of times errors on our applications are related to issues that are short lived, and are quickly solved by automatic processes or IT personnel. For example a database might be unreachable for a few seconds due to overload, network issues, etc. On these scenarios, when an operation fails, the problem could probably be solved just by retrying later.

MassTransit includes a mechanism to do this automatically. Whenever an exception is thrown, and not caught by the code in the Consumer or Saga, MassTransit will retry processing the message.

Info

How many times to retry, and how much time it should wait between attempts can be configured using the Saga or Consumer Definition, but default values are provided by the Suite, when using the Resilient Definitions, which should be good for most cases.

When Retry is not enough

There are two conditions where retry might not be enough to solve the issue.

  1. The infrastructure problem (e.g. database, network, etc.) could not be solved in time.

  2. The issue is related to a business condition that cannot be met (e.g. the data of the message is invalid.)

For the second case mentioned, it would be great to be able to short circuit the retry mechanism, to avoid retrying a message that we know cannot be processed. To support this, the Suite is pre-configured not to retry if the exception thrown is of type IBusinessException.

After giving up on retrying, either because all possible attempts were exhausted or because the error thrown is a IBusinessException, the system will produce a Fault.

Faults

From MassTransit

When a message consumer throws an exception instead of returning normally, a Fault<T> is produced, which may be published or sent depending upon the context. A Fault<T> is a generic message contract including the original message that caused the consumer to fail, as well as the ExceptionInfo, HostInfo, and the time of the exception.

Basically, producers of a message T can listen to failures of that message by consuming Fault<T>.

Instead of creating custom messages for error operations, we use MassTransit Faults by simply throwing Business Exceptions.

Handling Faults

MassTransit will route a Fault based on how the message that caused it was consumed.

When doing Publish/Subscribe, the producer needs to Consume Fault<T>. For Consumers like Sagas, the Fault will be published like any other message and it can be listen for using the same mechanisms use to consume any other type of messages. If an error is raised while consuming a CreateUserMessage, message a Fault<CreateUserMessage> message will be published. This means that the Fault message can be consume by the same Saga that produced it, or also by any other Saga. You just need to declare the Event as you do for any other message.

C#
    public class UserSaga : MassTransitStateMachine<TestSaga>
    {
        public UserSaga()
        {
            InstanceState(x => x.CurrentState, Started);

            // ...

            Event(() => UserCreationFaulted, x =>
            {
                x.CorrelateById(m => (Guid)m.CorrelationId);
            });

            During(Started, When(BlowUp)
                .Then(ctx =>
            {
                // Do something, like informing end users about the error
                // or publish another message to start another operation.
            }));
        }

        public Event<Fault<InvalidUserNameException>> UserCreationFaulted { get; private set; }
    }

For Request/Response patterns the Fault will be automatically sent to the requesting client, where it will manifest itself as a RequestFaultException which can be handle with the classic try/catch block.

C#
 try
    {
        var response = await this.UserRequestClient
            .GetResponse<UserCreationStartedResponse>(new
            {
                CorrelationId = correlationId,
                UserName = "TEST_USER_001"
            });
    }
    catch (RequestFaultException ex)
    {
        // Do something, like informing end users about the error.
    }

Faults caused by BusinessException

You can use the extension method HasBusinessException() on Fault messages to determine if the Fault was produce by a BusinessException. If you want to check for a particular BusinessException Code you can use the HasBusinessExceptionWithCode method.

C#
    public class UserSaga : MassTransitStateMachine<TestSaga>
    {
        public UserSaga()
        {
            InstanceState(x => x.CurrentState, Started);

            // ...

            Event(() => UserCreationFaulted, x =>
            {
                x.CorrelateById(m => (Guid)m.CorrelationId);
            });

            During(Started, When(BlowUp)
                .Then(ctx =>
            {
                if(ctx.Data.HasBusinessExceptionWithCode(InvalidUsernameException.Code))
                {
                    // Do something about invalid username
                }
            }));
        }

        public Event<Fault<InvalidUserNameException>> UserCreationFaulted { get; private set; }
    }

If you need to get all the BusinessException Codes contained in a Fault you can use the GetBusinessExceptionCodes method.

A note on using the Catch Method

There is a method available when setting up a Saga that can be used to catch exceptions thrown on the pipe. It might be tempting to do something like this:

C#
1
2
3
4
5
6
7
    .Catch<BusinessException<InvalidUserNameException>>(r =>
        r.Finalize()
        .PublishAsync(ctx => ctx.Init<CreateUserFailed>(new
        {
            CorrelationId = ctx.Instance.CorrelationId,
            Reason = ctx.Exception.Message,
        }))));

This is generally consider an Anti-Pattern and can cause several issues. If the Saga has a Catch call that handles an issue, then said error is consider, well, handled. This means that the error will not bubble up and the operation will be consider successful by MassTransit, so any database operation will be committed, even though the exception could have caused some field to be in an undefined state.

This Anti-Pattern also adds a lot of unnecessary code that has to be maintained and tested. Generally there is no need to manually publish a CreatedUserFailed custom error message, the system will automatically publish the corresponding Fault<CreateUserMessage> message.

Important

Avoid using the Catch method. If you really think that there is an scenario where you really need to, please contact the Suite team to discuss the matter.