Introduction Link to heading

You can find all the code mentioned in this article and more examples in thie Github Repository

A powerful feature in the Go programming language is the ability to create and manage lightweight concurrent threads, known as Go-Routines. In this article, we will explore how to keep them running for extended periods while gracefully recovering from panics. We will first demonstrate an example that shows how a panic can occur within a Goroutine and then proceed to showcase how to effectively recover and restart it when such a panic happens. By understanding these concepts, you will be better equipped to design and implement robust, fault-tolerant applications using Go’s concurrency features.

Let’s look at the first example. In this example we’ll go through an example of a panic happening in a go routine.

func main() {
   worker_one := workers.NewWorker("worker_1", 2*time.Second)

   go worker_one.Work()

   select {} // block forever
}

Here, a new Worker instance is created. The constructor takes in two arguments: a string identifier (“worker_1”) to give the worker a name and a duration (2 seconds in this case) to simulate the time it takes to do the work. Then we invoke the Work() method of the worker_one instance in a new Goroutine. The Work() method is expected to perform some amount of work and it will run concurrently without blocking the main function.To keep the main function from existing we use an empty select statement. This line blocks the main function indefinitely. This is done to keep the program running so that the worker_one Go-Routine can continue to execute its tasks in the background. If this line were not included, the main function would exit immediately, and the Go-Routine would be terminated.

Now let’s take a look at the worker package.

In this package we can see the custom type Worker with the following fields.

ID: A string identifier for the worker. Err: An error field to store any error that might occur during the worker’s execution. Duration: A time.Duration representing the duration we’ll use to simulate the time it takes to do the work.

There are few methods for this Worker type.

func (w *Worker) GetError() error method returns the error stored in the worker’s Err field. func (w *Worker) GetWorkerID() string returns the worker’s ID. func (w *Worker) GetSleepDuration() time.Duration returns the worker’s sleep duration. func (w *Worker) Work() (err error) is responsible for executing the worker’s task. It consists of the following steps:

  • Print a message indicating the worker has started.
  • Enter an infinite loop that executes the following actions:
  1. Seed the random number generator with the current time.
  2. Generate a random big integer within the range of 0 to 99.
  3. Print a message indicating the worker is doing work.
  4. Check if the generated number is a prime number. If so, raise a panic with an informative message.
  5. Sleep for the specified duration before starting the next iteration. Now let’s run this application and see what happens.
Starting Worker : worker_1 

worker_1 doing work ..

worker_1 doing work ..

panic: random 23 is prime 


goroutine 6 [running]:
go-routine-panic-recover/panic_example/workers.(*Worker).Work(0x14000070060)
        /Users/username/dev/go-routine-panic-recover/panic_example/workers/hard_worker.go:44 +0x234
created by main.main
        /Users/username/dev/go-routine-panic-recover/panic_example/app.go:12 +0x78
exit status 2

This output demonstrates what we expected. When the Work is called, it generates 3 numbers and panics when the third number is a prime. Since the panic is unhandled it terminates the program.

Now let’s take a look at how to change the implementation to recover in a panic and keep the program alive.

Let’s look at how the main function has changed.

worker_one := workers.NewWorker("worker_1", 2*time.Second)

wrks := make(chan *workers.Worker, 1)

go worker_one.Work(wrks)

for w := range wrks {

   fmt.Printf("\033[31m---------------- PANIC happened in worker : \033[0m\033[34m%s\033[0m\033[31m because %s\033[0m\n", w.GetWorkerID(), w.GetError().Error())
  
   fmt.Printf("\033[32m-------------\033[0m \033[34m%s\033[0m \033[32mrecovering ...\033[0m \n", w.GetWorkerID())
 
   go w.Work(wrks)
}

In this example we have a channel of type *workers.Worker. The channel is passed as an argument to the Work() method, this channel will be used to send messages from the worker go routine. Then a for loop is used to iterate over the wrks channel. Whenever a panic occurs, the worker responsible for the panic sends itself through the channel. The loop receives the worker instance and prints a message indicating that a panic has happened, along with the worker’s ID and the reason for the panic (retrieved using w.GetError().Error()). Inside the loop, the worker’s Work() method is started again in a new Goroutine, passing the wrks channel as an argument. This ensures that the worker resumes its work after a panic occurs.

Now let’s take a look at what changes were made to the Work() method.

This function is an updated version of the Work() method that includes a panic recovery mechanism using a defer statement and an anonymous function. The function signature has changed: it now takes a channel of type chan<- *Worker as an argument in addition to the *Worker receiver, and still returns an error. A defer statement is added at the beginning of the function. It defines an anonymous function that executes when the Work() function returns, either due to a panic or normal completion. The anonymous function inside the defer statement recovers from a panic using the recover() function. If a panic occurs, it checks if the panic value is of type error and stores it in the w.Err field; otherwise, it creates a new error with the panic value and stores it in the w.Err field. After recovering from a panic and setting the w.Err field, the Worker instance is sent through the worker channel, signaling the main Go-Routine that a panic has occurred. The rest of the function is the same as in the first example, including the for loop and the panic generation when a prime number is found.

In this article, we have explored how an unhandled panic can lead to the termination of the entire program and how implementing a proper recovery mechanism can mitigate this issue. By using channels, defer statements, and anonymous functions, we have shown how to handle panics gracefully and restart Goroutines after an error occurs. As you continue to build concurrent systems with Go, remember to handle panics responsibly and leverage the powerful concurrency features the language offers to build reliable and efficient software.