New Relic RPM with Resque

New Relic RPM announced that they’ve started to support Sinatra and Rack a few days ago (article). It’s also come up with a better background job monitoring. Good news. So I tried monitoring jobs being handled on Resque which we’ve started using recently for managing background jobs.

Unfortunately RPM 2.10.5 couldn’t monitor jobs of Resque. Resque forks to process a job. RPM agent instance can’t chase the things happend in forked processes which terminates in a short time.

However I succeeded to monitor Resque jobs with RPM with a little hack. I want to share it here, though, it is just a hack. The information might be obsolete soon. So I explain my process I took to find it too. In that way, I hope that it gives you a help when this hack is no longer applicable.

Step 1. Follow the instruction of RPM

Firstly I followed RPM’s instruction (API document) and implemented a sample job like this.

class HelloWorldJob
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

  @queue = :low
  def self.perform
    HelloWorldJob.new.hello_world
  end

  def hello_world
    RAILS_DEFAULT_LOGGER.info "#{Time.now}: Hello World!"
    RAILS_DEFAULT_LOGGER.flush
    sleep(3)
  end

  add_transaction_tracer :hello_world, :name=>'hello_world'
end

But even if the job is processed on Resque, it is not shown as a transaction in management screen of RPM.

Step 2. Google it

Secondly I googled it. I felt that researching myself is a waste of time if someone has found the way already(This idea is always wrong though). But Google didn’t give me the answer this time. I thank Google for that because it gives me an oppotunity to understand how both Resque and Newrelic RPM work deeply.

Step 3. Set up environment to debug

For Resque: start resque worker with very vobose mode.
e.g.

VVERBOSE=true QUEUE=* rake environment resque:work

For Newrelic: set log level with debug in newrelic.yml

  log_level: debug

Then I tested the job with script/console. You can run the job like this:

% script/console
> Resque.enqueue HelloWorldJob
=> Enqueues a job and it will be executed on Resque worker
> HelloWorldJob.process
=> Executes a job directly without queueing

You should be able to confirm that the job executed directly on console is monitored on RPM and shown in its management screen. But the one proccessed by Resque never comes to the screen.

Step 4. Try to know what happens in Resque

I proceeded the investigation with the following cycle.

1. Run job on Resque and script/console
2. Compare the log of Resque to the one of script/console
3. Make an assumption and put some debug codes in Resque or Newrelic RPM (Thanks god those are opensource!)
4. Restart Resque worker and script/console
5. Back to 1

Sorry but there is no magic and shortcut. But after I spent about 15 minutes or so with the cycle, I found(and remembered at the same time) that Resque uses forking for processing a job and I was convinced that it would cause the issue. So I changed the code on Reque not to fork the process as an experment.

in lib/resque/worker.rb

    def fork
+      return false

Then ran the job via Resque. Yay, the transaction information was shown in RPM.

Step 5. Then a little hack

Of course, you can’t turn off the forking of Resque. That is something that makes Resque awesome. So I reveted the change I made on worker.rb and focused on finding a way to monitor forked process with RPM.

What I tried at first is to call NewRelic::Agent.shutdown which sends the all queued data after the job is done. RPM sends the log every 60 seconds but the job finishes about in 3 seconds. So that the job has to make sure to send the data before ended.

class HelloWorldJob
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

  @queue = :low
  def self.perform
    HelloWorldJob.new.hello_world
    NewRelic::Agent.shutdown
  end

  def hello_world
    RAILS_DEFAULT_LOGGER.info "#{Time.now}: Hello World!"
    RAILS_DEFAULT_LOGGER.flush
    sleep(3)
  end

  add_transaction_tracer :hello_world, :name=>'hello_world'
end

On console/script, it worked in a way I expected, though, it didn’t on Resque. Dammn it. Next I looked into RPM’s codebase with the cycle I showed on Step 4.

Then I found that RPM resets the connection information if the current process id is different from the one which RPM got when it started. Then, on that situation, it tries to start worker thread which reports the information to RPM server every 60 seconds. But it takes seconds to launch and the job finishes before the thread finishes communicating server.

Anyway you don’t want to create a thread for RPM everytime Resque job is created. It just wastes your machines resource and makes a job process slow in terms of background job. Alternatively it would be nice if the forked process use the parent connection information simply. I monkey-patched onto RPM to behave in that way. This patch shouldn’t affect monitoring for other applications.

in lib/new_relic/agent/agent.rb

    def ensure_worker_thread_started
      return unless control.agent_enabled? && control.monitor_mode? && !@invalid_license

-     if !running?
+     if !running? && !control['resque']

Then changed the codebase of the job like this.

class HelloWorldJob
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

  @queue = :low
  def self.perform
    NewRelic::Control.instance['resque'] = true
    HelloWorldJob.new.hello_world
    NewRelic::Agent.shutdown
  end

  def hello_world
    RAILS_DEFAULT_LOGGER.info "#{Time.now}: Hello World!"
    RAILS_DEFAULT_LOGGER.flush
    sleep(3)
  end

  add_transaction_tracer :hello_world, :name=>'HelloWorldJob#hello_world'
end

Wheeee, now the job is shown and tracked as a transaction on RPM. You can track the exceptions occurred in Resque jobs as well as slow transactions with RPM.

Step 6. Refactor

To monitor all jobs in same way, I refactor HelloWorldJob with its super class. I hope that RPM is going to support forked background process like Resque officially in a nice way and the hack is no longer needed soon. So it doesn’t make sence to write this ugly hack on every single job. Instead of that, I put all newrelic related code onto a super class.

So here is a final codebase.

class BaseJob
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

  # This is called from Resque.
  # To be wrapped with the common operation, please describe job in the instance's perform method.
  def self.perform(*args)
    NewRelic::Control.instance['resque'] = true
    self.new.perform_wrapper(*args)
  ensure
    NewRelic::Agent.shutdown
  end

  # Sends a hint to our monitoring system, newrelic rpm
  def add_custom_parameters(key, value)
    NewRelic::Agent.add_custom_parameters(key => value.to_s)
  end

  # Calls :perform with monitoring on newrelic rpm
  def perform_wrapper(*args)
    # :task seems available only for charged plan. When you test this with a free plan, you have to get rid of it.
    # perform_action_with_newrelic_trace(:name=>'perform') do
    perform_action_with_newrelic_trace(:name=>'perform', :category => :task) do
      begin
        perform(*args)
      ensure
        logger.flush if logger && logger.respond_to?(:flush)
      end
    end
  end

  # Override this method instead of the class method.
  def perform(*args)

  end

  def logger
    RAILS_DEFAULT_LOGGER
  end
end
# Test purpose job. It output the information to log.
class HelloWorldJob < BaseJob
  @queue = :low

  def perform(who='World')
    # This sends the information to NewRelic RPM
    add_custom_parameters :whom, who

    logger "#{Time.now}: Hello #{who}!"
    sleep(3)
  end

end

Epilogue
I have to say lastly that reading Resque’s codebase was absolutely fun. It is simple and neat. You might feel that you can write the codebase yourself easily. But there are the good design decisions behind that: which are using only Redis for queue backend and forking for processing a job. Cool.

Advertisement

~ by ono on February 28, 2010.

9 Responses to “New Relic RPM with Resque”

  1. Tatsuya, thanks for the detailed writeup. You have these in a github fork by any chance? In the future feel free to reach out to us at support.newrelic.com, or on #newrelic at freenode.

  2. Thanks Bill for your immediate response! This is great.

    Here is the commit:
    http://github.com/ono/rpm/commit/c50e80b9c6ed7f70b37f681be2e67c24ec49f783

    And here is the sample:
    http://gist.github.com/318256

  3. I might be able to change the agent so that it would do the shutdown step automatically. If such a mode were supported, what would be the best way to specify it? If you pull it from the newrelic.yml, do you risk the setting being applied to other non-resque processes? Would it be better to have a configuration call?

    One option is to make you invoke NewRelic::Agent.manual_start :forked_task => true.

    This would eliminate the need for both agent_enabled=true in the newrelic.yml, and the call to shutdown.

    Is there a way to reliably detect when I’m running a resque forked worker? I could possibly auto-detect this mode.

  4. I wanted making a change as small as possible for the blog entry. There are some better implementations.

    > Is there a way to reliably detect when I’m running a resque forked worker?
    > I could possibly auto-detect this mode.

    I wrote a logic to detect resque child process.
    http://github.com/ono/rpm/tree/resque_auto
    http://github.com/ono/rpm/commit/7ce02b07bd5008512969db37abdfe62085ab64ae

    Resque worker edits $0 when processing e.g. “resque-1.5.0: Processing [queue name] since 1267577227!”. And, if the task is processed as a forked process, the parent process id should be same as @task_loop’s process id. You need this check logic because Resque doesn’t fork if the platform doesn’t support forking. In that case, Resque executes tasks on a worker’s process. …That’s what resque_forked_task? method does.

    I added a logic to skip resetting status in ensure_worker_thread_started for forked tasks on next commit.

    Regarding shutdown, I didn’t add anything so resque task still has to call shutdown. It would be nice if you put the logic at the suitable place by checking the process with forked_task? method.

  5. Sending the data out to RPM after each job sounds like a bad idea. Resque was built for supporting very long queues of small fast jobs. This would slow down these small jobs quite a bit, if they had to wait for the RPM agent to submit the data. It would probably also put a high load on the RPM servers.

  6. That’s true.
    When you have to call shutdown like I do in article, it gives you a possibility NOT to send information. So that you could only send the information when job gets slow or exception.
    Alternatively maybe RPM agent could run as a dedicated process and accept message from job process? It seems more works required, though, that will be nice.

  7. Mick makes a good point (also mentioned in on our support site). There’s not really a way around it. If your jobs are pretty short, the delay may be too much overhead. A separate process is a possibility. We actually have something like this in the agent itself (do mongrel_rpm -h) but it’s not really fleshed out and wouldn’t get you errors and transaction traces.

  8. Good news. I just committed an update of the rpm_contrib gem which includes support for Resque. I got it working well but it hasn’t been tested extensively.

    The upshot is that you just require the rpm_contrib gem (version 1.0.4) and it automatically detects resque, instruments the process method and flushes the data to the server. It all happens synchronously so there’s still a bit of a delay introduced at the end of the job but it should reliably send the data to the background tab.

    You need to remove all the workaround code in your job classes, and remove any settings you have which force the agent to start, like agent_enabled: true in the newrelic.yml. Anything leftover will probably interfere with the agent starting properly.

    This version of the rpm_contrib gem depends on a pre-release of the agent, version 2.11.0.beta.

    Would be grateful to anyone who can try it and let me know how it works.

  9. Brilliant.
    I will try it soon. But maybe it takes one or two weeks to give you feedback since I am bit busy with other projects and my laptop is away for repair.

    I ran into an issue recently which ActiveRecord attributes information is collapsed on Resque+RPM (with my patch). It’s not always happened. I’ve not found yet the certain way to reproduce the issue though, I am sure that it doesn’t occur when RPM is disabled (It’s funny because RPM also helps me to notice the issue quickly!). I think that I need a careful research for this and make sure that it is OK with your fix. I will let you know when I get more information.

    Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.