Processing 40k URLs using Laravel Queues


Some context – what really happened

So a week ago I had a P1 task from SEO team with concerns that Google hadn’t been happy about some redirects on the site, not only Google but also some users.

So to temporarily solve this issue we decided to deactivate all the dead redirects temporarily until I found a real solution.

The real problem here was that the site had accumulated 40,000 redirects over time and we couldn’t simply test them manually. We had no idea what URLs were returning 404 errors out of 40k.

A lot of them were set up manually, others using automated tools. It looked like this may have happened because of server misconfiguration and recent changes to database tables. But I wasn’t 100% sure.

So a quick and temporary solution both teams agreed on was to deactivate all dead redirections to buy us some time to investigate and properly fix the issue. This would mean we’d no longer have 404 pages in Google.

These weren’t critical pages so deactivating them made sense. Plus they were already returning 404 so either I had to fix them all QUICKLY or hide them temporarily until I do fix them.

So in this post I’ll talk about how I wrote a quick script in Laravel to go through 40,000 URLs and deactivated the 404 ones.

The idea was simple:

I wanted to write URLs returning non 200 status code to a file called non_200.csv.

Later I could use this list to deactivate the dead links in the system.

This is how I did it..

Creating a job to send HTTP request and write data to files

First I created a job called UrlProcessorJob that sends HTTP requests to a URL and stores the non 200 URLs in a non_200.csv file. 

You’ll notice redirects_processed.csv in the code, more about it later..

class UrlProcessorJob implements ShouldQueue {
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    protected $id;
    protected $url;

    public function __construct($id, $url = null) {
        $this->id = $id;
        $this->url = $url;
    }

    public function handle() {
        $resp = Http::get('nblabla'.$this->url);

        if ($resp->status() !== 200) {
            $fwrite = fopen(storage_path('app/non_200.csv'), "a");
            fwrite($fwrite, $this->id . ',' . $this->url . "\n");
        }

        $processed = fopen(storage_path('app/redirects_processed.csv'), "a");
        fwrite($processed, $this->id . ',' . $this->url . "\n");
    }
}

Pushing 40k jobs to queue

class UrlProcessorController extends Controller {
    public function process() {
        $fileRes = fopen(storage_path('app/redirects_original.csv'), 'r');
        while ($data = fgetcsv($fileRes, null, ",")) {
            UrlProcessorJob::dispatch(id: $data[0], url: $data[1])->delay(now()->addSecond(1));
        }
    }
}

Self explanatory, I’m looping through the data from the file and pushing the each URL into the queue I created in the previous step.

I didn’t have to make any special tweaks here in terms of performance.

It took the While Loop about 30 seconds to push 40k rows into the database queue. Laravel stores them in “jobs” table.

Starting 10 workers

With the job setup and data pushed into queue table (after making sure there’s no memory allocation errors), I finally started 10 Workers, one in each terminal like this. php artisan queue:work (x10).

I was using a database connection so I was able to see 40k entries in the database.

I was now ready to see jobs process and write data to files..

But as soon as I started the workers and they began processing the jobs there was an apparent issue, a lot of the jobs started to fail!

Handling data issues

failed_jobs table:

In a rush I made a mistake of NOT validating the URLs, I expected redirects_original.csv to contain only valid URLs. Bad idea! 

A lot of the requests failed. And because the data had already been passed to the job I couldn’t easily fix it.

See this payload for example, it already has the bad data, there’s no option but to delete the failed jobs and re-push them.

{"uuid":"0f01d90a-1389-477e-a74b-685bf808cfd9","displayName":"App\\Jobs\\UrlProcessorJob","job":"Illuminate\\Queue\\CallQueuedHandler@call","maxTries":null,"maxExceptions":null,"failOnTimeout":false,"backoff":null,"timeout":null,"retryUntil":null,"data":{"commandName":"App\\Jobs\\UrlProcessorJob","command":"O:24:\"App\\Jobs\\UrlProcessorJob\":3:{s:5:\"\u0000*\u0000id\";s:4:\"3968\";s:6:\"\u0000*\u0000url\";s:105:\"https:\/\/www.supremecodr.com\/product-category\/ebooks\/popular\/\";s:5:\"delay\";O:25:\"Illuminate\\Support\\Carbon\":3:{s:4:\"date\";s:26:\"2024-03-28 21:13:06.595291\";s:13:\"timezone_type\";i:3;s:8:\"timezone\";s:3:\"UTC\";}}"}}

Solution

Based on previous experience I knew something like this could happen so to make sure I didn’t have to go through 40k URLs over again I created urls_processed.csv.

Each time the URL was validated successfully, I wrote it in this file. 

This way I knew what had been processed. I could simply paste the URLs from urls_processed.csv into redirects_original.csv and remove the duplicates.

Pretty basic thing to do in excel. This would give me URLs that were yet to be processed.

Some may ask, couldn’t you just re-run the failed job? Well no, if the job has bad data and it’s already been stored in the database I can’t do much. If this was a code related issue I could fix the code and re-run the jobs.

Anyways, I ended removing all the failed jobs from the database and re-running the script with the leftover URLs in redirects_original.csv.

This fixed the issue of bad data and I managed to push unprocessed entries back into the queue.

More than half still to go…

More problems – handling timeout errors

The above trick fixed the bad data bug but with more than half the URLs still left to process I started seeing classic timeout errors. More failed jobs!

Because they were simple timeout errors and I knew they would return 200 at some point if I retried,

I simply pushed the jobs back into the queue for more attempts using: php artisan queue:retry command. 

Eventually they all succeeded.

Total processing time

With the above errors resolved I managed to go through all the URLs and got the URLs returning non 200 status codes appended to non_200.csv file.

Surprisingly there were 1.5k URLs with bad redirects!

This is exactly what I needed to run a query in the database to deactivate them.

It took the system about 40 minutes to go through 40k+ URLs with 10 workers Apple M1 Pro.

Lastly I plucked all the ids from non_200.csv and executed a WHEREIN() query to deactivate the URLs. (always backup your data or at least the table!)

Final notes

I made sure that redirects_original.csv and redirects_processed.csv continued the same number of rows to ensure I didn’t miss anything.

Another thing I didn’t do is to run it on LIVE website, instead replicated it to a Staging or dev server and tested it there. Sending so many HTTP requests might slow down your website for the users.

Regardless of the process you use to do the task you’re still limited to the memory available. Just because you’re running multiple processes doesn’t really mean it will be quicker, if you’re out of memory each job will take longer to process.

Note another thing that I noticed is that any external packages are generally going to cause a lot of speed issues. For example maatwebsite/excel don’t use them if you don’t need to!

Why I avoided using Laravel’s Storage::append() function

The UrlProcessorJob initially used Storage::append(); function instead of fwrite();

To my surprise append() was slightly slow compared to fwrite().

Also note that you cannot pass Stream Resources to Jobs! This means you have to create a new Resource (fopen) in each Job! This really slows things down! But fwrite() still performs better.

Results:

fwrite(): 26.4946 Seconds

while ($data = fgetcsv($fileRes, null, ",")) {
	$res = fopen(storage_path('app/redirects_processed.csv'), "a");

	fwrite($res, $data[0] . ',' . $data[1] .',' . "\n");
}

Stream::append(): 34.3062 Seconds 

$fileRes = fopen(storage_path('app/redirects_original.csv'), 'r');

while ($data = fgetcsv($fileRes, null, ",")) {
	Storage::append('/redirects_processed.csv', $data[0] . ',' . $data[1]);
}

Ps. you can use Laravel Debugbar to get a rough idea about speed.


Leave a Reply

Your email address will not be published. Required fields are marked *