We're hitting our server limits this week

Discuss technical problems and features here
Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6325
Contact:

We're hitting our server limits this week

Postby emk » Mon Oct 16, 2017 1:02 pm

While working on the recent maintenance issues, I realized that we've maxed out our server for the moment.

In order to keep the costs for this site low ($30/month for everything), we run on an Amazon t2.micro "burst" server. This is normally quite fast for a cheap server, but this is because it has certain number of "CPU Credits". Once it runs out of CPU credits, it slows down tremendously. This makes sense for a small site like ours, because we want pages to load quickly, but we're not serving 100s of requests per second day in and day out.

Which brings me to Yandex, which is apparently a Russian search engine or something. It was crawling our site at high speed, and it used up all of our CPU credits. Here's the graph of our balance over the last day and a half:

llo-cloudwatch-cpu-last-3-days.png


Oops. Well, I've banned Yandex, and I'm going to tell some other search engines to put the brakes on, and we'll see if that helps.

In the longer run, we could upgrade from a t2.micro to a t2.small:

llo-instance-cost.png


This would take our current monthly costs of US$30/month, and raise them to about $45:

llo-monthly-costs.png


Or we could be be really extravagant and consider an m3.medium, which doesn't have "burst mode" CPU credits, and which can't fall off a cliff. But it's more expensive, and probably massive overkill.

So I'm going to try to tweak the robots.txt file a bit to slow down the crawlers, and see if the problem goes away. Cross your fingers.
You do not have the required permissions to view the files attached to this post.
8 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6325
Contact:

Re: We're hitting our server limits this week

Postby emk » Mon Oct 16, 2017 1:42 pm

As a temporary workaround, I've replaced our server (again), giving us a new 30-credit balance on CPU credits, and I've set an alarm to notify me if the balance goes below 10 credits, so maybe I can catch this problem before it becomes critical. I've also asked most crawlers to slow down their crawl speed, because it looks like that's probably the root of our problems. Users read a few pages and take their time, which is a good match for the CPU credit model, but crawlers just keep grinding away hour after hour.

And of course, I put in a hard limit of 20 Apache workers so we can't spin up 150+ when things start melting down, and we'll instead dump traffic. And I've told Apache to make new PHP processes every 1,000 requests, so if PHP is leaking memory, it won't do anything bad.

But if the site still keeps getting slow again, we'll need a bigger server.
13 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6325
Contact:

Re: We're hitting our server limits this week

Postby emk » Mon Oct 16, 2017 2:34 pm

OK, it looks like our CPU credits have stabilized nicely for the time being, since I banned Yandex crawls and slowed down Bing, etc.:

llo-cpu-credit-balance.png


So we can keep running on the cheap server for now, and performance should stay reasonably high with any luck. It shouldn't take a big server to run this forum. We just need to keep an eye on bots and on tuning, I think. Not that I wouldn't love that t2.small instance; it has more RAM which would allow us to host more projects like the Super Challenge bot.
You do not have the required permissions to view the files attached to this post.
14 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6325
Contact:

Re: We're hitting our server limits this week

Postby emk » Tue Oct 17, 2017 12:06 pm

Just an update on the performance tuning. With Yandex disabled, and several other search engines forbidden to index more than 1 page every 10 seconds, our CPU credit balance seems to be accumulating nicely:

llo-cpu-credit-with-new-robots-txt.png

As long as this number stays well above 10 credits or so, the forum should normally be snappy. It's just that we can't stand multi-hour indexing runs where a search engine requests a page or more every second, because they eventually drain the balance.
You do not have the required permissions to view the files attached to this post.
2 x

User avatar
Serpent
Black Belt - 3rd Dan
Posts: 3657
Joined: Sat Jul 18, 2015 10:54 am
Location: Moskova
Languages: heritage
Russian (native); Belarusian, Polish

fluent or close: Finnish (certified C1), English; Portuguese, Spanish, German, Italian
learning: Croatian+, Ukrainian; Romanian, Galician; Danish, Swedish; Estonian
exploring: Latin, Karelian, Catalan, Dutch, Czech, Latvian
x 5179
Contact:

Re: We're hitting our server limits this week

Postby Serpent » Tue Oct 17, 2017 5:35 pm

Um just saying that Yandex is as big a deal in Russia as Google in the US. Isn't it possible to limit its crawling capacity instead of banning it entirely?
8 x
LyricsTraining now has Finnish and Polish :)
Corrections welcome

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23123
Contact:

Re: We're hitting our server limits this week

Postby rdearman » Wed Oct 18, 2017 8:12 am

Serpent wrote:Um just saying that Yandex is as big a deal in Russia as Google in the US. Isn't it possible to limit its crawling capacity instead of banning it entirely?

Yes, it is possible. The snag is that I need to configure it as a user (bot) so that the forum software gives out the pages in a nice consistent way rather than the bot following down broken links, etc. I'll configure the bot this week for Yandex and hopefully we can turn it back on and it will play nice with the forum software.

EDIT: I found their UserAgent-ID on their website and configured a bot, just need to turn it back on and make sure it plays nice.
4 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1620
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6325
Contact:

Re: We're hitting our server limits this week

Postby emk » Wed Oct 18, 2017 11:11 am

rdearman wrote:
Serpent wrote:Um just saying that Yandex is as big a deal in Russia as Google in the US. Isn't it possible to limit its crawling capacity instead of banning it entirely?

Yes, it is possible. The snag is that I need to configure it as a user (bot) so that the forum software gives out the pages in a nice consistent way rather than the bot following down broken links, etc. I'll configure the bot this week for Yandex and hopefully we can turn it back on and it will play nice with the forum software.

That's probably not going to be enough for Yandex. They crawl our pages much faster than the other search engines, and they do it for many hours at a time. The problem is that this eventually exhausts our "CPU credit" balance, as you can see in the graphs above. Do you see the two points below when the line starts going almost straight down? That's mostly Yandex, from what I saw in the logs.

Image
They just burn through our CPU in a totally irresponsible fashion. Google and Bing are a lot more mellow, and don't just hammer our server for hours on end.

I don't think that creating a bot account will be sufficient to slow down Yandex's crawling. I can try to tune their crawl speed with non-standard robots.txt extensions, but I'm honestly not willing to waste much of my time dealing with abusive bots that try to crawl at ridiculous rates. See this comment by the author of "badbotblocker":

The problem is that both Yandex and Baidu are rather poorly behaved - they hit your website way too fast, downloading large bandwidth files in quick succession. That's actually what led me to the bad-bot-blocker project in the first place. Baidu has also been accused of not respecting robots.txt though I have not personally observed that.
This is the reason they're blocked, not because they're new or non-English.

So once I finish getting the proxy IP address stuff fixed, I can try re-enabling Yandex with a much slower crawl rate. But if they kill our server another time, they're gone for good. Ditto for Baidu: Either they can respect robots.txt and crawl at a reasonable rate, or they're not welcome. But well-behaved bots are welcome to stay. So I'll give Yandex one more chance as soon as the other admin backlog is sorted out.
9 x


Return to “Technical Support and Feature Requests”

Who is online

Users browsing this forum: No registered users and 2 guests