Making WordPress.org

Opened 5 years ago

Closed 5 years ago

#4138 closed defect (bug) (maybelater)

PROPOSAL: Maintain a blacklist of obviously nefarious traffic sources

Reported by: jonoaldersonwp's profile jonoaldersonwp Owned by:
Milestone: Priority: low
Component: General Keywords: analytics
Cc:

Description

We monitor where traffic to wordpress.org comes from, in order to help understand our marketing performance and to prioritise strategies/tactics. ~40% of our traffic comes in the form of referrals from other sites.

A large proportion of that traffic is obviously fake, and/or nefarious by nature; general spam bots, or, more maliciously targeted fake traffic, designed to confuse, elicit clicks, or obfuscate other behaviours.

Some of this, we can see pretty easily. For example, between Jan 22nd-24th:

Further investigation into the behaviour of these particular visits shows significant evidence that they're bots - browsing patterns and meta data associated with the visits sticks out like a sore thumb.

These are bad sites, sending fake users, costing us time and money, and muddying our understanding of _actual_ user behaviour.

My original suggestion was going to be that we should filter out visits from these blacklisted sources/domains in Google Analytics ("ignore traffic with X referring source") - however, upon reflection, perhaps we should block these from the site entirely?

My suggestion, then, is that:

  • We undergo a regular review process of referrering sources in GA.
  • If they're obviously _bad_, we add them to a list to block them on the load balancer level, based on the referrer.
  • If it's less obvious what's going on, we consider filtering out in GA on a case-by-case basis.

Thoughts appreciated.

Change History (9)

#1 @jonoaldersonwp
5 years ago

  • Priority changed from normal to low

#2 @tobifjellner
5 years ago

How much if this is REAL traffic?
There are spammers that just send virtual hits directly to GA, without ever actually visiting the site.

#3 @jonoaldersonwp
5 years ago

How much of what - these referrers specifically, or, all referral traffic?

This isn't data which has been submitted via measurement protocol, they're hits triggered by bot visits.

#4 @tobifjellner
5 years ago

Oh. If these are spam bots that do real visits in order to infect the stats with fake referrals, then it would just be right to block them at the perimeter.

#5 @Otto42
5 years ago

Could some of these be browsers aggressively pre-loading links on the page?

I ask because I notice that in the case of gamefullpc net, the referrals mostly go to the two pages linked to from that site in the footer. Note that the themezee link redirects to the w.org page.

#6 @jonoaldersonwp
5 years ago

Oh, now that's an interesting idea. I can't see any evidence of it, though. Nothing obvious in headers, network connections, or JS.

#7 @tellyworth
5 years ago

Bot/fraud detection is a specialised area that can easily turn into a game of whack-a-mole. If Google Analytics can't detect these bots, how are we going to successfully manage it going forward? Google has resources and expertise that we don't.

#8 @jonoaldersonwp
5 years ago

Agreed. This is whack-a-mole. However, we don't need to detect the bots; if we just filter out particularly high volume referring domains of obviously bad traffic, and we'd easily clean up a big chunk.

I can do/maintain this via GTM if we'd prefer (nice interface, access control, change logs) and chip away at it happily - but that only hides the issue. Feels like if we're going that far, we should probably have a process to block them from the site (rather than just hide them from our tracking, save the bandwidth/processing overhead, etc)...

@Otto42 The plot thickens, I think all of the sites in question have been hacked, and are running some nasty obfuscated JS. Check the first script which gamefullpc loads. Can't see any evidence that it's preloading in/from there, but I might be missing something?

Last edited 5 years ago by jonoaldersonwp (previous) (diff)

#9 @jonoaldersonwp
5 years ago

  • Resolution set to maybelater
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.