Seventy Percent of My Traffic Was Fake and Here Is How I Proved It With One API Call

The analytics dashboard showed ten million monthly visits. Ten million. That number should have been a cause for celebration, and for a while, it was. The traffic graphs pointed upward, the page view counts accumulated impressively, and the bandwidth usage reflected a site that appeared to be thriving. But there was a persistent, nagging inconsistency that refused to go away. The engagement metrics told a completely different story. Bounce rates were astronomically high. Session durations were suspiciously short. Conversion rates were abysmal relative to the traffic volume. And the bandwidth bills from the hosting provider were staggering, far exceeding what ten million human visitors should reasonably consume, because many of these "visitors" were requesting pages at a rate and pattern that no human browsing session would produce.

The suspicion started as a quiet hunch and grew into a conviction over months. Something about the traffic was wrong. The server logs showed enormous volumes of requests from user agents claiming to be Googlebot, Bingbot, ChatGPT's crawler, and various other legitimate search engine crawlers. On the surface, this seemed normal. A large site naturally attracts heavy crawler activity. But the volume was disproportionate, and the behavior patterns were strange. Legitimate crawlers follow robots.txt directives, space their requests to avoid overwhelming the server, and come from known IP ranges associated with their respective companies. Much of this traffic did none of those things. It hammered the server relentlessly, ignored crawl-delay directives, and originated from IP addresses that belonged to cloud hosting providers rather than Google or Microsoft.

The definitive test was surprisingly simple. Take the IP address of a request that claims to be Googlebot and check whether it actually belongs to Google. Real Googlebot exclusively originates from IP addresses within Google's autonomous system, AS15169. If a request claims to be Googlebot but comes from an AWS IP address, or a DigitalOcean IP address, or any IP outside of Google's known ranges, it is unequivocally fake. One API call to the bot detection service with the IP address and user agent string, and the verdict came back instantly: not a legitimate Google crawler. That single call, repeated across a sample of the traffic, revealed that roughly seventy percent of all visits were from bots impersonating legitimate crawlers. The ten million monthly visits were closer to three million real ones, and seven million requests from imposters consuming server resources, inflating bandwidth costs, and polluting every analytics metric in the process.

The Moment the Numbers Stopped Making Sense

The realization did not arrive as a sudden epiphany. It accumulated through small observations over months. The first clue was the bandwidth bill. The hosting provider charged for data transfer, and the monthly bill was climbing steadily even though the content on the site had not grown proportionally. More pages were being served, but the content per page had not changed significantly. The additional bandwidth was being consumed by something, and the access logs pointed to crawler traffic as the primary driver. That seemed reasonable for a site of that size, so the concern was filed away as a cost of doing business.

The second clue was the server load. CPU usage during peak traffic hours was consistently higher than expected. The application was well-optimized, with caching at multiple layers, and the hardware specification should have handled the traffic comfortably. But the load averages told a different story. The server was working hard, and the extra work correlated not with user-facing traffic peaks but with sustained, around-the-clock request volume that never dipped to zero. Real human traffic follows predictable patterns. It peaks during business hours, drops at night, and varies by day of the week. Bot traffic runs twenty-four hours a day, seven days a week, at a constant rate, and it was visible in the load graphs as a baseline that never went below a certain threshold.

The third clue, and the one that finally triggered the investigation, was the analytics discrepancy. Google Analytics, which only tracks JavaScript-executing visitors, showed substantially less traffic than the server access logs. The difference between the two numbers was the bot traffic. Real browsers execute JavaScript and register in analytics. Bots that request HTML pages without executing JavaScript show up in server logs but not in analytics. A significant gap between the two is a strong indicator of heavy bot activity, and the gap on this site was enormous.

Armed with these observations, the investigation began in earnest. A sample of one thousand access log entries claiming to be from Googlebot were extracted and their IP addresses checked against Google's published IP ranges. The result was damning. Over seven hundred of those one thousand requests came from IP addresses that had no association with Google whatsoever. They originated from AWS, Hetzner, OVH, and various other hosting providers. The user agent string said Googlebot, but the IP address said random server in a data center. Extending the analysis to Bingbot, ChatGPT's crawler, and other claimed identities produced similar results. The traffic was overwhelmingly fake.

How One API Call Verifies Any Crawler's Identity

The verification process that revealed the fake traffic is conceptually simple but practically tedious to implement from scratch. Each major search engine and crawler operates from a specific set of IP ranges tied to their company's autonomous system number. Google uses AS15169. Microsoft uses several ASNs for Bing's infrastructure. OpenAI's crawler uses its own designated ranges. Verifying a crawler means taking the IP address of the incoming request, performing a reverse DNS lookup, confirming the domain matches the expected pattern, performing a forward DNS lookup to confirm the IP matches the domain, and checking whether the IP falls within the expected ASN. This multi-step verification catches sophisticated fakes that might pass one or two checks but fail the complete chain.

The bot detection API encapsulates this entire verification chain into a single call. Send the IP address and the claimed user agent string, and the API returns a verdict: legitimate or fake, along with the evidence supporting the determination. The ASN of the IP address, the reverse DNS result, the expected ASN for the claimed identity, and the confidence level of the assessment. For the seventy percent of traffic that was fake, the evidence was unambiguous. The IP addresses belonged to cloud hosting providers, the reverse DNS returned generic hostnames that had nothing to do with Google or Microsoft, and the ASN was completely wrong for the claimed identity.

What makes this approach definitive rather than heuristic is that it relies on verifiable network infrastructure data, not behavioral analysis. A sophisticated bot can mimic human browsing patterns, randomize its request timing, execute JavaScript, and even solve CAPTCHAs. But it cannot change the autonomous system number of the IP address it connects from. If a request claims to be Googlebot but originates from an AWS data center, it is fake. There is no gray area, no probability score, no false positive concern. The network infrastructure does not lie, and the API simply exposes that truth in a format that can be consumed programmatically.

What Changed After the Fake Traffic Was Identified

Knowing that seventy percent of traffic was fake immediately changed every business decision that had been based on traffic metrics. The actual audience was three million monthly visitors, not ten million. The real conversion rate was more than three times higher than the calculated rate, because the denominator had been inflated by seven million non-existent users. The true engagement metrics were respectable rather than embarrassingly low. Every report that had been generated, every strategy meeting that had referenced traffic numbers, every capacity planning decision that had been based on growth projections was built on a foundation of polluted data. The fake traffic had not just consumed server resources. It had distorted the entire analytical framework of the business.

The immediate technical action was to implement blocking at the server level. Every incoming request that claimed to be a search engine crawler was verified against the API in real time. Requests that failed verification were blocked before they reached the application layer. The effect was dramatic and immediate. Bandwidth consumption dropped sharply. Server CPU usage during off-peak hours fell to a fraction of its previous level. The application response times improved because the server was no longer wasting resources rendering pages for bots that would never index them. The hosting bill decreased proportionally.

The analytical cleanup took longer but was equally important. With the fake traffic filtered out, the analytics data became trustworthy for the first time. User behavior patterns became visible without the noise floor of bot activity. Actual traffic trends could be identified and correlated with marketing efforts. The content that was genuinely attracting human visitors could be distinguished from the content that was only attracting bots. This clarity transformed decision-making from guesswork based on polluted data to analysis based on reality.

The Scale of the Problem Across the Internet

This experience was not an outlier. Industry estimates consistently place bot traffic at thirty to fifty percent of all internet traffic globally, and for individual sites the proportion can be much higher. Sites with large page counts, high domain authority, or valuable content attract bot traffic disproportionately. Scrapers, fake crawlers, competitive intelligence bots, price monitoring bots, SEO analysis bots, and various flavors of malicious automation all contribute to the total. Most site operators have no visibility into this traffic because they rely on analytics tools that only measure JavaScript-executing visitors, leaving the entire bot layer invisible.

The financial impact extends beyond bandwidth costs. Advertising platforms charge based on impressions and clicks. If bot traffic generates ad impressions, those impressions inflate the numbers and distort campaign performance metrics. A/B testing frameworks that include bot visits in their sample produce unreliable results. Rate limiting and abuse detection systems calibrated against total traffic will be incorrectly tuned if the majority of that traffic is not human. Even SEO strategy can be affected, as server logs showing heavy crawl activity might be mistaken for evidence that search engines are deeply indexing the site, when in reality the crawlers are fakes and the real search engines are allocating a much smaller crawl budget.

The bot detection service was born directly from this experience. The verification logic that was built to clean up one site's traffic was generalized into an API that any site can use to verify crawler identities. The eight specific detectors covering Google, Bing, OpenAI, Yandex, DuckDuckGo, Qwant, and Seznam provide targeted verification for the most commonly impersonated crawlers. The result is that any site operator can run the same investigation that revealed the seventy percent fake traffic figure, and most of them will discover that their own numbers are similarly inflated. The first step toward fixing the problem is proving that it exists, and that proof is one API call away.

Frequently Asked Questions

How can I tell if my site has significant fake bot traffic?

Compare your server access logs against your JavaScript-based analytics. A large gap between the two numbers indicates substantial bot activity. Additionally, check the IP addresses of requests claiming to be from search engines. If many originate from cloud hosting providers rather than the expected company networks, they are fake.

What is the difference between a real Googlebot and a fake one?

Real Googlebot originates exclusively from IP addresses within Google's autonomous system AS15169. Fake Googlebot uses the same user agent string but connects from IP addresses belonging to cloud hosting providers like AWS, DigitalOcean, or Hetzner. The user agent string is trivially easy to spoof, but the IP address reveals the true origin.

Will blocking fake bots affect my search engine rankings?

No. Blocking fake bots only affects requests from IP addresses that do not belong to the legitimate search engine. Real Googlebot, Bingbot, and other legitimate crawlers will continue to access the site normally because they pass the verification check. Only imposters are blocked.

How much bandwidth can be saved by blocking fake bot traffic?

The savings depend on the proportion of fake traffic. Sites with heavy fake bot activity commonly see bandwidth reductions of forty to sixty percent after implementing verification and blocking. For sites with high bandwidth costs, this can translate to significant monthly savings.

Can fake bots execute JavaScript and appear in Google Analytics?

Some sophisticated bots do execute JavaScript, which means they can appear in analytics tools. However, the majority of fake crawlers are simple HTTP request generators that do not render JavaScript. IP-based verification catches both types because it does not rely on behavioral analysis but on the verifiable network origin of the request.

How does the bot detection API handle new or unknown crawlers?

The API includes specific detectors for the eight most commonly impersonated crawlers. For unknown user agents, the API provides ASN information and reverse DNS data that allows the caller to make their own determination. The general principle applies universally: verify the IP address against the claimed identity's known infrastructure.

A forgalmam hetven százaléka hamis volt, és egy API hívással bizonyítottam