Search for new ideas!

Spam filter for Search Engines

Search engine spam filtering

How many web pages do you find valuable?

Even though more than 14 billion web pages indexed on the web according to worldwidewebsize, you can’t even name 1 million useful pages.

Why these other pages are for?

Most of the pages are spams. We all know about the spam’s we receive on emails, even bill gates received a lot of spams. At present, I am talking about search result spam which is a much larger problem than email spamming. Last Night, I was watching the discussion of Matt Cutts (Google), Harry Shum(Bing) and Rich Skrenta (Blekko). Even though they have differences, the majority of their discussion was focused on improving search results. Moreover about the challenges they are facing of Spam websites and Irrelevant Search Results.



What we can do?

I thought it is important for me to discuss this problem with computer engineering graduates on this blog. If you can build good spam filters for search engines than it would be great help for millions of users. In this article, I will talk about reasons for spam, what big companies are doing? And, last but not least how can you develop a spam filter algorithm.

What these spam websites are and why they are polluting our web?

Spam Websites are sites having content that is not relevant for end-user. They are polluting our web for MONEY.

For a better perceptive, you need to understand the commercial side of it. Most of the genuine website owners make money by referrals systems or advertisements published on their website. Moreover, the amount of money they generate is directly proportional to the number of visitors they have. A large chunk of visitors comes from search engines.

To earn quick money spam website publisher build their site according to search engines and not for the end-users. After optimization, the site starts receiving traffic, this doesn’t look that bad but when a visitor doesn’t find relevant content on this site, it frustrates him and he thinks that this particular search engine is not relevant or query doesn’t have significant results available on the internet. The worst part of it is that end users are us.

Is it possible to optimize a website without having content?

Yes, it is difficult but possible. I give you an example of 1995 when the Google page rank algorithm was not there. Anyone could optimize his site with a domain name of the query and repeating strings.
Example: To optimize a site for the “XYZ” query in the 90’s spam website owner buy domain name XYZ dot com. And repeat the XYZ query many times. When someone searches that person got that site as the first result.


What big companies are doing?

Google search engine tweaks its algorithm in a big way to filter these sites, but with them, spammers become smarter too. They are now targeting Google page rank algorithm. Even though search engines like Blekko are developing algorithms such as Adspam for blocking sites with spam, it still needs a lot of improvement.

What are the common things in Spam sites?

Common things in spam sites are:

  • They all have some type of advertisement or referral associated with them.
  • They do not have more than 20 pages of original content.
  • Their domain names are really big.
  • They give links to many irrelevant websites (Sometimes more than 20).
  • Sometimes there font color and the background color is the same.
  • Send content with URL links.

How to develop a search engine spam filtering System?

As mentioned above if we can recognize spams, we can easily filter them:
Now to filter most of the spam’s, find out sites which has all the above points such as advertisement or referral, not more than 20 pages of content, its domain name is big, irrelevant website links or font color and background color is same than on experience I can say that 99 percent of time this website is not there to help user but to misguide him for money.

So our spam filtering algorithm should do the following tests:

Count number of words in URL (Use fopen and count function)

Most of the big and useful sites have a domain names such as Google, Amazon, Bing, Wikipedia or iProject Ideas. While genuine site owners want visitors to remember their website name, spam sites depend on search engines.
So check whether the domain name is more than 20 words or not(use if-else loop )

Check a number of links on that page that have “get” method associated with them. (Use regular expression )

The difference between a simple URL and get method URL is: simple URL will look like www . XYZ. com/ and get method URL is like www . XYZ .com?abc=4

I am giving much emphasis on counting URLs with GET methods because referral to other sites doesn’t always require us to send any data with URL. For e.g., if I give a URL of Wikipedia, I don’t need to send get data with it.

Check font and background color

If font color and the background color is same, this clearly signifies that the owner of this site has something to hide from the user but wants the search engine to read it.

Check whether given links of relevant websites 

You can check their quality by applying the above steps on these sites.

In the above paragraphs, I try to provide you help regarding spam filters. I hope it helped. If you have any queries regarding this project, you can leave a comment.

Related PDF:
Filtering spam using search engines

Related Projects :

Computer Science Projects
Computer final Year Project Ideas

1 comment :

Anonymous said...

How to use regular expressions to check wheteher there are links called by get method.Could you give some referrence links??