Difference between revisions of "Proposals/Done/Anti-spam"

Revision as of 18:40, 9 May 2011

Problem Areas

User registration

Cutting down on the number of spam user registrations actually solves a number of spam cases, namely, those that require the user to be logged in. These include sending messages to other users, posting in the forums, and posting feedback to non-public views.

Contact Us
Public view feedback

Current Approach - CAPTCHA

The one in Mahara doesn't work well, and CAPTCHAs in general are becoming decreasingly effective against spam.
Can be improved against spambots, but at the cost of a lower success rate among real people. Very frustrating for actual users, and could deter people from using features that require CAPTCHA. For a business, this could mean lost revenue.
Not accessible (i.e. blind, dyslexic people cannot solve a CAPTCHA)

"Better" CAPTCHAs

Audio CAPTCHA

An audio CAPTCHA tries to solve the accessibility by reading the CAPTCHA to the user. In addition, the distortions and noise patterns in the image can be increased, since human users can resort to the audio option if needed. However, the audio capture is actually far less effective against spammers. Since known words must be used, a dictionary attack could render the CAPTCHA useless. The audio component could also be targeted, since voice recognition software is getting pretty good.

ReCAPTCHA (http://recaptcha.net/)

ReCAPTCHA uses words from digitized books as the source of the CAPTCHA puzzle. Two words are presented: a word that has been successfully scanned (a "known" word) and a word that hasn't (an "unknown" word). The premise is that if a user correctly types the known word, there is a high probability that they also correctly typed the unknown word. A plus for Mahara is that it helps digitize books, which fits well with Mahara's standing as a tool used in education. A drawback is that ReCAPTCHA is a service that a Mahara plugin could be written for, but that couldn't be further customized.

"Math" CAPTCHA

The Math CAPTCHA asks users to solve a simple mathematical or logic puzzle. The puzzle is typically presented in words rather than symbols to prevent a computer from easily solving it. However, even this approach has limited effectiveness. Type "subtract the square root of nine from sixteen squared" into Wolfram Alpha as an example.

A variant on the Math CAPTCHA asks users a common-knowledge question, or a question that all users of the particular service are likely to know the answer to. The same problem applies: computers are increasingly able to answer such questions.

CAPTCHA Conclusions

In general, CAPTCHAs have limited effectiveness, and are an inconvenience to genuine users. An approach that is simultaneously transparent to genuine users and relatively effective would be ideal.

Alternative Methods

Referral system

Only allow a user to register if they have been referred by another user. Similar to invite-based systems like the early Facebook and Gmail. Would be a decent optional feature to provide another barrier against spam, but shouldn't be the primary technique, as it may not be appropriate for many Mahara installations.

Backend "Scoring" (for feedback and contact forms)

Develop an algorithm that "scores" form submissions to determine the likelihood that the submission is spam. Certain keywords can be defined that raise or lower the score of a submission by a specified (possibly even dynamic) amount.

The main challenge with this approach is avoiding false identifications. Obviously, occasionally identifying a spam submission as genuine is preferable to identifying a genuine submission as spam. In both cases, the system should be able to adapt when a false identification is pointed out.

The system could have a "mark as spam" feature that adjusts the algorithm automatically when a false result is identified. Similarly, results can be sorted into searchable "folders" or bins based on score ranges to make it easier to locate and identify genuine submissions that were marked as spam.

The algorithm would ideally be configurable to be more or less aggressive. This would be done in practice by adjusting the weightings of the various score components.

This approach could be combined with others. For example, a hidden field that was filled in could lower the submission score considerably.

A benefit to this approach for Mahara is that it could be highly customized for use specifically with Mahara. Rather than attempting to be an all-purpose anti-spam tool, the algorithm can make use of common characteristics of Mahara users and genuine submissions.

A huge drawback is the limited ability to test the algorithm's effectiveness without exposing it and observing the results. This could cause long development cycles.

Akismet (http://akismet.com/)

Akismet is similar the scoring method, but implemented as a web service with an API. It is free for use on a "personal blog", but there are plugins for bulletin boards such as phpBB. In any case, an Akismet plugin would be a very useful addition - Mahara users would be responsible for determining which type of API key they require. Existing plugins: http://akismet.com/development/

SpamAssassin (http://spamassassin.apache.org/)

Similar to Akismet, but not a web server and is completely free and open for all uses. Would have to be installed on a server that the Mahara instance has access to (not necessarily the same server). Not necessarily well suited to applications other than email filtering. Some discussion here: http://wiki.apache.org/spamassassin/BlogSpamAssassin The actual tests used and their weightings are available: http://spamassassin.apache.org/tests_3_3_x.html

Project Honey Pot http:BL (http://www.projecthoneypot.org/httpbl.php)

Provides a central blacklist, but requires registration and extra links/scripts. Plugins available for other CMS (drupal, joomla, etc).

Built in honeypot

A public view intended only for trapping spam. Log and then delete all comments.

Scoring Criteria

Email Address Checks

+ Address is registered on the Mahara instance

+ Address has an associated gravatar

- Address fails validation (could use Email::Valid - see http://search.cpan.org/~rjbs/Email-V...Email/Valid.pm)

Message Body Checks

- Contains many (2+ ?) links

- Contains links that fail Google Safe Browsing API blacklists

Timing Checks

- Form submitted very quickly (real users will generally take significantly longer than spammers to complete a form)

- Repeat submissions in a narrow time frame (spammers will often submit many times)

Form Tricks

Most of these ideas are from http://nedbatchelder.com/text/stopbots.html or http://www.infinetsoftware.com/blog/...our-web-forms/

Invisible Fields. Most spambots blindly fill in every field in a form. By introducing an invisible (not to be confused with hidden) field, spam bots will likely fill it in, while human users will not. The fields can be made invisible in various ways, the simplest being to set display: none in CSS. This is completely transparent to a genuine user, but may have varying effectiveness against spam.

Instead of using descriptive field names like "firstname", "lastname", etc, generate random field names based on a hash. Store the hash in a hidden form field, and use it when the form is submitted to locate data in the form. This stops bots who return to the site expecting the same field names previously encountered.
Check for newlines in single-line fields.
Randomize order of form fields.

Javascript trickery. Would stop any bot that doesn't process javascript (but also users without javascript enabled). Example: display the submit button with javascript after the page has loaded.

Subpages

Anti-spam Spec