User Input: Sanitization and Validation

To protect your website from malicious attacks (and also simply to prevent weird errors for users), you need to sanitize and validate user input.

Sanitization

Sanitization means removing any potentially malicious content, like a user writing a comment that contains a <script> tag intended to be unwittingly run on the page.

Validation

Validation means checking the data the user inputted is correct for your use case (e.g. that an email address has a specific format). This is usually more about data correctness than security, but is still relevant here.

The importance of sanitizing and validating user input

All user input is, by definition, out of your control - and thus you cannot trust it. It might contain malicious scripts, or any number of other things you do not want it to contain.

There are two critical places you need to be careful, though one of them is more of a bygone relic in modern development terms (at least for remotely competent developers).

SQL Injection

If you're at risk of this, you are doing something very, very wrong. Illidan Stormrage will be having a word with you.

So long as you do not use any string substition in SQL queries that involve user input, and parameterize ALL inputs to a SQL query (e.g. using the ? syntax), then SQL injection is not a problem. In the past it was a problem due to developers doing things like this, using PHP as an example language:

$userInput = getUserInputFromHTMLForm();
$query = "SELECT column1 FROM some_table WHERE column2=\"" . $userInput . "\"";

Here, we're not parameterising our query - we're directly putting it into the query string. As a result, there is no way for the SQL engine we are using to differentiate between a value we intended to be used in the query, and other query syntax. A malicious user can therefore submit input that ends the current query and executes another query you did not intend to execute - like dropping your entire database, destroying a bunch of data, or outputting sensitive info onto a page you did not want to be visible.

Per an (in)famous XKCD comic...

Instead, use a parameterized query and a prepared statement, like so (using the PDO library in PHP as an example):

$databaseConnection = new \PDO($params, $username, $password, $options);
$selectQuery = "SELECT column1 FROM some_table where column2=?";
$selectStmt = $databaseConnection->prepare($selectQuery);
try {
	$selectStmt->execute([$userInput]);
} catch (\PDOException $e) {
	error_log(print_r($e, true));
}

This way, the SQL engine knows that anything that is passed as a value to the search of column2 is a parameter, and is not to be treated as SQL to be run as part of the command, no matter what it contains.

Cross-Site Scripting (XSS)

The other security risk is from not removing HTML tags from user input. Suppose you have a form that asks the user to write their name, then shows their name on the page in another box. What if the user writes something like this?

<script>alert('Your mother was a hamster');</script>

If you output this directly onto the page as HTML, it will not be shown as text but executed as a script - causing the user's browser to display a popup with the given message. Of course, that script could do anything - it could connect to another website, send some form submission, and on top of that it has access to the user's cookie and browsing session for the website it is running on. As far the browser knows, this script came from the website itself, so it's trusted.

Preventing this means sanitising user input so that HTML-specific characters are encoded as string literals, so that the browser knows they are not meant to be executed and are just characters. You should do this ~~server-side upon receiving user input, AND~~ client-side ~~upon~~before displaying something that contains user input, to ensure malicious scripts are not executed. Doing it client-side when the user is submitting input is not useful, as the user can easily override the client-side scripts that do this and thus submit unsanitised data to your server anyway.

Pitfalls

Server-side HTML sanitisation when using a library

Libraries that deal with user input, e.g. rich text editors like TinyMCE, will often expect to receive unsanitised input and then sanitise it themselves. If you sanitise the input server-side, you can disrupt this, making the library unable to parse the user content correctly. In these cases you should avoid sanitising the user input server-side, and instead make certain that the user content is ALWAYS sanitised in your client-side scripts before being shown to the user.