Secure Coding — Validation and Encoding

James Ma
9 min readOct 14, 2023

Developing a secure web application is challenging, especially when user verification isn’t always foolproof. One of the most important tasks in securing your web application is to ensure proper validation of input and encoding of output to safeguard against malicious attacks.

Input validation serves as the gatekeeper, ensuring that the data entering your application conforms to the expected data type and structure. By implementing thorough validation checks, you can fortify your application against potentially harmful or malformed input.

Output encoding, on the other hand, transforms data from untrusted sources into a secure format, safeguarding external systems like databases, parsers, and web pages from potential exploitation. This conversion ensures that data remains safe and usable within your application environment.

The key to effective validation and encoding lies in knowing what the expected data should be, and how it will be used within the application environment. Firm grasp of these principles can bolster your application’s security and confidence that the data it relies on adheres to the desired standards.

Here are some key considerations to ponder when it comes to validation and encoding:

1. Use syntactic validation for correct input syntax

This is a fundamental step in ensuring the security and reliability of software applications. Take steps to validate whether the input data conform to the expected syntax or a specific data format.

For instance, when users submit data via web forms, APIs, or other interfaces, it is imperative to scrutinize if the information aligns with anticipated data structures. This scrutiny extends to common fields like phone numbers, birthdates, email addresses, or file types.

Syntactic validation helps prevent various security vulnerabilities and errors that can arise from malformed or unexpected input.

2. Use semantic validation for correct business rule validation

Semantic validation involves checking whether the business logic and rules embedded within your application function as intended. This ensures that the application behaves as expected and doesn’t allow unintended or malicious actions.

A classic example of semantic validation is confirming that a start date always precedes an end date or that prices consistently fall within predefined and anticipated ranges.

By incorporating semantic validation, you effectively install a vigilant guardian within your software, diligently safeguarding against rule violations and deviations from your intended application behavior. This proactive approach ensures the smooth and secure operation of your application, instilling confidence in its reliability and compliance with critical business rules.

With semantic validation, you stop rule violations and deviations from your application’s intended behavior, ensuring compliance to critical business rules.

3. Use an allowlist validation

Often referred to as a “positive” security approach, allowlists assertively specify what values are permitted to access and engage with your application’s functionality.

Allowlist validation is more powerful than blacklist validation. The allowlist serves as the gatekeeper, meticulously defining the patterns and criteria that incoming data must meet to proceed. Any input failing to align with the specified validation pattern is promptly and decisively declined. Allowlists exhibit the flexibility to detail the expected type, length, size, or numeric range, thereby setting clear boundaries for user input deemed suitable for processing.

By adopting an allowlist validation strategy, you proactively fortify your application’s defenses, ensuring that only authorized and conforming data gains access.

4. Avoid blocklist validation if you can

Blocklists operate on a negative security model, delineating what should be disallowed while permitting everything else to pass through. The focus is on identifying known malicious characters and patterns but leave the door open for anything that doesn’t precisely match those criteria.

However, the adoption of blocklist validation is generally discouraged for several compelling reasons. Instead, opting for an allowlist validation approach is often not only more secure but also more straightforward to implement.

With allowlists, you define what constitutes expected, safe data, and only data meeting those specific criteria is allowed to proceed. Anything that doesn’t align with the predefined allowlist is automatically rejected.

5. Prefer RegEx for input validation

You can use regular expression or regex, which is a sequence of characters that outlines a search pattern, to define an allowlist for input validation. Regex is a useful, but difficult to master tool. Regular expressions offer a way to check whether data matches a specific pattern.

RegEx is used to describe a sequence of characters that forms a search pattern. A regular expression can be a single character or a more complicated pattern. Regex can be used to perform all types of text search and replace operations. It can be used to search, edit, or manipulate data.

Do exercise caution, particularly when employing regex for search and replace operations, as these can introduce unintended consequences if not executed with care — especially in scenarios involving multiple rounds of replacements.

6. Select the right kind of RegEx for the job

Overly permissive regular expressions are common flaws that occur when regex is used to validate user input. Common mistakes include not restricting the number of characters allowed, improperly escaping special characters, or overuse of the wildcard feature.

Weaknesses like these can be used by an attacker to launch injection attacks, either on the client or server-side. If the attacker manages to get inappropriate values into the backend processing system, finding further injection flaws becomes even easier for the attacker.

Using excessively permissive regular expressions can result in issues, such as failing to identify the start and end of the target string and employing wildcards instead of defining acceptable character ranges.

When it comes to validating a US phone number, it’s best to avoid a simple zero to nine approach. Instead, opt for a regex pattern that mandates the first digit to fall between two and nine, followed by any two digits, a hyphen, another set of any three digits, and finally, any four digits.

While various regex patterns can be used for phone number validation, it’s crucial to carefully select the appropriate regex for the specific task at hand.

Let’s dive into a practical scenario to illustrate the importance of using the right regular expression for validating an IPv4 address. Imagine you have a simple regular expression that aims to validate IPv4 addresses. It attempts to match any sequence of numbers, each consisting of up to three digits, repeated four times and separated by periods. However, it’s crucial to highlight a significant limitation: IPv4 address numbers must strictly fall within the range of 0 to 255.

Consider this example in Python:

import re

ipv4_regex = r'^(\d{1,3}\.){3}\d{1,3}$'
test_addresses = ['192.168.1.1', '999.999.999.999', '256.0.0.1']

for address in test_addresses:
if re.match(ipv4_regex, address):
print(f'{address} is a valid IPv4 address.')
else:
print(f'{address} is not a valid IPv4 address.')

In this example, the regular expression ^(\d{1,3}\.){3}\d{1,3}$ is used to validate IPv4 addresses. The issue becomes evident when we test it with addresses like ‘999.999.999.999’, which is outside the valid IPv4 address range. This example demonstrates the importance of crafting precise regular expressions to ensure accurate validation.

7. Validate free-form Unicode text with care

Validating free-form Unicode text can be a complex task, as Unicode encompasses a wide range of characters from different scripts, languages, and symbols. The validation approach you take will depend on your specific requirements.

Here are some general steps and considerations for validating free-form Unicode text:

  • Specify and validate character encoding: Explicitly specify and validate the character encoding of Unicode text to prevent issues related to data interpretation. Ensuring that both input and output consistently use the correct Unicode encoding (e.g., UTF-8, UTF-16) helps prevent data corruption and security vulnerabilities stemming from malformed or incorrectly encoded text. By confirming that the text adheres to the expected encoding, you establish a foundation for reliable and secure text processing within your software.
  • Implement input sanitization and normalization: To bolster security and data integrity, implement robust input sanitization and normalization procedures. Input sanitization involves filtering and cleaning input data to remove or neutralize potentially harmful characters or sequences, guarding against vulnerabilities like cross-site scripting (XSS) attacks. Normalization, on the other hand, ensures consistent and equivalent representations of Unicode characters, reducing the risk of issues related to character variants. Properly combining these techniques helps safeguard your application from a wide range of security threats and data quality concerns.
  • Conduct cross-cultural testing and use unicode libraries: Validate the handling of diverse Unicode text by conducting thorough cross-cultural testing with inputs from various languages and cultures. This ensures that your software behaves correctly and securely for users worldwide. Additionally, consider leveraging Unicode libraries and functions provided by your programming language to handle text reliably. These libraries often offer built-in support for Unicode character properties, simplifying the implementation of secure text processing while adhering to best practices.

Remember that Unicode is vast, and text validation can be a nuanced task. Your validation criteria should align with your application’s specific needs and use cases. Regularly update and review your validation logic to adapt to evolving requirements and security concerns.

8. Encode all data before rendering it in HTML

Encode all untrusted data before rendering it in HTML templates. This process involves converting special characters such as <, >, &, and quotes into their respective HTML entities (&lt;, &gt;, &amp;, &quot;). This precautionary step helps prevent cross-site scripting (XSS) attacks, where malicious actors attempt to inject and execute their code within your application.

To execute this practice effectively, take advantage of web development frameworks and template engines that offer built-in encoding functions designed for this purpose. These tools streamline the encoding process, making it easier to secure your application’s output.

When using templating languages or engines, verify that they include robust protection against XSS attacks and ensure that you correctly employ their escaping mechanisms.

As a best practice, encode as close to the data consumer as possible. Whether it’s encoding data before sending it to a server or storing it in a database, or encoding data for HTML usage just before displaying it on a web page, the goal of encoding aims to thwart malicious attempts to execute unauthorized code within your environment.

9. Always sanitize and store user input

Never trust user input. It must be rigorously scrutinized and cleansed before it is rendered as HTML.

Following the earlier steps of syntactic and semantic validation, the next vital task is to cleanse the user input thoroughly.

Cleansing operates as the last line of defense against potentially harmful elements lurking within the data. These elements could be exploited by malicious actors for nefarious purposes, ranging from SQL injection to cross-site scripting (XSS) and other injection attacks.

The cleansing process typically entails the meticulous task of escaping or encoding characters that harbor special meanings within the specific context in which the data will be employed.

10. Use content security policy wherever possible

Implement a content security policy (CSP) header in your web application’s response to restrict the sources from which scripts, styles, and various resources can be fetched and executed. CSP can help mitigate the risk of malicious script injection.

CSP directives specify sources of trusted content that are allowed to execute, thereby establishing clear boundaries for the execution of scripts and the loading of resources. This reduces the risk of web-based threats, particularly cross-site scripting (XSS) attacks.

One of the core functions of CSP is script source whitelisting. By using the script-src directive, developers can explicitly specify the sources from which scripts can be loaded and executed on a web page.

CSP goes beyond script control; it extends to other types of resources, such as stylesheets, images, and fonts, through directives like style-src, img-src, and more.

11 . Avoid inline scripts and styles

Minimize the use of inline scripts and styles as an imperative safeguard in web development. Instead, use external files and scripts whenever possible. If inline scripts are necessary, use nonce-based or hash-based content security policy (CSP) mechanisms to allow only specific trusted scripts to be executed.

Avoiding inline scripts and styles is a critical practice in web development to mitigate the risk of cross-site scripting (XSS) attacks. When scripts or styles are directly embedded within HTML markup, especially when they incorporate user-generated or untrusted data, they serve as entry points for attackers to inject malicious code into web pages.

By choosing to place scripts and styles in separate files while avoiding inline implementation, developers establish a protective barrier. Within this barrier, they can perform thorough input validation and sanitization prior to execution, effectively reducing the susceptibility to XSS exploits and ensuring the safety of both users and web applications from potential threats.

Separating scripts and styles into external files enhances the overall maintainability and readability of code. It allows developers to organize code more efficiently, collaborate effectively, and implement version control.

Furthermore, debugging and troubleshooting become more straightforward when scripts and styles are externalized, as errors can be isolated and addressed more efficiently.

This practice not only promotes security but also contributes to the long-term maintainability and scalability of web applications.

Input validation and output encoding are vital tools in your arsenal to fortify your web application against the ever-evolving landscape of cyber threats. By understanding the intricacies of data expectations and utilization within your application, you empower yourself to create a resilient, secure environment for your users.

Read the full series in Secure Coding for Software Engineers. Available on:

--

--

James Ma
James Ma

Written by James Ma

Tech lead at a digital bank startup in Singapore.

Responses (1)