Codeprinting

Spectral can detect code copies, partial copies and fuzzy copies. Which means -- if a piece of your sensitive code, configuration or any textual assets need to be in a specific predefined place, you can guarantee that it does by creating a custom detector that looks for stray copies of it, or partial (modified) copies of it.

Use codeprinting to:

  • Locate a configuration sprawl - a securely stored sensitive configuration file that individuals copy/paste between projects, even projects that aren't deemed to be a safe place for those
  • Trace a file that's made it to mobile apps by mistake and is now delivered to many end-user devices as part of an APK build
  • Locate a complete codebase that's been misplaced on a production server, a sandbox computer or other unauthorized devices

Creating effective codeprints

Spectral will help you create codeprints securely and locally -- your code is never transmitted anywhere, and all codeprint hashes are produced with a local and secure hashing algorithm.

Codeprint hashes are a safe one-way hash converted into a textual string, which also hold a comparative trait which Spectral uses to measure code copies, partial copies, or fuzzy copies. You can store those in clear in your custom detectors for detecting code copying.

📘

Did you know? While not secrets, any hash type should be kept private

Just as any other piece of hash, keep your codeprints private to your organization. While codeprints are not secrets, and cannot be reversed to the original text, virtually all hashes or one-way functions, such as MD5, SHA256 and others can be used to "sniff out" some indirect knowledge about an organization.

Quick Start

Creating a codeprint means essentially creating a custom detector with your codeprints in it. Spectral gives you a shortcut:

$HOME/.spectral/spectral fingerprint --codeprint [FILE1] [FILE2] ...

Pick files that represent code, configuration, docs or other pieces of information that is unique to your organization, or that by your policy -- are deemed sensitive.

Spectral will generate a detector for you and, output the following:

Which you can copy or pipe to your own .spectral/rules/rules.yaml.

PS: As a security best-practice, we leave the decision for how to distribute or store these custom rules to you. You know best how to securely distribute assets within your organization.

Detector breakdown

Let's look at the detector Spectral has generated for you:

rules:
  - id: CPRT001
    applies_to:
      - ".*$"
    description: Detect code copies via secure codeprinting
    name: Codeprint detector
    severity: info
    tags:
      - base
      - codeprints
    pattern_group:
      patterns:
        - pattern: ".*"
          match_on_path: true
          pattern_type: single
          test_codeprints:
          - print: ".."
          - print: ".."
  • applies_to - use this to block out any unwanted file for scanning.
  • match_on_path - rewires spectral to look at file paths and not content; you can use pattern to apply a secondary filtering rule (regex)
  • test_codeprints - these are the actual codeprints

As is typical with Spectral's detector engine and custom detector capabilities -- feel free to experiment, mix and match other testers and use what works for you.

Do's

  • For each file spectral scans, it will match against one of the codeprints in the list, so you can add more than one codeprint
  • If you have a sensitive file that you want to codeprint, you can create a detector just for that one
  • If you have a large codebase or assets you want to protect, try to identify the most unique-to-you files that reside in there and create a codeprint for all of those

Don'ts

  • Use a very small file (smaller than 2kb), because it probably doesn't have enough text to make it unique
  • Avoid using a public-domain, or a file that's not originally yours, such as a piece of open source code, as you'll just end up with matches of this open source library, which can be used by a lot of codebases

security:

  • codeprint is one-way
  • however it can be "googled", and if it originally was a public source code - people can reverse that knowledge. so we: compress, encrypt (constant key, global to spectral, in the future can be team key) just to avoid brute-googling the simhashes.
  • to avoid bruteforcing, a codeprint has to have a minimal files size