Kali Linux

Octopii : An AI-powered Personal Identifiable Information (PII) Scanner

Octopii is an open-source AI-powered Personal Identifiable Information (PII) scanner that can look for image assets such as Government IDs, passports, photos and signatures in a directory.

Working

Octopii uses Tesseract’s Optical Character Recognition (OCR) and Keras’ Convolutional Neural Networks (CNN) models to detect various forms of personal identifiable information that may be leaked on a publicly facing location. This is done in the following steps:

1. Importing and cleaning image(s)

The image is imported via OpenCV and Python Imaging Library (PIL) and is cleaned, deskewed and rotated for scanning.

2. Performing image classification and Optical Character Recognition (OCR)

A directory is looped over and searched for images. These images are scanned for unique features via the image classifier (done by comparing it to a trained model), along with OCR for finding substrings within the image. This may have one of the following outcomes:

  • Best case (score >=90): The image is sent into the image classifier algorithm to be scanned for features such as an ISO/IEC 7810 card specification, colors, location of text, photos, holograms etc. If it is successfully classified as a type of PII, OCR is performed on it looking for particular words and strings as a final check. When both of these are confirmed, the result from Octopii is extremely reliable.
  • Average case (score >=50): The image is partially/incorrectly identified by the image classifier algorithm, but an OCR check finds contradicting substrings and reclassifies it.
  • Worst case (score >=0): The image is only identified by the image classifier algorithm but an OCR scan returns no results.
  • Incorrect classification: False positives due to a very small model or OCR list may incorrectly classify PIIs, giving inaccurate results.

As a final verification method, images are scanned for certain strings to verify the accuracy of the model.

The accuracy of the scan can determined via the confidence scores in output. If all the mentioned conditions are met, a score of 100.0 is returned.

To train the model, data can also be fed into the model_generator.py script, and the newly improved h5 file can be used.

Usage

  1. Install all dependencies via pip install -r requirements.txt.
  2. Install the Tesseract helper locally via sudo apt install tesseract-ocr -y (for Ubuntu/Debian).
  3. To run Octopii, type python3 octopii.py <location name>, for example python3 octopii.py pii_list/

python3 octopii.py <location to scan> <additional flags>

Octopii currently supports local scanning and scanning S3 directories and open directory listings via their URLs.

Example

owais@artemis ~ $ python3 octopii.py pii_list

Not a valid image format: pii_list/aadhaar/aadhaar-8.gif

[
    {
        "asset_type": Credit and Debit Cards,
        "country_of_origin": "International",
        "confidence": 100,
        "file_name": "credit-card.jpg",
        "extension": "jpg",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/credit-card.jpg"
    },
    {
        "asset_type": "PAN",
        "country_of_origin": "IN",
        "confidence": 100,
        "file_name": "dummy-PAN-India.jpg",
        "extension": "jpg",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-PAN-India.jpg"
    },
    {
        "asset_type": Aadhaar,
        "country_of_origin": "IN",
        "confidence": 100,
        "file_name": "dummy-aadhaar.jpg",
        "extension": "jpg",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-aadhaar.jpg"
    },
    {
        "asset_type": Driver License,
        "country_of_origin": "International",
        "confidence": 100,
        "file_name": "dummy-drivers-license-nebraska-us.jpg",
        "extension": "jpg",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-drivers-license-nebraska-us.jpg"
    },
    {
        "asset_type": Passport,
        "country_of_origin": "International",
        "confidence": 100,
        "file_name": "dummy-passport-britain.jpg",
        "extension": "jpg",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-passport-britain.jpg"
    },
    {
        "asset_type": Passport,
        "country_of_origin": "International",
        "confidence": 100,
        "file_name": "dummy-passport-india.jpg",
        "extension": "jpg",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-passport-india.jpg"
    },
    {
        "asset_type": "Signature",
        "country_of_origin": null,
        "confidence": 7,
        "file_name": "dummy-signature.png",
        "extension": "png",
        "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-signature.png"
    }
]
R K

Recent Posts

Useful Bug Bounty And Security Related Write-ups : A Comprehensive Guide For Enthusiasts

This repo contains all variants of information security & Bug bounty & Penetration Testing write-up…

2 hours ago

Admin-Panel-Dorks : Mastering Google Dorks To Uncover Hidden Admin Panels

site:*/sign-in site:*/account/login site:*/forum/ucp.php?mode=login inurl:memberlist.php?mode=viewprofile intitle:"EdgeOS" intext:"Please login" inurl:user_login.php intitle:"Web Management Login" site:*/users/login_form site:*/access/unauthenticated site:account.*.*/login site:admin.*.com/signin/…

2 hours ago

Conduwuit : Pioneering A New Era In Matrix Homeservers

Matrix is an open network for secure and decentralized communication. Users from every Matrix homeserver…

2 hours ago

LSMS – Linux Security And Monitoring Scripts

Linux Security And Monitoring Scripts are a collection of security and monitoring scripts you can…

2 hours ago

Fiber – Using Fibers To Run In-Memory Code

A fiber is a unit of execution that must be manually scheduled by the application…

2 hours ago

XSS-Exploitation-Tool : A Penetration Testing Tool

XSS Exploitation Tool is a penetration testing tool that focuses on the exploit of Cross-Site…

2 hours ago