Octopii is an open-source AI-powered Personal Identifiable Information (PII) scanner that can look for image assets such as Government IDs, passports, photos and signatures in a directory.
Octopii uses Tesseract’s Optical Character Recognition (OCR) and Keras’ Convolutional Neural Networks (CNN) models to detect various forms of personal identifiable information that may be leaked on a publicly facing location. This is done in the following steps:
The image is imported via OpenCV and Python Imaging Library (PIL) and is cleaned, deskewed and rotated for scanning.
A directory is looped over and searched for images. These images are scanned for unique features via the image classifier (done by comparing it to a trained model), along with OCR for finding substrings within the image. This may have one of the following outcomes:
As a final verification method, images are scanned for certain strings to verify the accuracy of the model.
The accuracy of the scan can determined via the confidence scores in output. If all the mentioned conditions are met, a score of 100.0 is returned.
To train the model, data can also be fed into the model_generator.py
script, and the newly improved h5 file can be used.
pip install -r requirements.txt
.sudo apt install tesseract-ocr -y
(for Ubuntu/Debian).python3 octopii.py <location name>
, for example python3 octopii.py pii_list/
python3 octopii.py <location to scan> <additional flags>
Octopii currently supports local scanning and scanning S3 directories and open directory listings via their URLs.
owais@artemis ~ $ python3 octopii.py pii_list Not a valid image format: pii_list/aadhaar/aadhaar-8.gif [ { "asset_type": Credit and Debit Cards, "country_of_origin": "International", "confidence": 100, "file_name": "credit-card.jpg", "extension": "jpg", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/credit-card.jpg" }, { "asset_type": "PAN", "country_of_origin": "IN", "confidence": 100, "file_name": "dummy-PAN-India.jpg", "extension": "jpg", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-PAN-India.jpg" }, { "asset_type": Aadhaar, "country_of_origin": "IN", "confidence": 100, "file_name": "dummy-aadhaar.jpg", "extension": "jpg", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-aadhaar.jpg" }, { "asset_type": Driver License, "country_of_origin": "International", "confidence": 100, "file_name": "dummy-drivers-license-nebraska-us.jpg", "extension": "jpg", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-drivers-license-nebraska-us.jpg" }, { "asset_type": Passport, "country_of_origin": "International", "confidence": 100, "file_name": "dummy-passport-britain.jpg", "extension": "jpg", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-passport-britain.jpg" }, { "asset_type": Passport, "country_of_origin": "International", "confidence": 100, "file_name": "dummy-passport-india.jpg", "extension": "jpg", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-passport-india.jpg" }, { "asset_type": "Signature", "country_of_origin": null, "confidence": 7, "file_name": "dummy-signature.png", "extension": "png", "path": "https://pii-carbonconsole.fra1.digitaloceanspaces.com/dummy-signature.png" } ]
This repo contains all variants of information security & Bug bounty & Penetration Testing write-up…
site:*/sign-in site:*/account/login site:*/forum/ucp.php?mode=login inurl:memberlist.php?mode=viewprofile intitle:"EdgeOS" intext:"Please login" inurl:user_login.php intitle:"Web Management Login" site:*/users/login_form site:*/access/unauthenticated site:account.*.*/login site:admin.*.com/signin/…
Matrix is an open network for secure and decentralized communication. Users from every Matrix homeserver…
Linux Security And Monitoring Scripts are a collection of security and monitoring scripts you can…
A fiber is a unit of execution that must be manually scheduled by the application…
XSS Exploitation Tool is a penetration testing tool that focuses on the exploit of Cross-Site…