Automated Metadata Extraction for 30k Handwritten Engineering Records

A system that reads scanned legacy engineering drawings and turns them into structured, searchable data.

Timeline

Jan – May 2024

Role

Developer

Team

Solo

Tools

Python, OpenCV, PyMuPDF, Tesseract OCR, Regex validation, CSV structured outputs

Select a step to see details

Overview

During the 20th century, millions of engineering drawings were created by hand, documenting critical systems across infrastructure, manufacturing, and aerospace. Today, many of these documents exist only as scanned archives, where their contents are effectively locked away.

While working on a large-scale document migration project, I developed a pipeline to automatically extract structured metadata from scanned engineering drawings, turning thousands of static images into searchable, usable records.

Context

To ground the problem, I looked to historical archives such as those from the MIT Instrumentation Laboratory, whose work on the Apollo Guidance Computer is preserved in collections like the AGC Aperture Card Collection. These drawings capture highly specialized knowledge, but without proper indexing, they are extremely difficult to navigate or reuse.

To access these drawings in an organized way, each drawing contains metadata, such as drawing numbers, revision letters, and sheet references, embedded within a title block. However, pages are not cleanly organized, metadata is not machine-readable, and related documents are difficult to link together. Even simple questions about how many drawings describe a subsystem, and more importantly, how revisions are linked, become impossible to answer with certainty.

When I encountered a similar challenge at my first internship, processing ~30,000 engineering documents, it became clear that manual labeling would not scale, and without structured metadata, valuable knowledge becomes effectively lost.

This motivated me to build a system that could extract metadata automatically while preserving accuracy.

Process

My approach to this issue was to convert PDFs into JPGs for optical character recognition and treat each page as an independent record, with all metadata extracted directly from the image.

The pipeline processes each page through several stages:

1. PDF → Image conversion: High-resolution rendering ensures small printed fields remain legible.

2. Image enhancement: Improves OCR reliability by standardizing scan quality.

3. Page triage: Filters out irrelevant pages (blank, photos, memos), focusing only on useful drawings.

4. Targeted extraction: Isolates the title block, where key metadata lives.

5. OCR + validation: Reads text and verifies it against expected formats.

6. Structured output: Generates clean, traceable metadata records.

Outcomes

In a referenced sample of 1000 documents, my metadata tagging system reached 90% accuracy; the resultant files eventually became part of my firm's document management system. In the future, I hope to create more tools like this that can automate the process farther for more contexts, especially with advancements in AI tools.

← Back to Projects