Preserving visual history in the cloud and making it more accessible with AI and machine learning

Earl Wilson/The New York Times

The morgue contains photos from as far back as the late 19th century, and many of its contents have tremendous historical value—some that are not stored anywhere else in the world. It’s not only the photos’ imagery that contains valuable information. In many cases the back of the photos include the time when and the place where the photo was taken.

“The morgue is a treasure trove of perishable documents that are a priceless chronicle of not just The Times’s history, but of nearly more than a century of global events that have shaped our modern world,” said Nick Rockwell, chief technology officer, The New York Times.

PRESERVING HISTORY

To preserve this priceless history, and to give The Times the ability enhance its reporting with even more visual storytelling and historical context, The Times is digitizing its archive, using Cloud Storage to store high-resolution scans of all of the images in the morgue.

Cloud Storage is our durable system for storing objects, and it provides customers like The Times with automatic life-cycle management, storage in geographically distinct regions, and an easy-to-use management interface and API.

Earl Wilson/The New York Times

CREATING AN ASSET MANAGEMENT SYSTEM

Simply storing high-resolution images is not enough to create a system that photo editors can easily use. A working asset management system must allow the users to be able to browse and search for photos easily. The Times built a processing pipeline that stores and processes the photos and will use cloud technology to process and recognize text, handwriting and other details that can be found in the images.

Here’s how it works. Once an image is ingested into Cloud Storage, the New York Times uses Cloud Pub/Sub to kick off the processing pipeline to accomplish several tasks. Images are resized through services running on Google Kubernetes Engine (GKE) and the image’s metadata is stored in a PostgreSQL database running on Cloud SQL, Google’s fully-managed database offering.

Cloud Pub/Sub helped The New York Times create its processing pipeline without having to build complex APIs or business process systems. It’s a fully-managed solution, so there’s no time spent maintaining the underlying infrastructure.

In order to resize the images and add captioning data, the New York Times uses “ImageMagick” and “ExifTool”, open-source command-line programs. They added ImageMagick and exiftool wrapped with Go services to Docker images in order to run them on GKE in a horizontally-scalable manner with minimal administrative effort. Adding more capacity to process more images is trivial, and The Times can stop or start its Kubernetes cluster when the service is not needed. The images are also stored in Cloud Storage multi-region buckets for availability in multiple locations.

The final piece of the archive is tracking both images and their metadata as they move through The Times’s systems. Cloud SQL is a great choice. For their developers, Cloud SQL provides a standard PostgreSQL instance—as a fully managed service, removing the need to install new versions, apply security patches, or set up complex replication configurations. Cloud SQL provides a simple and easy way for engineers to use a standard SQL solution.

MACHINE LEARNING FOR ADDITIONAL INSIGHTS

Storing the images is only one half of the story. To make an archive like The Times’ morgue even more accessible and useful, it’s beneficial to leverage additional GCP features. In the case of The Times, one of the bigger challenges in scanning their photo archive has been adding data regarding the contents of the images. The Cloud Vision API can help fill that gap.

Let’s take a look at this photo of the old Penn Station from The Times as an example. Here, we are showing you the front and the back of the photo:

It’s a beautiful black and white photo, but without additional context, it’s not clear from the front of the photo what it contains. The back of the photos contains a wealth of useful information, and the Cloud Vision API can help us process, store, and read it. When we submit the back of the image to the API with no additional processing, we can see that the Cloud Vision API detects the following text:

      NOV 27 1985
      JUL 28 1992
      Clock hanging above an entrance to the main concourse of Pennsylvania Station in 1942, and, right, exterior of the station before it was demolished in 1963.
      PUBLISHED IN NYC
      RESORT APR 30 ‘72
      The New York Time THE WAY IT WAS - Crowded Penn Station in 1942, an era “when only the brave flew - to Washington, Miami and assorted way stations.”
      Penn Station’s Good Old Days | A Buff’s Journey into Nostalgia
      ( OCT 3194
      RAPR 20072
      PHOTOGRAPH BY The New York Times Crowds, top, streaming into the old Pennsylvania Station in New Yorker collegamalan for City in 1942. The former glowegoyercaptouwd a powstation at what is now the General Postadigesikha designay the firm of Hellmuth, Obata & Kassalariare accepted and financed.
      Pub NYT Sun 5/2/93 Metro
      THURSDAY EARLY RUN o cos x ET RESORT
      EB 11 1988
      RECEIVED DEC 25 1942 + ART DEPT. FILES
      The New York Times Business at rail terminals is reflected in the hotels
      OUTWARD BOUND FOR THE CHRISTMAS HOLIDAYS The scene in Pennsylvania Station yesterday afternoor afternoothe New York Times (Greenhaus)
    

This is the actual output from our Cloud Vision API with no additional preprocessing of the image. Of course, the digital text transcription isn’t perfect, but it’s faster and more cost effective than alternatives for processing millions of images.

Bringing the past into the future

This is only the start of what’s possible for companies with physical archives. They can use the Vision API to identify objects, places and images. For example, if we pass the black and white photo above through the Cloud Vision API with Logo Detection, we can see that Pennsylvania Station is recognized. Furthermore, AutoML can be used to better identify images in collections using a corpus of already captioned images.

The Cloud Natural Language API could be used to add additional semantic information to recognized text. For example, if we pass the text “The New York Time THE WAY IT WAS - Crowded Penn Station in 1942, an era when only the brave flew - to Washington, Miami and assorted way stations.” through the Cloud Natural Language API, it correctly identifies “Penn Station,” “Washington,” and “Miami” as locations, and classifies the entire sentence into the category “travel” and the subcategory “bus & rail.”

Helping The New York Times transform its photo archive fits perfectly with Google’s mission to organize the world’s information and make it universally accessible and useful. We hope that by sharing what we did, we can inspire more organizations—not just publishers—to look to the cloud, and tools like Cloud Vision API, Cloud Storage, Cloud Pub/Sub, and Cloud SQL, to preserve and share their rich history.

Find out more about the technology that is helping The New York Times breathe new life into their photo archive.