Project Outfits

Problem Statement

In order to assemble matching outfits, the individuals with visual impairment rely on visually capable individuals to help put together their outfits or have them attach braille tags to all of their clothes in order to know the type, color and texture of their clothes.

Below is a video demonstrating how the visually impaired identify clothing items.

Our Solution

Our system gives the visually impaired independence by assisting them in identifying clothing items and helping them assemble matching outfits without the reliance on a visually capable individual and the need to attach tags, identification buttons, or RFID stickers to their clothing pieces. Below are videos demonstrating our applications in use for both the mobile and web-based versions.

Descriptive World Mobile App:

Play Video

Descriptive World Cloud-Based App:

How Does It Work?

We use computer vision, natural language processing and text to speech technologies to identify clothing type, texture and color then narrate a description of the clothing item back to the user.

Approach

Ideation

We began by exploring ideas; Initially we were looking to develop a system that could narrate real-time events in the field of vision of a user wearing the Descriptive World headset. However the scope of this task was monumental for the limited time our team had to research, design, develop and test a solution (10 weeks.) So after discussing various use cases including helping shoppers find items in a grocery store, helping the visually impaired navigate a city environment, and a few other situations, we decided to keep our scope to an environment that is mostly controlled and free of distractions (which could interfere with our ability to accurately identify items.) This lead us to limiting our environment to the home and eventually narrowing our scope to clothing identification as a ideal problem to tackle that can bring value to the lives of those visually impaired.

Research

Our team met with subject matter experts in the field of technology usage for the visually impaired. We were given some great advice to keep our approach simplistic so that the learning curve for usage was small to none and suggestions for several possible directions to take this project. The subject experts also pointed us toward other similar products available on the market to see their implementations and how to improve upon their approaches.

Datasets

To build our computer vision models, we needed to evaluate a number of different clothing-related datasets to determine which will give us the best results for our vision of our project. After searching the internet for several days, we came upon several clothing datasets that were accessible for acedemic/research purposes. Details for each dataset are shown below.

Dataset	# Classes	# Attributes	# Images	Total Size of Dataset	Type of Dataset & Annotation	Preprocessing Required?	Viable?
Deep Fashion 1	46	16	289,229	16GB	image + bounding box + landmarks	Data augmentation to increase variability	Yes
Deep Fashion 2	13	12	286,743	17.2GB	image + bounding box + landmarks + segmentation	Limit classes to higher occurrences	Yes
Clothing Dataset	20	2	5,403	7.0GB	image + metadata	Class balancing	Yes
Fabric Patterns*	7	1	3,730	0.5GB	image	*Custom scraped and cleansed	Yes
Fashion Product Images	58	10	44,000	16GB	image + metadata	Yes, unbalanced	No
Fashion MNIST	10	1	60,000/ 10,000	10 MB	URIs + Metadata	Download images	No

After exploring each of the datasets, we decide to work with only the datasets highlighted in green, as these datasets worked the best with our need for the project. Eventually we decided to work with one dataset, DeepFashion2, as it possessed the best quality images and bounding box which ultimately lead to our best performing models.

Additionally, we web scraped around 8000 images using DuckDuckgo Images to help build a dataset for texture/pattern detection. This was performed using a notebook in python utilizing a web scraping library. The scrapes for each pattern were then manually verified to insure only good quality images remained for training.

Data Transformations

Below is a diagram of our approach to preparing the dataset for training with our model. This includes our filtering/sub-selecting the dataset, formatting the images for ingestion, and transforming/editing the images for increasing the number and variation of the samples used for training.

In the end, we ended up with around 130,000 images from DeepFashion2 that we training our model with. Each image was converted into a 640 px x 640 px size for input.

The input image size for the texture/pattern model was 320 px x 320 px.

Models

While searching for the best performing model, initially we developed a number of classification models like EfficientNet, RESNET-50, RESNET-50v2 and RESNET-101 before realizing that this type of problem was more of an object identification problem. The realization lead us to evaluating 4 principle object detection model architectures: YOLOv5, Faster-RCNN, EfficientDet and Mask R-CNN. Below includes details of models built with each of the architectures. All of the models tested utilized pretained models for use in transfer learning to cut down on training time and improve performance.

Object Detection Architecture	Object Detection Method	Backbone Model	ML Framework	Layers	Pretrained Dataset	Trained Clothing Dataset
YOLOv5 or YOLOR	Bounding box	CSPDarknet	PyTorch	24	COCO	DeepFashion, DeepFashion 2 Clothing Dataset Fabric Patterns
Faster-RCNN	Bounding Box	VGG16	TensorFlow	16	COCO	DeepFashion 2
EfficientDet	Bounding Box	EfficientNet	PyLightning, PyTorch	36	ImageNet	DeepFashion 2
Mask R-CNN	Segmentation	VGG16, Mobilenet V2	Keras, TensorFlow	16 + 5 3	COCO, ImageNet	DeepFashion 2

Model Evaluation

Below are the evaluation results for each of the model architectures we explored for building the primary clothing type identification model as well as our objective assessment of the model with real-world examples.

Model Architecture	# of Classes	Image Res / Total	Labels	Training Time	Precision	Recall	mAP@0.5	Real-world Tests
YOLOv5	11 + Background	640×640 120,000	9220	16 hours 50 epochs	0.67	0.756	0.74	Satisfactory
EfficientDet	11 + Background	512×512 110,000	–	~40 hours 40 epoch	–	–	–	Satisfactory 6/7 samples
Faster-RCNN	13 + Background	640×640 480,000	–	~10 hours/ epoch	–	–	–	–
Mask-RCNN	13 + Background	1024×1024 480,000	–	~20 days 50 epochs	–	–	–	1/20

In the end only EfficientDet and YOLOv5 were able to produce inferences that were able to be tested with real-world examples at an satisfactory level. Note that EfficientDet doesn’t have values for precision, recall, and mAP@0.5 due to issuses causing the training to freeze when trying to calculate the rating for each epoch. Ultimately, we decided to use the YOLOv5 architecture for our final model due to its ease of building new models and its well documented examples from the community, which EfficientDet lacked.

The final clothing type model contains 11 identifiable classes. This includes: short sleeve top, long sleeve top, long sleeve outerwear, short sleeve dress, long sleeve dress, sling dress, vest dress, vest, skirt, shorts, and trousers.

The texture/pattern model was also built using the YOLOv5 architecture and includes 7 identifiable patterns. These patterns include: plain (solid), stripe, polka dot, camouflage, zig-zag, paisley, floral, and plaid.

Final Model Hyperparameter Tuning

We attempted using different settings for hyperparameters but ended up getting the best results by just using the defaults on the YOLOv5 architecture.

Infrastructure Development

An architecture diagram of the Descriptive World eco-system. This includes the layout of the IOS version (top portion) and the AWS version (bottom portion).

The IOS application built using local libraries for IOS in Swift and C as well as utilizing torchlight conversion for implementing our trained YOLO models.

The AWS version was built using several AWS applications/APIs which include a Kinesis video stream, s3 buckets, AWS lambdas, an AWS API gateway, Kinesis Data stream, Sagemaker endpoint, and Amazon Polly.

Security & Privacy

Privacy By Design
- Videos/Images are not saved
- Backgrounds & Faces Removed, no other metadata stored
- Video streams sent to the cloud are encrypted
Legislation and Governance
- Opt-in only to save images for improving training
- Provide user access to download their data and right to be forgotten
- Clearly defined Privacy Policy, Acceptable Use Policy and EULA

Ethical Considerations

We do not believe believe there are ethical implications of our product, however we do recognize that due to the dataset we are using we have limited classes (11) that are trained on common western clothing. We do not have data on other cultural outfits, however we would like to see if we can enrich our dataset with other ethnic clothing and add additional datasets in the future.

Product Development

We have treated Descriptive World as though it would be a product from day one, and we discuss more in the section on Go-to-Market. We gathered user requirements from our subject matter experts and performed user acceptance testing of our prototypes.

Results

Performance

Our results for our final model for clothing type object detection (please click the image for closeup.) Overall we achieved a mean Average Precision (mAP@0.5) of 71% for identification.

Prototype

We prototyped a composite model of garment, color and texture identification. To achieve this, we use a pipeline with two YOLOv5 models loaded (garment and texture) with the texture extracted from the center portion of the garment. Color of the garment is determined with python library that matches the most dominate color with a CSS1 color name lookup. Model size ranges from 30 MB to 185 MB for the largest. Inference takes around ~0.1-0.3s per model.

Minimum Viable Product

With both of our platforms we set out to create a platform that would allow a user without the benefit of sight to be able to identify clothing items in front of them. In order to do so we wanted to incorporate the following functionality:

– Accurately identify 11 classes of clothing items (object detection)

– Accurately identify 7 textures (pattern) of a clothing item (texture classification)

– Accurately identify the color of a clothing item (color identification)

– Implement a system to perform the aforementioned using state-of-the-art techniques

We were able to meet these objectives with our models and incorporate them in two different unique platforms. We developed two MVPs to demonstrate Descriptive World’s capabilities.
One entirely in in the cloud running on in AWS which exposes public APIs
and fronted by a demo web page and the other MVP is built on Apple IOS
which we have running on iPhone and is completely self-contained.

Our self-contained implementation requires no Internet access to operate and employs a lightweight model capable of running on mobile devices. Our cloud-based solution leverages AWS services and applications to ingest video in real-time, perform inference, and provide both a narration of the clothing items in the field of view as well as returning images with bounding boxes and confidence levels. Our model can detect any number of clothing items in the field of view. The iPhone app and our web implementation also leverage speech to text and natural language processing in order to process user commands, so that the user can interact with our system without the need for sight.

Our demo videos at the top of this page are the best illustrations of the functionality of our two MVPs work.

Check out the Github tab for all of the code used and tested in this project.

Business Plan

Business Model Canvas

Our business model canvas.

Competetive Analysis

Below is a comparison of how our approach performs compared to other solutions available on the market.

Company	Product	Voice Commands	Voice Narration	Clothing Recognition	Color Recognition	Texture Recognition	Mobile Services	100% Hands Free	Cost
Descriptive World	Outfits	YES	YES	YES	YES	YES	YES	YES	~$2500
Envision	Glasses	YES	YES	YES	YES	NO	NO	NO	~$3,800
Orcam	MyEye	YES	YES	YES	YES	NO	NO	NO	~$3,700
eSight	eSight 4	YES	NO	NO	NO	NO	NO	NO	~$6,000
NuEyes	NuEyes Pro	YES	NO	NO	NO	NO	NO	NO	~$6,000

SWOT Analysis

A SWOT analysis on where Descriptive World Outfits is positioned in the market.