vAIsual Inc, the company behind the largest visual dataset collection in the world, today launched the first of its Asian diaspora non-biometric datasets, consisting of thousands of Asian people and scenes.
The Asian People in Context dataset, with over 20,000 images, will play a crucial role in training AI models to recognize, classify, and analyze Asian scenes and characters. The resulting trained models can contribute to applications for environment detection, generative AI and human identification.
The dataset is the first of many delivered through a partnership with Vietnam-based stock agency Dragonimages.
Machine learning researchers and data scientists can use the datasets for a variety of purposes. All the images are legally cleared with model releases and trademark compliance. The datasets are available for non-European customers only, due to non-GDPR compliant model releases.
The images feature mostly Asian people, of various ages and genders, in a range of contexts, including streets, cafes, workplaces and retail settings.
The datasets are specially prepared to meet the needs of ML teams, such as detailed and consistent metatags, high-resolution images, and, most importantly, legal clearances.
Self-service access to the datasets is via the Dataset Shop, established in 2022 by clean data specialists vAIsual Inc, and specifically catering to research and engineering teams training AI for a range of applications.
According to vAIsual CEO, Michael Osterrieder, diversity is king in AI training and our customers have been anticipating access to datasets with Asian identities.
"We are excited to launch this first Asian People in Context dataset that focus on the Asian diaspora. Using our proprietary dataset-building technology, we can now assemble datasets consisting of tens of thousands of images of a particular theme or subject.
Being able to collate and package these datasets saves hundreds of hours for engineers to prepare material for AI training." says Osterrieder.
While reducing time is a core benefit, Osterrieder also emphasizes the importance of having full legal clearance.
"We are starting to see dataset disclosure requirements emerging in some jurisdictions, which will mean any AI model trained on scraped data will risk being blocked," says Osterrieder.
The availability of legally clean datasets, that also remunerate the original content creators, is an important step to ensure companies building AI technology are doing it ethically and responsibly.
"Offering custom-prepared datasets containing premium visual content, with the consent of the original copyright owners (or their legal representatives). is essential for the AI industry to mature into a truly commercial and viable industry," says Osterrieder.
In the coming weeks, additional datasets will be added to the datasetshop.com. The datasets are specially prepared for engineers to add to their workflow for AI training and are commercially available in a variety of resolutions.