Voice SDK

Speaker recognition for stand-alone or Web applications

VeriSpeak voice identification technology is designed for biometric system developers and integrators. The text-dependent speaker recognition algorithm assures system security by checking both voice and phrase authenticity. Voiceprint templates can be matched in 1-to-1 (verification) and 1-to-many (identification) modes.

Available as a software development kit that enables the development of stand-alone and Web-based speaker recognition applications on Microsoft Windows, Linux, macOS, iOS and Android platforms.

Features and Capabilities
Text-dependent algorithm prevents unauthorized access with a covertly-recorded user voice.
Two-factor authentication by checking voice biometrics and pass-phrase authenticity.
Regular microphones and smartphones are suitable for recording user voices.
Available as a multiplatform SDK that supports multiple programming languages.
Reasonable prices, flexible licensing and free customer support.
The VeriSpeak algorithm implements voice enrollment and voiceprint matching using proprietary sound processing technologies:
Text-dependent algorithm. The text-dependent speaker recognition is based on saying the same phrase for enrollment and verification. The VeriSpeak algorithm determines if a voice sample matches the template that was extracted from a specific phrase. During enrollment, one or more phrases are requested from the person being enrolled. Later that person may be asked to pronounce a specific phrase for verification. This method assures protection against the use of a covertly recorded random phrase from that person.
Two-factor authentication with a passphrase. The VeriSpeak voiceprint-matching algorithm can be configured to work in a scenario where each user records a unique phrase (such as passphrase or an answer to a "secret question" that is known only by the person being enrolled). Later a person is recognized by his or her own specific phrase with a high degree of accuracy. The overall system security increases as both voice authenticity and passphrase are checked.
Text-independent algorithm. The phrase-independent speaker recognition uses different phrases for user enrollment and recognition. This method is more convenient, as it does not require each user to remember the passphrase. It may be combined with the text-dependent algorithm to perform faster text-independent search with further phrase verification using the more reliable text-dependent algorithm.
Automatic voice activity detection. VeriSpeak is able to detect when users start and finish speaking.
Liveness detection. A system may request each user to enroll a set of unique phrases. Later the user will be requested to say a specific phrase from the enrolled set. This way the system can ensure that a live person is being verified (as opposed to an impostor who uses a voice recording).
Identification capability. VeriSpeak functions can be used in 1-to-1 matching (verification) and 1-to-many (identification) modes.
Multiple samples of the same phrase. A template may store several voice records with the same phrase to improve recognition reliability. Certain natural voice variations (i.e. hoarse voice) or environment changes (i.e. office and outdoors) can be stored in the same template.
Fused matching. A system may ask users to pronounce several specific phrases during speaker verification or identification and match each audio sample against records in the database. The VeriSpeak algorithm can fuse the matching results for each phrase together to improve matching reliability.
SDK Contents

VeriSpeak SDK is based on VeriSpeak voice recognition technology and is designed for biometric systems developers and integrators. The SDK allows rapid development of biometric applications using functions from the VeriSpeak algorithm. VeriSpeak can be easily integrated into the customer's security system. The integrator has complete control over SDK data input and output.

VeriSpeak is available as the following SDKs:
VeriSpeak Standard SDK is designed for PC-based, embedded or mobile biometric application development. It includes Voice Matcher and Extractor component licenses, programming samples and tutorials and software documentation. The SDK enables the development of biometric applications for Microsoft Windows, Linux, macOS, Android and iOS operating systems.
VeriSpeak Extended SDK is designed for biometric Web-based and network application development. It includes all features and components of the Standard SDK with the addition of Voice Client component licenses for PCs and Android devices, sample client applications, tutorials and a ready-to-use matching server component.

License Activation Options

The components are copy-protected. The following license activation options are available:

Serial numbers are used to activate licenses for particular VeriSpeak components on particular computer or device. The activation is done via the Internet or by email. After activation the network connection is not required for single computer license usage. Notes:
  • Activation by serial number is not suitable for ARM-Linux, except BeagleBone Black and Raspberry Pi 3 devices.
  • Activation by serial number is not suitable for virtual environments.
Internet activation . A special license file is stored on a computer or a mobile/embedded device; the license file allows to run particular VeriSpeak components on that computer or device after checking the license over the Internet. Internet connection x should be available periodically for a short amount of time. A single computer license can be transferred to another computer or device by moving the license file there and waiting until the previous activation expires.
Volume License Manager. Licenses may be stored in a volume license manager dongle . The license activation may be performed without connection to the Internet and is suitable for virtual environments. Volume license manager is used on site by integrators or end users to manage licenses for VeriSpeak components in the following ways:
  • Activating single computer licenses – An installation license for a VeriSpeak component will be activated for use on a particular computer. The number of available licenses in the license manager will be decreased by the number of activated licenses.
  • Managing single computer licenses via a LAN or the Internet – The license manager allows the management of installation licenses for VeriSpeak components across multiple computers or mobile/embedded devices in a LAN or over the Internet. The number of managed licenses is limited by the number of licenses in the license manager. No license activation is required and the license quantity is not decreased. Once issued, the license is assigned to a specific computer or device on the network.
  • Using license manager as a dongle – A volume license manager containing at least one license for a VeriSpeak component may be used as a dongle, allowing the VeriSpeak component to run on the particular computer where the dongle is attached.

Additional VeriSpeak component licenses for the license manager may be purchased at any time.

Licenses Validity

All SDK and component licenses are perpetual and do not have expiration. There are no annual fee or any other fees except license purchasing fee. It is possible to move licenses from one computer or device to another. Neurotechnology provides a way to renew the license if the computer undergoes changes due to technical maintenance.

SDK Components

The table below compares VeriSpeak 12.1 Standard SDK and VeriSpeak 12.1 Extended SDK. The list can be narrowed with filtering by certain requirements based on the target biometric system.

Select the required biometric components:
VeriSpeak SDK components and licenses
Component types VeriSpeak 12.1
Standard SDK
VeriSpeak 12.1
Extended SDK
Voice component licenses included with a specific SDK:
Voice Extractor 1 single computer license 1 single computer license
Voice Matcher 1 single computer license 1 single computer license
Voice Client   3 single computer licenses
Mobile Voice Extractor 1 single computer license 1 single computer license
Mobile Voice Matcher 1 single computer license 1 single computer license
Mobile Voice Client   3 single computer licenses
Matching Server   +

VeriSpeak SDK includes programming samples and tutorials that show how to use the components of the SDK to perform voice template extraction or matching against other templates. The samples and tutorials are available for these programming languages and platforms:

Windows 32 & 64 bit Linux 32 & 64 bit macOS Android iOS
Programming samples
C/C++ + + +
Objective-C +
C# +
Visual Basic .NET +
Java + + + +
Programming tutorials
C + + +
C++ + + +
C# +
Visual Basic .NET +
Java + + + +
System Requirements

There are specific requirements for each platform which will run VeriSpeak-based applications.

Microsoft Windows Platform Requirements

Microsoft Windows 7 / 8 / 10.
PC or laptop with x86-64 (64-bit) compatible processors.
  • 2 GHz or better processor is recommended.
  • x86 (32-bit) processors can still be used, but the algorithm will not provide the specified performance.
  • AVX2 support is highly recommended. Processors that do not support AVX2 will still run the VeriSpeak algorithms, but in a mode, which will not provide the specified performance. Most modern processors support this instruction set, but please check if a particular processor model supports it.
2 GB of free RAM is recommended for general usage scenarios. It is possible to reduce RAM usage for particular scenarios. Also, additional RAM may be required for applications that perform 1-to-many identification, as all biometric templates need to be stored in RAM for matching.
Microphone. Any microphone that is supported by the operating system can be used.
Database engine or connection with it. VeriSpeak templates can be saved into any DB (including files) supporting binary data saving. VeriSpeak Extended SDK contains the following support modules for Matching Server on Microsoft Windows platform:
  • Microsoft SQL Server;
  • MySQL;
  • Oracle;
  • PostgreSQL;
  • SQLite.
Network/LAN connection (TCP/IP) for client/server applications. Also, network connection is required for using Matching server component (included in VeriSpeak Extended SDK). VeriSpeak SDK does not provide communication encryption with the Matching server, therefore, integrators should secure the communication by themselves.
Microsoft .NET framework 4.5 or newer (for .NET components usage).
One of following development environments for application development:
  • Microsoft Visual Studio 2012 or newer (for application development under C/C++, C#, Visual Basic .Net)
  • Java SE JDK 8 or newer

Android Platform Requirements

A smartphone or tablet that is running Android 5.0 (API level 21) OS or newer.
  • If you have a custom Android-based device or development board, contact us to find out if it is supported.
ARM-based 1.5 GHz processor recommended for voiceprint processing in the specified time. Slower processors may be also used, but the voiceprint processing will take longer time.
At least 1 GB of free RAM should be available for the application. Additional RAM is required for applications that perform 1-to-many identification, as all biometric templates need to be stored in RAM for matching.
Any smartphone's or tablet's built-in or headset microphone which is supported by Android OS.
Network/LAN connection (TCP/IP) for client/server applications. Also, network connection is required for using Matching server component (included in VeriSpeak Extended SDK). VeriSpeak SDK does not provide communication encryption with the Matching server, therefore, integrators should secure the communication by themselves.
PC-side development environment requirements:
  • Java SE JDK 8 (or higher)
  • AndroidStudio 4.0 IDE
  • AndroidSDK 21+ API level
  • Gradle 6.1.1 build automation system or newer
  • Android Gradle Plugin 4.0.0
  • Internet connection for activating VeriSpeak component licenses

iOS Platform Requirements

One of the following devices, running iOS 11.0 or newer:
  • iPhone 5S or newer iPhone.
  • iPad Air or newer iPad models.
At least 1 GB of free RAM should be available for the application. Additional RAM is required for applications that perform 1-to-many identification, as all biometric templates need to be stored in RAM for matching.
Any smartphone's or tablet's built-in or headset microphone which is supported by Android OS.
Network/LAN connection (TCP/IP) for client/server applications. Also, network connection is required for using Matching server component (included in VeriSpeak Extended SDK). Communication with Matching server is not encrypted, therefore, if communication must be secured, a dedicated network (not accessible outside the system) or a secured network (such as VPN; VPN must be configured using operating system or third party tools) is recommended.
Development environment requirements:
  • a Mac running macOS 10.12.6 or newer.
  • Xcode 9.x or newer.

macOS Platform Requirements

A Mac running macOS 10.12.6 or newer.
  • 2 GHz or better processor is recommended.
  • AVX2 support is highly recommended. Processors that do not support AVX2 will still run the VeriSpeak algorithms, but in a mode, which will not provide the specified performance. Most modern processors support this instruction set, but please check if a particular processor model supports it.
2 GB of free RAM is recommended for general usage scenarios. It is possible to reduce RAM usage for particular scenarios. Also, additional RAM may be required for applications that perform 1-to-many identification, as all biometric templates need to be stored in RAM for matching.
Microphone. Any microphone that is supported by the operating system can be used.
Database engine or connection with it. VeriSpeak templates can be saved into any DB (including files) supporting binary data saving. VeriSpeak Extended SDK contains SQLite support modules for Matching Server on macOS platform.
Network/LAN connection (TCP/IP) for client/server applications. Also, network connection is required for using Matching server component (included in VeriSpeak Extended SDK). VeriSpeak SDK does not provide communication encryption with the Matching server, therefore, integrators should secure the communication by themselves.
Specific requirements for application development:
  • XCode 6.x or newer
  • GNU Make 3.81 or newer (to build samples and tutorials development)
  • Java SE JDK 8 or newer

Linux x86-64 Platform Requirements

Linux 3.10 kernel or newer is required.
PC or laptop with x86-64 (64-bit) compatible processors.
  • 2 GHz or better processor is recommended.
  • x86 (32-bit) processors can still be used, but the algorithm will not provide the specified performance.
  • AVX2 support is highly recommended. Processors that do not support AVX2 will still run the VeriSpeak algorithms, but in a mode, which will not provide the specified performance. Most modern processors support this instruction set, but please check if a particular processor model supports it.
2 GB of free RAM is recommended for general usage scenarios. It is possible to reduce RAM usage for particular scenarios. Also, additional RAM may be required for applications that perform 1-to-many identification, as all biometric templates need to be stored in RAM for matching.
Microphone. Any microphone that is supported by the operating system can be used.
glibc 2.17 library or newer
alsa-lib 1.1.6 or newer (for voice capture
libgudev-1.0 219 or newer (for microphone usage)
Database engine or connection with it. VeriSpeak templates can be saved into any DB (including files) supporting binary data saving. VeriSpeak Extended SDK contains the following support modules for Matching Server on Linux platform:
  • MySQL;
  • Oracle;
  • PostgreSQL;
  • SQLite.
Network/LAN connection (TCP/IP) for client/server applications. Also, network connection is required for using Matching server component (included in VeriSpeak Extended SDK). VeriSpeak SDK does not provide communication encryption with the Matching server, therefore, integrators should secure the communication by themselves.
Specific requirements for application development:
  • gcc 4.8 or newer
  • GNU Make 3.81 or newer
  • Java SE JDK 8 or newer

ARM Linux Platform Requirements

We recommend to contact us and report the specifications of a target device to find out if it will be suitable for running VeriSpeak-based applications.
There is a list of common requirements for ARM Linux platform:
A device with ARM-based processor, running Linux 3.2 kernel or newer.
ARM-based 1.5 GHz processor recommended for voiceprint processing in the specified time.
  • ARMHF architecture (EABI 32-bit hard-float ARMv7) is required.
  • Lower clock-rate processors may be also used, but the voiceprint processing will take longer time.
At least 1 GB of free RAM should be available for the application. Additional RAM is required for applications that perform 1-to-many identification, as all biometric templates need to be stored in RAM for matching.
Microphone. Any microphone that is supported by the operating system can be used.
glibc 2.17 or newer.
alsa-lib 1.1.6 or newer (for voice capture)
libgudev-1.0 219 or newer (for microphone usage)
Network/LAN connection (TCP/IP) for client/server applications. Also, network connection is required for using Matching server component (included in VeriSpeak Extended SDK). VeriSpeak SDK does not provide communication encryption with the Matching server, therefore, integrators should secure the communication by themselves.
Development environment specific requirements:
  • gcc 4.8 or newer
  • GNU Make 3.81 or newer
  • Java SE JDK 8 or newer
Technical Specifications
General recommendations:
  • The speaker recognition accuracy of MegaMatcher depends on the audio quality during enrollment and identification.
  • Voice samples of at least 2-seconds in length are recommended to assure speaker recognition quality.
  • A passphrase should be kept secret and not spoken in an environment where others may hear it if the speaker recognition system is used in a scenario with unique phrases for each user.
  • The text-independent speaker recognition may be vulnerable to attack with a covertly recorded phrase from a person. Passphrase verification or two-factor authentication (i.e. requirement to type a password) will increase the overall system security.
Microphones – there are no particular constraints on models or manufacturers when using regular PC microphones, headsets or the built-in microphones in laptops, smartphones and tablets. However these factors should be noted:
  • The same microphone model is recommended (if possible) for use during both enrollment and recognition, as different models may produce different sound quality. Some models may also introduce specific noise or distortion into the audio, or may include certain hardware sound processing, which will not be present when using a different model. This is also the recommended procedure when using smartphones or tablets, as different device models may alter the recording of the voice in different ways.
  • The same microphone position and distance is recommended during enrollment and recognition. Headsets provide optimal distance between user and microphone; this distance is recommended when non-headset microphones are used.
  • Web cam built-in microphones should be used with care, as they are usually positioned at a rather long distance from the user and may provide lower sound quality. The sound quality may be affected if users subsequently change their position relative to the web cam.
Sound settings:
  • Settings for clear sound must be ensured; some audio software, hardware or drivers may have sound modification enabled by default. For example, the Microsoft Windows OS usually has, by default, sound boost enabled.
  • A minimum 11025 Hz sampling rate, with at least 16-bit depth, should be used during voice recording.
Environment constraints – the MegaMatcher speaker recognition engine is sensitive to noise or loud voices in the background; they may interfere with the user's voice and affect the recognition results. These solutions may be considered to reduce or eliminate these problems:
  • A quiet environment for enrollment and recognition.
  • Several samples of the same phrase recorded in different environments can be stored in a biometric template. Later the user will be matched against these samples with much higher recognition quality.
  • Close-range microphones (like those in headsets or smartphones) that are not affected by distant sources of sound.
  • Third-party or custom solutions for background noise reduction, such as using two separate microphones for recording user voice and background sound, and later subtracting the background noise from the recording.
User behavior and voice changes:
  • Natural voice changes may affect speaker recognition accuracy:
    • a temporarily hoarse voice caused by a cold or other sickness;
    • different emotional states that affect voice (i.e. a cheerful voice versus a tired voice);
    • different pronunciation speeds during enrollment and identification.
  • The aforementioned voice and user behavior changes can be managed in two ways:
    • separate enrollments for the altered voice, storing the records in the same person's template;
    • a controlled, neutral voice during enrollment and identification.

All voice templates should be loaded into RAM before identification, thus the maximum voice template database size is limited by the amount of available RAM.

The voiceprint template size has linear dependence on the voice sample length. For example, when using voice samples that are 2 times shorter, the template size values will be 2 times smaller.

VeriSpeak 12.1 text-dependent engine can perform template matching in two modes:
Fixed phrase – each subject in the database has recorded the same phrase. This mode provides faster matching, but lower reliability.
Unique phrase – each subject in the database has recorded a unique phrase. This mode provides higher reliability, but slower matching.

VeriSpeak biometric template extraction and matching algorithm is designed to run on multi-core processors allowing to reach maximum possible performance on the used hardware.

VeriSpeak 12.1 text-dependent voiceprint engine specifications
  Android-based
platform
PC-based
platform
Template extraction components Mobile Voice
Extractor
Mobile Voice
Client
Voice
Extractor
Voice
Client
Template extraction time (seconds) 1.34 (1) 1.20 (1) 1.34 (2) 0.60 (2)
Template matching components Mobile Voice Matcher Voice Matcher
Template matching speed
fixed phrase mode
(voiceprints per second)
100 (1) 8,000 (2)
Template matching speed
unique phrase mode
(voiceprints per second)
20 (1) 1,700 (2)
Single voiceprint record size in a template, when 5 second long voice samples used (bytes) 3,500 - 4,500

Notes:
(1) Requires to be run on Android devices based on at least Snapdragon S4 system-on-chip with Krait 300 processor (4 cores, 1.51 GHz).
(2) Requires to be run on PC or laptop with at least Intel Core i7-8700K processor.

Reliability Tests

The VeriSpeak 12.1 algorithm has been tested with voice samples taken from the XM2VTS Database, as well as with voice samples from Neurotechnology's internal database.

These voice template matching experiments were performed with the VeriSpeak 12.1 text-dependent engine:

Experiment 1 used voice samples from the XM2VTS database. All samples include the same fixed phrase pronounced by all subjects.
Experiment 2 used voice samples from Neurotechnology's internal voice database 1. All samples included the same fixed phrase pronounced by all subjects.
Experiment 3 used voice samples from Neurotechnology's internal voice database 2. Each subject pronounced a unique phrase during his/her recording.
Experiment 1
VeriSpeak ROC chart calculated using voice samples from XM2VTS database
Click to zoom
Experiments 2 and 3
VeriSpeak ROC chart calculated using voice samples from Neurotechnology internal database
Click to zoom

Receiver operation characteristic (ROC) curves are usually used to demonstrate the recognition quality of an algorithm. ROC curves show the dependence of false rejection rate (FRR) on the false acceptance rate (FAR). Charts with ROC curves for each of the experiments are available above.

VeriSpeak 12.1 text-dependent algorithm tests with XM2VTS and Neurotechnology's internal databases
  Exp. 1 Exp. 2 Exp. 3
Total voice samples in the database 2360 309 305
Subjects in the database 295 42 42
Recording sessions per subject 8 1 - 10 1 - 10
Average voice sample length (seconds) 7.112 4.975 6.214
EER 0.5647 % 1.1750 % 0.1720 %
FRR at 0.1 % FAR 1.5630 % 3.2000 % 0.2854 %
FRR at 0.01 % FAR 4.1560 % 6.9110 % 0.3806 %
FRR at 0.001 % FAR 11.230 % 8.3490 % 0.4757 %
Send us an email on
[email protected]