HMM-BASED SEQUENCE-TO-FRAME MAPPING FOR VOICE CONVERSION
#1

ABSTRACT
Voice conversion can be reduced to a problem to find a transformationfunction between the corresponding speech sequences of twospeakers. Perhaps the most voice conversions methods are GMMbasedstatistical mapping methods [1, 2]. However, the classicalGMM-based mapping is frame-to-frame, and cannot take accountof the contextual information existing over a speech sequence. It iswell known that HMM yields an efficient method to model the densityof a whole speech sequence and has found great successes inspeech recognition and synthesis. Inspired by this fact, this paperstudies how to use HMM for voice conversion. We derive an HMMbasedsequence-to-frame mapping function with statistical analysis.Different from previous HMM-based voice conversion methods[3, 4, 5] that used forced alignment for segmentation and transformframes aligned to a state with its associated linear transformation,our method has a soft mapping function as a weighted summationof linear transformations. The weights are calculated as the HMMposterior probabilities of frames. We also propose and compare twomethods to learn the parameters of our mapping functions, namelyleast square error estimation and maximum likelihood estimation.We carried out experiments to examine the proposed HMM-basedmethod for voice conversion.Index Terms: Voice conversion, sequence-to-frame mapping,HMM, speech synthesis
1. INTRODUCTION
Voice conversion (VC) aims at transforming a speaker’s voice tomake it sound like another speaker’s without changing the linguisticcontents. VC has many important applications in practice, and isreceiving more and more attentions nowadays. Since utterances oftwo speakers differ from each other in many aspects, such as speechrate, duration, pitch, formant frequencies and speaking style etc., anideal VC technique should take account of all these aspects. However,this is difficult in practice, some of these features are difficultto calculate and some are difficult to convert. For this reason, manyVC techniques focus on the transformation of spectral features, andonly conduct simple modifications for prosody features such as f0.The GMM-based statistical mapping technique proposed byStylianou et al. [1] has been widely used to convert spectral featuresbetween different speakers. These techniques make use of GMMto model the densities of source cepstral vectors [1] or joint cepstralvectors [6]. The mapping function is a weighted summationof linear transformations for each Gaussian component while theweights are calculated as posterior probabilities of source vectors.The parameters of the linear transformations are estimated by minimizingsquared errors. The efficiency of GMM-based mapping andits advantage to other spectral conversion methods such as mappingcodebooks and artificial neural network, have been demonstrated inmany previous studies [1, 6, 2, 7]. However, GMM only describesthe density of frame vectors and cannot take account of the contextual(dynamic) information. Although one can incorporate deltaor delta-delta features into GMM, these features still only providelocal dynamic information. On the other hand, HMM is a densitymodel for sequences and the transition probabilities of HMM allowit to take account of the dynamics in speech. This paper studies anHMM-based mapping method for voice conversion. We deduce theformulas for sequence-to-frame mapping based on HMM by usingstatistical analysis. We use least square error (LSE) and maximumlikelihood (ML) criteria to estimate the parameters of the mappingfunction. We find that the LSE estimation has a closed form solution,while the ML estimation leads to a nonlinear optimizationproblem. For this reason, we develop an EM-based algorithm for theML estimation of HMM-based mapping. We conduct experimentsto examine the performances of LSE estimation and ML estimationfor HMM-based voice conversion. The results show the usefulnessof the proposed method.HMM has been applied to voice conversion in previous studies[3, 4, 5]. Kim et al. [3] introduced a hidden Markov VQ modelfor voice conversion, where the mapping function is determined bythe codebook and the optimal states of a source utterance. Differentfrom this method, we use normal HMM and our mapping function isa weighted summation of several linear transformations. Duxans etal. [4] used HMM to model the densities of source vectors and jointvectors, and estimated a linear transformation for each state ofHMMto convert an input utterance. In [5], Wu et al. proposed durationembeddedDeBi-HMM for expressive voice conversion. Unlike themethods in [4] and [5] where the mapping functions only depend onthe optimal states obtained by forced alignment, the mapping functionof our method is derived by combining the linear transformationsof different states using weights of posterior probabilities ofstates. We hope that this ‘soft’ mapping function can partly dealwith the problem of spectral jumps at the boundaries of segmentsresulted from forced alignment [3, 4].
2. HMM-BASED VOICE CONVERSION
Voice conversion requires to find a mapping from an utterance of asource speaker to that of a target speaker. Let Y = F(X) denotethe mapping function where X, Y represent speech sequences ofsource and target speakers, respectively. Let X = [x1, x2, ..., xT ],where xt (1 ≤ t ≤ T) represents a d-dimensional frame vector.However, to find a direct mapping between two sequences is verydifficult. This is because that, a sequence usually contains a largenumber of elements and the length of sequences X and Y can bedifferent. Therefore, many researchers reduced the sequence mappingto a frame-to-frame conversion problem, which is denoted byyt = f(xt).


Download full report
http://ieeexplore.ieeeiel5/5487364/54948...er=5495141
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: tcp ip frame, brain mapping, network mapping, code for voice conversion using neural network in java, bit mapping, mapping solution pdf**ication of family activities into, mind mapping tool,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  Voice Based Automated Transport Enquiry System seminar class 2 3,163 05-10-2016, 09:34 AM
Last Post: ijasti
  RF Controlled Robot with Metal Detector and Wireless image and voice transmission(Mod seminar class 1 3,886 06-11-2012, 12:37 PM
Last Post: seminar details
  INTERACTIVE VOICE RESPONSE SYSTEM smart paper boy 2 1,629 02-10-2012, 11:41 AM
Last Post: seminar details
  LASER TORCH BASED VOICE TRANSMITTER AND RECEIVER full report project topics 28 35,159 28-01-2012, 09:25 AM
Last Post: seminar addict
  PHASE SEQUENCE DETECTION WITH THREE PHASE CONTROLLED POWER SUPPLY seminar class 2 2,675 14-11-2011, 10:02 AM
Last Post: seminar addict
  Doubly-Fed Induction Generator for Variable Speed Wind Energy Conversion Systems smart paper boy 2 2,213 29-09-2011, 09:50 AM
Last Post: seminar addict
  VISION BASED PROCESSING FOR REAL TIME 3-D DATA ACQUISITION BASED CODE STRUCTURED LIGH computer science crazy 7 4,762 26-09-2011, 02:34 PM
Last Post: seminar addict
  Voice controlled room Using bluetooth project topics 0 1,037 04-08-2011, 03:05 PM
Last Post: project topics
  Voice to Text project topics 0 754 04-08-2011, 03:02 PM
Last Post: project topics
  Voice-Controlled Remotes project topics 0 704 04-08-2011, 03:00 PM
Last Post: project topics

Forum Jump: