ASK HERE

seminar class · 06-05-2011, 11:07 AM

ABSTRACT
Voice conversion can be reduced to a problem to find a transformationfunction between the corresponding speech sequences of twospeakers. Perhaps the most voice conversions methods are GMMbasedstatistical mapping methods [1, 2]. However, the classicalGMM-based mapping is frame-to-frame, and cannot take accountof the contextual information existing over a speech sequence. It iswell known that HMM yields an efficient method to model the densityof a whole speech sequence and has found great successes inspeech recognition and synthesis. Inspired by this fact, this paperstudies how to use HMM for voice conversion. We derive an HMMbasedsequence-to-frame mapping function with statistical analysis.Different from previous HMM-based voice conversion methods[3, 4, 5] that used forced alignment for segmentation and transformframes aligned to a state with its associated linear transformation,our method has a soft mapping function as a weighted summationof linear transformations. The weights are calculated as the HMMposterior probabilities of frames. We also propose and compare twomethods to learn the parameters of our mapping functions, namelyleast square error estimation and maximum likelihood estimation.We carried out experiments to examine the proposed HMM-basedmethod for voice conversion.Index Terms: Voice conversion, sequence-to-frame mapping,HMM, speech synthesis
1. INTRODUCTION
Voice conversion (VC) aims at transforming a speaker’s voice tomake it sound like another speaker’s without changing the linguisticcontents. VC has many important applications in practice, and isreceiving more and more attentions nowadays. Since utterances oftwo speakers differ from each other in many aspects, such as speechrate, duration, pitch, formant frequencies and speaking style etc., anideal VC technique should take account of all these aspects. However,this is difficult in practice, some of these features are difficultto calculate and some are difficult to convert. For this reason, manyVC techniques focus on the transformation of spectral features, andonly conduct simple modifications for prosody features such as f0.The GMM-based statistical mapping technique proposed byStylianou et al. [1] has been widely used to convert spectral featuresbetween different speakers. These techniques make use of GMMto model the densities of source cepstral vectors [1] or joint cepstralvectors [6]. The mapping function is a weighted summationof linear transformations for each Gaussian component while theweights are calculated as posterior probabilities of source vectors.The parameters of the linear transformations are estimated by minimizingsquared errors. The efficiency of GMM-based mapping andits advantage to other spectral conversion methods such as mappingcodebooks and artificial neural network, have been demonstrated inmany previous studies [1, 6, 2, 7]. However, GMM only describesthe density of frame vectors and cannot take account of the contextual(dynamic) information. Although one can incorporate deltaor delta-delta features into GMM, these features still only providelocal dynamic information. On the other hand, HMM is a densitymodel for sequences and the transition probabilities of HMM allowit to take account of the dynamics in speech. This paper studies anHMM-based mapping method for voice conversion. We deduce theformulas for sequence-to-frame mapping based on HMM by usingstatistical analysis. We use least square error (LSE) and maximumlikelihood (ML) criteria to estimate the parameters of the mappingfunction. We find that the LSE estimation has a closed form solution,while the ML estimation leads to a nonlinear optimizationproblem. For this reason, we develop an EM-based algorithm for theML estimation of HMM-based mapping. We conduct experimentsto examine the performances of LSE estimation and ML estimationfor HMM-based voice conversion. The results show the usefulnessof the proposed method.HMM has been applied to voice conversion in previous studies[3, 4, 5]. Kim et al. [3] introduced a hidden Markov VQ modelfor voice conversion, where the mapping function is determined bythe codebook and the optimal states of a source utterance. Differentfrom this method, we use normal HMM and our mapping function isa weighted summation of several linear transformations. Duxans etal. [4] used HMM to model the densities of source vectors and jointvectors, and estimated a linear transformation for each state ofHMMto convert an input utterance. In [5], Wu et al. proposed durationembeddedDeBi-HMM for expressive voice conversion. Unlike themethods in [4] and [5] where the mapping functions only depend onthe optimal states obtained by forced alignment, the mapping functionof our method is derived by combining the linear transformationsof different states using weights of posterior probabilities ofstates. We hope that this ‘soft’ mapping function can partly dealwith the problem of spectral jumps at the boundaries of segmentsresulted from forced alignment [3, 4].
2. HMM-BASED VOICE CONVERSION
Voice conversion requires to find a mapping from an utterance of asource speaker to that of a target speaker. Let Y = F(X) denotethe mapping function where X, Y represent speech sequences ofsource and target speakers, respectively. Let X = [x1, x2, ..., xT ],where xt (1 ≤ t ≤ T) represents a d-dimensional frame vector.However, to find a direct mapping between two sequences is verydifficult. This is because that, a sequence usually contains a largenumber of elements and the length of sequences X and Y can bedifferent. Therefore, many researchers reduced the sequence mappingto a frame-to-frame conversion problem, which is denoted byyt = f(xt).

Download full report
http://ieeexplore.ieeeiel5/5487364/54948...er=5495141

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Voice Based Automated Transport Enquiry System	seminar class	2	3,163	05-10-2016, 09:34 AM Last Post: ijasti
	RF Controlled Robot with Metal Detector and Wireless image and voice transmission(Mod	seminar class	1	3,886	06-11-2012, 12:37 PM Last Post: seminar details
	INTERACTIVE VOICE RESPONSE SYSTEM	smart paper boy	2	1,629	02-10-2012, 11:41 AM Last Post: seminar details
	LASER TORCH BASED VOICE TRANSMITTER AND RECEIVER full report	project topics	28	35,159	28-01-2012, 09:25 AM Last Post: seminar addict
	PHASE SEQUENCE DETECTION WITH THREE PHASE CONTROLLED POWER SUPPLY	seminar class	2	2,675	14-11-2011, 10:02 AM Last Post: seminar addict
	Doubly-Fed Induction Generator for Variable Speed Wind Energy Conversion Systems	smart paper boy	2	2,213	29-09-2011, 09:50 AM Last Post: seminar addict
	VISION BASED PROCESSING FOR REAL TIME 3-D DATA ACQUISITION BASED CODE STRUCTURED LIGH	computer science crazy	7	4,762	26-09-2011, 02:34 PM Last Post: seminar addict
	Voice controlled room Using bluetooth	project topics	0	1,037	04-08-2011, 03:05 PM Last Post: project topics
	Voice to Text	project topics	0	754	04-08-2011, 03:02 PM Last Post: project topics
	Voice-Controlled Remotes	project topics	0	704	04-08-2011, 03:00 PM Last Post: project topics

Important Note..!

ASK HERE