【API解析】微软edge浏览器大声朗读功能(read aloud)调用步骤

【API解析】微软edge浏览器大声朗读功能(read aloud)调用步骤

1. 来源

  • github: MsEdgeTTS, edge-TTS-record
  • 吾爱破解:微软语音助手免费版,支持多种功能,全网首发

2. 准备工作

  • 功能来源:edge浏览器
  • 抓包工具:fiddler
  • 模拟请求:postman

3. 主要分析步骤

  • 第一步:确定edge浏览器read aloud功能用js如何调用,fiddler上没有捕捉到
const voices = speechSynthesis.getVoices()
function speakbyvoice(text, voice) {var utter = new SpeechSynthesisUtterance(text)for (let v of voices) {if (v.name.includes(voice)) {utter.voice = vbreak}}speechSynthesis.speak(utter)return utter
}
speakbyvoice("hello world", "Xiaoxiao")
  • 第二步:试着对edge-TTS-record抓包,抓到了一个http请求websocket连接。对照MsEdgeTTS的代码可知:
/** postman中模拟成功* 获取可用语音包选项,等价于speechSynthesis.getVoices()* http url: =6A5AA1D4EAFF4E9FB37E23D68491D6F4* method: GET*/
{uri: "",query: {trustedclienttoken: "6A5AA1D4EAFF4E9FB37E23D68491D6F4"}method: "GET"
}/** postman中模拟成功* 发送wss连接,传输文本和语音数据,等价于speechSynthesis.speak(utter)* wss url: wss://speech.platform.bing/consumer/speech/synthesize/readaloud/edge/v1?TrustedClientToken=* send: 发送两次数据,第一次是需要的音频格式,第二次是ssml标记文本(需要随机生成一个requestid,替换掉guid的分隔符“-”即可)* receive: 接收到的webm音频字节包含在相同requestid的正文部分,用Path=audio\r\n定位正文索引* 存在的问题: 1、第一次发送的音频格式文本中,只有在webm-24khz-16bit-mono-opus格式下才能成功连接,其他格式尝试后直接断开;*           2、第二次发送的ssml文本不支持mstts命名空间的解析,是Auzure语音服务的阉割版,例如不能出现xmlns:mstts="****"、<mstts:express-as/>、<p/>、<s/>等语言标记*/
{uri: "",query: {trustedclienttoken: "6A5AA1D4EAFF4E9FB37E23D68491D6F4"},sendmessage: {audioformat: `
X-Timestamp:Mon Jul 11 2022 17:50:42 GMT+0800 (中国标准时间)
Content-Type:application/json; charset=utf-8
Path:speech.config{"context":{"synthesis":{"audio":{"metadataoptions":{"sentenceBoundaryEnabled":"false","wordBoundaryEnabled":"true"},"outputFormat":"webm-24khz-16bit-mono-opus"}}}}`,ssml: `
X-RequestId:7e956ecf481439a86eb1beec26b4db5a
Content-Type:application/ssml+xml
X-Timestamp:Mon Jul 11 2022 17:50:42 GMT+0800 (中国标准时间)Z
Path:ssml<speak version='1.0' xmlns='' xml:lang='en-US'><voice  name='Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)'><prosody pitch='+0Hz' rate ='+0%' volume='+0%'> hello world</prosody></voice></speak>`}
}

4. 编写代码

  • websocket库:WebSocketSharp。最新版安装失败的可以降版本安装,此文发布的时候最新预览版是1.0.3-rc11
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using WebSocketSharp; // nuget包:WebSocketSharp(作者:sta,此文安装版本:1.0.3-rc10)namespace ConsoleTest
{internal class Program{static string ConvertToAudioFormatWebSocketString(string outputformat){return "Content-Type:application/json; charset=utf-8\r\nPath:speech.config\r\n\r\n{\"context\":{\"synthesis\":{\"audio\":{\"metadataoptions\":{\"sentenceBoundaryEnabled\":\"false\",\"wordBoundaryEnabled\":\"false\"},\"outputFormat\":\"" + outputformat + "\"}}}}";}static string ConvertToSsmlText(string lang, string voice, string text){return $"<speak version='1.0' xmlns='' xmlns:mstts='' xml:lang='{lang}'><voice name='{voice}'>{text}</voice></speak>";}static string ConvertToSsmlWebSocketString(string requestId, string lang, string voice, string msg){return $"X-RequestId:{requestId}\r\nContent-Type:application/ssml+xml\r\nPath:ssml\r\n\r\n{ConvertToSsmlText(lang, voice, msg)}";}static void Main(string[] args){var url = "wss://speech.platform.bing/consumer/speech/synthesize/readaloud/edge/v1?trustedclienttoken=6A5AA1D4EAFF4E9FB37E23D68491D6F4";var Language = "en-US";var Voice = "Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)";var audioOutputFormat = "webm-24khz-16bit-mono-opus";var binary_delim = "Path:audio\r\n";var msg = "Hello world";var sendRequestId = Guid.NewGuid().ToString().Replace("-", "");var dataBuffers = new Dictionary<string, List<byte>>();var webSocket = new WebSocket(url);webSocket.SslConfiguration.ServerCertificateValidationCallback = (sender, certificate, chain, sslPolicyErrors) => true;webSocket.OnOpen += (sender, e) => Console.WriteLine("[Log] WebSocket Open");webSocket.OnClose += (sender, e) => Console.WriteLine("[Log] WebSocket Close");webSocket.OnError += (sender, e) => Console.WriteLine("[Error] error message: " + e.Message);webSocket.OnMessage += (sender, e) =>{if (e.IsText){var data = e.Data;var requestId = Regex.Match(data, @"X-RequestId:(?<requestId>.*?)\r\n").Groups["requestId"].Value;if (data.Contains("Path:turn.start")){// start of turn, ignore. 开始信号,不用处理}else if (data.Contains("Path:turn.end")){// end of turn, close stream. 结束信号,可主动关闭socket// dataBuffers[requestId] = null;// 不要跟着MsEdgeTTS中用上面那句,音频发送完毕后,最后还会收到一个表示音频结束的文本信息webSocket.Close();}else if (data.Contains("Path:response")){// context response, ignore. 响应信号,无需处理}else{Console.WriteLine("unknow message: " + data); // 未知错误,通常不会发生}}else if (e.IsBinary){var data = e.RawData;var requestId = Regex.Match(e.Data, @"X-RequestId:(?<requestId>.*?)\r\n").Groups["requestId"].Value;if (!dataBuffers.ContainsKey(requestId))dataBuffers[requestId] = new List<byte>();if (data[0] == 0x00 && data[1] == 0x67 && data[2] == 0x58){// Last (empty) audio fragment. 空音频片段,代表音频发送结束}else{var index = e.Data.IndexOf(binary_delim) + binary_delim.Length;dataBuffers[requestId].AddRange(data.Skip(index));}}};webSocket.Connect();var audioconfig = ConvertToAudioFormatWebSocketString(audioOutputFormat);webSocket.Send(audioconfig);webSocket.Send(ConvertToSsmlWebSocketString(sendRequestId, Language, Voice, msg));while (webSocket.IsAlive) { }Console.WriteLine("接收到的音频字节长度:" + dataBuffers[sendRequestId].Count);Console.ReadKey(true);}}
}

5. 结语

模拟websocket请求成功,缺陷是postman模拟结果显示音频outputformat参数只能是webm-24khz-16bit-mono-opus,也就是说还需要再用ffmpeg之类的库转换格式。暂时也没找到比较好用的库,先记录到这

【API解析】微软edge浏览器大声朗读功能(read aloud)调用步骤

【API解析】微软edge浏览器大声朗读功能(read aloud)调用步骤

1. 来源

  • github: MsEdgeTTS, edge-TTS-record
  • 吾爱破解:微软语音助手免费版,支持多种功能,全网首发

2. 准备工作

  • 功能来源:edge浏览器
  • 抓包工具:fiddler
  • 模拟请求:postman

3. 主要分析步骤

  • 第一步:确定edge浏览器read aloud功能用js如何调用,fiddler上没有捕捉到
const voices = speechSynthesis.getVoices()
function speakbyvoice(text, voice) {var utter = new SpeechSynthesisUtterance(text)for (let v of voices) {if (v.name.includes(voice)) {utter.voice = vbreak}}speechSynthesis.speak(utter)return utter
}
speakbyvoice("hello world", "Xiaoxiao")
  • 第二步:试着对edge-TTS-record抓包,抓到了一个http请求websocket连接。对照MsEdgeTTS的代码可知:
/** postman中模拟成功* 获取可用语音包选项,等价于speechSynthesis.getVoices()* http url: =6A5AA1D4EAFF4E9FB37E23D68491D6F4* method: GET*/
{uri: "",query: {trustedclienttoken: "6A5AA1D4EAFF4E9FB37E23D68491D6F4"}method: "GET"
}/** postman中模拟成功* 发送wss连接,传输文本和语音数据,等价于speechSynthesis.speak(utter)* wss url: wss://speech.platform.bing/consumer/speech/synthesize/readaloud/edge/v1?TrustedClientToken=* send: 发送两次数据,第一次是需要的音频格式,第二次是ssml标记文本(需要随机生成一个requestid,替换掉guid的分隔符“-”即可)* receive: 接收到的webm音频字节包含在相同requestid的正文部分,用Path=audio\r\n定位正文索引* 存在的问题: 1、第一次发送的音频格式文本中,只有在webm-24khz-16bit-mono-opus格式下才能成功连接,其他格式尝试后直接断开;*           2、第二次发送的ssml文本不支持mstts命名空间的解析,是Auzure语音服务的阉割版,例如不能出现xmlns:mstts="****"、<mstts:express-as/>、<p/>、<s/>等语言标记*/
{uri: "",query: {trustedclienttoken: "6A5AA1D4EAFF4E9FB37E23D68491D6F4"},sendmessage: {audioformat: `
X-Timestamp:Mon Jul 11 2022 17:50:42 GMT+0800 (中国标准时间)
Content-Type:application/json; charset=utf-8
Path:speech.config{"context":{"synthesis":{"audio":{"metadataoptions":{"sentenceBoundaryEnabled":"false","wordBoundaryEnabled":"true"},"outputFormat":"webm-24khz-16bit-mono-opus"}}}}`,ssml: `
X-RequestId:7e956ecf481439a86eb1beec26b4db5a
Content-Type:application/ssml+xml
X-Timestamp:Mon Jul 11 2022 17:50:42 GMT+0800 (中国标准时间)Z
Path:ssml<speak version='1.0' xmlns='' xml:lang='en-US'><voice  name='Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)'><prosody pitch='+0Hz' rate ='+0%' volume='+0%'> hello world</prosody></voice></speak>`}
}

4. 编写代码

  • websocket库:WebSocketSharp。最新版安装失败的可以降版本安装,此文发布的时候最新预览版是1.0.3-rc11
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using WebSocketSharp; // nuget包:WebSocketSharp(作者:sta,此文安装版本:1.0.3-rc10)namespace ConsoleTest
{internal class Program{static string ConvertToAudioFormatWebSocketString(string outputformat){return "Content-Type:application/json; charset=utf-8\r\nPath:speech.config\r\n\r\n{\"context\":{\"synthesis\":{\"audio\":{\"metadataoptions\":{\"sentenceBoundaryEnabled\":\"false\",\"wordBoundaryEnabled\":\"false\"},\"outputFormat\":\"" + outputformat + "\"}}}}";}static string ConvertToSsmlText(string lang, string voice, string text){return $"<speak version='1.0' xmlns='' xmlns:mstts='' xml:lang='{lang}'><voice name='{voice}'>{text}</voice></speak>";}static string ConvertToSsmlWebSocketString(string requestId, string lang, string voice, string msg){return $"X-RequestId:{requestId}\r\nContent-Type:application/ssml+xml\r\nPath:ssml\r\n\r\n{ConvertToSsmlText(lang, voice, msg)}";}static void Main(string[] args){var url = "wss://speech.platform.bing/consumer/speech/synthesize/readaloud/edge/v1?trustedclienttoken=6A5AA1D4EAFF4E9FB37E23D68491D6F4";var Language = "en-US";var Voice = "Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)";var audioOutputFormat = "webm-24khz-16bit-mono-opus";var binary_delim = "Path:audio\r\n";var msg = "Hello world";var sendRequestId = Guid.NewGuid().ToString().Replace("-", "");var dataBuffers = new Dictionary<string, List<byte>>();var webSocket = new WebSocket(url);webSocket.SslConfiguration.ServerCertificateValidationCallback = (sender, certificate, chain, sslPolicyErrors) => true;webSocket.OnOpen += (sender, e) => Console.WriteLine("[Log] WebSocket Open");webSocket.OnClose += (sender, e) => Console.WriteLine("[Log] WebSocket Close");webSocket.OnError += (sender, e) => Console.WriteLine("[Error] error message: " + e.Message);webSocket.OnMessage += (sender, e) =>{if (e.IsText){var data = e.Data;var requestId = Regex.Match(data, @"X-RequestId:(?<requestId>.*?)\r\n").Groups["requestId"].Value;if (data.Contains("Path:turn.start")){// start of turn, ignore. 开始信号,不用处理}else if (data.Contains("Path:turn.end")){// end of turn, close stream. 结束信号,可主动关闭socket// dataBuffers[requestId] = null;// 不要跟着MsEdgeTTS中用上面那句,音频发送完毕后,最后还会收到一个表示音频结束的文本信息webSocket.Close();}else if (data.Contains("Path:response")){// context response, ignore. 响应信号,无需处理}else{Console.WriteLine("unknow message: " + data); // 未知错误,通常不会发生}}else if (e.IsBinary){var data = e.RawData;var requestId = Regex.Match(e.Data, @"X-RequestId:(?<requestId>.*?)\r\n").Groups["requestId"].Value;if (!dataBuffers.ContainsKey(requestId))dataBuffers[requestId] = new List<byte>();if (data[0] == 0x00 && data[1] == 0x67 && data[2] == 0x58){// Last (empty) audio fragment. 空音频片段,代表音频发送结束}else{var index = e.Data.IndexOf(binary_delim) + binary_delim.Length;dataBuffers[requestId].AddRange(data.Skip(index));}}};webSocket.Connect();var audioconfig = ConvertToAudioFormatWebSocketString(audioOutputFormat);webSocket.Send(audioconfig);webSocket.Send(ConvertToSsmlWebSocketString(sendRequestId, Language, Voice, msg));while (webSocket.IsAlive) { }Console.WriteLine("接收到的音频字节长度:" + dataBuffers[sendRequestId].Count);Console.ReadKey(true);}}
}

5. 结语

模拟websocket请求成功,缺陷是postman模拟结果显示音频outputformat参数只能是webm-24khz-16bit-mono-opus,也就是说还需要再用ffmpeg之类的库转换格式。暂时也没找到比较好用的库,先记录到这