• RSS feed
  • Blog
  • About
  • Projects

Taking a critical look at market and technology development around the enterprise space.


ellementK: (ĕll'ǝ-mǝnt-kā) noun - A fundamental, essential, or irreducible constituent of a composite entity. Middle English, from Old French, from Latin elementum. In this case, also related to the modern French mentir, to lie. (adapted from Dictionary.com)


About Eleanor Kruszewski: I'm known variously as Eleanor or Elle. My last name is like that coach from Duke - kru-shef-ski.

Based in Menlo Park, CA, I work for Yahoo! in their Developer Network. The easiest description of what I do is the MBA shin kicker, handling community, marketing, commercial programs and sundry backend stuff.

Disclaimer: I've done big corps, midcorps, and startups, so I overstate and oversimplify as much as anyone else. These opinions are my own, not my employer's.

« Test post for reader-ratings   |   Main   |   How can we tap into all this audio content? »

Transcript of Jon Udell podcast on IVR

Transcript of Monday’s podcast on IVR applications, a conversation between Infoworld columnist John Udell and Ron Owens, director of software application engineering and professional services for Intervoice (audio here) .

  • Owens – Well, essentially what we do is, you can think of it as self service automation primarily accomplished though voice recognition. So if you look at companies that utilize websites or other things in terms of a self-service strategy, our goal is to make voice an additional channel to that self service automation so that clients can use these systems to gain access to information, conduct business, complete transactions.
  • Udell – Right, so basically the IVR story. So why don’t we start with applications, because these discussions can tend to get a bit abstract. I’m just going to read off the ones you listed as packaged applications. This one is interesting, you have something you call an identifier, which is evidently an IVR-based password changing application. Is that correct? Why don’t you talk about how that works, and what’s happening on the back end to enable that?
  • Owens – So we’ve tried to create a few applications that are geared to a horizontal market that are reusable, that allow the interface into client databases but essentially go through a series of questions to identify who the caller claims to be and then authenticates via either knowledge or speaker verification whether they are who they claim to be.
  • Udell – So when you say “speaker verification”, is the point here that this is multifactor authentication and one of the factors is verification including voiceprint?
  • Owens – Yes
  • Udell – Now is this in use anywhere now? I’ve never run into this, but it’s happening out there in the world someplace. Where is this being used?
  • Owens – It’s actually used at Ameritrade, and we have done press releases with them so we can talk about it.
  • Udell – Oh, because I actually have an account at Ameritrade, which I have not visited for a long time. Now if I go there, where would I find this feature?
  • Owens – They actually implemented it when they purchased Datek. So I’ll be honest, I’m an Ameritrade client, but I am an old Ameritrade client, and I have never accessed it via the telephone either, but I believe it was given as option to the Datek customers that were converted and they went through an enrollment process. It’s one of the largest customer facing verification deployments that’s available.
  • Udell – Now the voice print is subject to replay attack if it’s captured, right?
  • Owens – Actually, the voiceprint itself, what gets created is not anything that exists in terms of a wav file that can be reverse engineered. It’s just a series, it’s a numerical representation that’s housed in a database. You could break in and steal all of the voiceprints from any company that’s out there and you have something that’s absolutely worthless.
  • Udell – OK, but if I speak the magic phrase on one hand, or if I play you a recording of the magic phrase on the other hand, you, listening, aren’t going to know the difference.
  • Owens – Well, it depends on how you capture the recording.
  • Udell – OK well, let’s talk about that a little, because that’s interesting.
  • Owens – Well, if you capture the recording with high quality microphone and it’s a digital recording that you play back over the phone, you might be able to spoof the system, but sometimes they’re still able to detect fraudulence because some of the systems say ‘this is too close to the original, so it must be a fake’. There’s a certain amount of variance each time we speak a word is expected.
  • Udell – Interesting, interesting. We’ve sort of gone down a rat hole - this is extremely interesting, but I want to back up and get the larger context as well. So can we broaden this out to the kinds of services that your platform makes available to developers, kinds of tools that are used to build these IVR applications and the environment that all this fits into in terms of application servers and standards like SALT and VoiceXML
  • Owens – So to start at a platform level, we have platforms that support VoiceXML 2.0 and SALT standards, the Microsoft speech server. From a development environment…
  • Udell – Well, can I just… probably most people are as vaguely familiar with those two things as I am so… I mean, to prepare for this call I went back to refresh my memory as to what’s going on.
  • Owens – That doesn’t entitle you to ask hard questions, though. (laugh), so can you explain.
  • Udell – Well it seems like we can’t have anything in this century without a war between two competing standards, right? And it’s almost ludicrous, it takes you 5 minutes just to get to the members’ page of the SALT Forum on one hand, and the members’ page of the VoiceXML Forum on the other hand. And the icons tell the story. In fact, I’m on those pages now, and I’m looking at the SALT forum and see Microsoft, Cisco, Intel, Phillips, ScanSoft and Comverse. And on the VoiceXML side – AT&T, Lucent, IBM, Motorola, HP, Oracle, Verizon, and a couple of others. What jumps out at me is hmm… phone companies versus not. Not knowing anything about the politics, obviously you guys have to play both sides of the fence, but I’m curious if it’d be possible for you to condense all into a brief overview.
  • Owens – It’s possible for me to give you my opinion of the whole situation and where it stands. So, first thing that I try to tell clients is that there was a lot of publicity on VoiceXML a couple years ago and how it was going to revolutionize voice automation and that you had to have VoiceXML to do voice recognition and a lot of other things. The bottom line is all of those justifications for doing it were hyped. Because we actually had speech recognition systems literally years before the first VoiceXML companies were in existence, and some very sophisticated systems too. What VoiceXML as a standard did do is introduce some application portability and extension outside of call center space, moving into it infrastructure space — how to make it available and accessible in visual media, how to extend web pages into new channels – how do you connect self service on internet with telephone – how people want to get services. SALT came in when VoiceXML was getting near 2.0. VoiceXML was trying to do for voice what HTML did for the web [make it usable and presentable]. SALT was even more explicit, to make it multimodal [allow interaction with web pages via multiple modes]. SALT is less of a linear programming language than VoiceXML, it’s much more like a XML namespace that can be woven into XML or HTML documents. VoiceXML has much more of its own syntax and own tool environment. [therefore it is more embeddable into existing tools and infrastructure]
    Intervoice has its own VoiceXML browser conformant to the VoiceXML specifications, where you can view all VoiceXML files developed with their tool or others. The browser takes and interprets the file and turns it into instructions for the IVR. The same thing happens on SALT side – Microsoft provides SALT browser here, but the underlying technology is the same underneath, just the parsing technology comes from Microsoft. In both cases it’s streamed into output that is neutral for consumption by the IVR software. Intervoice has their own IVR, which is both a virtual and physical machine that drives their IVR functions. Microsoft’s SALT browser is the equivalent of IE, takes SALT documents and converts them into something that is digestible by voice [IVR].
  • Udell - To what extent is it possible to build an application once for multimodal deployment? Can you write webpages that can be expressed vocally with one code base, does it meet Microsoft’s vision of SALT?
  • Owens – Not now, not many consumer devices ready to take this input as multimodal. We don’t have phones where you browse to an item and then speak commands into while browsing the web [use both a cursor and a voice command in an arbitrary manner]. It’s still Microsoft’s vision, but they’ve realized that people are going to start with using simple speech recognition first and as consumer devices catch up to take advantage of the functionality, then they can build and extend out to more multi-modality.
  • Udell – What does supporting Microsoft Speech Server mean for you?
  • Owens – Basically we sell it as a solution – Microsoft Speech Server and our system work together. We can also use Scansoft ASR and also Nuance ASR for speech recognition technology. On it’s own, Intervoice’s system will give you touchtone DTMS interactivity, but no voice. They do not have own ASR engine, but [that’s ok since they are enabling the first real commercial applications] – speech server is theoretical until you have Intervoice making it commercially viable deployment. They make it into application companies can use.
  • Udell – What does your product really add?
  • Owens – Speech servers do not connect to telephony on their own. They won’t take a call, can’t handle calls, their progression through the system, transfers, CTI – we provide telephony or IP infrastructure. [In their scenarios] the application is driven from application server, which calls into the Intervoice server. Intervoice brings in speech recognition as needed, sending the utterance off to the ASR engine for processing, and then handling the returned response and the general infrastructure.
  • Udell – I’m trying to understand the architecture, are pictures or a schematic available on your website>
  • Owens – Probably not, we’ll have to talk to the marketing people.
  • Udell – Let’s talk about applications. Tell me about the Amtrak system. What is the architecture?
  • Owens – Calls come through call center and hit the switch. Then they’re greeted by IVR system and the application walks caller through getting info they need. The speech recognition component is ScansoftASR running on a separate server. The call flow is mapped out with prompts, responses are captured, the transaction goes against database to get info required [the schedule], then the data is transformed into concatenated speech and played back for the caller. This is built with Intervoice’s tool called Envision, which can develop code either in their legacy proprietary environment or VoiceXML-standard compliant code. It’s GUI-based with a drag-and-drop tools and pre-built components – things like dates, digits, already there to help focus on the logic flow.
  • Udell – If wanted to reduce the GUI model to text, can you? [I have no idea why this is important]
  • Owens – Yes, the system does output VoiceXML as text that you can view, after it’s been generated. If you’re using Microsoft’s Speech Server, it can be output in VB.NET, ASP, C# out of Visual Studio with Speech Server modules.

    Intervoice offers professional services to help create usable and effective voice implementation. It’s tricky to create a model that works for voice, not just a voice reading all the options on a given website. The mode of interaction determines what info should be presented how. We resell ASR products, and provide layer of integration.

  • Udell – I’m trying to understand what specific value you bring…
  • Owens – Technically you need to control telephony card, which our products do. If you didn’t have us, you’d need to build that to allow transfers, passing protocols, passing instructions to flow commands to CTI. You’d need to build monitoring system to watch the history of calls and the overall flow. On the services end, we provide expertise to move through [connect] business goals and translate into call flow and create the backend transaction model. We have the business logic and business rules to build voice applications for customers [that work…. Udell really didn’t get the value of the company at this point, and Owens was not doing a good job pitching it, as you will see the conversation takes a much different turn in the next exchange]. We take data out of the enterprise systems our customers have and link it into the process to create a full voice solution.
  • Udell – Looking through my list, we’ve talked about a lot. What have we not talked about yet?
  • Owens – Usability. Whether it’s usable or not determines success of project. [The voice of] “Amtrak Julie” was a deliberate product, constructed with impact and branding in mind – here characteristics, personality, age, demographics, and nuance. You must have the voice and the text and the script and the words match the expectations of the callers when they call. You can read a lot in the area of “persona development”. The key factors determining [the appropriate] voice are the customer age demographics, the formality of the prompts, selecting the gender, and trying to convey something consistent with image of the company. This is part of the design process we take our [professional services] customers through. Our voice user interface design engineers take the goals of the business – typical ones are to get the shortest length of call possible, highest possible automation rate, and the highest possible customer satisfaction – and balance that with what end users need – enough instruction to feel comfortable (which may lengthen the call), easy access to a rep if call doesn’t fit what self service system is designed to handle, but not so easy that people won’t try to use the system – customers want to keep the caller satisfaction level up but don’t want them to feel trapped in the system. Intervoice’s engineers take the end user goals and the organization goals and try to strike a balance. That’s what creates the personality of the system, how the scripting is done and how the logic flows. This gets tuned over time, since like the web page, the system is tracked as people use it.
  • Udell – What are the kinds of opportunities for packaged applications that haven’t been seen in the past but are becoming possible going forward?
  • Owens – What we’re trying to find is an area where… if you think about how Microsoft products work, if you’re working in Excel, PowerPoint or Word, you look for consistency of tools – File, Edit, Save, etc. That’s our definition of a horizontal function – it is always the same process to accomplish common goals. If you look at voice, horizontal applications are where companies don’t find a competitive advantage to be the world’s best. Like in changing your address, there’s no market there to differentiate yourself. The reverse is actually true – having the same process across all, say, credit card companies, is a benefit. Customers can learn over time that there is a common way to go about this common task, even though it is with different companies. Utilities and magazine subscription companies have this [address change] problem too. Familiarity will drive increased acceptance of automation and successful task completion. Then this is not just for address changes, but things like identify verification, password resets, things that cross multiple industries. Things for which voice is an appropriate, or even the preferred medium.
  • Udell – When is voice the preferred media for data-driven interaction?
  • Owens – It’s very interesting – all the data I’ve read shows there is no end to transactions that people want to complete over the phone. It’s kind of like the debit card phenomenon resulting in the death of checks – there are fewer checks going through financial institutions, but banks spend a lot of money on the increased level of financial transactions [driven by debit cards]. A net balance. The internet and phone are not so much substitutes, because, depending on circumstances, you might choose one or the other no matter what your preferences are. But really what that means is that companies have to deal with more interactions with their clients. Clients have more ways to interact and it’s our (Intervoice’s) job to make sure that they interact successfully and profitably for the organization.
  • Udell – This reminds me of earlier when we were talking about multimodality with Microsoft’s approach with Speech Server. I want to circle back on one point. In addition to possibility of being able to interact with a webpage through direct manipulation or speech, there is this underlying notion that you’d like to develop an interface once, one interface, that would be at least minimally accessible by phone. The web may be preferable, but that it would be possible to interact with it by phone if necessary. To what extent is it possible to provide in a relatively automatic way at least minimal access to functions on a web page out of a development environment that is not dual-track (don’t have to build it twice)? Developing it twice is very hard, every time you do it two different ways it’s a huge impediment unless have an extreme motivation [i.e., because of the cost, effort and time of maintaining two applications].
  • Owens – I’m going to take a slightly different view, and I may be biased, but I draw a different demarcation from the one you did. At the application server level when the data is being served up to either be presented to a web page, at that point, that data in that format can be presented to a voice system – that’s the demarcation of commonality and I think that companies who try to skip, either on web page development to make it a very clean and user friendly presentation layer via the web, or a very clean and aesthetically pleasing voice interface, if they try to skip at that presentation layer – they’ve made a critical and strategic error. Where they want to make the investment in the commonality is to make that data available. Once that data’s available, if it’s in an XML format – the cost of the presentation layer relative to that infrastructure if minimal.
  • Udell – The point is well taken, and is actually the point I’ve been making, for, well, years now, which is that ultimately it is just a historical accident that we have so much presentation logic that is directly intertwined with the production of html. The point of the web services philosophy is to separate those two things, that in fact, in my view, if everyone could just switch a flip tomorrow, we’d all be better off if every web-facing application is encapsulated as neutral xml which is then rendered by html production technology, which could also be rendered by voice production technology.
  • Owens – Exactly. It’s no different than the spaghetti code in COBOL we used to do years ago. Theoretically, we should have made all those host interactions in a modular form that could have been called from multiple applications, even though there’s no analogous presentation layer the principle is the same. Did we violate that? Regularly. We embedded the access and the data into whatever we were trying to accomplish. And it always has and always will cost companies millions. We’re at the point now where standards themselves don’t solve problems for companies, the implementation of standards can help accelerate a lot of things and help companies to be efficient and take advantage of it, but VoiceXML infrastructure for some clients won’t be the fix unless they do exactly what you said. And that takes some time to figure out how am I going to get the data and share that data across multiple channels.
  • Udell – Nothing new here. Just common sense.
  • Owens – Common sense that’s incredibly uncommon!

#end#

This entry was posted on Thursday, December 23rd, 2004 at 4:10 pm and is filed under Emergent.

You can follow any responses to this entry through the RSS 2.0 feed.

You can leave a response below, or trackback from your own site.

Leave a Reply

  • Recently modified posts
    • Last day of ciccadas, hummingbirds, and fighting with blue jays
    • Finally, the Amazon Darknet review
    • OpenOffice 1.1.4: motivation for switching and review
    • Viral marketing movie preview for bloggers tonight "Yes"
    • Mac moves to Intel as the Windows tax grows heavier
    • Fun with the thinking man's drinkers
    • Notes from Stanford US-Asia lecture with Prahalad and Barker
    • Blog as narrative: Nature speculates on flu crisis
  • Recent comments
    • propecia online on "Home networking: ..."
    • Hydrocodone. on "Wal-Mart RFID pilot:..."
    • Hydrocodone. on "Last geek dinner..."
    • Hydrocodone. on "Titans Intel and..."
    • Hydrocodone. on "SCO-Linux copyright battle..."
  • View by category
    • Datapoints (23)
    • Emergent (82)
    • Enterprise IT (49)
    • Events & Happenings (48)
    • Geek (50)
    • Life-Culture-Play (35)
    • Mobility (36)
    • Open Source (22)
    • Strategy-Marketing (53)
    • Toys, Tips, & Tricks (14)
    • Venture & Startup (8)
  • Archives
    • January 2006
    • June 2005
    • May 2005
    • April 2005
    • March 2005
    • February 2005
    • January 2005
    • December 2004
    • November 2004
    • October 2004
    • September 2004
    • August 2004
    • April 2004
    • March 2004
    • February 2004
    • January 2004
    • December 2003
    • November 2003
    • September 2003
    • August 2003


Creative Commons License This work is licensed under a Creative Commons License

EllementK is proudly powered by WordPress - RSS Entries and Comments.