Tumbled Logic

Dec 22 2009

From the archives: Building an Internet TV platform

The below is lifted from a couple of posts I wrote on my old blog (and didn’t bother to migrate) back in April 2007. Some may find them interesting, mostly it’s just brain-dump stuff. (I got caught up in other things and didn’t ever write Part Four).

Part One

You can say ‘Internet TV’ to ten different people and get ten (in fact, probably more than ten) suggestions back as to what you might mean by it. Does it mean ‘using IP to transport cable TV’? Does it mean ‘distributing user-generated content’? Does it mean ‘distributing broadcast television programming over the Internet (legally or otherwise)’? Does it mean ‘utilising open Internet standards to distribute television programming’?

The answer is ‘yes’, which doesn’t help anybody a whole lot. What’s interesting about this is that it indicates there’s a lot of give in the Internet TV arena, and that people en mass don’t readily identify with one particular technology, never mind a specific product or service. My take on this is that an Internet TV platform probably has to cover at least the majority of the bases in order to succeed.

This is where Joost fails. Lots of people focus on the Joost client, which I’m starting to think is perhaps a little short-sighted. With Skype, the big barrier to VoIP adoption was a decent user agent, and Skype provided that. There aren’t actually many things the majority of users want to do with a VoIP platform (although there are plenty of deal-breakers for the minority), so the fact the protocol was largely secret and there was no interoperability with other VoIP networks wasn’t a big deal for anybody outside of the industry. Tell a Skype user that Skype is flawed because there’s a limit to the peer review of the security, that if Skype goes out of business the whole network will collapse, that there’s so much more to VoIP than the facilities offered by Skype, and they won’t care. “I’m not passing national secrets! eBay bought Skype out. Do you think eBay are going out of business any time soon? Who cares about the extra features? My brother can call me on a local number, even when I’m away on holiday in New Zealand! Better still, I can call my friends when I’m in New Zealand as though I were still at home.” Skype disrupted, because it provided what people wanted, even if it did it in a way which—from a technical perspective, at least—was less than ideal. By shunning open standards they gained themselves a business model. You want to do PSTN breakout? Only one place you can get that: Skype itself. Skype didn’t just disrupt, but it disrupted an industry which was itself considered to be inherently disruptive.

Joost is different, though. People are embracing Internet TV right now and have been—in some form or another, and on a big scale—for a long time. In real terms, what does Joost give you? A nicer UI than YouTube, with a load of content that most people are likely only watching because it’s there. There’s some good stuff being distributed via Joost, but there’s not very much of it. The UI is not quite intuitive (and buggy, and—depending on who you’re talking to—slow and laggy to the point of being unusable); the protocols are completely closed; there’s no room for user-generated content in any of the senses we’ve grown used to; and most importantly: it’s just not as good as broadcast television (in fact, it doesn’t even come close).

I’m not saying these things couldn’t be fixed, but I doubt any of them will be by the time Joost 1.0 is released. On the one hand, Joost have built a user interface that’s squarely aimed at full-screen, possibly hooked up to the TV, use. On the other, there’s no compelling reason to actually do that instead of tuning into broadcast content on your DVB, cable or satellite box. The only part of the package that Joost has got right is the abandonment of schedules, and even that isn’t suited to all of the people all of the time.

Of all of the things we associate with ‘Internet TV’ right now, Joost doesn’t really tick many boxes. There’s no open API (not yet, at any rate). You can’t upload your own programming. You can’t watch the broadcast TV programme that you missed last night because you were out. You can’t even distribute the TV programme you recorded last night so that others who did miss it can watch it. For all the DRM and secrecy, Joost doesn’t actually allow people to do anything they can’t already do in a less restrictive (but varyingly less legal) fashion. Joost doesn’t disrupt.

The buzz at the moment is that Joost has signed up ABC as a content provider. ABC gives us shows like Grey’s Anatomy, Desperate Housewives, and Lost. This is good news for Joost, and that’s an understatement. What people forget is that Joost already has region controls. Right now, there are channels on Joost which are US-only. What makes anybody think that the ABC deal would be any different? My money’s on ABC’s flagship shows only being available within the United States, and if they go beyond that, it’ll likely only be older episodes. An open platform doesn’t solve the content issue, of course, but it does make it easier to solve. It makes it easier to get content providers on board, but because anybody can do it, not just the people who run the network.

Then there’s the advertising. “If the platform’s open, people will skip the adverts”.

Yes. So? Here’s the real deal on advertising, though: good adverts get remembered. People will watch them. They’ll want to go back and watch them again. They’ll tell their friends. They’ll blog about them. Good adverts go viral overnight. Remember when Honda’s ‘Cog’ was first aired, during the F1 Grand Prix? The buzz about that advert was immense. Sony’s Bravia advertising team blogged about the production of the advert they shot in Glasgow, and everybody knew about it, and lots and lots of people wanted to see it.

Bad adverts, though, people will skip. People will skip them within a fraction of a second of them appearing. Mediocre adverts? Some will skip, most won’t. Think about it, though: what happens if you apply all that ‘user-centric’ theory to advertising? Let people rate the ads. Publish real-time statistics about the popularity of adverts. Was the new Sony or Orange advert well-received? Were people sick of yet another Capital One advert? Was there an unexpected underground cult following behind an ad (think: Cillit Bang)?

If the platform’s open, it’s possible to go beyond the the original intentions. You start off with a desktop client, letting you stream, download and rate video. Next, you find out that some technology firm you’ve never heard of has produced a set-top box that connects to the platform you’d created and makes all of it available without a PC.

Right now, and probably not for some time, Joost doesn’t make any of that possible (and yes, Joost’s adverts suck). It might do, in future, but none of us really know.

If I get as far as following up this post, I’ll describe how I’d build such an open system.

Part Two

The requirements:

  • Allow user-generated as well as network-prescribed content
  • Use open standards and protocols where appropriate
  • Not be ultimately reliant on a single company for longevity
  • Be inherently cross-platform
  • Allow the broadcast of live (or near-live) streaming content as well as on-demand stored programming

In effect, this gives us a sort of combination of Joost, Democracy, Veoh, and a whole host of other applications and platforms, not least the likes of uStream, YouTube and iTunes. Where do you begin?

Let’s start with the low-level transports, and work upwards. Note that I haven’t designed this hypothetical system: this is an exercise as much for me as it is for anybody else; what you’re reading here is mostly off the top of my head, so if it’s coherent and makes sense, it’s a testament to my thought processes rather than hours spent planning anything.

<

p>A well-put-together Internet TV platform really needs to be able to source content in a variety of different ways, and from an array of sources. Joost’s hybrid peer-to-peer model is, in my view, one that has the most merit. I’d suggest designing a system such that the client is capable of:

  • Connecting to designated servers either via unicast or multicast IP
  • Using multicast DNS (e.g., Apple’s Bonjour) to locate peers on the local network and behave intelligently with them
  • Being able to use a combination of peer-to-peer protocols alongside more “classical” distribution protocols (e.g., RTP/RTSP and HTTP), selecting the most appropriate protocol for the peer and local network conditions (e.g., firewalling and NAT).
  • Separating control channels from content distribution channels, such that the control channels are effectively generic session management conduits between peers or between a client and a server.

With that in mind, I’d use SIP as the basis for the network. Although SIP has had its fair share of problems, the requirement for interoperability in this scenario is relatively limited: as part of the production of this system, it would be entirely feasible to publish details of a specific subset (or clarification of behaviours) of SIP which should be used. SIP has a lot of advantages, too: it’s pretty well-understood, it’s lightweight, it can be transported over pretty much any low-level protocol (e.g., TCP, UDP, SCTP), and there are lots of implementations of SIP clients and servers out there right now, although many are aimed specifically at VoIP applications.

Building upon SIP, there are a few basic operations that we’d want to perform:

  • Negotiating with a peer or a server regarding available content and supported distribution methods
  • Establishing data channels for the transfer of content
  • Obtaining lists of nearby peers from a server
  • Establishing data channels for the retrieval of non-video content (e.g., overlays, channel information, interactive features, subtitling/closed captioning)
  • Transporting non-video-related sessions (such as instant messaging)
  • Establishing unicast or multicast streams of near-real-time broadcast content
  • Authenticating with a server or peer in order to access features

Of these, SIP (and existing protocols built upon SIP) can already deal with a reasonable amount of the required functionality: SDP over SIP is used for establishing RTP sessions between peers in SIP-based VoIP networks. SIMPLE provides instant messaging facilities to a SIP session. SIP already handles authentication, in a very similar fashion to HTTP. SIP could, without difficulty due to it’s HTTP-like open-ended structure, be extended to allow for the remainder of our requirements, although SIP wouldn’t of course be actually transporting the content itself.

At this point, there’s a little less detail that we can draw on in terms of existing protocols. My inclination would be to utilise SIP in much the same way that HTTP is used today on BitTorrent networks to contact a tracker: the client makes a simple request to the tracker, and the tracker returns a list of peers hosting the required content. There’s no technical reason why such requests couldn’t be made over SIP in our system.

If this sounds like a heady combination of SIP-controlled RTP/RTSP streaming and SIP-controlled BitTorrent, you wouldn’t be far wrong. The intelligence in this respect isn’t really down to the protocols (although getting the protocols right makes a lot of difference), but how they’re used. How they’re used precisely depends upon what we’re trying to get from a server or peer at a given instant.

Next time, I’ll discuss content distribution strategies, the balance between peers and servers, and why “live” channels probably won’t be: in other words, how the SIP-controlled protocols will work together to get the TV channels to the client.

Part Three

In part 2, I discussed some of the low-level protocols that I intended to use for this system. In this post, I want to talk about how we’ll be using them.

I find it’s helpful to go over the logical entities involved in the system, their relationships, and how an end-user actually obtains references to them.

Starting from the top, we have a channel catalogue: in simple terms, a list of channels. A pre-configured client would ship configured with the URIs of the network’s channel catalogues, or alternatively a user could enter them manually (much as users can add their own podcast URIs to clients such as iTunes or Democracy). My gut instinct is that channel catalogues are just RDF, listing the source URIs of each provided channel: each catalogue and channel should be identified by a UUID (and where multiple sources exist for the same channel or catalogue, the UUID is repeated, making it easy for clients to organise the catalogue properly).

Next, we have a channel. A “channel” could have several definitions, depending upon how you want to define it: is it one (or more) live RTSP-based streams? Is it a series of pre-recorded MPEG4 videos? Is there channel meta-data and interactive content? My solution would be to not care at a high level: use another RDF sequence to define a channel, with the type of content simply defined by the kind and type of URI supplied. You might probably want to add some additional information that doesn’t fit into the generic RDF schema, though: perhaps a channel “type” to define how the channel’s content should be presented to the user: on-demand video obviously has UI considerations which are different to live streams, and you can’t necessarily infer the type of UI from the channel contents (you might have pre-recorded content, but want it to be presented to users as though it were streamed). Similarly, for individual items of content, it might be necessary to add a specific attribute to indicate whether they should be exposed to the end-user or not (for example, interactive services might use video elements which shouldn’t be presented as “programmes” in their own right, but should still be obtained using the same content retrieval strategies used for other things).

Using RDF gives us quite a lot of flexibility: the Dublin Core schema gives us elements for creating a human-readable description of a resource, and using selected parts of the XHTML schema means such descriptions can be slightly richer than they otherwise would be. In practice, it might be useful for the channel catalogue to duplicate some of the information contained within the individual channels: channel descriptions, thumbnail images, web-site URIs, EPG URIs, and so forth. Doing so would allow access to the information without the need for the user-agent to go away and retrieve it for each individual channel separately. In a similar vein, although a channel might provide EPG data, it would no doubt be useful for the individual programme sequences to contain the EPG data related to those programmes, which means the EPG data is available for any programme that’s currently available, even if the full EPG information has yet to be obtained.

In part 2, I talked about using SIP as a base protocol for retrieval of programming, but SIP is just a framework: none of the content itself is carried by SIP—SIP is just used for authorisation, discovery, event notification, and to control the setup and tear-down of the data channels. With that in mind, how do we fit SIP into our framework?

First, we can describe programmes in terms of SIP URIs. Consider the following RDF fragment:

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tv="http://tv.example.com/ns/tv#"
xmlns:epg="http://tv.example.com/ns/epg#">
<rdf:Description rdf:about="http://tv.example.com/channels/test/2007/04/test-programme/5" tv:presentation="programme">
<dc:title>Test Programme</dc:title>
<tv:subtitle>Episode 5: Testing again</tv:subtitle>
<dc:creator>Example Productions, Ltd.</dc:creator>
<dc:rightsholder>Example Holdings International Plc.</dc:rightsholder>
<dc:publisher>Example Networks</dc:publisher>
<dc:rights>http://tv.example.com/channels/test/copyright</dc:rights>
<dc:type>MovingImage>/dc:type<
<dc:identifier>urn:uuid:6a9ef3d1-2f0b-4fab-bdfc-3acfa142d4ff</dc:identifier>
<dc:language>en-GB</dc:language>
<epg:start>2007-04-29T13:25:00Z</epg:start>
<epg:end>2007-04-29T13:29:59Z</epg:end>
<epg:description xmlns:html="http://www.w3.org/1999/xhtml">
This week, Test Programme explores RDF namespaces. <html:strong>Not suitable for viewers aged 102 or over.</html:strong>
</epg:description> <tv:source tv:href="rtsp://live.example.com/6a9ef3d1-2f0b-4fab-bdfc-3acfa142d4ff" tv:type="live" tv:pref="2" />
<tv:source tv:href="sip:6a9ef3d1-2f0b-4fab-bdfc-3acfa142d4ff@livesip.example.com" tv:type="live" tv:pref="1" />
<tv:source tv:href="sip:6a9ef3d1-2f0b-4fab-bdfc-3acfa142d4ff@sip1.example.com" tv:pref="1" />
<tv:source tv:href="sip:6a9ef3d1-2f0b-4fab-bdfc-3acfa142d4ff@sipmirror.example2.com" tv:pref="2" />
<tv:preview tv:type="image/png" href="http://tv.example.com/channels/test/2007/04/test-programme/preview.png" />
</rdf:Description>
</rdf:RDF>

With this description, we have a wealth of information about our “Test Programme, Episode 5”, including multiple sources for the programme itself, copyright and publisher’s information, start and end times, a description of the programme (including HTML mark-up), the language the programme is broadcast in, and the URI for more information on the particular episode. There are lots of information missing from this sample, though: as it’s an episode in a series, there’s nothing relating it to other episodes in the same series; there’s no production date or information; and there’s no structured information on people related to the episode (producer, director, actors, narrators, etc.). All of this, and probably a few more things besides, would ideally be included in a real programme resource. Note also that, although the SIP URIs use the same UUID used as the programme’s identifier, there’s no absolute reason for this to be the case, nothing should make the assumption that the programme’s identifier is related to anything in particular, and the same applies to any URIs included in the programme meta-data.

So, now we have a set of source URIs, ordered by type, then preference. What do we do with them? First of all, let us assume that sources without a ‘type’ are defaulted to ‘demand’, and that ‘demand’ sources are treated separately from ‘live’ sources. With ‘demand’ sources, we can pre-fetch content and cache it locally, whereas ‘live’ sources must be streamed when the user indicates that they wish to watch the channel. Let us examine ‘live’ sources first. We have two strategies for retrieving the content: a SIP URI and an RTSP URI, with the SIP URI being preferred. In both cases, it’s likely that the underlying transport used for content would be RTP: in the SIP case, we place a “call” the URI given (sip:6a9ef3d1-2f0b-4fab-bdfc-3acfa142d4ff@livesip.example.com) using INVITE (authenticating along the way if required), and SDP is used to set up the RTP stream. In the RTSP case, the user agent would DESCRIBE, SETUP and PLAY as appropriate. Note that RTP’s DESCRIBE method also uses SDP to describe the media streams that are available.

With such similarities between using SIP and RTSP for initiating a stream, what’s the benefit of supporting SIP at all? RTSP is used throughout the streaming media world already, and is supported by a number of clients for precisely this purpose. The answer lies in the end-to-end architecture: we support RTSP because lots of software already uses it, and because there are streaming platforms already built upon it. SIP, on the other hand, gives us a lot more flexibility: built primarily as a telephony protocol, it gives us the ability to perform stream trunking, to provide instant messaging and presence services, and even perform voice or video calls over our Internet TV network (a slightly strange contrast with many telecoms providers’ strategies which are focussing on building IP-based voice networks that can also carry TV: in reality, the structure is the same, just looked at from the other side).

Ultimately, if your network architecture is built around SIP, adding set of RTSP streaming services could well be a difficult proposition. Whilst adding support for both SIP and RTSP is relatively cheap at the user-agent side, it’s comparatively expensive from a network architecture perspective.

For a live stream, you’re limited in how you can optimise the viewing process: if there’s a transport problem, the stream will stutter and skip. Although we have a provision for multiple stream sources, attempting to utilise more than one source simultaneously would likely just compound the problems! A user-agent could be proactive about such issues, though: establish connections to the various available sources (or a selection of them), and whilst streaming from one periodically ‘ping’ the others and query them regarding projected capacity and latency. Doing this would require extensions to both SIP and RTSP, so is possibly limited in its feasibility (and, of course, media stream doesn’t necessarily have to be served from the same location as the SIP/SDP or RTSP/SDP sources which describe and initiate them), but it could provide a mechanism for a user-agent to effectively second-guess (or even fall back in a fashion more intelligent than round-robin) the health of a stream.

In the part 4, I’ll discuss the ‘on-demand’ side of things, which I suspect is what most people are really interested in. After that, I’m planning on covering extended features beyond raw video streaming (EPGs, presence/IM, two-way streams, interactive features), and the user interface issues concerned with tying all of this together in a useable fashion.


blog comments powered by Disqus
Page 1 of 1