N.B. THIS IS A ROUGH DRAFT, AND IT MAY NEED TO BE RESTRUCTURED

News organizations are expected to provide more coverage with fewer employees. Jeff Jarvis advises to Cover what you do best, link to the rest.

A variation on this is in aggregation of various news sources. To do so, many media organizations are expected to turn in two directions for quality content:

  • Toward established providers of news content such as Reuters on the international scale and national news services
  • Toward ad-hoc networks of news focused on a single topic or major news event
  • Toward bloggers and citizen journalism initiatives

As this occurs, media organization staff will be required to manage these multiple sources of data, which will require improved tools for both setting up sources and rules as well as for managing the information once it is ingested.

We have already dabbled with imports of external content through our Purgatory implementation, and now we have proper XML Import.

The idea is to create a module or a server that could reside on the same or separate machine, which would aggregate feeds from external sources: newswires, RSS, Atom, even data from custom parser scripts (like the ones used in the purgatory).

The server/module would then parse the feeds into forms that can be used as follows:

Use Case 1

Content is imported into the Campsite database with association with a specific stage in the Campsite workflow (Pending, New, Submitted, Published). Conditions may be set for the automatic import of content.

For example, if there are certain keywords and the article comes from a trusted source, the article may be imported into a section with status ‘Published’.

  • Example 1: An article from Reuters in NewsML with the keyword of ‘business’ goes to the business section with status of published, and the photo associated with it is also imported;
  • Example 2: A local sports blogger writes a blog post which is preset to go into the sports section with status of submitted, a video embedded in the blog post is imported with the embed pointing to its URI;
  • Example 3: A photo from a local photographer’s RSS feed goes to the news section with a status of new, where a new caption is written for it.
  • Example 4: An article from an affiliated newspaper’s Atom feed goes to the news section with a status of ‘published’.

Use Case 2

As a set of data for the front end. This would include:

  • Weather forecast data
  • Currency exchange rate information (as we did with data parsed in the Purgatory)
  • Sports scores
  • Stock and financial information
  • Entertainment listing information from either an in-house database or external provider
  • Mashup output

This information would bypass workflow and the Campsite database, and would be handled directly by the frontend. The base information may come in a recognized XML interchange format such as Atom, SportsML or NITF, or may require further pre-processing before being worked with.

Use Case 3

As source material for writing articles, i.e. not in the workflow or in Campsite database, but in a separate repository (possibly its own database) accessible from the Campsite backend. The idea behind this separation is to allow for easier maintenance and management of accumulated data from sources such as wire services which may be used later (by definition Campsite data is there to stay, this source material may be saved or purged according to certain criteria).

Pre-defined XML structures

We know that there is a fairly limited set of XML structures that we can expect to see content come in. They include:

  • IPTC G2 specification (including NewsML G2, EventsML G2, SportsML G2)
  • NITF
  • Adobe In_Design XML export (from older versions) see InDesign for more details
  • RSS
  • Atom

Formats that probably should be supported at some point:

  • OpenOffice? XML
  • Microsoft OpenXML
  • Adobe In_Design CS 4 IDML
  • Quark Xpress XML

Our ingest module should be able to accept a number of different pre-defined document type definitions for such formats, but it should also allow third parties to add new DTDs as necessary, provided the DTDs can be mapped to something Campsite can understand. The formats should be preset, so that only defined formats are accepted for import – no AI should be required.

Pre-import checks

The module should provide users with a means of previewing the import before it occurs. This should allow users to change the XML tag mapping or to cancel the import before it occurs. Such a pre-import check would simply list the article type field name along with the text that is supposed to be imported.

Tag remapping

In the event an import preview does not look correct the user, they should have the option of changing the mapping of their tags. This could be handled by a two-column layout with destination fields on one column and source columns on the next with individual pulldowns for each of the article types.

Working better with Adobe In_Design

Experiments with Adobe In_Design CS2 and its XML export features show that it is possible to export content to Campsite without errors, however, a large amount of work (several hours per issue for a newspaper) is required to prepare the document for import into Campsite. For example: In_Design has a feature for mapping object styles to XML tags, but in practice each paragraph is given a redundant tag. An example would be how it handles the body text style. An example of default XML export with no user preparation is included here.

The ingest module needs to clean up the XML as it comes from InDesign, especially for adjacent redundant </fulltext><fulltext> tags nested on the same level inside an element. When it sees such multiple tags, it should merge them into a single tag. This function alone would probably greatly increase the likelihood of successful imports.

Also, by default, InDesign uses the <root> and <story> tags instead of <articles> and <article>. The import should check for these tags at a fixed position. <root> always follows the <?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>LF tag. We should check for this and replace it if necessary.

Import directory of files

The module should also allow users an option to select a directory of exported files with an option to recurse subdirectories.

Flexible Campsite tag export

Campsite should provide an export file made up of all of the field names as defined in all of the article types. The content.xml file is good from 3.2.1, but it does not account for individual article type differences; it only uses a limited number of fields which may or may not correspond to the content a publication actually has. Therefore, either in the Import XML menu or the Article Types menu (or both), users should have the ability to export the list of article types currently in use on the instance. This file could then either be used by third-party developers for use in their own XSLT transforms or by InDesign users seeking to match their style tags with Campsite’s article type fields.

API

These functions should also be accessible through an XML-RPC and/or REST API for scripting and mashups, including connection to news services. (We’ve used SOAP for some Purgatory data aggregation purposes)

TODO: PHOTOS AND MULTIMEDIA

Photos from wire services Photos from blogs Photos from other RSS/Atom feeds (Flickr) News graphics from wire services Video and animation from wire services

FEATURES AND THE ADMIN INTERFACE

Ingest Server/Module Administration

  1. Individual user feeds (come up with a better name) YES/NO

Ingest Server/Module Administration

Administering feeds:

Global feeds (Publication/Section wide)

  1. Add feed
  2. Delete feed
  3. Feed Name and Number? (to be made accessible by the template engine)
  4. Select feed parser script (dropdown: RSS,Atom, NewsML,… custom)
    • Upload
    • Edit (Opens up the online editor, preferably the one described in #2805)
    • Delete script
  5. Feed status ON/OFF
  6. Update frequency
  7. Feed Statistics: Last successful retrieval, Retrieval history)
    • Warning if feed retrieval fails.
  8. Parsed feed (parsing results) goes to:
    • Database Import of items from the feed
      • Automatic/Manual (if Automatic, define interval?)
      • Association with a Campsite workflow stage (Pending, New, Submitted, Published)
    • Text file (or whatever was used in the purgatory) – specify file name
    • Repository (Magpie, DB whatever).

Individual User Feeds (RSS and Atom feeds added by regular users):

  1. Add feed
  2. Delete feed
  3. Feed Name and Number (to be made accessible by the template engine?)
  4. Feed type: dropdown (RSS/Atom) or just print the type once the feed is detected, as these should be easily identifiable automatically?)
  5. Suggest feed to administrator (for inclusion to global feeds)

ACCESSING FEED DATA FROM ARTICLE EDIT SCREEN

This will describe how data from USECASE 3 will be used in article editing.

We could basically have three panes (IMPORTANT: design with R-T-L in mind as well):

  1. PANE 1: Feeds (with two views, toggled):
    • View one: tree list of feeds and individual items of those feeds
    • View two: flat list of items from all feeds, most recent on top
      • Item title | Source (Feed) | Date and Time
      • Sort by Title, Source, Date and Time
      • Search, Search by Title, Source, Date and Time
      • Ability to display or hide Global Feeds or Individual User feeds (checkboxes?).
  2. PANE 2: For displaying items whose title is selected in PANE 1
  3. PANE 3: Article edit pane: our revamped Article Edit Screen
  4. Create interaction between the PANE 2 and PANE 3
    • “Copy” button that copies selected text from PANE 2 into the selected field in PANE 3
    • Drag and drop

Attachments