Posts Tagged ‘ data ’

Needling the Old Guard: XML in Prosopography

The last few weeks we have been discussing the ongoing debate in the digital humanities between textual markup and databases. Reading K.S.B. Keats-Rohan’s “Prosopography for Beginners” on her Prosopography Portal (http://prosopography.modhist.ox.ac.uk/index.htm), I found it interesting that the tutorial focuses initially and primarily on mark-up. Essentially, Keats-Rohan outlines three stages to prosopography:
1. “Data modelling”—For Keats-Rohan, this stage is accomplished by marking up texts with XML tags “to define the groups or groups to be studied, to determine the sources to be used from as wide a range as possible, and to formulate the questions to be asked.” It does far more than that, however, since the tags identify the particular features of sources that need to be recorded. Keats-Rohan covers this activity extensively with eleven separate exercises, each with its own page.
2. “Indexing”—This stage calls for the creation of indexes based on the tag set or DTD developed in stage one. These indexes collect specific types of information, such as “names”, “persons” and “sources”. These indexes are then massaged with the addition of biographical data into a “lexicon”, with the application of a “questionnaire” (i.e. a set of questions to query your data points.) Ideally, it is suggested, this is done through the creation of a relational database with appropriately linked tables. A single page is devoted to the explanation of this stage, with the following apology:

It is not possible in the scope of this tutorial to go into detail about issues relating to database design or software options. Familiarity with the principles of a record-and-row relational database has been assumed, though nothing more complex that an Excel spreadsheet is required for the exercises.

…11 lengthy exercises for XML, but you’re assumed to appreciate how relational databases work by filling out a few spreadsheets?
3. “Analysis”—This is, of course, the work of the researcher, once the data collection is complete. This section of the tutorial includes a slightly longer page than stage 2 with 4 sample exercises. The exercises are designed to teach users how prosopographical analysis can be conducted.
It strikes me as incongruous that, for a research method that relies so heavily on the proper application of a relational database model, so little time is devoted to discussing its role in processing data. Instead, Keats-Rohan devotes the majority of her tutorial in formulating an XML syntax that, when all is said and done, really only adds an unnecessary level of complexity to processing source data. You could quite easily completely do away with stage one, create your index categories in stage two as database tables, and process (or “model”) your data at that point, simply by entering it into your database. What purpose does markup serve as a means of organizing your content, if you’re just going to reorganize it into a more versatile database structure?
Keats-Rohan’s focus on markup starkly emphasizes how XML is far more greatly valued than databases by humanities scholars. Since both are useful for quite different purposes, and relational databases have so much to offer to humanities scholarship—as prosopographies prove—I am baffled that such a bias persists.

Advertisements

The Implications of Database Design

In studying the database schema for the Prosopography of Anglo-Saxon England (PASE), several features of the design are immediately apparent[1].  Data is organized around three principal tables, or data points: the Person (i.e. the historical figure mentioned in a source), the Source (i.e. a text or document from which information about historical figures is derived), and the Factoid (i.e. the dynamic set of records associated with a particular reference in a source about a person).  There are a number of secondary tables as well, such as the Translation, Colldb and EditionInfo tables that provide additional contextual data to the source, and the Event, Person Info, Status, Office, Occupation and Kinship tables, among others, that provide additional data to the Factoid table.  In looking at these organizational structures, it is clear that the database is designed to pull out information about historical figures based on Anglo-Saxon texts.   I admire the versatility of the design and the way it interrelates discrete bits of data (even more impressive when tested using the web interface at http://www.pase.ac.uk ), but I can’t help but recognize an inherent bias in this structure. In reading John Bradley and Harold Short’s article “Using Formal Structures to Create Complex Relationships: The Prosopography of the Byzantine Empire—A Case Study”, I found myself wondering at the choices made in the design of both databases.  The PBE database structure appears to be very similar if not identical to that of the PASE.  Perhaps it’s my background as an English major—rather than a History major—but I found it especially unhelpful in one particular instance: how do I find and search the information associated with a unique author? With its focus on historical figures written about in sources, rather than the authors of those sources, the creators made a conscious choice to value historical figures over authors and sources.  To be fair, the structure does not necessarily preclude the possibility of searching author information, which appears in the Source table, and there is likely something to be said of the anonymous and possibly incomplete nature of certain Anglo-Saxon texts.  In examining the PASE interface, the creators appear to have resolved this issue somewhat by allowing users to browse by source, and listing the author’s name in place of the title of the source (which, no doubt, is done by default when the source document has no official title).  It is then possible to browse references within the source and to match the author’s name to a person’s name[2].  The decision to organize information in this way, however, de-emphasizes the role of the author and his historical significance, and reduces him to a faceless and neutral authority.  This is maybe to facilitate interpretation; Bradley & Short discuss the act of identifying factoid assertions about historical figures as an act of interpretation, in which the researcher must make a value judgment about what the source is saying about a particular person(8).  Questions about the author’s motives would only problematize this act.  The entire organization of the database, in fact, results in the almost complete erasure of authorial intent. What this analysis of PASE highlights for me is how important it is to be aware of the implications of our choices in designing databases and creating database interfaces.  The creators of PASE might not have intended to render the authors of their sources so impotent, but the decisions they made both in the construction of their database tables and of the user interface, and of the approach to entering factoid data had that ultimate result. Bradley, J. and Short, H. (n.d.).  Using Formal Structure to Create Complex Relationships: The Prosopography of the Byzantine Empire.  Retrieved from http://staff.cch.kcl.ac.uk/~jbradley/docs/leeds-pbe.pdf PASE Database Schema. (n.d.). [PDF].  Retrieved from http://huco.artsrn.ualberta.ca/moodle/file.php/6/pase_MDB4-2.pdf Prosopography of Anglo-Saxon England. (2010, August 18). [Online database].  Retrieved from http://www.pase.ac.uk/jsp/index.jsp


[1] One caveat: As I am no expert, what is apparent to me may not be what actually is.  This analysis is necessarily based on what I can understand of how PASE and PBE are designed, both as databases and as web interfaces, and it’s certainly possible I’ve made incorrect assumptions based on what I can determine from the structure.  Not unlike the assumptions researchers must make when identifying factoid assertions (Bradley & Short, 8).
[2] For example, clicking “Aldhelm” the source will list all the persons found in Aldhelm, including Aldhelm 3, bishop of Malmsbury, the eponymous author of the source (or rather, collection of sources).  Clicking Aldhelm 3 will provide the Person record, or factoid—Aldhelm, as historical figure.  The factoid lists all of the documents attributed to him under “Authorship”.  Authorship, incidentally, is a secondary table linked to the Factoid table; based on the structure, it seems like this information is derived from the Colldb table, which links to the source table.  All this to show that it is possible but by no means evident to search for author information.

Shapiro’s Shakespeare and the “Generative Dance” of his Research

Perhaps the most interesting thing about James Shapiro’s A Year in the Life of Shakespeare is the kind of scholarship that it represents.  Drawing upon dozens—likely hundreds—of sources, Shapiro presents a credible depiction of Shakespeare’s life in 1599.  Rather than limiting himself to sources that are exclusively about Shakespeare or his plays, Shapiro gathers a mountain of data about Elizabethan England.  He consults collections of public records that shed light either on Shakespeare’s own life or the life of his contemporaries, not just to identify the historical inspiration and significance of his plays, but to give us an idea of what living in London as a playwright in 1599 would have been all about.  This, to me, is a fascinating use of documentary evidence that few have successfully undertaken.

Before I go on, I should note that I’m currently working on a directed study in which I am being thoroughly steeped in the objects and principles of knowledge management.  It is in light of this particular theoretical context that I read Shapiro and think, “he’s really on to something here.”   In their seminal article “Bridging Epistemologies: The Generative Dance Between Organizational Knowledge and Organizational Knowing”, Cook & Brown present a framework in which “knowledge”—the body of skills, abilities, expertise, information, understanding, comprehension and wisdom that we possess—and “knowing”—the act of applying knowledge in practice—interact to generate new knowledge.  Drawing upon Michael Polanyi’s distinction between tacit and explicit knowledge, Cook & Brown present a set of distinct forms of knowledge—tacit, explicit, individual and group.  They then advance the notion of “productive inquiry”, in which these different forms of knowledge can be employed as tools in an activity—such as riding a bicycle, or writing a book about an Elizabethan dramatist—to generate new knowledge, in forms that perhaps were not possessed before.  It is the interaction between knowledge and knowing that produces new knowledge, that represent a “generative dance”.

Let’s return for a moment to Polanyi’s tacit and explicit knowledge.  The sources Shapiro is working with are, by their nature, explicit, since he is working with documents.  The book itself is explicit, since it too is a document, and the knowledge it contains is fully and formally expressed.  The activity of taking documentary evidence from multiple sources, interpreting each piece of evidence in the context of the other sources, and finally synthesizing all of it into a book, represents more epistemic work than is represented than in either the book or the sources by themselves.  The activity itself is what Cook & Brown describe as “knowing”, or the “epistemology of practice”.  The notions of recognizing context and of interpretation, however, suggest that there’s even more going on here than meets the eye.  In this activity, Shapiro is merging these disparate bits of explicit knowledge to develop a hologram of Shakespeare’s 1599.  This hologram is tacit—it is an image he holds in his mind that grows more and more sophisticated the more historical relational evidence he finds.  Not all of the patterns and connections he uncovers are even expressible until he begins the synthesis, the act of writing his book.  Throughout this process, then, new knowledge is constantly being created—both tacit and explicit.

Let’s also consider for a moment Cook & Brown’s “individual” and “group” knowledge.  Shapiro’s mental hologram can be safely classified as individual knowledge.  And each piece of evidence from a single source is also individual knowledge (though, certainly, some of Shapiro’s sources might represent popular stories or widely known facts, and thus group knowledge).  The nature of Shapiro’s work, however, the collective merging of disparate sources, problematizes the individual/group distinction.  What arises from his scholarship is neither group knowledge (i.e. knowledge shared among a group of people) or individual knowledge (i.e. knowledge possessed by an individual), but some sort of hybrid that is not so easily understood.

From a digital humanist perspective, we can think of Shapiro’s scholarship (and have) as a relational database.  All of the data and the documentary evidence gets plugged into the database, and connections no one even realized existed are then discovered.  We might have many people adding data to the database, sharing bits of personal knowledge.  And everyone with access to the database can potentially discover new connections and patterns, and in doing so create new knowledge.  Would such a collective be considered group knowledge?  Would individual discoveries be individual knowledge?  Would the perception of connections be tacit or explicit?  It is not altogether clear because there are interactions occurring at a meta-level, interactions between data, interactions between sources, interactions between users/readers and the sources and the patterns of interacting sources.  What is clear is that this interactive “dance” is constantly generating additional context, new forms of knowledge, new ways of knowing.

 

Cook, S. D. N., and Brown, J. S. (1999). Bridging Epistemologies: The Generative Dance between Organizational Knowledge and Organizational Knowing, Organization Science 10(4), 381-400.

Shapiro, J. (2006).  A Year in the Life of William Shakespeare: 1599.  New York: Harper Perrennial.  394p.

Review Paper 1: Wrapping our Heads Around KM

In this week’s readings, Prusak and Nunamaker Jr. et al. successfully provide a solid and informed definition for ‘knowledge management’ (KM), and why it is important.  Prusak establishes from the get-go that KM is not just about managing information, but about providing and maintaining access to “knowledge-intensive skills” (1003).  He also identifies the pitfall of reducing KM to simply “moving data and documents around”, and the critical value of supporting less-digitized / digitizable tacit knowledge (1003).  Prusak chooses to define KM based on its disciplinary origins, noting economics, sociology, philosophy and psychology as its “intellectual antecedents”, rather than defining it from a single perspective or its current application alone (1003-1005).   Nunnaker et al. take a different approach, defining KM first in the context of IT, that is, KM as a system or technology, and then presenting a hierarchical framework from which to understand its role.  In this sense, data, information, knowledge and wisdom all exist on a scale of increased application of context (2-5).  Except for this first theoretical framework that they present, Nunamaker Jr. et al. risk falling into the trap Prusak warns against; they define KM as the effort to organize information so that it is “meaningful” (1).  But what is “meaningful”?  Only context can determine meaning—fortunately, Nunamaker Jr. et al. at least account for this unknown quantity in their framework (3-4).  They also propose a unit to measure organizational knowledge: intellectual bandwidth.  This measurement combines their KM framework and a similar framework for collaborative information systems (CIS), and is defined as: “a representation of all the relevant data, information, knowledge and wisdom available from a given set of stakeholders to address a particular issue.” (9)  It is clear from their efforts to quantify KM and from the manner in which the frame KM as a system that Nunamaker Jr. et al. are writing for a particular audience, the technicians and IT specialists. Meanwhile Prusak is writing for a more general audience of practitioners.

One thing I felt was lacking from both articles was a clear statement and challenge of the assumptions of systematizing knowledge.  Nunamaker Jr. et al.’s argument for “intellectual bandwidth” is compelling, but I cannot help but be skeptical of any attempt to measure a concept as fuzzy as “wisdom” and “collective capability” (8-9).  Even Prusak clearly states that, as in economics, an essential knowledge management question is “what is the unit of analysis and how do we measure it?” (1004).  The underlying assumption is that knowledge can, in fact, be measured.  I am dubious about this claim (incidentally, this is also why I am dubious of similar claims often proposed in economic theory).  Certainly, there are other, qualitative forms of analysis that do not require a formal unit of measurement.  Assuming (a) that knowledge is quantifiable, and (b) that such a quantity is required in order to properly examine it, to me seems to lead down a quite dangerous and not altogether useful path.  The danger is that, in focusing on how to measure knowledge in a manner that lends itself to a quantitative analysis, one is absorbed in the activity of designing metrics and forgets that the purpose of KM is primarily to capture, organize and communicate the knowledge and knowledge skills within an organizational culture.  Perhaps this danger should be considered alongside and as an extension of Prusak’s pitfall of understanding KM merely as “moving data and documents around”.

Both of these articles, as well as the foundational article by Nonaka also under discussion this week, are valuable insofar as they lay the groundwork for knowledge management as a theoretical perspective.  Nunamaker Jr. et al. present much food for thought on how knowledge is formally conceptualized with their proposed frameworks. Meanwhile Prusak provides a sound explanation of the origins of KM and forecasts the future of the field by suggesting one of two possible outcomes; either it will become so embedded in organizational practice as to be invisible, like the quality movement, or it will be hijacked by opportunists (the unscrupulous, profit-seeking consultants Prusak disdains at the beginning of his article, 1002), like the re-engineering movement (1006).  Both papers were published in 2001, and a decade later neither of these predictions appears to have been fulfilled.  KM has been adopted by organizations much as the quality movement has been, but I suspect that knowledge workers are still trying to wrap their heads around how it is to be implemented and what it actually means.

 

 

Cited References

 

Nunamaker Jr., J. F., Romano Jr., N. C. and Briggs, R. O. (2001). A Framework for Collaboration and Knowledge Management, Proceedings of the 34th Hawaii International Conference on System Sciences – 2001. 1-12.

 

Prusak, L. (2001). Where did knowledge management come from? IBM Systems Journal 40(4), 1002-1007.

Too Much Information, part 2: Recontextualization

The second article I want to discuss is “Data as a natural resource” by Matthew Aslett, and deals principally with the idea of transforming data—records decontextualized—into products (records recontextualized as commodities).  Aslett introduces the concept of the “data factory”, a place where data is “manufactured”.  He also highlights this in the context of “Big Data”—the current trend of accomodating larger and larger collections of information.  The problem is, “Big Data” are useless unless you can process them, analyze them, contextualize them.  Aslett suggests that the next big trend will be “Big Data Analytics”, which will focus on harnessing data sources and transforming them into products.  Assigning meaning to the raw, free-floating information, as it were.

One of the things I like about Aslett’s article is his analogy between data resources and energy resources, comparing the “data factory” with the oil industry.  Data is the new oil; useable data can be very valuable, as eBay and Facebook (Aslett’s two main examples) demonstrate.  What’s interesting about both eBay and Facebook, and why Aslett draws attention to them in particular, is that they don’t in themselves produce the data; they harness pre-existing data streams (the data “pipeline”), building on transactions that already take place, automate these transactions for their users, and parse their user data into saleable products.  In the case of Facebook, this comes in the form of ad revenue from targetted marketing, based on the most comprehensive demographic information available online (a user base of 500+ million); for eBay, it is the combination of transactional and behavioural data that identifies its top sellers and leads to increased revenue for them.  If Facebook or eBay didn’t exist, as Aslett points out, people would still communicate, share photos, buy and sell products.  They have just automated the process, and acquired the transaction records that are associated with such interactions in the process.

This makes me wonder about the ownership implications, once again, and about the Facebook terms of use I trotted out in a previous blog entry.  Is it fair for Facebook to profit off your personal information in this way?  To control your data?  Isn’t it a little worrisome that eBay and Amazon track what I buy online well enough to make quite accurate recommendations?  In terms of IAPP discussed in the last class and of David Flaherty’s list of individual rights, it is troubling to consider that, if the countless disparate traces of me online were somehow pulled together and processed, someone could construct a reasonable facsimile of me, my personality, my identity.  And isn’t this what Aslett is really talking about when he uses the word “analytics”?

Aslett, M. (2010, November 18).  Data as a natural energy source.  Too much information. Retrieved on November 26, 2010 from http://blogs.the451group.com/information_management/2010/11/18/data-as-a-natural-energy-source/

Too Much Information, Part 1: e-Disclosure

Today I’m going to write about a RIM blog I have discovered thanks to the ARMA website links, “Too Much Information” by The 451 Group.  In particular, I want to discuss two articles from different authors, on quite different topics.  Given the word length limit on entries for the journal assignment, I’ll be splitting my writing up into two seperate entries.

The first article, by Nick Patience, is a review of the topics discussed at the 6th Annual e-Disclosure Forum in London, dealing primarily with UK law.  Patience identifies key themes that came up during the forum.  The first of these is “Practice Direction 31B”, which is an amendment to the rules of civil procedure in the disclosure of electronic documents.  Of the changes, Patience identifies the addition of a 23-question questionnaire to be used in cases that involve a large number of documents, and emphasizes how this would be useful both in getting parties organized for proceedings and as a pre-emptive method for organizations to prepare records in the event of future litigation.  In Canada we have some standard guidance in the form of the Sedona Canada Principles, the Sedona Working Group, and provincial task forces working on refining e-Disclosure practices.  I suspect there are discrepancies in practices between provinces, simply due to the nature of the Canadian legal system, which might make it difficult to apply a detailed questionnaire as common resource (conjecture on my part, since I’m certainly not an expert in law), but I certainly agree with Patience about the potential benefits of such a resource.  In reviewing the case law digests, it is clear that one of the great challenges of e-Disclosure is limiting the scope on what constitutes evidence, which is, I believe, at the court’s discretion.  Examples that I’ve found are:

Dulong v. Consumers Packaging Inc., [2000] O.J. No. 161 January 21, 2000 OSCJ Commercial List Master Ferron.. The court held that a broad request from a plaintiff that the corporate defendant search its entire computer systems for e-mail relating to matters in issue in the litigation was properly refused on the grounds that such an undertaking would, “having regard to the extent of the defendant’s business operations, be such a massive undertaking as to be oppressive”. (para 21).

Optimight Communications Inc. v. Innovance Inc., 2002 CanLII 41417 (ON C.A.), Parallel citations: (2002), 18 C.P.R. (4th) 362; (2002), 155 O.A.C. 202, 2002-02-19 Docket: C37211. Moldaver, Sharpe and Simmons JJ.A. The appellants appeal a Letter of Request issued in a California court seeking the assistance of Ontario courts in enforcing an order for production of 34 categories of documents by Innovance, Inc. Appellate Court limited the scope of production and discovery. Schedule A details the electronic sources and search terms.

Sourian v. Sporting Exchange Ltd., 2005 CanLII 4938 (ON S.C.) 2005-03-02 Docket: 04-CV-268681CM 3. Master Calum U.C. MacLeod. Production of information from an electronic database. An electronic database falls within the definition of “document” in our (Ontario) rules. The challenge in dealing with a database, however, is that a typical database would contain a great deal of information that is not relevant to the litigation.  Unless the entire database is to be produced electronically together with any necessary software to allow the other party to examine its contents, what is produced is not the database but a subset of the data organized in readable form.  This is accomplished by querying the database and asking the report writing software to generate a list of all data in certain fields having particular characteristics.  Unlike other documents, unless such a report is generated in the usual course of business, the new document, the requested report (whether on paper or on CD ROM) would have to be created or generated. Ordering a report to be custom written and then generated is somewhat different than ordering production of an existing document.  I have no doubt that the court may make such an order because it is the only way to extract the subset of relevant information from the database in useable form. On the other hand such an order is significantly more intrusive than ordinary document production. A party must produce relevant documents but it is not normally required to create documents.  Accordingly such an order is discretionary and the court should have regard for how onerous the request may when balanced against its supposed relevance and probative value. (Italics P.D.)

[These only represent the first three cases I found in the LexUM Canadian E-Discovery Case Law Digests (Common Law) online, under “Scope of production and discovery”. http://lexum.org/e-discovery/digests-common.html#Scope]

What this news about UK policy makes me wonder, though, is precisely why we haven’t implemented a better national standard.  The Sedona Principles are wonderful for what they are—recommendations from a think tank drawing on the experience of lawyers, law-makers, technology and information professionals—but in order for it to really mean anything, it has to be enacted in policy.  Naturally, that kind of legislation doesn’t happen overnight.

Another theme Patience identifies is the growing trend of cloud computing, and the problems therein.  This sort comes back to my frequent rants about web records; the conference participants agreed that service level agreements (SLAs—precisely the kind of agreements I noted in my last entry) by cloud service providers did not provide sufficient guarantee as to the control and security of a user’s records (in this case, the user being an organization).  Patience describes this quality of the SLA as lacking the “necessary granularity”—you need to know that you can search for, find, and retrieve your data in a form that you can use.  As Patience says, not having that guarantee is a “dealbreaker”.  This seems like a very important counterpoint to the ever-growing buzz about cloud computing, and re-enforces the need for organizations to exercise caution before making a decision about how they want to manage their data.

 

Resources:

ARMA International

E-Discovery Canada

Patience, N.  (2010, November 16). e-Disclosure – cooperation, questionnaires and cloud. Too Much Information. Retrieved on November 26, 2010 from http://blogs.the451group.com/information_management/2010/11/16/e-disclosure-cooperation-questionnaires-and-cloud/