Discovering latent topical structure by second-order similarity analysis

Cribbin, T

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/6702

Full metadata record

DC Field	Value	Language
dc.contributor.author	Cribbin, T	-
dc.date.accessioned	2012-09-21T15:13:22Z	-
dc.date.available	2012-09-21T15:13:22Z	-
dc.date.issued	2011	-
dc.identifier.citation	Journal of the American Society for Information Science and Technology, 62(6): 1188 - 1207, Jun 2011	en_US
dc.identifier.issn	1532-2882	-
dc.identifier.uri	http://onlinelibrary.wiley.com/doi/10.1002/asi.21519/abstract	en
dc.identifier.uri	http://bura.brunel.ac.uk/handle/2438/6702	-
dc.description	This is the post-print of the Article - Copyright @ 2011 ASIS&T	en_US
dc.description.abstract	Document similarity models are typically derived from a term-document vector space representation by comparing all vector-pairs using some similarity measure. Computing similarity directly from a ‘bag of words’ model can be problematic because term independence causes the relationships between synonymous and related terms and the contextual influences that determine the ‘sense’ of polysemous terms to be ignored. This paper compares two methods that potentially address these problems by modelling the higher-order relationships that lie latent within the original vector space. The first is latent semantic analysis (LSA), a dimension reduction method which is a well known means of addressing the vocabulary mismatch problem in information retrieval systems. The second is the lesser known, yet conceptually simple approach of second-order similarity (SOS) analysis, where similarity is measured in terms of profiles of first-order similarities as computed directly from the term-document space. Nearest neighbour tests show that SOS analysis produces similarity models that are consistently better than both first-order and LSA derived models at resolving both coarse and fine level semantic clusters. SOS analysis has been criticised for its cubic complexity. A second contribution is the novel application of vector truncation to reduce the run-time by a constant factor. Speed-ups of four to ten times are found to be easily achievable without losing the structural benefits associated with SOS analysis.	en_US
dc.language.iso	en	en_US
dc.publisher	American Society for Information Science and Technology	en_US
dc.title	Discovering latent topical structure by second-order similarity analysis	en_US
dc.type	Article	en_US
dc.identifier.doi	http://dx.doi.org/10.1002/asi.21519	-
pubs.organisational-data	/Brunel	-
pubs.organisational-data	/Brunel/Brunel Active Staff	-
pubs.organisational-data	/Brunel/Brunel Active Staff/School of Info. Systems, Comp & Maths	-
pubs.organisational-data	/Brunel/Brunel Active Staff/School of Info. Systems, Comp & Maths/IS and Computing	-
pubs.organisational-data	/Brunel/University Research Centres and Groups	-
pubs.organisational-data	/Brunel/University Research Centres and Groups/School of Information Systems, Computing and Mathematics - URCs and Groups	-
pubs.organisational-data	/Brunel/University Research Centres and Groups/School of Information Systems, Computing and Mathematics - URCs and Groups/Multidisclipary Assessment of Technology Centre for Healthcare (MATCH)	-
pubs.organisational-data	/Brunel/University Research Centres and Groups/School of Information Systems, Computing and Mathematics - URCs and Groups/People and Interactivity Research Centre	-
Appears in Collections:	Publications Computer Science Department of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
Fulltext.pdf		1.16 MB	Adobe PDF	View/Open

Show simple item record