- two groups of challenges, (1) design concept behind of the system we evaluate, (2) details that matter a lot. crucial to advance our field.
- getting much better at doing evaluations, but still concerns (eg munzner 2008 paper pitfalls in visualization)
- concerns are shared by research in usability evaluation (gray and salzman, 1998) and discussion of methodological concerns in CHI (eg. greenberg and buxton 2008).
- conceptual challenges
- win-lose evaluations: outcome mostly given before evaluation, so obvious beforehand, throw away well selected tasks, will lose
- point designs, huge sea of design possibility, pick one point and dive in, but no mapping or laying out or results into the science space
- lack of clarity in key constructs: paper who claims it evaluates fisheye but no real fisheye
- lack of theory-driven experiments: some in perceptual studies, but general, no real deep theories
Eg the concept of overview
- overview: definitions (read from slide), surprised there are relatively few definitions of overview
- bottom up approach to get definition: looked in 60 studies, 3 uses:
-- technical (overview as interface component e.g. O+D)
-- user centered (user forming an overview)
-- other (given an overview of...)
- does an overview give an overview? we have no clue, how to tackle this problem? need to do more of this style of evaluation
Eg Fisheye interfaces
- results is quite mixed: some optimistic, pessimistic (yes, Heidi and Tamara...)
- use fisheye in programming environments
- is a priori interest useful? may be but hard, hard for users to understand, useful a priori algo is complex but hard to understand relative to distance, have this interface concept has been around for 24-ish years, but mostly untested, few hints, but made very little progress at the concept level
Strong inference
- Platt 1964, notion of strong inference, with the steps
-- devising alternative hypotheses, equal proposal ways of going down any branch, use more theory-driven experiment that look at equally probably hypotheses. Very few results that report anything interesting--our great idea didn't work and didn't get published. Therefore need good theoretical motivation for either outcome is interesting.
-- devise a crucial experiment which will rule out some hypotheses
Radical solutions
- Newman 1994 reviewed CHI publication and found 25% radical solutions
- ran into a style of publication for radical style. Weird problem, nice interface, works in some way, find new weird problem and move on. Newman suggests carry on work, or verify work
Practical challenges - simple outcome measures
- mostly do binary task completion, or relatively simple error measures, only very few used expert grading of products, more interested in comprehension/learning in outcome
- simple process measures
- time is a summary of measure of the process, how should we interpret task completion time?
- eDoc study: 20% higher task completion time with O+D, tried to look at the place in the text people have visible and made progression maps
- further exploration: people found answers and do stuff, so what seems to spend more time with O+D, but ppt spent more time in further exploration, so O+D using more time that is not necc bad, promotes checking etc, made a case for much more complex interfaces
- standing on the shoulders of giants
- challenge to select data, tasks, measures, interfaces. People in this community mostly evaluate their own interfaces. Independent research found different effects than interested parties. At least fixing and generating resources for some of these, crucial for repeating experiments.
- people keep inventing new facets of satisfaction, but no clue how these relate (I feel, the system is, on a 1-5 scale)
- questionnaire use, reliability of questionnaire, those who use standardized, and homegrown, standards have higher reliability
Eg selecting tasks - tasks are crucial, effects are strong, other effects are less, task-level testing (Catherine)
- tasks are chosen ad hoc to match evaluation or habitually
- eg evaluate overview tasks? most studies used navigation tasks, started to conclude about psychological aspects. monitoring is rarely used
What to do? - more strong, theoretically motivated comparisons (eg. TILCS?)
- more complex measures of outcome, process, coupled with richer data
q and an: - more vigorous experimentation, but takes long, fast development of technology in our field, not modern is anymore, solutions? do experiment on interface that is not the most modern but embody essence of basic ideas (eg fisheye in programming environment). Not a factor experiment psycho. Also important in real work with studies of adoption and integration. Go for the more conceptual type of study.
- reproducibility of study results, evaluate own work (your baby) a bit taint, practical to build up two things? Dunno. Competition, don't evaluate your own system, pooled system (?). Would you accept a paper that replicate experiments as a reviewer? Large scale repetition of graph perception (Heer), made a contribution, then to mostly not work, have to be more clever. Hate experiments with only two interface, one will be your darling, experiments use more interfaces are more interesting. For any controlled experiment, you can bias in any way you want, but more so in uncontrolled ones
- will reject repetition papers, need separate venue. Novelty is cool and attract attention. Journal?
- not have enough info to repeat, make public data set and tasks we use to allow others to repeat experiments
- SEMVAST: all the dataset from VAST, with CHI97 tree browse off, use same dataset; have Spacetree code, data, and tasks
- higher level question: how would you extend proposal to corporate and industrial researchers? would have to test our own products, different to share data? No clue. Conceptual replication probably not relevant to industry? figure out if works in real life.
- replication issue, whole maturity of field, we could agree on fundamental issues or problems, push to some sort of replication framework of central problems, work on harder problems, field has to push to replicating results. High level problems that very few address. Know so little about interaction and visualization. How do you trade off one against another. Before replications, try to attack higher level problems.
- other domain like architecture, replication is not important, creating something new, hard to evaluate moving targets. Arch builds engineering principles, what are our principles?
- run experiments on representative vis (eg treemap), how to tease apart interaction and visualization? Have some clue of things to do, don't know enough about it to do. Programming environment, think aloud, what people want to do vs what they did. Not sure how to do it.
- situation awareness, well documented, interrupted people during task for them to say stuff, nice procedure
- focus on qualitative metrics on utility than usability
- interested analytic process, use of vis in this process, quality of results of process
- VAST developed data sets with ground truths, know you are right
- review of all the reviews submitted for mini challenges (42)
- asked three questions, focused on second
- material ok
- (focus) what reviewers think are important
- selecting 2 types of reviewers: prof analysts (hard), vis experts
- reviewers given video submission, screen shots, text descriptions
- reviewers comment on clarity, clear the better scores obtained, not clear, only 5% get a score higher than 5
- comments analytical, vis
- why select a particular vis; intuitive to analysis
- comment on vis: complexity, too complex, can't see answer, find line between complexity of vis and putting enough for people to what's happening
- people don't like having to mouse
- careful about showing relationships
- comparison is very important
q and a
- how many reviewers
- comments available to next year's contestants
- comments may be too prescriptive, only put comments there that have been made multiple times (no frequency statistics)
- some comments may not be right, but it's a starting point
- have to be careful as to how to phrase it, shouldn't be applied without thinking about it
- video quality, infovis/vast high correlation between acceptance and availability of video; in challenge--explanation is more correlated than with videos
- metrics of the contest in the future that reviewers will use
- different data--different comments this year
- accessibility--not explicitly stated: guidelines or metrics given; not catered to color-blind / blind people;
- time and error: outcome measure
- look inside the black box, how people solve their tasks, not only better / worse
- insight is already in the black box; what happening in the process, one of these is problem solving process, match to vis / visual analytics tool
- problem solving: two types of problems: well vs ill defined (more than one correct solution), both have multiple ways to get to solution. ill defined is more interesting in the real world, but research focuses on well defined
- study looked at both well and ill defined problems
- well: strategy: different for diff data and tools
- ill: diff people used diff data sources, solution quality correlated with # of data sources, better solution used more strategies
q and a
- problem solving is very tool centered: more task centered, can look at interaction with the tool, some part of the problem solving process is not tool related e.g., using prior knowledge
- how do you get your data: think aloud, interaction logs, eye gaze
- how fruitful for log: proof of which level of the data was used, very useful to get the strategies
- strategy depends on tasks, more general problem solving strategy: for more general tasks, e.g., locating specific date, but differ from tool to tool, different interaction possibilities, don't think you can have meta-problem solving strategy
- more sources used, right idea, not spending more time, no solution spent more time
- not linking # info sources and strategy
- missing data? say what you know and don't know, are all data sources relevant to the problem: could get solution without using all the data sources
- tradeoff, how information is presented and how much? with gaze pattern, can tell what people used
- using working memory to evaluate info vis tools
- evaluation developed for single tool/user/data, used cognitive resources to create standardized metrics: well designed interface should free up cog burden to free up resources to make sense of the data
- interested in working memory, limited capacity and measurable
- dual-task methodology: primary (interaction with interface), secondary (test of working memory)
- secondary task: auditory memory tasks (remember letters)
- also use NASA TLX to develop convergent evidence
q and a:
- possible divergent: dual task to study encoding, but more visual analytics, force people to put more thoughts in vis than remember the letters: try to keep primary task the mechanism to interact with vis just testing usability of interfaces, not deep analysis, later--tease out interaction strategy and outcomes
- interesting to include a base line, what impact is on the primary tasks with the secondary task; student get more vigorous insights into working memory, think hard, eye dilates, can reliably tract somebody is thinking hard, down the road may have interesting measurements
- going fishing--change nature of secondary task and changing fish etc--have to change primary tasks, nature of running a control is really critical, change nature of secondary task to see interfering characteristics; will do pilot
- choice of auditory is slight weird, auditory memory may not interfere with visual memory; try to tap into basic working memory, auditory may be too distracting from the primary; encoding involved, more complicated than just auditory or visual
- not true visualization is easy to grasp, text provides context; not a bad feature
- look at cognitive and personality trait, experience in visualization, familiarity with data, quality of visualization, compatibility, contextual factors
- determine a promising set of measures
q and a
- which factor might be more important than others: spatial, perceptual speed, some potential, not critical
- to what extend related only to infovis? not sure, lots of literature have been done in hci, novelty of vis tool, other metrics not looked at in hci may be more important
- different ways to approach vis, most important to think about simple tests to figure out category of users and test them differently; want to have some category of users or cannot trust results clearly
- do come up with measures, active evaluation is kinda of tricky, may be find correlation and simplify it: max time is 30 minutes
- metrics about metrics, how base level is it (e.g., age is an aggregation)
- how should i redesign studies with these in mind, implications: measure as a covariance as analysis (linear?), treat separate groups differently, blocking, give ppt best configuration that works for s/he, not crazy about idea of adapting vis to people since people are adaptive, some power of statistics
- what does it mean to have metrics, scientific point of view to have different results and different experiment results
- see the limit / best case of vis / interaction, interesting to eliminate people that will take a long time to get to the best case, compare two different interaction, first want to see if there is a different for best case;
- training: multi trial study, takes 10-12 trials
- increase number of participants would factor decrease, but how many ppts
- quality of measures, quick to run to foobar test for x but some may not be good, grain of salt: wisdom for right measures, and guidelines for junior
- gestalt psychology: concept of insight: holistic theory of perception and reasoning similarity between perception and reasoning; perception is problem solving e.g., selection, continuum between perception and problem solving, cannot be divided into simple paths but with an underlying structure
- juggling with elements on the screen and get an insight
- make relationships between what you find out now and what you know already (selective comparison)
q and a:
- gist: walk into a room and do i have prototype of a room (e.g., meeting). same for vis: oh yes, there is a treemap and i expect this and that. analogy relies on our template, interesting factors to test: also in infovis to use past experience as analogies.
- link interaction pattern with cog process, possible to triangular data source with interactions and logs: also did other research to compare log files, differences and similarity
- how to design vis to support reformation: davidson, support processes (encoding, combination, and comparison), what people do may not be the same as the cog processes, and don't see them in log files. hard to figure out what's going on in people's head
- visual literacy, if users are better trained in problem solving, w/o vis: vis focuses on vis, but problem solving is overlooked. this is really a problem of problem solving, not just interpreting a scene
- complexity of visualization with shortest length
- information theory, size of description is a measure of goodness, for matrix, can swap rows to get a better compression
- future work: pixel based technique, need 1:1 and reversible and lossless mapping between original data
q and a:
- data dimensions can be reordered, quantitative cannot be reordered; time series, series of number, but can make spiral of numbers, problem is to encode and map the data into a matrix
- adjusting the phase, reordering? only find out seasonality, but not more
- kol complexity, dealing with something that is connected to perception, not necc correlated to gzip size, agree about technique, but not based on any description of language, need to take perception into account; agree, universal description language, measure the number of bits to encode
- lossless encoding, affect best (yes), how about lossy compressions? Depends if the outliers are important, come closer to perception. Lossless is extremely important, but need to reverse
- get away from subjective vs math preference; all measures included in kol complexity
- not fundamental math complexity?
- hard to find expert users; proposed soln: cross domain analysis: solve tasks from own domain and other domains, with story about the data
- insight-gatherer: desc data and characteristics; hunter: active search, driven by prior knowledge (not related to domain expertise)
q and a:
- dependent on task/problem/domain: ok for abstract tasks only? hard to compare to argue non-expert domains; not in scientific domain, business works; automotive domain; may be there is a level of expertise; may be it is dependent on tasks; have different tasks, some very goal directed, not much difference
- people behave same way? depends on personality
- analogy--use hci, domain, or combo expertise, domain uncovers interesting problems, look at the types of insights and analyze if insights could have occurred without domain expertise, a way to qualify what expertise affect
- hunter/gather: one better than other in using vis. No distinction in number or quality of insights.
- insight method to find out what users really learned from vis, against benchmark task to prescribe what users should learn from vis
- for some tasks, benchmark tasks are done well, but not getting insights, but not when the tasks were not done well but people could still find insights
- tasks done in insight tests, didn't test in benchmark (e.g., nodes that are unlike any other)
- subjectivity associated: task methods--choice of the tasks (ecological validity threat); insight method--coding the insights (repeatability threat)
- benchmark--supports; insight--promoted
- controlled ethnography: rich insight in their process and quantify them; possible to add intervention with large-scale displays
q and a:
- juxtapose: more of a continuum than a juxtaposition, don't you need support before promote; explaining why there is blank box
- give them tasks, think aloud to solve the task, with free exploration phase to combine both methodology; sounds good, but be careful, bias insight (cannot count insight, but can tell how to improve vis) since you told them what to look for, reverse influences time;
- simple abstracts can be supported but not promoted, only safer to do in benchmark tasks? ok...
- little concerned about promoting tasks, intelligence community does not adapt vis in rapid pace, how far you do to promote a task but the world and the data change with time (and not a new tool), promote is good, but does it deter me from doing other things; granularity is low, not so much differences with time
- questions about if people are going to do supported tasks because they are not easy (e.g., finding topology)? need to define what is "easy" as the task is "fast and accurate", need different definition
- identify a set of visual tasks based on a task to find conflicts in air traffic
- vocab for people to design things
q and a:
- one way to test is gaze tracking data; lots of individual differences? don't want to validate the model, understand what you ask to the users to accomplish, don't really know what we ask the users to do
- scan line path shorter is better? designer to explain why different scanning tasks is harder
- why even use visualization, can be completely automated? strip is interactive so automated, cells will be highlighted, still have to test with eyes as systems may not be aware of orders
- paper to electronics, often a reluctance to move to electronics b/c of ritual of handing off information physically
- scenarios, tasks, and ground truth, 2009 added video
- synthetic? not computer generated but inserted ground truth into the data
- no people or license plates
q and a:
- don't want to generate data, use real if possible; it's a problem, data we use is no where as noisy as real data, goal is to push
- small data set, use that tool to real dataset? one metrics for reviewer is scalability
- some uncertainty: min/max cog load--put all of them together, frustration + load, not good; pleasure + load, good
- strategies: how to approach; med imaging to see how people read xrays to teach students how to read it (eg concentrating on some points on the xray)
- is there one or two that is more promising in the short time that we should look at more: no, add them all of them, correlation between all of them will give us the answer, triangulation; eye-tracking is easier to analyze right now; if got significant results, see in physiological measures, if not, hard to see signals
- wear many gears? for now, but yes, have to start somewhere
- ind differences? for each person, 90% effective with training, it's a within subject thing
- multidimensional understand, most measures are correlated, some may be impact by parasympathic/sym activities, going to take many years before one can tease these apart; same as interview, don't try to do it and look at the data to find something, start with hypothesis to verify, esp with significant results from other studies
- brain activity, can we provide some visual stimuli systematically generate insights, can catch them or not? tricky problem
- good for micro level design issues, surgical tool after people have done a usability evaluation
- mouse? talk? recruit different number of people
- difficult to do eye tracking on rich client interfaces
- data? qual or quant?
- heatmaps? great for first short term, more interested in temperate changes, can we extract scanning strategies
- (started talking about studies)
q and a:
- problems solved with statistics
- visual stimuli are so small; arms length can see a thumbnail width (couple of cms on the screen), need buffer area on the AOI
- strategies between spider / bar graphs, when a scanpath differ? reading across a graph and scanning between dimensions, when id different enough for paths to be different? hierarchical clustering to find strategies.
- which chart types? it depends, very task dependent
- automatically select graph types depending on a task
- interactive vis? interactive UI, accordance, or simple menu, take a movie and look at segment of movies, love to have a video based tool, not impossible
- look at eye tracking in the entire view, more complicated vis analysis tools
- break vis into views, track what's going on when using views
- integrate eye tracking into improvise
q and a:
- example of use? track pattern of gazes between the filtering control and the filtered results, to see how the filtering is done, see how control is observed (multiple control may have changed)
- anticipate user action and prepare system for it
- don't need eye tracking, but can use mouse tracking? the two together will tell you a lot more, but don't know how the two combine yet, but mouse/eye may be separate
- eye tracking data to be nosier than mouse data, eye tend to lead mouse movement
- multiple level of processing? base scanning of the vis, action
- high level stuff? too complicated? try to tune design and spacing? more space between views
- differ from a different focus group: focus group is typically single salient tracked that may not represent actual use, here, look at utility, eg., mapping, effectiveness of vis in general
- free form exploration, conversation (ppt adds to each other, enlighten explains to the confuse), strong sense of utility, removes potential confounds to usability
- caveats: domain is general, easy to find ppt, got eco validity, great users, no usability
q and a:
- sounds like users inspecting prototype, need formal setting / ways to conduct focus group, how different in terms of quality? not focus group, have interviews
- one problem found with some focus group, found a dominant person that can sway the rest of the people, did you find, deal with? had that to some extend, 4 ppt from 3 communities, interested in different things but still able to communicate with others. find out should be supported and promoted, and who needs these tasks. different interests balance group
- good experience with focus group: bring up controversy, saw it? no, but can see how that can happen
- realistic tasks vs abstract tasks that capture essence
- realistic has eco validity but not canonical
- do both: mutually linked studies, can recycle mechanics as can reuse design, participants
q and a: - in both cases, compare colorlens to slider looking at tone mapping - order matters? it may, didn't control for that, do noise first and photo after, need to find a particular photos - difference between two studies? colorlens is better in both studies, numbers were different
- existing metrics are not something that dns can buy into (time and error)
- metrics of interest: novice analysts to become expert (users to gain knowledge, insight)
- learning: ask what they have learned using the vis via tests, or ask directly to see if they learned
- pre-test, training, tasks (time and error, logging), post-test (sub preferences). Focus tends to be in users solving tasks using vis
- instead, spend time testing their knowledge, questionnaire or solve a new task (quantitative score), second time around, time and error (look at the change)
- types of learning (1) gaining interface knowledge; (2) gaining knowledge about data; (3) increasing expertise in domain
- look at evaluation from the perspective of a client
q and a: - look at users themselves and what they consider as metrics, look at how vis' users are evaluation (put it in their terms): glazing over how to test learning, focus on testing at the end, may have something more tangible
- new dataset at VAST, reducing the vis process to the people's ability to use their own tools, some people are better at reasoning? current way is similar, but not measuring how fast they solve it. tool has to be mature enough to run another dataset and to solve the problem (reprogramming the tool while trying to solve it). Return of investment is not good.
- interface, system designed by engineer, there is a system model where one has to learn in order to be proficient in the tool, and is more than just interface.
- small jobs @ amazon
- automatic recruitment and payment, prevent people going the tasks multiple times
- pretty good gender balance, broader age pool
- faster at tasks, but some unexplained differences
- personality may differ (unusual population)
- design, may have cheaters (accuracy/quality) problems, may be solved by bonus system, build checks into data collection to filter
- good for: simple interaction, short performance time (less drop out), responses that are hard to fake
- demographics: cannot trust
q and a:
- self report, people are more willing to self report when f2f, may not a population difference
- cheaper: 6-10X cost saving, but time saving is more
- human study report? yes, pretty fast, publish IRB protocol?
- faster, accuracy? no difference
- more intensive tasks with more money? long tasks, high drop out, paid $2, need to figure out how much to pay for the long task
- vis research, monitor, graphic cards etc critical and not controlled? doesn't go away with any online studies, make sure that's something not critical, most people have similar pixel counts and default colors
- self report? no, but should do
- connectivity speed
- can collect a lot of client info with javascript
- mostly US, for our case, mostly looking for perceptual / cognitive effects so pretty culturally neutral, but a consideration. benefits is could get people from different countries and compare, but not sure if ip logging is reliable.
- how do you account for questions that you don't know the answers to, no ground truth? add some ground truth, and intersubject agreement, task dependent.
- for most vis, is there a better design? engaging designer in evaluation
- analytical framework for vis criticism
- cost effective vis techniques: crowdsourced experiments
- replicate clevelands graphical perception study in the 80s using m. turks
- higher variance but same pattern holds, turkers are more accurate and faster
q and a:
- cost to pay for becoming main stream, people doing fun things for a while, but will settle? more important in system design and defaults
- reward structure different for designers and infovis researchers? can learn from each other, playfulness may be good
- easier on the web to propose new design since only need to draw? data is a stakeholder in vis as in users, prototyping for vis is different from other interfaces
- criticism, bad vis is not helpful? mean constructive, strength and weaknesses
- other side, tufte, too far over the extreme and too designed? have other books
- evaluate utility in real environment, field study
q and a:
- org obstacles, don't like connecting to database (BofA), from prototype to integration takes a really long period of time since they need to reimplement? it depends, database requires reimplementation, but can connect to database to only reading rights (lots of lunches!). Don't have access to the real database. Measured time and was significantly faster, bmw funded even though didn't learn why the tool is faster. Integration takes time (> 3 years).
- do a smaller version and prove itself? that's one way, lots of social skills
- use graduate student, to what extend are the results valid, how would results change if we were to use domain experts
- quantitative and qual, wanted to get both, but reviewers wanted a clearer definition
- 16 ppt in total, no stat. sig but hard to do in depth work with more ppts
- smaller set of documents, not eco valid but need to reduce doc size to make the study feasible
- big system, big learning curve for ppts, overlooked functionality of tool, takes too much time for training
- many investigative analysis, use focused scenario, difficult develop strategies for more general tasks
q and a:
- object to cs students, not students. match task subject with school subjects, they are novices but they are still more of a domain experts? use grad student for their abilities to analyze
- based on vis analytics tools
q and a: - cog load, main task with secondary task, flow of interaction. how much cog load required to manipulate flow of cognition. lots of factors in the interface, very complicated. remove main task and focus on cog load, do the exp in a controlled way? good suggestion to follow, flow of cognition. most interfaces focus on rich interaction, need to maintain flow of cognition.