Methodology for Distributed Usability Evaluation in Collaborative Virtual Environments
Jolanda G. Tromp
Communications Research Group, Department of Computer Science, University of Nottingham, University Park, Nottingham, NG7 2RD, U.K.
Tel: 0115 951 4226
Fax: 0115 951 4254
This paper describes the methodological implications for evaluation research on collaborative virtual environments. The information is based on the experiences drawn from a longitudinal study of MASSIVE-1, a virtual reality tele-conferencing system. Benefits and drawbacks of the constraints on methodology, implicated by the developmental stage of collaborative virtual environment technology, are identified and discussed.
Keywords: collaborative virtual environments, methodology, usability, evaluation, presence, multiple tasks, multiple users.
The need for the development of a special tool kit for Virtual Reality (VR) evaluation has been expressed by many researchers, while at the same time, there has been a tendency to ignore or minimise the evaluation of VR applications (Durlach & Mavor, 1995). The work presented in this paper is based the hypothesis that there is a need for development of a methodological approach to evaluation of VR, and more specifically Collaborative Virtual Environments (CVEs). Therefor the author identifies the position of CVE evaluation in the empirical cycle of scientific inquiry. From this methodological positioning appropriate research methods for CVE evaluation are derived. Implications are drawn about the constraints within which this particular research needs to take place. We integrate socio-technological phenomena in our evaluation by using methods taken from sociological, psychological and HCI research disciplines. The work presented in this paper is based on lessons learned from a longitudinal evaluation of a CVE called MASSIVE-1.
In the next section CVE evaluation is placed within the empirical cycle of scientific enquiry. In section 3 the constraints, which are caused by CVE technology for methodology of CVE evaluation research are discussed. Section 4 presents the experiences with doing evaluations in MASSIVE-1. From these experiences recommendations are made for similar research. Finally, in the conclusions the merits and drawbacks of the methodological approach are summarised.
2. Positioning of CVE evaluation within Empirical Cycle of Scientific Inquiry
This section describes the position of CVE evaluation research within the empirical cycle of scientific inquiry. Development of CVEs is still in the early phases. CVE technology is a relatively young, multidisciplinary science (Kalawsky, 1993). Like any new science, it aims to discover new concepts to develop, such as presence in a virtual environment, virtual embodiments, and working together in a virtual space. These concepts become constructs in a model of human behaviour and this human behaviour is explored in order to come to operational definitions for further empirical research.
The empirical cycle consists of 5 phases (Groot, 1969). Phase 1 is the æObservationÆ phase. It consists of collection and grouping of empirical materials; and the (tentative) formation of hypotheses. Phase 2 is the æInductionÆ and consists of the formulation of hypotheses. Phase 3 is the æDeductionÆ and consists of the derivation of specific consequences from the hypotheses, in the form of testable predictions. Phase 4 is the æTestingÆ of the hypotheses against new empirical materials, by way of checking whether or not the predictions are fulfilled. Phase 5 is the final æEvaluationÆ of the outcome of the testing procedure with respect to the hypotheses or theories stated, as well as with a view to subsequent, continued, or related investigations.
2.1 From Abstract Concept to Measurable Behaviour
Operational definitions are essential for the measurement of the human behaviour. An operational definition specifies how to measure a construct, so that it becomes a variable in a theory which can be assigned values, such as high, medium, and low. This theory can than be tested by performing experiments (Judd, Smith, & Kidder, 1991). For instance, we might want to find out if a high degree of presence in a virtual environment has a positive effect on performance. In order to do this we first need to be able to describe what we understand by the constructs presence and performance. Based on this description we can define operational definitions for these constructs. What does it mean to experience presence in a virtual environment? How can we measure presence? Before we can measure when a construct, such as presence is high, medium, or low, we first need to specify what we understand by presence; we need to explore the construct.
2.2 CVE Developmental Phase Needs Exploratory Studies
The development and demands of social behaviour in CVE will only become clear by people using the software, and by exploring the needs of these users. Examples of constructs which have been identified so far are presence, and co-presence, immersion, awareness, spatial phenomena, wayfinding, subjective views, believability, collaborative work, etc. The concepts co-presence and collaborative work are already known from CSCW research. Likewise, the concepts presence and telepresence are already known from research in the field of robotics.
Many of the constructs which need to be addressed to fully exploit to capabilities of CVE software and hardware are either unknown, or unexplored, and untested. Thus, because CVE development is still in its early stages, we often need to employ an exploratory approach to identify the aspects of human behaviour which affect performance and satisfaction. Psychological exploratory techniques, and sociological ethnographic techniques are therefore particularly suitable at this phase of the empirical cycle of inquiry.
Ethnographic CVE investigation entails extended CVE participant observation with and within the CVE. The rationale is that a prolonged period of intense immersion in a culture best enables the ethnographer to experience the world of his or her subjects, and hence to grasp the significance of their language and actions. For CVE development and evaluation this means that ethnographic inquiry is aimed at trying to interpret the hidden assumptions of the designers, their pioneering behaviours, their experiences and wishes with the CVE technology, by describing the field, and registering the phenomena. By focusing on uncovering these hidden assumptions there is a lot of design information to be found.
Psychological exploratory investigations are not primarily aimed at describing phenomena, investigations of the field are made to articulate and select possible hypotheses. Phenomena are recorded, and ordered, in order to establish relationships between them and to come to would-be hypotheses. For CVE explorations this means that the researcher lets the subjects 'speak for themselves', in order to gather as much concrete data as possible. The concrete data is analysed to find the elements which are causing or related to the phenomena observed. This will then help the CVE researcher to decide what data are to be used, what is to be measured, and roughly what relationships are to be studied in future evaluations of performance and satisfaction with CVEs.
3. Constraints of Method for CVE evaluations
This section describes the constraints on methodology of CVE evaluations, which are caused by the peculiarities of CVE technology and developmental phase. First the types of usability testing for CVE are identified, secondly the goal of the software is described. Thirdly, a number of threats to validity and reliability of the research setting, and findings are identified. Lastly, recommendations are made to come to solutions for these problems.
3.1 Two Types of Usability Testing
CVE usability testing is similar to HCI usability testing in that evaluation tasks can be roughly divided into two groups: 1) evaluation of system characteristics and performance and 2) observation and measurement of human behaviour and performance with the application. This is where the similarity ends however. A major difference for CVE evaluation is that the evaluation task of measuring human performance can take place on two levels. We are interested in human behaviour and performance with the application. At the same time, we are also interested in human behaviour and performance inside the CVE, answering questions about perception of 3D computer generated space, navigation, presence, and awareness, etc. Thus, in general observations and experiments can and need to be performed both from outside the CVE and from inside the CVE.
3.2 Two goals of CVE software
Another way in which evaluation of CVEs is different from traditional HCI evaluations is that the goal of the application is two-fold. The general goal of CVEs is to create a place for people to interact, while the specific goal always involves managing multiple collaborative tasks. Both these goals have implications for the process of scientific inquiry, because in order to establish whether users can achieve their goal with the application, we need to understand these goals, and be able to measure the success the users have with reaching these goals.
General Goal: Creating a Sense of Presence
The general goal of CVEs is to create a place for people to interact. This is a goal all CVEs have in common. CVEs must create a 3 dimensional space for its users, because of the intention of providing a place for users to manage their activities (Harrison & Dourish, 1996). It seems common sense to assume that the users need to feel present in this space in order to make sense of this 3 dimensional environment. For this reason presence and telepresence have received relatively much attention in the VR community (Tromp, 1995). CVEs also need to create a sense of co-presence, and a sense of being present in two or more environments at the same time. One of the consequences of presence in CVEs is that users of a typical CVE may have more than one embodiment in more than one CVE, and they always have to share their attentional resources between their real body in their real environment, and their virtual embodiment in the CVE.
Specific Goal: Multiple Tasks
The specific goal of CVEs always involves multiple tasks, not one single task. Users are working in a shared space in which they have to coordinate multiple activities. Because of the collaborative nature of CVEs, it is not sufficient to test for usability of one task. Instead, usability testing needs to address how multiple users are handling multiple, simultaneous tasks, and additionally it needs to look at their satisfaction and performance on executing and switching between these tasks. Which particular tasks these are depends on the specific goal of the application. For instance, in a CVE intended for virtual conferencing, the tasks become introducing oneself, establishing relationships, running a meeting, writing on the blackboard, observing behaviour of other participants, distributing information amongst the participants, etc.
3.3. Threats to Validity and Reliability of the Measures
Because the production of CVEs is taking place within the developmental phase of the production, the final product is often a prototype or demonstrator. Boundaries of the existing technology are pushed to create new ways of doing things, and as a result even more new things become possible. One of the side-effects of doing evaluations on a prototype which is still under development, is that there are no manuals, no fully functioning application, and few opportunities for using proper representative subjects from the population of intended users for the usability studies.
This it can occur that the researcher has to use the subjects s/he can get, and often these subjects are the developers of the software and their nearest colleagues. One of the consequences of this constraints is that it is not possible to a use a random sample of subjects. In order to generalise the findings from an experiment with a small group of subjects to a larger population, the sample of subjects must be a random choice from the set of representative members of that larger population. Obviously, developers of CVEs are not representative members of the future group of intended users of that CVE, and they have not been randomly chosen. Actually, these developers are a highly specific selection of subjects, and the data gathered from their behaviour will have to be interpreted with this knowledge in mind.
Physical Distribution of Users
Another characteristic of CVEs is that the users are geographically distributed. A CVE allows multiple users to interact simultaneously within the CVE in real-time, regardless of the physical location of these users. This means that if the CVE is tested for usability, it has to be tested in its distributed functioning for multiple users, as well as for the direct interface offered to the single user. One of the implications of the distributed character of the CVE application and its users, is that it becomes more difficult to conduct proper controlled experiments.
Threats to Internal Validity due to Distributed Setting
The physical distribution of the users becomes a constraint on the degree to which the evaluation researcher can control the experimental setting. A typical concerns for experimental set-ups is that they should be as similar as possible for each subject in each condition of the study. In order to be able to claim with confidence that the observed difference in behaviour is attributable to a specific different implementation of a construct in the CVE, the researcher needs to rule out any influences on the user, other than the desired ones. This means that the environments of the users should be as similar as possible, the researcher has to behave as similar as possible with each subject in the study, questionnaires should be answered at similar times as possible, subjects within one group should be as similar as possible, subjects within one group should receive similar treatment, etc. When conducting distributed usability studies this becomes a complicated task, because the researcher can not be in all places simultaneously in order to guarantee a similar treatment of all subjects.
Independent Variable can not be Manipulated
Another consequence of evaluating prototypes is that it is often not feasible within the time and effort available to create two (or more) different situations for an experiment. This means that the independent variables can not always be manipulated. For instance, a researcher may have found that having a personal shadow in a CVE may assist orientation and wayfinding. In order to find out what kind of shadow is most effective the researcher needs at least two versions of the CVE, each one with a different kind of shadows, and preferably one CVE without any shadows at all, for the control group. These three versions of the CVE constitute the manipulation of the variable 'shadow'. The group that does best on the orientation and wayfinding points to the CVE with the best shadow. Obviously this is an informative, but labour intensive way of gathering knowledge, which may not always be possible. Thus, sometimes the researcher is limited in the way of researching possible CVE design solutions, to such an extent that an experimental design is impossible.
Work Within Constraints
There are numerous threats to the validity and generalisability of research findings of human behaviour. The threats that are particularly hazardous for CVE evaluation, because of the nature of CVEs, have been listed above. The solution to these problems is to regard them as constraints; not to abandon the research. We employ the best research design we can use when we recognise these limitations and attempt to overcome them as ably as we can. A rich source of information becomes available by exploring human behaviour with and within CVEs, which is an important component for phase 2 & 3 of the empirical cycle of scientific inquiry and subsequently for future empirical research of CVEs.
4. Longitudinal Exploratory Study of a CVE called MASSIVE
The results of the study described here, are part of a longitudinal survey of a CVE called MASSIVE-1. The results are described in terms of methodology of evaluation issues, as relevant to the constraints mentioned above. Readers interested in a more detailed description of the findings are referred to Tromp & Snowdon (1997).
4.1 Description of MASSIVE-1
MASSIVE-1 (Greenhalgh & Benford, 1995), is a real-time, distributed 3D graphical desktop VR system. It provides facilities to support geographically isolated users at graphical workstations with a tele-conferencing system for interaction and cooperation via text, audio and graphics media, using the Internet. The environment consists of several rooms ('worlds') and portals which connect these worlds, through which participants can move by pointing with their mouse. Each user has a graphical virtual embodiment, and can move freely and independently through the CVE. Users can communicate by talking in their microphone, which allows all other participants in the same world to hear each other, and additionally by typing messages in the textual window. Several worlds contain a white-board on which users can write, also by typing in the textual window.
4.2 Description of ITW Longitudinal Study
During the ITW project 20 official virtual meetings took place, using MASSIVE-1 over Super JANET, the UK's high-speed academic network. ). The meetings took place on a fortnightly basis over the period of a year. The meetings involved between 3 and 10 simultaneous participants connecting from British Telecommunications at Ipswich, and 5 UK universities: Nottingham, Lancaster, Manchester, Leeds, and UCL. The subjects in the trials are 10 programmers and developers of CVE technology and computer science students. The early trials involved getting the software to work at each site, and getting all subjects to see and hear each other in the CVE. During this time network activity measurements, video recordings, and ethnographic observations were made. After the software was stable enough, twelve sessions were dedicated for usability studies, and in addition to exploring CVE related constructs, again network activity was measured, video recordings and ethnographic observations were made. Each trial took approximately 1 hour.
A replicated time-series design was used, in which the repetitive collection of data involved constructs such as satisfaction, experience, ease of use, group involvement and awareness. In addition each session was used to explore new constructs such as dealing with multiple virtual embodiments, distributed awareness, switching between the virtual environment and the real environment. In order to measure network activity, and usability issues, the subjects performed such tasks as team word games, hide and seek in a maze, team exploration in a noisy environment, a dance contest, project meetings, and presentations.
Data Collection Procedures
Human behaviour and experience was captured using questionnaires, and video-recordings were made of each trial session. Network activity logs were made at each site for each trial and send by email to the coordinator of the project. The analysis of the network activity data, together with the video recordings and questionnaire data have provided an interesting and useful insight in human behaviour in CVEs (Tromp, 1997), which will be used to predict network load for large scale CVEs (Greenhalgh, 1997).
The questionnaires consisted of attitude statements with Likert-scales, and open-ended questions. These questionnaires were made available on the WEB, as forms, so that the subjects could answer and send them back through the WEB as well. It took on average about 30 minutes to answer each questionnaire. In addition, the fax was used for sending drawings, in such tasks where drawing a map of the virtual space was asked of the subjects. Interviews were used as a follow-up for some interesting behaviours observed in the CVE. The project WEB pages were also used to create a source for tutorials of special features of the software, for task descriptions in the trials, and for keeping statistics and results of the trials.
4.1 Problems and Solutions Encountered
Having described the nature of the constraints of CVE evaluations, we now turn to the results of the experiences with doing evaluation studies within these constraints. Problems with the constraints are described, and below each description recommendations are made on how to overcome these problems.
Single Researcher, Many Sites
Because of the distributed nature of the CVE subjects it is difficult to ensure that all subjects get a similar treatment. For instance, problems may occur with the software of the interface for which the subject will ask help from unknown others in the workplace. Or, subjects may have misunderstood part of the task because of a distraction at their workplace, or might simply be helped by somebody unknown to the researcher in their workplace. In the first place it is ideal to find other researchers who can assist from the site of the subjects in assuring that all subjects are in similar conditions during the trial. To ensure that all assisting researchers give the same treatment, help and explanation to the subjects, a description of the task can be send out, which is than used by the assisting researchers. The assistant researchers can also perform an important task in trial cases were the subjects need to be debriefed after the experiment.
Waiting time occurs naturally, because of the distributed nature of the CVE participants, and it has spontaneously been used for informal interaction, which also proved beneficial to team building. So, secondly, it is suggested to build in an official 'waiting-time' for each trial session, to make sure all late-comers have arrived before the actual task has started. In addition it is recommended to ask the subjects to describe their physical settings, in order to establish whether they differ in ways that matter. Also, make some space in the questionnaire to ask routinely whether any help was received, or what interruptions, or confusions were encountered, etc.
Motivation of subjects to attend
It has not always proved easy to make sure that subjects who were committed to be present at a trial actually turned up. Subjects may forget, or may be ill, in addition subjects may be trying to turn up but be unable due to network problems which are out of their control. Because of the small number of participants missing out on a few subjects may be disastrous for the proper conduct of the experiment. It has proved effective to use team-building games and competitive games from management training to make the participants in the trials feel more involved. Attendance rates and subjective enjoyment of the task have been observed to improve after 2 game sessions. It is recommended to make sure the subjects are happy being subjects, by making them feel an essential part of the trials, and by asking them feedback questions about their experiences and frustrations with being a subject. Also rewards could be used, e.g. a chocolate bar was promised to the winner of one of the competitive games, which increased morale notably.
In general team building games were found effective to make the subjects more committed to attend meetings, and to provide a back-up person in case of illness, who understands enough of the software and interface to be a similar subject. It is also highly recommended to urge the subjects to establish a routine by giving them a estimate of the time the experiment takes, and keeping this time roughly similar for each session, so that the subjects can plan to arrive in plenty of time for the trial, and schedule in enough time for the total task .
Motivation of subjects to return questionnaire
Because of the distributed nature of the subjects in the sample, it is difficult to make them all understand that it is important to return the questionnaire directly after the trial has ended. In any questionnaire survey it is a well known problem to get a proper response rate, but in the case of very few, distributed, subjects it becomes incredibly important to get as many returned questionnaires as possible.
It is recommended to indicate the time it takes to answer the questionnaire. Remind the subjects to do answer the questionnaire right after the end of the trial. It has also been found useful to weed out any irritations, or frustrations with being a subject by asking the users feedback questions about answering the questionnaires. At the bottom of each questionnaire a space was reserved for subjects to add anything they wanted, and subjects were encouraged to use this space. Quite often this option was used, and several times important additional information was gathered.
In general the effort to answer a questionnaire should be kept as low as possible. The need to keep this in mind was illustrated by the findings from the feedback questionnaires. Subjects felt they needed short questionnaires, with easy to answer questions, because they feel they do not want to spend a lot of time on the activity. They also wanted clear introductions of the topics of the questions, because the questions sometimes addressed topics which they had never thought about before. On the whole, the general consensus was that it was interesting and even a learning experience to answer the questionnaires.
Explanation of task important
There is rather a heavy cognitive load on the subjects while entering the virtual space. They have to coordinate software and hardware to work properly, to interact with the other subjects on an informal basis, and in addition the subjects might need to listen to a description of their task in the CVE. It has been observed that subjects have not understood the task properly because they were distracted, or because their connection gave them problems, or because the task was so complicated that they simply could not understand what to do after only one explanation.
It has proved to be useful to have a description of the task available on the WEB, to encourage the subjects to read the task, while the experimenter is reading the task out-loud to the subjects, within the CVE. It is also useful to ask the subjects whether they have any questions, etc. because it is still difficult to observe confusion on virtual faces.
Training & Documentation of Interface
Often the CVE application is in a very early state, so that the researcher and coordinator need to create their own manuals and training documentation for the software. Especially when new features are introduced or old features constantly adapted during the design process it becomes important to make sure all participants in principle have the same level of knowledge and skills.
Providing training at the start of some of the trial sessions has proved to be a successful and highly appreciated activity from the subjects points of view. Documentation can be provided via WEB pages, and seems to be appreciated in two forms: i) short overviews of all commands, which can be used as references while being engaged in the task, ii) longer cook-book like explanations of commands and pictures of their results which can be studied any time.
In this final section a summary is made of the types of data which have been gathered by doing exploratory studies of a CVE. First the benefits of the exploratory approach are listed. Secondly the drawbacks are discussed, and it is argued the drawbacks can be turned into an advantage for the data gathering process when looked at in a different light. Lastly general conclusions about the methodological positioning approach described in this paper are discussed and based on these conclusions, recommendations for future research are made.
Direct benefits observed with the ITW longitudinal study are that the 2 types of usability testing, e.g. of the network and human behaviour go hand in hand very well. Combining network data, video data, and questionnaire data can lead to a rich set of interesting insights in human behaviour with CVEs. The types of data which have been found can be summarised as:
User interface misunderstandings identified: Some options in the user interface were not clear or misunderstood in terms of their functionality. The questionnaires proved insightful in identifying those issues.
User interface improvements identified: Many suggestions from the subjects as to what interface issues should be addressed first and foremost for improvements have been identified successfully. The tasks and questionnaires which were focused on one particular aspect of the CVE proved insightful in identifying these issues.
Insight in human behaviour with CVE: Information has been gathered about the behaviour of subjects caused by the introduction of the new CVE technology, such as how subjects can and must work together to solve problems even though they are in geographically distributed places. The experiences of the coordinator and the feedback reports from the subjects, observations and diaries provided the insightful information.
Insight in human behaviour in CVE: Using focused tasks for the usability studies has proved to record and elicit new constructs to look into about the behaviour of users in CVEs. Questionnaires, video recordings and network activity logs provided the insightful information.
Insight in methodology of CVE evaluation: Information has been gathered on the methodology of setting up and conducting distributed usability studies. Based on the experiences of the coordinator, the experimenter, the results from the questionnaires, and the feedback from the subjects the insightful information has been gathered.
The distributed nature of CVE usability testing creates the largest difficulty for doing empirical research. Even though many problems caused by this drawback have been identified and solutions have been found, by far the best solution is to invite other researchers who are locals at the geographically remote sites to assist in the studies. This solution could be regarded as an advantage, because not only can the researchers share the work-load of running the trials, the assistant researchers can use the trial sessions to gather data for their own purposes, or take responsibility for designing an experiment.
Another drawback is the fact that the subject group is highly specific. Developers of CVEs are not typical users of the finished product. It is therefore dangerous to assume that the findings from usability studies with the developers will be sufficient to develop a finished product. However, because of the early developmental stage of the CVE technology, it seems rather useful to elicit opinions of the CVE developers. These people are pioneers of the technology, they have thought about it and worked on it for many hours, if not years, and as such they can provide us with an insight in the technology that no other type of subject can give us. The kind of information CVE developers can give about CVEs is biased, but in a useful and interesting way, and a rich set of data can be gathered from them.
5.3 General Conclusions
Overall it can be concluded that it is quite possible to work within the constraints of the CVE technology while still gathering useful and interesting data which answers questions on many levels of scientific inquiry. Seeming constraints can be used to the advantage of the researcher and the research issues, if the approach is broad-minded and multidisciplinary.
By capturing network activity, video recordings of human behaviour with the CVE software, and within the CVE, and combining the questionnaire data with observational techniques from sociology and psychology a rich set of data can be gathered.
As a final recommendation the author wishes to urge researchers of human behaviour in CVEs to pool the results of their respective explorations of constructs systematically, so that the next phases of the empirical cycle can be entered with an informed, large set of data.
This longitudinal study has been funded by BT/JISCÆs the æInhabiting The WebÆ project. The method have been further developed within the ACTS COVEN project. Ideas presented have been discussed with colleagues through the mailing list this topic (URL: www.crg.cs.nott.ac.uk/~jgt/ic3d2.html) and members of the USINACTS project.