WEBVTT Kind: captions; language: en-us NOTE Treffsikkerhet: 81% (H?Y) 00:00:02.100 --> 00:00:10.800 So today is psychometrics and we'll see how far we get. First of all, this is what I want to talk about. 00:00:10.800 --> 00:00:18.200 We will have an intro. This is for two lectures, so we'll see how we go and where we get stuck. 00:00:18.200 --> 00:00:25.700 So first, I introduce psychometrics, what it is, why we need it, I will explain more little bit about 00:00:25.700 --> 00:00:31.850 what is it exactly and the Cosman framework, which is a psychometric framework, NOTE Treffsikkerhet: 74% (MEDIUM) 00:00:31.850 --> 00:00:39.400 the international consensus based framework. Then I'll give you some examples about what 00:00:39.400 --> 00:00:43.900 psychometrics property are we talking about and I give you examples and we discuss them. I talk 00:00:43.900 --> 00:00:51.500 about the study methodological quality checklist, which is the Cosmin checklist that has got to do, 00:00:51.500 --> 00:00:58.700 it's like a Cat, a critical appraisal tool, but then for studies that are psychometric, 00:00:58.700 --> 00:01:01.800 psychometric studies that discuss these NOTE Treffsikkerhet: 80% (H?Y) 00:01:01.800 --> 00:01:08.200 psychometric properties over measure. I explain a little bit what is a psychometric review. 00:01:08.200 --> 00:01:15.100 We talked about diagnostic review, systematic reviews, but what is the psychometric review? And then 00:01:15.100 --> 00:01:19.700 I give you a little bit like, okay, now we know all of this. So how to select a screen or an 00:01:19.700 --> 00:01:25.900 assessment? Now all of this doesn't fit into one lecture, if it does fit in one lecture, 00:01:25.900 --> 00:01:31.800 I've talked too far. So just let's start with an introduction. Now if we talk about NOTE Treffsikkerhet: 86% (H?Y) 00:01:31.800 --> 00:01:37.500 measurements then you've got screenings. We did that yesterday and the lecture before that, 00:01:37.500 --> 00:01:44.200 and that can be at school entry or you want to screen kids for dyslexia or literacy, 00:01:44.200 --> 00:01:51.600 you name it. There are screens and they are only meant to identify those at risk. And if a 00:01:51.600 --> 00:01:58.400 child is at risk or an adult, you want to refer for further assessment, then you've got in... if we 00:01:58.400 --> 00:02:01.300 talk about assessment and this is just a screen. NOTE Treffsikkerhet: 81% (H?Y) 00:02:01.300 --> 00:02:08.000 Screening is again, detect persons at risk and the first step in decision making. But I 00:02:08.000 --> 00:02:13.300 just want to go back a little bit on too quick. So we've got ourselves the screen and this is all 00:02:13.300 --> 00:02:19.400 assessment. First, briefly repeat the screen. And that was to detect 00:02:19.400 --> 00:02:27.200 person at risk, the first step of decision making and if you fail, you need further assessment. 00:02:27.200 --> 00:02:31.750 Now we discuss how do you select a screen? And we say NOTE Treffsikkerhet: 90% (H?Y) 00:02:31.750 --> 00:02:38.300 find yourself a screen in the literature. You want good study quality. That means that the article 00:02:38.300 --> 00:02:45.100 that describes your screen should be of good methodological quality. And the screen should have good 00:02:45.100 --> 00:02:51.600 diagnostic performance. And then we talk about sensitivity specificity and the whole things. If that 00:02:51.600 --> 00:02:58.600 is poor, any of those to the screen is out. But if both are okay, then you can consider implementing 00:02:58.600 --> 00:03:01.700 to new clinics and research. NOTE Treffsikkerhet: 78% (H?Y) 00:03:01.700 --> 00:03:07.800 We talked about critical appraisal tool. This is a short one from the Cochran for screens and there 00:03:07.800 --> 00:03:15.500 were items on validity, items on generalizability and on reliability. Now, the whole diagnostic 00:03:15.500 --> 00:03:21.500 performance is based on this crosstabs. So you've got yourself your reference test, your gold 00:03:21.500 --> 00:03:27.600 standard, and you compare your screen with that reference test. If you do that, you end up with all 00:03:27.600 --> 00:03:31.800 these true positives and true negatives. And you want, you hope that most NOTE Treffsikkerhet: 81% (H?Y) 00:03:31.800 --> 00:03:37.000 all of your participants are in the green cells, but that's not life. So you need to determine 00:03:37.000 --> 00:03:43.550 diagnostic performance. And there are two other important characteristics of a screen and that is 00:03:43.550 --> 00:03:49.500 feasibility with referring to time and complexity. A screen should be simple. It should be 00:03:49.500 --> 00:03:56.800 quick, and be reliable. So that means if I do the screen today in my classroom and 00:03:56.800 --> 00:04:01.750 I repeat it tomorrow, I want similar results. So we're talking about intra into NOTE Treffsikkerhet: 81% (H?Y) 00:04:01.750 --> 00:04:09.800 to reliability and test-retest reliability. All of that is important when we select our screen. 00:04:09.800 --> 00:04:17.200 You consider a reference test as a gold standard actually, and that 00:04:17.200 --> 00:04:23.900 brings us to Cosmin, to today. That is actually Criterion validity, Criterion validity is one of the 00:04:23.900 --> 00:04:31.750 forms of validity and it refers to the degree to which scores of a health-related NOTE Treffsikkerhet: 85% (H?Y) 00:04:31.750 --> 00:04:38.150 instrument or educational instrument are an inadequate reflection of 00:04:38.150 --> 00:04:44.200 a gold standard. And you can see that Health HRPRO, you will see that term more often, 00:04:44.200 --> 00:04:50.400 it stands for health-related patient reported outcome, but that you can replace that by educational 00:04:50.400 --> 00:04:57.500 output. It's all the same but, it is criteria validity the gold standard. You compare a measure is 00:04:57.500 --> 00:04:59.750 screen with the gold standard. NOTE Treffsikkerhet: 90% (H?Y) 00:04:59.750 --> 00:05:07.000 And your gold standard has no error, sensitivity hundred percent 00:05:07.000 --> 00:05:12.900 specificity hundred percent. No false positives or false negatives. Of course, that is actually not true. 00:05:12.900 --> 00:05:18.800 We're making life a little bit more beautiful than it is. But this is how you deal with the gold standard. 00:05:18.800 --> 00:05:26.500 When we determined diagnostic performance we actually assume that there is no error 00:05:26.500 --> 00:05:29.900 in our gold standard and that is how you determine the diagnostic performance NOTE Treffsikkerhet: 91% (H?Y) 00:05:29.900 --> 00:05:38.200 using that Criterion, that gold standard. Now, back to this one. This is assessment and that 00:05:38.200 --> 00:05:45.500 different types of assessment, you've got gold standard assessment. Well that is not always clear in 00:05:45.500 --> 00:05:51.600 education. You can doubt sometimes, for instance, faculty outcomes, exams, final exams are 00:05:51.600 --> 00:05:59.100 considered to be gold standards or if you talk about assessment, then you've got parent report, 00:05:59.100 --> 00:06:00.100 teacher observationi NOTE Treffsikkerhet: 74% (MEDIUM) 00:06:00.100 --> 00:06:06.800 you've got special needs education assessment... It's all assessment, and then there's such a 00:06:06.800 --> 00:06:13.100 thing as self-report. And then we talked about Functional Health status or health-related quality of 00:06:13.100 --> 00:06:20.500 life, things like that, is you ask a child or you ask that the target population to report on 00:06:20.500 --> 00:06:28.800 certain things. And then it's, of course, self-report. Now, for all of that you need a Cat, 00:06:28.800 --> 00:06:29.950 critical appraisal tool. NOTE Treffsikkerhet: 83% (H?Y) 00:06:29.950 --> 00:06:37.800 You decide what? How do you report? How do you evaluate? What steps do I need? So that is about 00:06:37.800 --> 00:06:43.000 selecting checklists and reporting guidelines. And we've seen already before, you've got this 00:06:43.000 --> 00:06:52.000 website with more tools than you want. And all of that, so you need to decide you go through 00:06:52.000 --> 00:06:58.100 this flow chart and you decide: Well, is it animals or not? Is it quantitative or qualitative, Etc, 00:06:58.100 --> 00:06:59.950 and you end up you can NOTE Treffsikkerhet: 84% (H?Y) 00:06:59.950 --> 00:07:07.650 decide on all of these names, Strobe, Consult, Prisma, that all guidelines and checklist on 00:07:07.650 --> 00:07:11.800 critical appraisal tools or guidelines on reporting. NOTE Treffsikkerhet: 79% (H?Y) 00:07:11.800 --> 00:07:19.700 If a study has got poor study quality, then it's out. We know that already. So that's why use a 00:07:19.700 --> 00:07:26.500 critical appraisal tool to decide whether the studies, thea articles, the manuscripts are actually 00:07:26.500 --> 00:07:33.100 good quality, method quality. You're not there yet. So that is all about quality 00:07:33.100 --> 00:07:39.600 assessment, but we are not there because we also want to talk about psychometrics, and that is about 00:07:39.600 --> 00:07:41.600 the measure itself. NOTE Treffsikkerhet: 81% (H?Y) 00:07:41.600 --> 00:07:49.400 The critical appraisal tools are to identify to determine whether a study is good. Whether it's the 00:07:49.400 --> 00:07:55.200 method in the study is correct, whether there's no bias, no confounding, etc. But now 00:07:55.200 --> 00:08:01.900 we're going to talk about psychometrics and that relates to the assessment. So psychometrics, what 00:08:01.900 --> 00:08:06.900 is it? Well, if you start quick and easy, we go to Wikipedia, you never allowed to do that. 00:08:06.900 --> 00:08:12.500 But sometimes it is easy. And this is what Wikipedia tells us. NOTE Treffsikkerhet: 81% (H?Y) 00:08:13.600 --> 00:08:18.800 Psychometrics is a field of study concerned with the theory and technique of psychological 00:08:18.800 --> 00:08:25.200 measurement. Because it comes out of psychometric, it comes out of psychology, but it's also a lot in 00:08:25.200 --> 00:08:31.100 education. You have a lot of studies and interesting research done, but in general it refers to the 00:08:31.100 --> 00:08:38.299 field in psychology and education and it is about testing, measurement, assessment, all of that. 00:08:38.299 --> 00:08:43.600 And you want to have objective measurements of skills and knowledge, ability, Etc. NOTE Treffsikkerhet: 79% (H?Y) 00:08:43.600 --> 00:08:50.700 So it focus on the construction and validation of assessment instruments. Now, that is 00:08:50.700 --> 00:08:56.300 not that clear in itself, so, okay, why do we even care? What is 00:08:56.300 --> 00:09:03.000 psychometrics? Well, the reason why psychometrics is important is because it gives researchers and 00:09:03.000 --> 00:09:11.000 practitioners, we use assessments and we need psychometric properties to support evidence-based and 00:09:11.000 --> 00:09:13.550 research practices. NOTE Treffsikkerhet: 91% (H?Y) 00:09:13.550 --> 00:09:20.500 If a measurement is not valid or not reliable, we've got a problem. You can have potentially 00:09:20.500 --> 00:09:27.800 error in your clinical judgment or in your research because you can't trust your assessment. So if 00:09:27.800 --> 00:09:33.200 you do anything in measurement, you need to know that your measure is measuring what it is supposed 00:09:33.200 --> 00:09:39.700 to measure, that is doing that in a reliable way, that it is sensitive to change. All of that is 00:09:39.700 --> 00:09:41.400 psychometrics. NOTE Treffsikkerhet: 84% (H?Y) 00:09:41.400 --> 00:09:48.800 So, knowing that a measure is as robus psychometric properties that enables us to use it in 00:09:48.800 --> 00:09:54.900 clinical settings or in research and we can then determine effectiveness of Intervention Program. 00:09:54.900 --> 00:10:01.800 If you use a measurement for psychometrics, your whole research is based on nothingness, you can 00:10:01.800 --> 00:10:08.400 throw it away. It is not valid. It's not reliable. So it is pretty important that when you do any 00:10:08.400 --> 00:10:11.349 research that you are sure your outcome measures are valid and reliable. NOTE Treffsikkerhet: 87% (H?Y) 00:10:11.349 --> 00:10:17.400 That is what we will be talking about the next two lectures. Now, this is just some 00:10:17.400 --> 00:10:24.550 examples to explain where we went wrong or where we had problems. So this is a psychometric review on 00:10:24.550 --> 00:10:32.700 health-related quality of life measures in swallowing problems. We did a review and it was the outcome was... 00:10:32.700 --> 00:10:38.500 Actually you should use the SWAL-QoL, which is a measure the most. It was supposed to 00:10:38.500 --> 00:10:41.300 be the best one. We looked at NOTE Treffsikkerhet: 80% (H?Y) 00:10:41.300 --> 00:10:46.800 psychometric criteria and we said, SWAL-QoL was, at that time and that was a 2014, we said 00:10:46.800 --> 00:10:54.700 that is the one with strongest ratings and strongest psychometric properties. Then a few years later we said, 00:10:54.700 --> 00:11:00.100 well, actually we're going to have a look at this SWAL-Qol and we used different stats 00:11:00.100 --> 00:11:05.800 and that is item response theory. I'm not going to give you all the details, but then we decided 00:11:05.800 --> 00:11:11.150 urgent need to further investigate the underlying structure. We've got a problem. NOTE Treffsikkerhet: 82% (H?Y) 00:11:11.150 --> 00:11:16.300 And that is got to do with, we used in the past, we used classic test Theory but item 00:11:16.300 --> 00:11:22.100 response theory is actually what we use now more and more when we look at instrument development. 00:11:22.100 --> 00:11:27.800 So a few years ago with we said SWAL-Qol is the best and now we know already we've got a problem. 00:11:27.800 --> 00:11:34.600 I'll give you another example. And this is just in my area, but it is an every area. So we looked at the MBSImp. 00:11:34.600 --> 00:11:40.300 And one of authors, Martin Harris, she came with the MBSImp and she said: well, I've 00:11:40.300 --> 00:11:41.300 got this, fantastic tool NOTE Treffsikkerhet: 91% (H?Y) 00:11:41.300 --> 00:11:48.200 that you can use it in. Remember that I showed you, the gold standard in swallowing. It is for 00:11:48.200 --> 00:11:55.700 radiography. You get to make an x-ray recording of swallowing. Gold standard assessment. She's got a visual 00:11:55.700 --> 00:12:00.500 perceptual measure because you need to rate the recordings. She came with the measure. She said, 00:12:00.500 --> 00:12:07.800 this one is fantastic for the inter-rater reliability, content, external validity, Etc. Okay, use 00:12:07.800 --> 00:12:11.349 that one. So then we did a psychometric review. NOTE Treffsikkerhet: 88% (H?Y) 00:12:11.349 --> 00:12:20.000 We looked at visual perceptual measures. Those are measures used to evaluate recordings of either 00:12:20.000 --> 00:12:26.550 endoscopic or these radiographic recordings. And we looked at all of them. And we said, there's 00:12:26.550 --> 00:12:34.400 insufficient evidence to recommend any measure. There are big problems in psychometrics. So, which is 00:12:34.400 --> 00:12:40.800 actually, totally not aligned with this MBSImp. So we said it's not a good one. NOTE Treffsikkerhet: 84% (H?Y) 00:12:40.800 --> 00:12:47.200 We've got a lot of discussions going on currently and these are just discussion examples from 00:12:47.200 --> 00:12:55.600 my area but people will write and present many scripts and say ''my measure is fantastic''. But if you 00:12:55.600 --> 00:13:00.300 then compare it and you look at other articles and you compare the psychometric property, 00:13:00.300 --> 00:13:07.500 well actually that there's bias, it's not okay. And in order to be able to compare all these 00:13:07.500 --> 00:13:11.000 assessments and outcomes we need a framework, NOTE Treffsikkerhet: 79% (H?Y) 00:13:11.000 --> 00:13:20.300 in the AA International consensus based framework in psychometrics. And that is the Cosmin, 00:13:20.300 --> 00:13:25.900 and Cosmin stands for consensus-based 00:13:25.900 --> 00:13:31.100 standards for the selection of Health measurement, instruments. Again, health and education. It is 00:13:31.100 --> 00:13:38.200 also used in education. And this is the most frequently used currently International consensus 00:13:38.200 --> 00:13:41.250 because it was based on delfi studies, you name it. 00:13:42.700 --> 00:13:52.100 So in 2010, they published a framework they had done their delfi study which is a way to 00:13:52.100 --> 00:13:59.200 achieve International consensus. You ask experts, and in a number of rounds, you ask experts to 00:13:59.200 --> 00:14:05.300 agree on definitions on a framework, whatever you decide that you want to do discuss with 00:14:05.300 --> 00:14:11.200 experts. They did that with psychometrics because there in the literature, you will see that people use many 00:14:12.300 --> 00:14:17.600 different terms and that's a problem for the same psychometric property. So let's have a look at this 00:14:17.600 --> 00:14:25.400 Cosmin framework. And there's a whole website, so if you want to know more about it, the website is 00:14:25.400 --> 00:14:30.500 www.cosmin.nl , it is happens to be Dutch. But that is not the reason why I talked about it. 00:14:30.500 --> 00:14:36.400 You will find a lot on that website. And one of the things that you will find there is their 00:14:36.400 --> 00:14:42.250 framework, that domains, like we talked about psychometrics that there are domains and within the NOTE Treffsikkerhet: 90% (H?Y) 00:14:42.250 --> 00:14:46.700 domains we've got psychometric properties, and that's what we're going to talk about now. So first, 00:14:46.700 --> 00:14:53.100 we've got reliability. Well, most of us have got an idea about reliability and it is the degree to 00:14:53.100 --> 00:14:58.600 which the measurement is free from measurement error. Now, you can see within the domain, we've got 00:14:58.600 --> 00:15:05.600 different psychometric properties. We talk about those later. The other domain is validity and validity 00:15:05.600 --> 00:15:12.300 has got to do with the degree to which in an instrument measures the constructs it purports to measure. NOTE Treffsikkerhet: 91% (H?Y) 00:15:12.300 --> 00:15:18.700 So are you measuring what you think you're measuring? And then we've got responsiveness. 00:15:18.700 --> 00:15:25.800 And that is the ability to detect change over time. And that is important. Because especially if you use 00:15:25.800 --> 00:15:32.300 an assessment in interventions, then you must make sure that your assessment is sensitive to 00:15:32.300 --> 00:15:37.300 change if there is a change in the condition of the kids, of the children, you need to make sure 00:15:37.300 --> 00:15:42.250 that your measure can catch up on that. Now, in the corner you can see into NOTE Treffsikkerhet: 72% (MEDIUM) 00:15:42.250 --> 00:15:49.500 interpretibality, which is not a psycometric property, but it is important because 00:15:49.500 --> 00:15:57.100 it is the degree to which you can assign qualitative meaning to quantitative scores and that has got to 00:15:57.100 --> 00:16:04.700 do with distribution of your scores, floor and ceiling effects, Etc. But the whole framework is three 00:16:04.700 --> 00:16:11.800 domains. So you can see the three domains and interpretability is a little bit next to that. NOTE Treffsikkerhet: 85% (H?Y) 00:16:11.800 --> 00:16:19.200 Now, these are my three domains and again, reliability, validity and responsiveness. Within the domains, we've got 00:16:19.200 --> 00:16:26.300 psychometric properties. So this is reliability. You see internal consistency, reliability and 00:16:26.300 --> 00:16:33.000 measurement error. Within validity you can see content validity, structural validity, hypothesis 00:16:33.000 --> 00:16:42.200 testing, cross-cultural validity or measurement invariance, Criterion validity and responsiveness, the third domain. 00:16:42.200 --> 00:16:48.900 Now, just to give you a little bit, these three, these ones, structural hypothesis, testing and 00:16:48.900 --> 00:16:56.100 cross-cultural are also together called construct validity. They've got to do with, how is the 00:16:56.100 --> 00:17:03.900 measure constructed, are there any subscale etc, that is construct validity. But this is the framework of Cosmin, 00:17:03.900 --> 00:17:10.900 9 psychometric properties each within three domains, and each with definitions, etc. 00:17:12.349 --> 00:17:19.400 We are going to have a look at some of these examples. There we are. I'll give you one by one. 00:17:19.400 --> 00:17:26.250 We will talk through all these nine psychometric properties. Before we do that, are there any questions? 00:17:26.250 --> 00:17:32.800 Melissa: Yeah, I have a question about the articles that you showed us. So you 00:17:32.800 --> 00:17:39.050 So you realize that there was like a bit of a maybe bias and stuff? NOTE Treffsikkerhet: 85% (H?Y) 00:17:39.050 --> 00:17:48.000 Okay, so this is just to go one slide back. We have the nine properties and we're 00:17:48.000 --> 00:17:54.700 going to take them one by one. So first, internal consistency, what is internal consistency? 00:17:54.700 --> 00:18:00.000 And that has got to do with if you've got yourself instead questionnaire, how much are they related 00:18:00.000 --> 00:18:06.700 together, these items? It's about the interrelatedness among items and there should be some 00:18:06.700 --> 00:18:09.650 relatedness because otherwise you are NOTE Treffsikkerhet: 78% (H?Y) 00:18:09.650 --> 00:18:15.350 comparing concept or domains that should not be in one measure. You don't want to put in one measure 00:18:15.350 --> 00:18:21.800 how much do you earn and how many pets have you got. It doesn't make any sense. So, there should be 00:18:21.800 --> 00:18:29.300 some correlation. But if it is too high the correlation then there is redundancy. Meaning, you are 00:18:29.300 --> 00:18:35.500 measuring the same thing. And that is a waste of everybody's time. If there is too low, 00:18:35.500 --> 00:18:39.550 if there is too little correlation, NOTE Treffsikkerhet: 72% (MEDIUM) 00:18:39.550 --> 00:18:45.500 that means you are actually measuring different concepts. Now, we measure internal 00:18:45.500 --> 00:18:51.000 consistency. There are different measures and one of the measures is Cronbach's Alpha, and rule of 00:18:51.000 --> 00:18:57.750 thumb is Alpha should be somewhere between 0.70 and 0.95. Too high, redundancy. 00:18:57.750 --> 00:19:04.600 Too low, not enough interrelatedness. So your measure is okay, internal consistency is okay if you are 00:19:04.600 --> 00:19:09.600 somewhere in between, and that is for the measure in total NOTE Treffsikkerhet: 88% (H?Y) 00:19:09.600 --> 00:19:16.700 and for at any subscale that is present if there are sub skills in your measure. Again, you don't want 00:19:16.700 --> 00:19:22.800 to measure apples and oranges. That's why it is important to look at internal consistency. Now, this 00:19:22.800 --> 00:19:29.100 is an example, the validity of the special needs education assessment tool, or in SNEAT and 00:19:29.100 --> 00:19:34.500 it is a newly developed scale for children with disabilities. Now, I'm going to talk you through 00:19:34.500 --> 00:19:39.550 this table. So here, you see this SNEAT and the SNEAT has got eleven items NOTE Treffsikkerhet: 83% (H?Y) 00:19:39.550 --> 00:19:45.700 and three constructs. You can see here the three constructs, these are the constructs, 00:19:45.700 --> 00:19:51.200 physical, functioning mental health and social functioning. And you can see here the Cronbach's Alpha. 00:19:51.200 --> 00:19:58.500 Here we are and you can see for each subscale they determine the Cronbach's Alpha. 00:19:58.500 --> 00:20:05.000 And for all scales, together the total scale and the conclusion is, since this is a cutoff between point 00:20:05.000 --> 00:20:09.600 0.70 and 0.95 that is good internal consistency. NOTE Treffsikkerhet: 90% (H?Y) 00:20:09.600 --> 00:20:15.800 It's nicely between those limits. So that's good. So you can tick off that psychometric property. 00:20:15.800 --> 00:20:22.600 Now, if you go to another psychometric property, we talk about reliability. And if we talk 00:20:22.600 --> 00:20:30.100 about reliability, you've got different types of reliability. We've got intra-rater, inter-rater and 00:20:30.100 --> 00:20:39.500 test-retest. Intra raiter is if the same person rates something. Again, you NOTE Treffsikkerhet: 83% (H?Y) 00:20:39.500 --> 00:20:47.700 have two moments of measurement, usually the week, max 10 days or so, if you 00:20:47.700 --> 00:20:53.200 score a group you don't want any changes in the group. So you are looking 00:20:53.200 --> 00:21:02.600 at error in the rater. So the group --or you can even you rate yourself-- the same recordings 00:21:02.600 --> 00:21:09.000 twice within a few days and you check whether you are reliable rater. That is if this 00:21:09.000 --> 00:21:09.700 is the same rater. NOTE Treffsikkerhet: 84% (H?Y) 00:21:09.700 --> 00:21:16.100 Inter-rater is if you've got several raters rating the same stuff and you see them and you just 00:21:16.100 --> 00:21:23.500 check how they doing. And test-retest is if I use the same assessment in a group within a couple 00:21:23.500 --> 00:21:30.199 of days in between, then I expect that the group has not changed. There's no change in the 00:21:30.199 --> 00:21:37.700 participants, so I would like to find the same data outcome. So, intra-rater, same raters. 00:21:37.700 --> 00:21:39.550 Inter-raters, between raters. NOTE Treffsikkerhet: 90% (H?Y) 00:21:39.550 --> 00:21:45.400 Test-retest, in the same group, the same measurement with a couple of days in between usually. 00:21:45.400 --> 00:21:51.400 That is all reliability and it's got to do with the proportion of the total variance in the measurements, 00:21:51.400 --> 00:21:58.600 which is due to true differences between patients, meaning not error. Okay, now this is 00:21:58.600 --> 00:22:04.900 reliability. Reliability has got to do with consistency and it's nicely also valid, which is next, 00:22:04.900 --> 00:22:09.650 because it's in the center but it's consistent. We talked about that previously. And if I look NOTE Treffsikkerhet: 75% (MEDIUM) 00:22:09.650 --> 00:22:16.100 at this one on the right, two raters and compare their scoring. It's like a cloud and that is a 00:22:16.100 --> 00:22:21.900 mess. There's not enough correlation between the two raters. So, the inter-rater reliability is very 00:22:21.900 --> 00:22:30.100 poor on the right. Now, back to my SNEAT. So these were the constructs and I've got a ICC intraclass 00:22:30.100 --> 00:22:36.900 correlation coefficient. So that is one of the statistics that you can use if you want 00:22:36.900 --> 00:22:39.550 to know about reliability. And again, NOTE Treffsikkerhet: 78% (H?Y) 00:22:39.550 --> 00:22:46.000 looking at these subscales and the total scores. You can see here, my ICCs have been determined 00:22:46.000 --> 00:22:53.700 and there are also rules for this and it says your ICC should be above point 0.7 00:22:53.700 --> 00:22:59.200 sand all of these rules comes from Cosmin by the way. So ICC is something that can vary between 0 to 1 , 00:22:59.200 --> 00:23:08.600 1 is ideal and you would like a ICC above 0.7 . So, if I look at that, 00:23:08.600 --> 00:23:09.550 I think well, NOTE Treffsikkerhet: 90% (H?Y) 00:23:09.550 --> 00:23:16.100 it's kind of okay, they're above. That one is, nah. That one just made it. That's how you look 00:23:16.100 --> 00:23:26.700 at reliability. And in this case, we are looking at the ICC. Okay, measurement error. 00:23:26.700 --> 00:23:32.950 That's another psychometric property, it has got to do with systematic random error of a score 00:23:32.950 --> 00:23:39.450 that is not attributed to true changes. We really talk about error in your data. NOTE Treffsikkerhet: 84% (H?Y) 00:23:39.450 --> 00:23:46.500 And we always have data. And I give you an example what error is. So you've got some terms, 00:23:46.500 --> 00:23:54.700 standard error of measurement we call it SEM and small as detectable change that is SDC. Now, the 00:23:54.700 --> 00:24:00.000 error has an impact on the smallest detectable change. I'll give you an example. I've got two 00:24:00.000 --> 00:24:08.000 pictures here, beautiful high resolution, and low resolution. Now if you've got a lot of noise for error, 00:24:09.550 --> 00:24:17.600 you can't see the details. So, that means, if you've got a high SEM, High error, you miss small changes. 00:24:17.600 --> 00:24:24.500 You can't see them. That's the idea. So, if a lot of error is in your data, then the smallest 00:24:24.500 --> 00:24:30.800 detectable change is not that small anymore because it is being hindered, it's got a 00:24:30.800 --> 00:24:33.050 problem with the error in your data. NOTE Treffsikkerhet: 85% (H?Y) 00:24:33.050 --> 00:24:40.400 I give you an example. I'm going to measure ladybirds. Why? Because. And I'm going to do that with a 00:24:40.400 --> 00:24:46.700 ruler. And this rule has got one centimeter. So the smallest detectable change on my ruler is one 00:24:46.700 --> 00:24:51.900 centimeter. Well, if I'm going to measure this one, that's not really useful. What I actually would 00:24:51.900 --> 00:24:59.000 like is a ruler with millimeters and then my minimal important change here is one minimum 00:24:59.000 --> 00:25:03.250 millimeters. If I were to use this one is smaller I can NOTE Treffsikkerhet: 81% (H?Y) 00:25:03.250 --> 00:25:10.500 measure is centimeters, if I use this one I've got millimeters. So again, the minimal important 00:25:10.500 --> 00:25:19.000 change is if I measure ladybugs or birds is probably the milimeter. 00:25:19.000 --> 00:25:27.400 So if you say my minimal important change is one millimeter, but if I were to use 00:25:27.400 --> 00:25:33.199 that ruler there, my smaller detectable change would be 1 centimeter. That smallest detectable NOTE Treffsikkerhet: 89% (H?Y) 00:25:33.199 --> 00:25:40.500 change should be smaller than a minimal important change. I hope that is clear. So what you want to 00:25:40.500 --> 00:25:48.800 detect should be smaller, or what you can detect should be smaller than what you want to detect. 00:25:48.800 --> 00:25:50.900 Is that clear for everybody? NOTE Treffsikkerhet: 91% (H?Y) 00:25:51.400 --> 00:26:00.950 Melissa: Can you repeat that last sentence you said? Renee: Okay. So SDC has got to do with 00:26:00.950 --> 00:26:08.600 what you can detect. So what you can detect should be smaller than what is important, what you want 00:26:08.600 --> 00:26:16.500 to detect. So if I can detect only one centimeter differences, but what I want or need to detect is 00:26:16.500 --> 00:26:22.100 one millimeters, I've got a problem because my SDC is not smaller than my SEM. 00:26:24.000 --> 00:26:26.700 Is that clear? NOTE Treffsikkerhet: 91% (H?Y) 00:26:28.100 --> 00:26:30.800 Is that okay? NOTE Treffsikkerhet: 80% (H?Y) 00:26:31.900 --> 00:26:42.600 Melissa: Yeah, maybe I'm a little bit lost there. If you go back to 00:26:42.600 --> 00:26:49.500 the ladybug example... Renee: Okay, I'm gonna stop the recording, then we can have a chat. NOTE Treffsikkerhet: 85% (H?Y) 00:26:50.100 --> 00:26:59.700 So all I'm doing is the detectable change is what your measure can identify, what 00:26:59.700 --> 00:27:07.300 small changes can it still pick up. And you need to decide as educationalist or clinician, what is 00:27:07.300 --> 00:27:15.600 important change or not. But the detectable change should always be smaller than what 00:27:15.600 --> 00:27:20.400 I'm interested in, because otherwise I can't identify a change that NOTE Treffsikkerhet: 86% (H?Y) 00:27:20.400 --> 00:27:26.800 is important to me. That is the main issue. Now, I'm going to make your life a little bit more complex 00:27:26.800 --> 00:27:33.500 before break. So there's an example. This is Pragmatic Observational Measure, POM. It was a new 00:27:33.500 --> 00:27:40.400 measure and what they did was, they were interested in measurement error. You don't need to 00:27:40.400 --> 00:27:44.300 understand all the details, but I'm going to explain to you what is behind measurement error. 00:27:44.300 --> 00:27:50.350 So don't just let it go, just hear me out. So, the first step is you need to NOTE Treffsikkerhet: 87% (H?Y) 00:27:50.350 --> 00:27:56.200 to determine the standard error of measurement, the SEM. There's a formula for it, so we do that. 00:27:56.200 --> 00:28:04.300 So Sam was 0,blah , it's a fact. So the intra reliability, the measurement error is that, then 00:28:04.300 --> 00:28:09.500 you want to know the smallest detectable change. There's a rule for how you do that and we say, 00:28:09.500 --> 00:28:15.900 well, that is a smallest detectable change. That is what my measure still can do. Then the other 00:28:15.900 --> 00:28:20.400 thing is my minimal important change and that is NOTE Treffsikkerhet: 89% (H?Y) 00:28:20.400 --> 00:28:26.750 something that you say: I've got minimum and maximum scores on 27 and 108, that is the range. 00:28:26.750 --> 00:28:34.400 And my SDC versus the minimum is that and that my SDC versus the maximum is that. 00:28:34.400 --> 00:28:43.000 And then you say, well, the SDC is too small. So small to represent any 00:28:43.000 --> 00:28:50.300 clinical important change. That's how you try. And if you go to this, that means here are the rules. NOTE Treffsikkerhet: 83% (H?Y) 00:28:50.300 --> 00:29:00.300 We're not going to get any deeper but you compare errors and the smallest 00:29:00.300 --> 00:29:06.050 detectable change and what do you think is important, the minimal important change. All of that, 00:29:06.050 --> 00:29:11.200 you try to compare. I'm not going any deeper, I will not ask you to determine anything 00:29:11.200 --> 00:29:17.250 like it but just for to you understand a little bit the idea behind measurement error. NOTE Treffsikkerhet: 86% (H?Y) 00:29:17.250 --> 00:29:23.800 And that brings me to the next one. And that is where I would like to stop right now before we go to 00:29:23.800 --> 00:29:33.600 the next one. Let's see, stop sharing. *came from break* Okay, this was where we were, content validity. 00:29:33.600 --> 00:29:41.000 So, content validity is one of the most important psychometric property according to the Cosmin framework 00:29:41.000 --> 00:29:45.500 and it's got to do with whether your measure, your instrument is an adequate reflection of the 00:29:45.500 --> 00:29:46.750 construct that you want to measure. NOTE Treffsikkerhet: 84% (H?Y) 00:29:46.750 --> 00:29:51.200 And of course, it is very important because if you are measuring something that you don't 00:29:51.200 --> 00:29:57.400 want to measure or it is measuring slightly something different, you've got a problem. 00:29:57.400 --> 00:30:04.550 So content validity is really number one. If content validity is poor then Cosmin actually 00:30:04.550 --> 00:30:11.100 suggests the whole measure is already out. So content validity has got to do with the following 00:30:11.100 --> 00:30:16.800 three components. It's got to do with relevance, comprehensiveness and comprehensibility. NOTE Treffsikkerhet: 89% (H?Y) 00:30:16.800 --> 00:30:25.050 Now again, you see here the overall definition but relevance is got to do with 00:30:25.050 --> 00:30:31.000 the degree to which all items of instrument are relevant for the construct of interest 00:30:31.000 --> 00:30:37.300 with the target population. Now, one of the ways to make sure you've got good relevance is to have a 00:30:37.300 --> 00:30:43.200 Delfi study and everybody can have their say about your items and about what should be added or revised. 00:30:43.200 --> 00:30:53.800 Comprehensivenesss is the degree to which all key concepts of construct are included. So if you do your delfi, 00:30:53.800 --> 00:31:01.300 you ask about if items are relevant or not. But you also ask, is anything missing, what other items 00:31:01.300 --> 00:31:09.100 should we add? And then the third step comprehensibility has got to do with how easy is your measure 00:31:09.100 --> 00:31:16.800 understood by the respondent? Meaning if I design a self-report measure for children, NOTE Treffsikkerhet: 91% (H?Y) 00:31:16.800 --> 00:31:23.600 then the child must understand the items. They must understand the sentence, you want to know how to 00:31:23.600 --> 00:31:30.100 interpret it. So very often you pilot that in a small group of children and you talk to them, you've 00:31:30.100 --> 00:31:35.100 got semi-structured interview and you ask: ''How do you interpret this? How do 00:31:35.100 --> 00:31:41.350 you rate this? '' So, all three needs to be ticked off when we talk about content validity. NOTE Treffsikkerhet: 80% (H?Y) 00:31:41.350 --> 00:31:48.600 Now, the Cosmin has got ten criteria for good content validity. You can see, 00:31:48.600 --> 00:31:56.300 this is relevance. Then we've got comprehensiveness, anything missing and comprehensibility. How do 00:31:56.300 --> 00:32:03.000 they interpret the target population? How do they interpret your items? How is the wording? Is the 00:32:03.000 --> 00:32:08.800 response option okay? Sometimes you think yes and they say no no, life is not 00:32:08.800 --> 00:32:11.750 that black and white. They want a different response option. NOTE Treffsikkerhet: 88% (H?Y) 00:32:11.750 --> 00:32:20.000 That has got to do with comprehensibility. Now, structural validity is the one that has got to do 00:32:20.000 --> 00:32:27.500 with whether your items, your scores on your items, are adequate reflection of the dimensionality of 00:32:27.500 --> 00:32:33.600 your construct. We're talking about are there for instance sub scores and things like that. 00:32:33.600 --> 00:32:40.500 I'm going to give you subscales. I'm going to give you an example. So this is about structural 00:32:40.500 --> 00:32:41.699 validity and then we NOTE Treffsikkerhet: 91% (H?Y) 00:32:41.699 --> 00:32:48.900 talk about things like factor analysis. And you can see here, all these terms. Well, I don't want 00:32:48.900 --> 00:32:55.900 you to bother too much. This is all stats, but I want you to understand a little bit what is factor 00:32:55.900 --> 00:33:03.400 analysis because factor analysis is what you use to determine to identify different factors, 00:33:03.400 --> 00:33:08.900 different subscales in your scale if present. It could also be that it's uni-dimensional, your 00:33:08.900 --> 00:33:11.700 measure. Meaning, there's only one factor. NOTE Treffsikkerhet: 71% (MEDIUM) 00:33:11.700 --> 00:33:18.800 But using factor analysis, you can see your group items together. Sometimes it can even be 00:33:18.800 --> 00:33:26.400 an overlap. But you kind of say, well these items seem to be linked to the same underlying concept, 00:33:26.400 --> 00:33:31.900 could be health-related quality of life. Or you say, well these items are more functional 00:33:31.900 --> 00:33:37.699 health status. So and you can say, well this is about economic. So it's totally different 00:33:37.699 --> 00:33:41.350 factors and factor analysis NOTE Treffsikkerhet: 78% (H?Y) 00:33:41.350 --> 00:33:48.300 kind of groups them together, tries to say: Well, this is really where I would like a cut off, 00:33:48.300 --> 00:33:54.200 however there's an overlap, a gray area and then you need to decide either to delete the 00:33:54.200 --> 00:34:02.400 item or maybe to group it in one of the subscales. So it is about the underlying clusters, factors of 00:34:02.400 --> 00:34:10.000 items, of variables. And factor analysis uses correlations between these variables to group them. 00:34:10.000 --> 00:34:11.699 And then you could decide if I had data NOTE Treffsikkerhet: 91% (H?Y) 00:34:11.699 --> 00:34:18.100 like this, I would say probably there are three subscales in my measure. Now, you've got two 00:34:18.100 --> 00:34:25.600 different type of factor analysis. You've got exploratory. You've got confirmatory. And exploratory 00:34:25.600 --> 00:34:31.400 is, what you do, you've got your data, you just want to know... Okay, you group them, how many 00:34:31.400 --> 00:34:40.900 many factors are there? How the items divided? But confirmatory is you decided already, I've got two 00:34:40.900 --> 00:34:41.649 factors, these items, NOTE Treffsikkerhet: 91% (H?Y) 00:34:41.649 --> 00:34:51.250 with other words, confirmatory is I decided I've got these two or three subscales 00:34:51.250 --> 00:35:01.100 I want to know how well factor analysis can confirm my model. Exploratory factor 00:35:01.100 --> 00:35:08.950 analysis, I do not have a model, I ask my analysis: Hey, you give me a good grouping, and then 00:35:08.950 --> 00:35:11.850 they come up with a model that NOTE Treffsikkerhet: 87% (H?Y) 00:35:11.850 --> 00:35:19.100 fits the best for your data. So confirmatory would be if you designed a measure and I want to 00:35:19.100 --> 00:35:26.600 check how well are your dimension? How well is your structural validity? Then I would use confirm to 00:35:26.600 --> 00:35:32.300 factor analysis. If I develop something, then I would say, well, let's see how it looks. So I don't 00:35:32.300 --> 00:35:40.700 have a fixed idea yet. So structural validity has got to do with underlying groupings, underlying factors, 00:35:41.650 --> 00:35:50.000 you use such a thing as factor analysis to identify those different factors. And a factor is usually 00:35:50.000 --> 00:35:56.200 a subscale. If you have everything groups together, that means you've got a unit dimensional scale, 00:35:56.200 --> 00:35:58.050 there's only one factor. NOTE Treffsikkerhet: 86% (H?Y) 00:35:58.050 --> 00:36:05.450 Another psychometric property is hypothesis testing for construct validity and that's got to do with 00:36:05.450 --> 00:36:11.800 scores on your measure. And what you are going to do is you've got ideas about your scores. You say, 00:36:11.800 --> 00:36:18.500 well, I think that these scores are correlating with an existing measure. So you develop the measure 00:36:18.500 --> 00:36:25.600 in dyslexia, there are other measures in dyslexia and you say: well, 00:36:25.600 --> 00:36:27.550 I think actually the scores on my measure NOTE Treffsikkerhet: 87% (H?Y) 00:36:27.550 --> 00:36:34.300 will be highly correlating with the scores on another measure. And that 00:36:34.300 --> 00:36:41.900 is a hypothesis that you formulate and beforehand and then you do your analysis and 00:36:41.900 --> 00:36:51.100 you try to confirm that hypothesis. Assumption of relation to other measures 00:36:51.100 --> 00:36:57.500 for instance. Now, I give you an example. The validation of the social inclusion scale with NOTE Treffsikkerhet: 76% (H?Y) 00:36:57.500 --> 00:37:04.900 students. There's no gold standards. That means you can't do Criterion validity, but convergent 00:37:04.900 --> 00:37:11.700 validity is when several different assessments or methods obtain the same information and 00:37:11.700 --> 00:37:17.600 you think they will have the same output. That could be two different measures in 00:37:17.600 --> 00:37:25.000 social inclusion. You expect they would correlate, that means you are determining convergent 00:37:25.000 --> 00:37:27.500 validity. So my hypothesis NOTE Treffsikkerhet: 91% (H?Y) 00:37:27.500 --> 00:37:33.200 in this case, in this article, was that social inclusion scale will show positive 00:37:33.200 --> 00:37:39.600 Correlations with other social inclusion measures, and you also have another hypothesis saying 00:37:39.600 --> 00:37:46.500 greater social inclusion is associated with greater mental well-being. So, these two hypotheses are 00:37:46.500 --> 00:37:53.200 part of hypothesis testing. So you do that in advance. Now, how does that look like? First, very short 00:37:53.200 --> 00:37:57.100 on correlations. You might have seen this before. NOTE Treffsikkerhet: 75% (MEDIUM) 00:37:57.100 --> 00:38:04.400 These are all positive correlations. This is a maximum positive correlation, R is 1. This is pretty 00:38:04.400 --> 00:38:09.750 good R is 0.8 . It's a little bit scattered, but it still has got that tendency that is 00:38:09.750 --> 00:38:18.900 lined up from left to top bottom. This is ideal again, except this one outlier. So correlation, 00:38:18.900 --> 00:38:26.200 you can see how it's really affected just by that one single outlier. Now, then there is also negative correlations, 00:38:26.750 --> 00:38:34.200 then that means, if one measure has got higher values, 00:38:34.200 --> 00:38:39.300 the other one has also got higher values. It's the other way around, higher values on one measure, 00:38:39.300 --> 00:38:46.450 lower values on the other. Yeah, high temperatures and the amount of ice cream, 00:38:46.450 --> 00:38:53.900 that's negative correlation. So this is a ideal negative correlation of minus 1. This is, there is a 00:38:53.900 --> 00:38:56.750 correlation, but it's rather weak, is moderate. NOTE Treffsikkerhet: 78% (H?Y) 00:38:56.750 --> 00:39:04.000 You can still draw a line through it and will be negative, but it's quite scattered. And that one is a 00:39:04.000 --> 00:39:11.450 total cloud. Meaning that is a zero. There's no correlation whatsoever in it. So ideal positive, 00:39:11.450 --> 00:39:22.100 ideal negative, and that is zero correlation. Okay, so R score can be ranging 00:39:22.100 --> 00:39:25.750 anywhere between minus 1 and plus 1. NOTE Treffsikkerhet: 85% (H?Y) 00:39:25.750 --> 00:39:33.500 Okay. Now, here we are. I'll talk you through it. These are the scales set. So you've 00:39:33.500 --> 00:39:38.900 got the social inclusion scale, and these are subscales. And apparently there's also a short 00:39:38.900 --> 00:39:47.500 form, then here we've one hypothesis. 00:39:47.500 --> 00:39:53.400 First one said SIS will show positive correlations with other social inclusion measures. So that is one to five. 00:39:53.400 --> 00:39:56.300 We've got here a number of other NOTE Treffsikkerhet: 83% (H?Y) 00:39:56.300 --> 00:40:04.900 social inclusion scales and the 6th one, that is mental well-being. That's another scale. Now, how do 00:40:04.900 --> 00:40:09.900 you determine whether your hypothesis is confirmed or not? So you have all these correlations 00:40:09.900 --> 00:40:16.900 between all these subscales, the total mean, the short form. What you do is you check the 00:40:16.900 --> 00:40:23.500 significance. The yellow ones are significant and the red ones are not significant. Now, you can see 00:40:23.500 --> 00:40:25.950 the majority confirms my hypothesis, NOTE Treffsikkerhet: 90% (H?Y) 00:40:25.950 --> 00:40:33.100 the yellow one. So you can see how you interpret here. And so the conclusion is 00:40:33.100 --> 00:40:40.200 hypothesis confirmed because far out the majority confirms correlations, and that was linked to my 00:40:40.200 --> 00:40:46.100 two hypotheses I had. And that means that your hypothesis testing is positive. It's good. Y 00:40:46.100 --> 00:40:53.500 You confirm the hypothesis. That brings me to cross-cultural validity. That's the next one. NOTE Treffsikkerhet: 77% (H?Y) 00:40:53.500 --> 00:41:00.700 And cross-cultural validity has got to do with the performance of items on a translated or 00:41:00.700 --> 00:41:09.900 culturally adapted instrument. And for instance we had in the Netherlands, we had vocabulary tests and it 00:41:09.900 --> 00:41:17.250 was very old one and they use pictures of hamburgers. And hotdogs. It was an American 00:41:17.250 --> 00:41:23.350 assessment. At that time those terms were not common in the NOTE Treffsikkerhet: 85% (H?Y) 00:41:23.350 --> 00:41:30.700 Dutch language, especially not those pictures in young children, so that had to be adapted and we 00:41:30.700 --> 00:41:36.800 had to replace that by the typically Dutch bread, boring brad, like in Norway by the way, it's 00:41:36.800 --> 00:41:42.500 quite the same. So we changed it, we adapted it. But if you do that, you need to check whether it's 00:41:42.500 --> 00:41:48.600 still valid in the same way as the original measure. Well, there are also other measurements or 00:41:48.600 --> 00:41:53.400 measurement invariance. I'm not going to teach you too much with that but it's NOTE Treffsikkerhet: 91% (H?Y) 00:41:53.400 --> 00:42:00.200 got to do with if you translate a measure, if you adjust measures for Norway, and you probably 00:42:00.200 --> 00:42:07.000 will have to do that because the Norwegian language is of course a small language relatively 00:42:07.000 --> 00:42:13.700 compared to English. You will probably want to cross validate many of these measures and this part 00:42:13.700 --> 00:42:17.600 of the Cosmin checklist, what you need to do, you need to have a look at other stats, an example 00:42:17.600 --> 00:42:23.350 similar for the one you want to compare it with. So the Norwegian sample NOTE Treffsikkerhet: 83% (H?Y) 00:42:23.350 --> 00:42:30.700 compared to the original sample, is the approach okay, did you use the correct analysis, 00:42:30.700 --> 00:42:37.000 are there any particular flaws? So all of that and the Cosmin has got a whole book work 00:42:37.000 --> 00:42:43.600 behind it. So you can check that whether you are doing the right stats to compare your measure in 00:42:43.600 --> 00:42:48.700 Norwegian language compared to the original measure. It's got to do with cross-cultural validity. NOTE Treffsikkerhet: 83% (H?Y) 00:42:48.700 --> 00:42:54.300 Now, this one we talked already about, Criterion validity. When we talked about diagnostic 00:42:54.300 --> 00:43:03.100 performance, you compare a index test your screen with a gold standard, your reference test. That is in 00:43:03.100 --> 00:43:10.200 fact, Criterion validity. So if you've got a new measure, you developed a new measure and we are 00:43:10.200 --> 00:43:15.600 developing several measures. You want to compare it with anything that's considered to be a gold 00:43:15.600 --> 00:43:18.250 standard that is present. NOTE Treffsikkerhet: 79% (H?Y) 00:43:18.250 --> 00:43:26.300 Sometimes you need to say there is no gold standard in God knows what, in the example 00:43:26.300 --> 00:43:33.400 I told you in shyness there was no gold standard. But there may be in language. There may be at 00:43:33.400 --> 00:43:41.600 least gold standards or at least test that most clinicians experts say this is a very important 00:43:41.600 --> 00:43:46.600 language test and then you can say, well, how about I consider that to be my gold standard. 00:43:46.600 --> 00:43:48.350 So the gold standard is something NOTE Treffsikkerhet: 83% (H?Y) 00:43:48.350 --> 00:43:56.200 clinicians determine. Not just you, 00:43:56.200 --> 00:44:02.600 I mean the whole group of Educationalist, they decide this is the most important test in 00:44:02.600 --> 00:44:07.900 my area. Now there's another thing that you also say regarding Criterion validity. Sometimes 00:44:07.900 --> 00:44:15.300 you've got long versions of instrument and then researchers want to shorten it. Then you 00:44:15.300 --> 00:44:18.250 want to compare the shortened version with the long version NOTE Treffsikkerhet: 75% (MEDIUM) 00:44:18.250 --> 00:44:23.700 and see if it's still doing the same thing. So that is a particular case of still 00:44:23.700 --> 00:44:25.850 Criterion validity. NOTE Treffsikkerhet: 87% (H?Y) 00:44:25.850 --> 00:44:33.100 I give you an example again here. So this is the reliability and validity of student peer assessment 00:44:33.100 --> 00:44:38.600 in medical education. This is also a review that I did and I wanted to know is peer assessment 00:44:38.600 --> 00:44:46.600 actually useful or not. So what I did, I use the gold standard, faculty assessment, meaning are 00:44:46.600 --> 00:44:53.149 compared peer assessment anywhere in medical or alike health education and I compared was that 00:44:53.149 --> 00:44:55.200 anyway related to the final exams. NOTE Treffsikkerhet: 81% (H?Y) 00:44:55.200 --> 00:45:02.900 So this is an example, I've got here some participant groups, the geomedical, single 00:45:02.900 --> 00:45:10.900 measurement, 18 students, etc. Etc. There was a physician performance and I had a peer assessment. 00:45:10.900 --> 00:45:18.300 So this was the faculty assessment and this was students rating each other and the conclusion was 00:45:18.300 --> 00:45:25.250 peer and faculty rating weakly correlate, still statistically significant, but 0.3 isn't very strong. NOTE Treffsikkerhet: 91% (H?Y) 00:45:25.250 --> 00:45:33.000 So conclusion actually here is peer assessment is not really linked or hardly linked to 00:45:33.000 --> 00:45:40.350 your final outcomes. So why would you use it? That was a whole discussion?Okay, Criterion validity. 00:45:40.350 --> 00:45:46.100 Hang in there. There are nine psychometric properties. We're getting there. 00:45:46.100 --> 00:45:53.100 That brings me to responsiveness. Now I talked already a number of times about responsiveness and 00:45:53.100 --> 00:45:54.950 responsiveness is NOTE Treffsikkerhet: 78% (H?Y) 00:45:54.950 --> 00:46:01.800 about the change over time in a construct to be measured. So I'm looking here at my thermometer and 00:46:01.800 --> 00:46:07.200 you can see if I've got very high temperatures, it's not sensitive anymore because of the 00:46:07.200 --> 00:46:14.900 thermometer is incapable of measuring any higher temperatures. Also in there when it's 00:46:14.900 --> 00:46:22.200 getting colder my termomitor is doing a bad job. So this thermometer in middle here, 00:46:22.200 --> 00:46:27.200 it's very responsive. But the edges are not responsive anymore. NOTE Treffsikkerhet: 78% (H?Y) 00:46:27.200 --> 00:46:31.900 Now that is important in interventions because you want to make sure that 00:46:31.900 --> 00:46:36.600 your measure is sensitive enough to the changes you are interested in. NOTE Treffsikkerhet: 86% (H?Y) 00:46:37.300 --> 00:46:46.300 So responsiveness is assessed with hypothesis testing based on correlations of absolute changes in 00:46:46.300 --> 00:46:54.700 scores. You formulate your hypothesis before data analysis, and then you have expected correlations 00:46:54.700 --> 00:47:01.800 based on literature, on clinical experience, or consensus with authors, this is how you 00:47:01.800 --> 00:47:07.100 interpret the correlations and you say there's good responsiveness. If more than 75% NOTE Treffsikkerhet: 80% (H?Y) 00:47:07.100 --> 00:47:13.800 your hypothesis can be confirmed. It's a little bit like hypothesis testing, but this is 00:47:13.800 --> 00:47:21.200 now hypothesis testing link to change, but hypothesis about changes and you want changes to correlate. 00:47:21.200 --> 00:47:29.200 So an example is the change in total score of measure X shows at least a very strong 00:47:29.200 --> 00:47:34.600 positive correlation with the change in total score measure. Why? So, I've got two measures that are 00:47:34.600 --> 00:47:37.050 correlated. That will be hypothesis testing. NOTE Treffsikkerhet: 81% (H?Y) 00:47:37.050 --> 00:47:45.000 But now I want to know why the change in both measures is correlated. And that is 00:47:45.000 --> 00:47:48.600 responsiveness. Is that okay? NOTE Treffsikkerhet: 91% (H?Y) 00:47:50.100 --> 00:47:58.400 That brings me to the last one, interpretability. And as it says in the title, that is not a 00:47:58.400 --> 00:48:03.600 psychometric property, not a measurement property. Psychometric property, measurement property 00:48:03.600 --> 00:48:10.700 means the same thing, but it still is important. Interpretability is the degree to which you can 00:48:10.700 --> 00:48:17.200 assign qualitative meaning to your scores. Now, what does that mean? I'll give you an example, 00:48:18.200 --> 00:48:24.900 interpretability of single scores is got to do with the 00:48:24.900 --> 00:48:31.200 distribution. You've got the distribution of a study population and then you've got a score on your 00:48:31.200 --> 00:48:38.900 child, one child only and you want to know, is that good or is it actually bad. So you can by 00:48:38.900 --> 00:48:45.300 comparing that score to the population score or relevant subgroups, you can interpret the data for 00:48:45.300 --> 00:48:47.899 that child. You can interpret this NOTE Treffsikkerhet: 78% (H?Y) 00:48:47.899 --> 00:48:54.300 objective score and give it a qualitative meaning. The other thing is that we talked about minimal 00:48:54.300 --> 00:49:02.700 important changes and in response shift, that is little bit complex, but sometimes there are changes 00:49:02.700 --> 00:49:09.500 in one self-evaluation of target construct and that is has got to do with internal standards that 00:49:09.500 --> 00:49:16.450 may change over time, that may change about the patient's value, how important something is to him 00:49:16.450 --> 00:49:18.050 and the comfort and quality of life NOTE Treffsikkerhet: 91% (H?Y) 00:49:18.050 --> 00:49:25.700 that may change, constructs like that, or a redefinition of the Target. And we're talking 00:49:25.700 --> 00:49:33.300 actually totally about reconceptualization. But let's have a look at the distribution of scores, an 00:49:33.300 --> 00:49:39.200 example. So I've got a distribution of scores and that looks like that. Yeah, so it is a study 00:49:39.200 --> 00:49:45.000 population or other relevant subgroups, and we're gonna have a look at clustering of scores and 00:49:45.000 --> 00:49:48.100 floor and ceiling effects. Now, this NOTE Treffsikkerhet: 87% (H?Y) 00:49:48.100 --> 00:49:55.300 is a histogram and that's how you can see it. You just visualize your data. This is the frequency 00:49:55.300 --> 00:50:00.900 on a degradation handicap index. This is a total score. You can see there's not much there. 00:50:00.900 --> 00:50:09.300 So there's no ceiling effects and there are rules. something like less than 15%. That depends on the 00:50:09.300 --> 00:50:14.600 criteria you use but usually take something like 15 or 10 percent and they say, well actually less 00:50:14.600 --> 00:50:18.050 than that person percentage is total on the left, NOTE Treffsikkerhet: 91% (H?Y) 00:50:18.050 --> 00:50:23.900 so there is no floor effect. That would be a floor effect if you had a lot of clustering on 00:50:23.900 --> 00:50:31.000 the left, some tests have got that, meaning they don't distinguish at the bottom or they can't 00:50:31.000 --> 00:50:38.100 distinguish in the higher data, but this is a nice spread. So that gives an idea about the 00:50:38.100 --> 00:50:45.000 interpretability, it particular the floor and ceiling effects of this measure. 00:50:45.000 --> 00:50:48.700 So, no floor or ceiling effects is a good thing. NOTE Treffsikkerhet: 76% (H?Y) 00:50:49.200 --> 00:50:57.700 So here we are, if you look at assessment properties of a 00:50:57.700 --> 00:51:04.700 measure, the most important thing according to Cosmin is content validity. If content 00:51:04.700 --> 00:51:11.500 is a problem, if that is not well documented, then measure is out. Then they say, okay, have a look 00:51:11.500 --> 00:51:17.950 at the internal structure, that it's got to do with subscales, with how do items correlate, NOTE Treffsikkerhet: 82% (H?Y) 00:51:17.950 --> 00:51:25.200 internal consistency, cross-cultural validity if applicable, of course if you don't 00:51:25.200 --> 00:51:29.900 translate it you don't need to tick that box. And then we've got the remaining measurement 00:51:29.900 --> 00:51:35.400 properties, and that is reliability. And that is measurement error, Criterion validity, 00:51:35.400 --> 00:51:43.300 hypothesis testing for construct validity and responsiveness. Now, all of that together is doing 00:51:43.300 --> 00:51:48.000 psychometrics, and yes you do need all nine. NOTE Treffsikkerhet: 90% (H?Y) 00:51:48.000 --> 00:51:54.200 You can't say this is where I stopped. You need to check all nine. If one of them is very bad, 00:51:54.200 --> 00:52:00.200 for instance very poor reliability, it's not like the other psychometric properties can compensate for 00:52:00.200 --> 00:52:09.200 that. You've got a problem. It's not reliable. So again, this is your domain of validity. You've got 00:52:09.200 --> 00:52:14.500 content validity, structural hypothesis testing, cross culture and Criterion validity. All of that 00:52:14.500 --> 00:52:17.950 together is validity. So that is why NOTE Treffsikkerhet: 87% (H?Y) 00:52:17.950 --> 00:52:25.200 it is wrong if you read an article and someone says, okay, we compared with gold standard, 00:52:25.200 --> 00:52:32.100 fantastic validity. No, No. Fantastic Criterion validity only. And that's just part of the 00:52:32.100 --> 00:52:40.400 whole domain. Same with measurement property. If we talk about reliability, there is more, you will 00:52:40.400 --> 00:52:47.000 see many articles that just do something like test-retest or only inter-rater and intra-rater 00:52:47.000 --> 00:52:47.950 and they claim to have NOTE Treffsikkerhet: 91% (H?Y) 00:52:47.950 --> 00:52:55.400 reliable measurement, a reliable measure, reliable assessment. That is not, that's half the truth. 00:52:55.400 --> 00:53:02.800 There is way more in the domain reliability, and people just will every time again, just state 00:53:02.800 --> 00:53:10.200 that the measure is valid and reliable while they only did a very small bit of the total domain reliability. 00:53:10.200 --> 00:53:17.950 And the final one, responsiveness, and that one is extremely important when NOTE Treffsikkerhet: 89% (H?Y) 00:53:17.950 --> 00:53:23.300 you use an assessment in an intervention and that's probably how you want to use very often your assesments. 00:53:23.300 --> 00:53:28.200 Now, I'll stop for a moment. NOTE Treffsikkerhet: 74% (MEDIUM) 00:53:32.000 --> 00:53:39.300 Okay. So what I am going to give you an example, I'm going to look at what psychrometric 00:53:39.300 --> 00:53:46.400 property are we actually talking about. So these are abstracts and if you can, if you read it, 00:53:46.400 --> 00:53:52.800 so we just going to make a look at this one. This is an example and what I want to know is 00:53:52.800 --> 00:53:59.500 what's psychometric property are we talking about. Now, take a little bit of time. Just scan it. 00:53:59.500 --> 00:54:02.700 Don't learn it by heart. Just scan it. NOTE Treffsikkerhet: 72% (MEDIUM) 00:54:13.300 --> 00:54:18.800 So really scan and I'm talking you through, I gave the answer in the beginning but then we were 00:54:18.800 --> 00:54:24.700 discussing other problems. So I'm going to talk you through. So what they are doing, 00:54:24.700 --> 00:54:33.350 the objective is they want to have a oral motor assessment soma compared with video fluoroscopy. 00:54:33.350 --> 00:54:41.000 Now, we know that that is the same as we talked about the gold standard because record was a 00:54:41.000 --> 00:54:46.000 gold standard. How would you call then the psychometric property? 00:54:49.600 --> 00:54:59.700 Anybody? NOTE Treffsikkerhet: 83% (H?Y) 00:55:10.600 --> 00:55:23.300 Melissa: Criterion validity. Renee: Bingo. Yeah, this is because the clinicians- we decided this is a this 00:55:23.300 --> 00:55:27.700 is a gold standard. So that's why I say you just scan it, don't learn it by heart. So I 00:55:27.700 --> 00:55:33.300 just want to know this is what they do. If they do that we call it Criterion validity, because the 00:55:33.300 --> 00:55:39.500 degree to which scores of an instrument are an adequate reflection of a gold standard. 00:55:39.500 --> 00:55:39.850 Gold standard again, NOTE Treffsikkerhet: 91% (H?Y) 00:55:39.850 --> 00:55:47.200 something that we as clinician and the educationalists, we determine that. And we say this is 00:55:47.200 --> 00:55:59.500 the best one. So that's Criterion validity. So I'm going to give you a few more. Okay, just scan it. 00:56:00.600 --> 00:56:06.900 And what are words that are important when we talk about psychometrics, just recognize the words. 00:56:06.900 --> 00:56:10.100 I want to know what are we doing here. NOTE Treffsikkerhet: 80% (H?Y) 00:56:11.900 --> 00:56:20.600 Melissa: Reliability testing. Testing and retesting. Renee: Yeah. So that is the paragraph We 00:56:20.600 --> 00:56:26.400 interested in. Everybody agrees? So I just want you to scan that. It doesn't matter what 00:56:26.400 --> 00:56:33.150 it is but we just want to scan. So they say they talk about test-retest, 00:56:33.150 --> 00:56:38.600 internal reliability and construct validity. That's what they say. Yeah, there's all in this text. 00:56:38.600 --> 00:56:41.800 So they say here, they do test-retest NOTE Treffsikkerhet: 76% (H?Y) 00:56:41.800 --> 00:56:47.900 internal reliability, construct validity. I'm going to give you some example. So this is what the 00:56:47.900 --> 00:56:53.700 authors say. Now I'm going to explain to you what they're doing and how I would call it if we talk 00:56:53.700 --> 00:57:00.200 about Cosmin and how they call that, because this is an example of that actually authors use different 00:57:00.200 --> 00:57:04.900 terminology and you need to understand what what are they doing. So I'm going to give you, this is 00:57:04.900 --> 00:57:10.800 another bit and I'm going to help you. So this is about explanation of what they were doing. 00:57:11.600 --> 00:57:18.000 They said test-retest was done for both clinical and non-clinical subjects four to six weeks. After the first 00:57:18.000 --> 00:57:26.100 administration cases were chosen by blabla. So a retest was done. So that sounds like test-retest 00:57:26.100 --> 00:57:35.850 reliability. Yeah, everybody agrees? Yes. Okay. I really appreciate that at least I hear one. Thank you. 00:57:35.850 --> 00:57:41.850 Okay, we go to the next one. This is still the same article. Now here they are saying NOTE Treffsikkerhet: 90% (H?Y) 00:57:41.850 --> 00:57:49.400 to assess internal consistency, Cronbach's alpha were computed. What are we talking about 00:57:49.400 --> 00:57:53.100 here? How would we call that if we talk about cosmin? NOTE Treffsikkerhet: 73% (MEDIUM) 00:57:54.400 --> 00:58:02.900 Just use your slides to cheat. Melissa: Is it internal consistency? Renee: Exactly. So this is indeed internal 00:58:02.900 --> 00:58:07.900 consistency because you know you can see the cronbach's alpha. And this is still the same 00:58:07.900 --> 00:58:18.100 term as we use. It is about the interrelatedness among items. But now we get the next one 00:58:18.100 --> 00:58:23.700 and this is whole story and they talked about the NOTE Treffsikkerhet: 79% (H?Y) 00:58:23.700 --> 00:58:30.000 construct validity, they talking about construct validity but they referred to 00:58:30.000 --> 00:58:39.100 comparisons of the this measure and they've got hypotheses. So what they call construct validity, 00:58:39.100 --> 00:58:47.900 Cosmin calls hypothesis testing. And that is why you need to, that's a little bit confusing, 00:58:47.900 --> 00:58:53.400 you need to make sure that whatever the author's call it, check what they are actually doing 00:58:53.649 --> 00:58:59.200 and then, you know this seems reasonable or not. 00:58:59.200 --> 00:59:06.600 So this is what they called it, test-retest, internal reliability, testing construct validity. 00:59:06.600 --> 00:59:15.800 Cosmin would say reliability, that's test-retest. Internal reliability or internal consistency and 00:59:15.800 --> 00:59:20.100 hypothesis testing. So the terms of slightly different. NOTE Treffsikkerhet: 91% (H?Y) 00:59:20.100 --> 00:59:26.700 Okay. Now, I think that this is for next time, we're not going to do that now, because I feel a 00:59:26.700 --> 00:59:32.900 little bit guilty that I've got one person who's doing the whole talking.